# Cross-Validation with CUAnalytics

This notebook demonstrates how to use `cross_validate` across several supervised models.

In [12]:
import cuanalytics as ca
import pandas as pd


## Classification Models (Breast Cancer)

We'll run cross-validation for several classifiers using the same dataset.

In [13]:
df_cls = ca.load_breast_cancer_data()
df_cls.head()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


In [14]:
cv_logit = ca.cross_validate(
    ca.fit_logit,
    df_cls,
    formula='diagnosis ~ .',
    k=10,
    stratify_on='diagnosis',
    max_iter=5000
)
cv_logit


{'task': 'classification',
 'k': 10,
 'stratify_on': 'diagnosis',
 'folds': [{'fold': 1,
   'accuracy': 0.9649122807017544,
   'kappa': 0.925974025974026,
   'macro_f1': 0.9629870129870131},
  {'fold': 2,
   'accuracy': 0.9473684210526315,
   'kappa': 0.8917036098796707,
   'macro_f1': 0.9456970466814862},
  {'fold': 3,
   'accuracy': 0.9473684210526315,
   'kappa': 0.8857715430861723,
   'macro_f1': 0.942866688940862},
  {'fold': 4,
   'accuracy': 0.8771929824561403,
   'kappa': 0.7223382045929019,
   'macro_f1': 0.85995085995086},
  {'fold': 5,
   'accuracy': 0.9649122807017544,
   'kappa': 0.9230769230769231,
   'macro_f1': 0.9614864864864865},
  {'fold': 6,
   'accuracy': 0.9473684210526315,
   'kappa': 0.8834355828220859,
   'macro_f1': 0.9415384615384615},
  {'fold': 7,
   'accuracy': 0.9649122807017544,
   'kappa': 0.9246031746031746,
   'macro_f1': 0.9623015873015872},
  {'fold': 8,
   'accuracy': 0.9298245614035088,
   'kappa': 0.8521400778210116,
   'macro_f1': 0.925974025974

In [15]:
cv_svm = ca.cross_validate(
    ca.fit_svm,
    df_cls,
    formula='diagnosis ~ .',
    k=5,
    stratify_on='diagnosis',
    C=1.0,
)
cv_svm['summary']['mean']


{'accuracy': 0.9473063188945815,
 'kappa': 0.8865181337371659,
 'macro_f1': 0.9432162291897941}

In [16]:
cv_knn = ca.cross_validate(
    ca.fit_knn_classifier,
    df_cls,
    formula='diagnosis ~ .',
    k=5,
    stratify_on='diagnosis',
)
cv_knn['summary']['mean']


{'accuracy': 0.9350100916006833,
 'kappa': 0.8595097817578076,
 'macro_f1': 0.9296289544147232}

In [17]:
cv_nn = ca.cross_validate(
    ca.fit_nn,
    df_cls,
    formula='diagnosis ~ .',
    k=5,
    stratify_on='diagnosis',
    hidden_layers=[10, 5],
    max_iter=10000,
    random_state=42,
)
cv_nn['summary']['mean']


{'accuracy': 0.9349945660611707,
 'kappa': 0.8589420169518303,
 'macro_f1': 0.929369836610733}

## Regression Models (Real Estate)

Now we will cross-validate a couple of regression models.

In [18]:
df_reg = ca.load_real_estate_data()
df_reg.head()


Downloading Real Estate Valuation dataset from UCI...
Dataset loaded successfully!


Unnamed: 0,transaction_date,house_age,distance_to_MRT,num_convenience_stores,latitude,longitude,price_per_unit
0,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
1,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
2,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
3,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1
4,2012.666667,7.1,2175.03,3,24.96305,121.51254,32.1


In [19]:
cv_lm = ca.cross_validate(
    ca.fit_lm,
    df_reg,
    formula='price_per_unit ~ .',
    k=5,
)
cv_lm['summary']['mean']


{'r2': 0.5701514116900681, 'rmse': 8.85711306019509, 'mae': 6.236755122472933}

In [20]:
cv_knn_reg = ca.cross_validate(
    ca.fit_knn_regressor,
    df_reg,
    formula='price_per_unit ~ .',
    k=5,
)
cv_knn_reg['summary']['mean']


{'r2': 0.6120228021351977,
 'rmse': 8.404148224456701,
 'mae': 5.7446041727887165}

## Notes

- `cross_validate` uses each model's `predict` output and computes metrics without printing.
- For a single dataset evaluation with printing, use `model.score(df)`; for a silent version, use `model.get_score(df)`.