# Model evaluation using cross-validation

## Data preparation

In [1]:
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")

# Separate data and target
data, target = adult_census.drop(columns="class"), adult_census["class"]

# subset the numerical data
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]

## Creating a machine learning pipeline

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [4]:
model = make_pipeline(StandardScaler(), LogisticRegression())

## Cross-validation
The performance of the model can vary on the way the data is splitted.

Approach to address these issues is cross-validation. THereby we repeat the splitting of train-test set for multiple times. In order to have different training and test sets. In other words: we create different partitions of the data.
For each iteration the training sets overlap. The test-set varies each time. (5 iterations in the KFold example image).
The result is an accuracy score for each iteration (partition).

In KFold, you train the model K times (5 times, 10 times, etc.). Note that training a model on a large dataset takes a long time.

![Cross-validation diagram](../figures/cross_validation_diagram.png)

In [9]:
from sklearn.model_selection import cross_validate

# dataset is the full, non-splitted, numeric dataset
cv_result = cross_validate(model, data_numeric, target, cv=5)

In [10]:
cv_result

{'fit_time': array([0.07347465, 0.10056591, 0.10927868, 0.09904099, 0.08461881]),
 'score_time': array([0.01770091, 0.02382255, 0.01950097, 0.02127552, 0.01615644]),
 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}

The goal is not to fit the model, but to evaluate the **accuracy of the chosen model** on this dataset.
The output are the scores, **not a fitted model!**

### Now that we know the scores, we can do something with it. For instance taking the average.

In [12]:
scores = cv_result["test_score"]
scores

array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])

In [13]:
scores.mean()

0.7998035202866813

In [14]:
scores.std()

0.0029046240751883553