# What is Cross Validation?

Cross Validation is a model evaluation technique used to check how well the model will perform on unseen data.

Instead of training on one part and testing once,
we train and test multiple times on different splits of the dataset and take the average accuracy.

In machine learning, Cross-Validation (CV) is a resampling technique used to evaluate how well a model will perform on unseen data.

When you train a model, itâ€™s easy for it to "memorize" the training data (overfitting). If you only test it on one small slice of data, you might get luckyâ€”or unluckyâ€”with that specific slice. Cross-validation solves this by rotating which parts of the data are used for training and which are used for testing.

ðŸ‘‰ It helps to:

Avoid overfitting

Avoid underfitting

Get more reliable accuracy

Use available data effectively

# What is it?
It is a statistical method of dividing data into multiple subsets. You train the model on some subsets and validate it on the remaining ones, repeating this process multiple times.

# Why use it?
Prevents Overfitting: It ensures the model generalizes well to new data.

Reliability: It provides a more accurate estimate of model performance (Accuracy, Precision, etc.) than a single train/test split.

Small Datasets: It allows you to use your limited data more effectively by making sure every data point is used for both training and validation.

# When to use it?
During Model Selection (comparing Model A vs. Model B).

During Hyperparameter Tuning (finding the best settings for your model).

Whenever you have a limited amount of data and want a robust performance metric.

# Types of Cross Validation
âœ… K-Fold (Most Common)

Randomly divides data into K equal parts.

âœ… Stratified K-Fold

Used for classification
Maintains same class percentage in each fold

âœ… Leave One Out (LOOCV)

Each sample becomes test once
Very accurate but very slow

âœ… Time Series Cross Validation

NO SHUFFLING
Past â†’ Train
Future â†’ Test

# Types of Cross-Validation

* K-Fold CV (The Standard): The data is split into $k$ equal parts (folds). The model is trained $k$ times, each time using a different fold as the test set and the others as training
* Stratified K-Fold: Used for classification. It ensures that each fold has the same percentage of each class (e.g., 50% "Yes", 50% "No") as the whole dataset
* Leave-One-Out (LOOCV): Only one data point is used for testing, and the rest for training. This is repeated for every single point in the dataset. (Very slow for large data)
* Time-Series CV: Since time data is ordered, you cannot shuffle it. This method uses a "sliding window" or "expanding window" to train on the past and test on the future

# When Should You Use Cross Validation?

Use CV when:

Data is small or limited

You want more reliable accuracy

You are comparing models

You are doing hyper-parameter tuning

Machine learning competitions

Do NOT use CV when:

Data is extremely large (takes too much time)

Time series forecasting (special CV needed)

# K-Fold Cross Validation (Most Used)

Dataset is divided into K equal parts
Model trains K times

Each time:

One fold â†’ Test set

Remaining Kâˆ’1 folds â†’ Training set

Finally â†’ Take Average Accuracy

Example:
If K = 5

* Run1: Train on Fold 2-5 â†’ Test on Fold 1
* Run2: Train on Fold 1,3-5 â†’ Test on Fold 2
* Run3: Train on Fold 1-2,4-5 â†’ Test on Fold 3
* Run4: Train on Fold 1-3,5 â†’ Test on Fold 4
* Run5: Train on Fold 1-4 â†’ Test on Fold 5

# Key Takeaway

We donâ€™t manually choose test data

Cross validation automatically rotates

Ensures fair testing

Prevents overfitting

Uses all data for training & testing

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

In [38]:
data=load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [39]:
X = data.data     # features
y = data.target   # label

In [40]:
model = LogisticRegression(max_iter=1000)

In [41]:
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
scores = cross_val_score(model, X, y, cv=kfold)

print("Scores for each fold:", scores)
print("Average Accuracy:", np.mean(scores))

Scores for each fold: [0.96666667 0.96666667 0.93333333 0.93333333 1.        ]
Average Accuracy: 0.96


# Stratified K-Fold (Best for Classification)

Stratified ensures class ratio remains same in each fold
Useful for datasets like:

Fraud detection

Disease prediction

Any imbalanced class problem

In [7]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
scores = cross_val_score(model, X, y, cv=skf)

print("Scores:", scores)
print("Average Accuracy:", np.mean(scores))

Scores: [0.96666667 1.         0.93333333 1.         0.9       ]
Average Accuracy: 0.9600000000000002


# Cross Validation in One Line

In [8]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=10)
print("Scores:", scores)
print("Average:", scores.mean())

Scores: [1.         0.93333333 1.         1.         0.93333333 0.93333333
 0.93333333 1.         1.         1.        ]
Average: 0.9733333333333334


In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

data = load_boston()
X = data.data
y = data.target

model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5)

print("R2 Scores:", scores)
print("Average R2:", scores.mean())

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

In [14]:
data = fetch_california_housing()
X = data.data
y = data.target

In [15]:
model = LinearRegression()

scores = cross_val_score(model, X, y, cv=5)

print("Scores for each fold:", scores)
print("Average Score:", scores.mean())

Scores for each fold: [0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
Average Score: 0.5530311140279226
