<a href="https://colab.research.google.com/github/johnsanterre/santerreAI/blob/main/003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3: Cross-Validation

Cross-validation is a technique for evaluating the performance of a machine learning algorithm. It helps to prevent overfitting, which is when a model performs well on the training data but poorly on unseen data. Cross-validation involves splitting the data into a number of folds, training the model on some of the folds, and evaluating it on the remaining folds. This process is repeated a number of times, with different folds being used as the evaluation set each time. The performance of the model is then averaged across all of the folds.
Here is a tutorial on how to perform cross-validation in Python:
First, we will start by loading the necessary libraries and creating some sample data. For this tutorial, we will use the load_diabetes function from the sklearn.datasets module to load a sample dataset.


In [2]:
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

X, y = load_diabetes(return_X_y=True)

Next, we will create a model that we want to evaluate using cross-validation. For this tutorial, we will use a random forest regressor.

In [3]:
model = RandomForestRegressor()

Now, we can use the cross_val_score function from the sklearn.model_selection module to perform cross-validation. The cross_val_score function takes the model, the input data, and the target data as input, and returns an array of evaluation scores. By default, the cross_val_score function uses a 5-fold cross-validation, but this can be changed using the cv parameter.

In [4]:
scores = cross_val_score(model, X, y)
print(scores)

[0.38502019 0.50405881 0.43881749 0.37877011 0.40456854]


The scores array contains the evaluation scores for each fold of the cross-validation. We can calculate the mean and standard deviation of these scores to get a sense of the model's performance.
Finally, we can use the mean and standard deviation to get a sense of the model's performance. If the mean score is high and the standard deviation is low, it means that the model is performing well and is consistent across different folds. If the standard deviation is high, it means that the model's performance is more variable across different folds, which could indicate overfitting or underfitting.

In [5]:
print(f"Mean: {scores.mean()}")
print(f"Standard Deviation: {scores.std()}")

Mean: 0.42224702744364784
Standard Deviation: 0.045948572219829495
