# Cross-validation

Hello, welcome to this chapter where we will explore some other scikit-learn tools that will help us create more and better models.

The first thing we'll look at is called cross-validation.

Cross-validation is a technique that allows us to evaluate the performance of a machine learning model.

We already know that training a model with a dataset and testing it on the same dataset is a mistake, which is why we usually divide the data into two sets: training and testing.

It's useful because instead of dividing the data into two static training and test sets as we're normally accustomed to, this technique divides the data into multiple sets called "folds".

Let's say you have this data. What cross-validation does is divide it into N groups of roughly similar size.

So the distribution of the data looks like this:

It then iterates over all combinations of this data, using a different training set for each iteration and also evaluating on a different set each time. This will give us N measures of how good our algorithm is.

This training method is useful for getting a better estimate of the model on unseen data, which becomes more relevant when we don't have a large dataset.

The go-to method for applying and calculating cross-validation in scikit-learn is the `cross_val_score` function.

## The `**cross_val_score**` function

One of the easiest ways to use cross-validation is through the `cross_val_score` function. This is a function that trains and tests a model on each and every one of the folds generated by cross-validation.

What happens inside the function is that at each iteration a new model is created and trained, then evaluated and the score it receives is stored in an array. This array will be returned as the result of the function.

To see it with an example, let's load a dataset and a machine learning model. In a future chapter we'll look at models in more detail, for now, simply execute the code:

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier


In [None]:


iris_dataset = load_iris()
X, y = iris_dataset.data, iris_dataset.target

X[:10]


Now, to use `cross_val_score`, you need to import the function from `sklearn.model_selection`:

In [None]:
from sklearn.model_selection import cross_val_score


And we proceed to use it by passing the untrained model, input data, and label as arguments:

In [None]:
model = RandomForestClassifier()

scores = cross_val_score(model, X, y, cv=5)


Depending on the size of your data, the function may take some time.

Remember what is happening internally: your data is being divided into 5 segments, a value that we are setting in this case with the `cv` argument. Each of these segments will be used to test the performance of a model trained on the remaining segments.

If we visualize the function's result, you will see 5 values.

In [None]:
scores


Each of these values represents the model's score in each of these cross-validation segments. In this case, the model obtained a high score in most segments, suggesting that the model we passed to the function has acceptable performance.

### How do I train a model with `cross_val_score`?

While the function trains models, it only does so for the purpose of evaluating performance. If all the scores returned by the function are acceptable, you can proceed to train a final model using all your training data. This would be the model you should test with your test set and subsequently move to production.

In [None]:
model.fit(X, y)


## Other arguments

The base arguments of the function are the estimator, or the model to test, the input variables, and the label. Optionally, we can specify how many segments we want to use with the `cv` argument by passing an integer.

You might be wondering, what metric is it using to measure performance? – By default, `cross_val_score` uses an evaluation metric specific to the estimator in question. For example, for a logistic regression classifier, the default metric is accuracy (`accuracy`), while for a regression model, the default metric is the coefficient of determination R-squared (`R^2`). However, a different metric can be specified through the `scoring` argument if you want to use a different evaluation measure than the default one.

For example, let's say you want to use 6 segments and instead of using `accuracy`, we're more interested in knowing the precision of the model. We would have to call `cross_val_score` like this:

In [None]:
model = RandomForestClassifier()

scores = cross_val_score(model, X, y, cv = 6, scoring='precision_macro', verbose = 3)


If you notice, I'm also specifying `verbose` equal to 3, which controls the amount of information printed during cross-validation. A higher `verbose` value prints more information, while a lower value prints less information. Useful for when you want to know what's happening.

## The `**cross_validate**` function

There's another even more generic function than `cross_val_score` that receives the same arguments. The main difference between them is that `cross_validate` allows you to specify multiple metrics for evaluation and returns more information:

In [None]:
from sklearn.model_selection import cross_validate


In [None]:
model = RandomForestClassifier()

scores = cross_validate(model, X, y, cv = 6, 
	scoring=[
		'precision_macro', 
		'precision_micro',
		'accuracy'
	],
	verbose = 3)


And the result is a dictionary with more information:

In [None]:
for key, values in scores.items():
    print(key)
    print(values)


And there you have it, this book section covered cross-validation and how it can be used in scikit-learn. Remember that this is an important technique in machine learning, and it's recommended to use it whenever possible. However, it's important to consider the computational complexity and resources that would be consumed when working with a large amount of data.

That's why cross-validation is especially useful when you have a small dataset and want to estimate the performance of a model reliably.

Join me in the next chapter, where we'll discuss the hyperparameters of our model.