<a href="https://colab.research.google.com/github/natalia7244/Machine-Learning-Exercises/blob/main/Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Machine learning is something we do step by step. At each step, we must choose things like:

    What data (columns) to use

    What type of model to try

    What settings (parameters) to give the model

To see which model is better, we test them. We don't test on the same data the model learned from. Instead, we save part of the data to test — this is called a validation set.

For example, if we have 5000 rows of data, we might use 1000 rows to test and 4000 to train. This helps us see how well the model works on new data.

But there's a problem: the results can change depending on which 1000 rows we choose for testing. One model might look good just by luck — not because it’s really better.

If we use only 1 row for testing, it’s too small. That’s almost just guessing!

If we use a big test set, we get better (less random) results. But then we have less data to train, and that can make the model worse.

So we must find a balance: enough test data to trust the result, but enough training data to build a good model.

# What is cross-validation?

Cross-validation is a smart way to test how good a machine learning model is.

We do this by splitting our data into small parts — for example, 5 equal parts. These parts are called folds.

Then we do 5 tests (experiments):

    In the first test, we use the first part to check the model (validation), and the other 4 parts to train the model.

    In the second test, we use the second part to check the model, and the rest to train.

    We repeat this, each time changing which part we use for checking.

In the end:

    Every row in the data is used once to test the model.

    We get 5 different scores (one from each fold).

    We can take the average of those scores to understand how good the model is.

This way, the result is more fair and reliable, because we used all the data — just not all at once.

# When should you use cross-validation?

Cross-validation gives a better (more correct) way to check how good your model is. This is helpful when you need to make many decisions about the model.

But: cross-validation is slower because it trains the model many times — one time for each fold.

So, when should you use cross-validation, and when just one validation set?

    If you have a small dataset, and your code is not very slow, then use cross-validation. It gives better results.

    If you have a big dataset, using just one validation set is usually enough. It runs faster, and you still get good results.

There is no exact rule for what is small or big data. But:

    If your model runs in a few minutes or less, then it’s probably OK to use cross-validation.

    You can also try cross-validation once. If all the test results (folds) are similar, then using one validation set later is probably fine.

# Example

In [4]:
import pandas as pd

data = pd.read_csv('/content/drive/MyDrive/Data_sets/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

y = data.Price


# Using a pipeline will make the code remarkable straightforward

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline( steps = [('preprocessor', SimpleImputer()),
                                 ('model', RandomForestRegressor(n_estimators =50, random_state =0))
                                 ])

# Cross_val_score

In [9]:
from sklearn.model_selection import cross_val_score

scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring = 'neg_mean_absolute_error')

print("MAE scores: \n", scores)

print("Average MAE score:")
print(scores.mean())

MAE scores: 
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]
Average MAE score:
277707.3795913405
