## Intermediate Machine Learning - kaggle

https://www.kaggle.com/learn/intermediate-machine-learning

### Cross-Validation

https://www.kaggle.com/code/alexisbcook/cross-validation

**Cross-validation** is running the modeling process on different subsets of the data to get multiple measures of model quality. 

Start by dividing the data into 5 pieces, or **folds**, (20% of the full dataset). Then, run an experiment for each fold:
- **Experiment 1**: use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set. 
- **Experiment 2**: hold out data from the second fold and use everything else for the training model. 
- The process is repeated using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout, and we end up with a measure of model quality that is based on all of the rows in the dataset even though all rows are not used simultaneously. 

### When should cross-validation be used? 

Cross-validation gives a more accurate measure of model quality but it can take longer to run because it estimates multiple models

- *For small datasets*, where extra computational burden isn't a big deal, you should run cross-validation.
- *For larger datasets*, a single validation set is sufficient, Code will run faster, and you may have enough data that there's little need to re-use some of it for holdout. 

There's no threshold for what is a large vs small dataset. If your model takes a couple minutes or less to run, it could be worth switching to cross-valdation. 

Also, when running cross-validation and the scores for each experiment seem to be the same results, a single validation set is probably sufficient. 

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('./melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

Next, define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions. 

Cross-validation can be done without pipelines but it makes it more difficult. Using a pipeline will make the code straightforward. 

In [2]:
from sklearn.ensemble import RandomForestRegressor 
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50, random_state=0))
                              ])

Cross-validation scores are obtained with the `cross_val_score()` function from scikit-learn. Set the number of folds with the `cv` parameter. 

In [6]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5, 
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


The `scoring` parameter chooses a measure of model quality: `neg_mean_absolute_error`. 
A list of options can be found in the [docs for sci-kit learn](https://scikit-learn.org/stable/modules/model_evaluation.html).

Scikit-learn has a convention where all metrics are defined so a high number is better. That is why a negative MAE is specified. 

Typically, we want a single measure of model quality to compare alternative models. So an average is taken across experiments. 

In [8]:
print("Average MAE score (across experiments): ")
print(scores.mean())

Average MAE score (across experiments): 
277707.3795913405


### Conclusion

Using cross-validation yields a better measure of model quality. Note that we no longer have to keep track of separate training and validation sets. So, for small datasets, it's a good improvement. 
