### How to Detect and Avoid Overfitting

The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do that, the second option is to reduce the number of predictors in your model — either by combining or eliminating them. Factor Analysis is one method you can use to identify related predictors that might be candidates for combining.

#### Cross-Validation

Use cross validation to detect overfitting: this partitions your data, generalizes your model, and chooses the model which works best. One form of cross-validation is predicted R-squared. Most good statistical software will include this statistic, which is calculated by:

Removing one observation at a time from your data,
Estimating the regression equation for each iteration,
Using the regression equation to predict the removed observation.
Cross validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even with an adequate sample size.

Here are the steps involved in cross validation:

    You reserve a sample data set
    Train the model using the remaining part of the dataset
    Use the reserve sample of the test (validation) set. This will help you in gauging the effectiveness of your model’s performance. If your model delivers a positive result on validation data, go ahead with the current model. It rocks!

#### A few common methods used for Cross Validation

**The validation set approach**

In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. However, a major disadvantage of this approach is that since we are training a model on only 50% of the dataset, there is a huge possibility that we might miss out on some interesting information about the data which will lead to a higher bias.

Python Code:

`train, validation = train_test_split(data, test_size=0.50, random_state = 5)`

**k-fold cross validation**

    1.	Randomly split your entire dataset into k”folds”
    2.	For each k-fold in your dataset, build your model on k – 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold
    3.	Record the error you see on each of the predictions
    4.	Repeat this until each of the k-folds has served as the test set
    5.	The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model
    
![image.png](attachment:image.png)

Python Code:

`from sklearn.model_selection import KFold 
kf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None) 
for train_index, test_index in kf.split(X):
      print("Train:", train_index, "Validation:",test_index)
      X_train, X_test = X[train_index], X[test_index] 
      y_train, y_test = y[train_index], y[test_index]`


**Stratified k-fold cross validation**
Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.

![image.png](attachment:image.png)

Python code :

`from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
for train_index, test_index in skf.split(X,y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]`

Note:X is the feature set and y is the target

### Exercise:
    What is the purpose of performing cross-validation?
    A.To assess the predictive performance of the models
    B.To judge how the trained model performs outside the sample on test data
    C.Both A and B 