### __Cross Validation__
__Cross Validation__
Cross-validation is a crucial technique in machine learning used to evaluate the performance of a model on unseen data. It helps to assess how well the model generalizes and avoids overfitting. Here's a breakdown of different cross-validation techniques, along with code examples and explanations of why it's needed:

**Why is Cross-Validation Needed?** 

When we train a machine learning model, we split our data into training and testing sets. We train the model on the training set and evaluate its performance on the testing set. However, a single train-test split can be sensitive to the specific data points included in each set. This can lead to an unreliable estimate of the model's performance.

Cross-validation addresses this issue by performing multiple train-test splits and averaging the results. This provides a more robust and reliable estimate of the model's performance on unseen data.

__Types of Cross-Validation:__
1. __k-Fold Cross-Validation:__
- The data is divided into k equally sized folds.
- The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.
- The performance metric (e.g., accuracy, mean squared error) is calculated for each test set, and the average of these metrics is taken as the final evaluation.   

In [4]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target


In [18]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [20]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [33]:
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold cross validation
model = LogisticRegression(max_iter=1000) # Increased max_iter to avoid convergence warnings
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print("Cross-validation scores:", scores)
print("Mean cross-validation score", scores.mean())

Cross-validation scores: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean cross-validation score 0.9733333333333334


2. __Stratified k-Fold Cross-Validation:__

* Similar to k-Fold, but it ensures that each fold has approximately the same proportion of target classes as the original dataset. This is particularly important for imbalanced datasets.

In [37]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y= iris.target

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter = 1000)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')


print("Stratified cross-validation scores:", scores)
print("Mean stratified cross-validation score:", scores.mean())

Stratified cross-validation scores: [1.         0.96666667 0.93333333 1.         0.93333333]
Mean stratified cross-validation score: 0.9666666666666668


3. **Leave-One-Out Cross-Validation (LOOCV):**

* A special case of k-Fold where k is equal to the number of data points.
* Each data point is used as a test set, with the remaining data points used for training.
* Computationally expensive for large datasets.


In [45]:
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

loo = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

print("LOOCV scores:", scores)
print("Mean LOOCV score:", scores.mean())

LOOCV scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Mean LOOCV score: 0.9666666666666667


### **Choosing the Right Cross-Validation Technique:**
* __k-Fold:__ General purpose, good for most cases.
* __Stratified k-Fold:__ Use for classification with imbalanced datasets.
* __LOOCV:__ Use for small datasets when computational resources are not a major concern. Provides a nearly unbiased estimate but has high variance.
