## K-Fold Cross Validation

Testing accuracy just once doesn't account for the variance in the data and might give misleading results. K-Fold validation randomly selects one of the parts of the data set, then tests the accuracy on the same. After the required number of iterations, the accuracy is averaged.

In [6]:
# Import necessary libraries
# pandas is used for data manipulation and analysis
# matplotlib.pyplot is used for plotting graphs
# numpy is used for numerical operations
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset into a pandas DataFrame
# read_csv is a pandas function that reads a CSV file and loads it into a DataFrame
# 'Social_Network_Ads.csv' is the path to the dataset
df = pd.read_csv('Social_Network_Ads.csv')

# Extract features and target variable from the DataFrame
# iloc is a pandas method for integer-location based indexing
# df.iloc[:, 2:4] extracts the 3rd and 4th columns (index 2 and 3) as features
# This slicing operation selects all rows and columns at indices 2 and 3, resulting in a 2D array
X = df.iloc[:, 2:4]

# df.iloc[:, 4] extracts the 5th column (index 4) as the target variable
# This slicing operation selects all rows and the column at index 4, resulting in a 1D array
y = df.iloc[:, 4]

# Display the first few rows of the DataFrame
# head is a pandas method that returns the first n rows of the DataFrame (default is 5)
# This allows us to preview the data and ensure it has been loaded correctly
df.head()



Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [8]:
# Import the StandardScaler class from sklearn.preprocessing
# StandardScaler is used for scaling features to have mean = 0 and variance = 1
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
# X_sca is an object of the StandardScaler class
X_sca = StandardScaler()

# Fit the StandardScaler to the data and transform it
# fit_transform method fits the scaler to the data and then transforms it
# This means it calculates the mean and variance on the training set and then scales the data accordingly
X = X_sca.fit_transform(X)



In [9]:
# Import division from the __future__ module to ensure true division in Python 2
# This line is not necessary if you are using Python 3
from __future__ import division

# Import necessary classes and functions from sklearn
# KFold is used for k-fold cross-validation
# accuracy_score is used to calculate the accuracy of the model
# SVC is the Support Vector Classifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

# Initialize k-fold cross-validation with 10 splits
# n_splits=10 specifies the number of folds
kfold_cv = KFold(n_splits=10)

# Initialize counters for correct predictions and total folds
correct = 0
total = 0

# Perform k-fold cross-validation
# kfold_cv.split(X) splits the data into train and test indices for each fold
for train_indices, test_indices in kfold_cv.split(X):
    # Split the data into training and test sets
    # X_train and y_train are the training features and target variable
    # X_test and y_test are the test features and target variable
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], y[train_indices], y[test_indices]
    
    # Initialize and fit the Support Vector Classifier
    # kernel='linear' specifies a linear kernel for the SVC
    # random_state=0 sets the random seed for reproducibility
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)
    
    # Calculate the accuracy for the current fold and add to the correct counter
    # accuracy_score(y_test, clf.predict(X_test)) calculates the accuracy for the test set
    correct += accuracy_score(y_test, clf.predict(X_test))
    
    # Increment the total counter
    total += 1

# Calculate and print the average accuracy across all folds
# correct/total gives the average accuracy
# format function is used to format the accuracy to 2 decimal places
print("Accuracy: {0:.2f}".format(correct / total))



Accuracy: 0.82


In [10]:
# Import the Support Vector Classifier from sklearn.svm
# SVC is a powerful classifier that works well for both linear and non-linear classification problems
from sklearn.svm import SVC

# Initialize the Support Vector Classifier
# kernel='linear' specifies the use of a linear kernel, which is effective for linearly separable data
# random_state=0 sets the random seed to ensure reproducibility of results
clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)



In [12]:
# Import the cross_val_score function from sklearn.model_selection
# cross_val_score is used to evaluate a model using k-fold cross-validation
from sklearn.model_selection import cross_val_score

# Apply k-fold cross-validation to the Support Vector Classifier (clf)
# cross_val_score evaluates the classifier clf on the training data X_train and y_train using 10-fold cross-validation
# cv=10 specifies the number of folds to use in the cross-validation
accuracies = cross_val_score(clf, X_train, y_train, cv=10)

# Print the individual accuracy scores for each fold
# This shows the accuracy score for each of the 10 folds
print(accuracies)

# Print the mean accuracy across all folds
# The mean method calculates the average accuracy score across the 10 folds
print(accuracies.mean())

# Print the standard deviation of the accuracy scores
# The std method calculates the standard deviation of the accuracy scores across the 10 folds
print(accuracies.std())


[0.72222222 0.69444444 0.94444444 0.94444444 0.97222222 0.94444444
 0.83333333 0.75       0.80555556 0.91666667]
0.8527777777777776
0.0994196120454071


## Leave-One-Out Cross Validation

Another type of cross-validation is **Leave-One-Out Cross Validation (LOOCV)**. In this method, out of the \( n \) samples, one sample is left out as the validation set, and the model is trained on the remaining \( n - 1 \) samples. This process is repeated \( n \) times, with each sample being used exactly once as the validation data.



In [14]:
# Load the dataset into a pandas DataFrame
# read_csv is a pandas function that reads a CSV file and loads it into a DataFrame
# 'Social_Network_Ads.csv' is the path to the dataset
df = pd.read_csv('Social_Network_Ads.csv')

# Extract features and target variable from the DataFrame
# iloc is a pandas method for integer-location based indexing
# df.iloc[:, 2:4] extracts the 3rd and 4th columns (index 2 and 3) as features
# This slicing operation selects all rows and columns at indices 2 and 3, resulting in a 2D array
X = df.iloc[:, 2:4]

# df.iloc[:, 4] extracts the th column (index 4) as the target variable
# This slicing operation selects all rows and the column at index 4, resulting in a 1D array
y = df.iloc[:, 4]

# Display the first few rows of the DataFrame
# head is a pandas method that returns the first n rows of the DataFrame (default is 5)
# This allows us to preview the data and ensure it has been loaded correctly
df.head()


Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [15]:
# Import the StandardScaler class from sklearn.preprocessing
# StandardScaler is used for scaling features to have mean = 0 and standard deviation = 1
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
# X_sca is an object of the StandardScaler class
X_sca = StandardScaler()

# Fit the StandardScaler to the data and transform it
# fit_transform method fits the scaler to the data and then transforms it
# This means it calculates the mean and standard deviation on the training set and then scales the data accordingly
X = X_sca.fit_transform(X)


In [16]:
# Import division from the __future__ module to ensure true division in Python 2
# This line is not necessary if you are using Python 3
from __future__ import division

# Import necessary classes and functions from sklearn
# LeaveOneOut is used for leave-one-out cross-validation
# accuracy_score is used to calculate the accuracy of the model
# SVC is the Support Vector Classifier
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

# Initialize leave-one-out cross-validation
# LeaveOneOut splits the data such that each sample is used once as a test set
loo_cv = LeaveOneOut()

# Initialize counters for correct predictions and total samples
correct = 0
total = 0

# Perform leave-one-out cross-validation
# loo_cv.split(X) splits the data into train and test indices for each sample
for train_indices, test_indices in loo_cv.split(X):
    # Uncomment these lines to print splits
    # print("Train Indices: {}...".format(train_indices[:4]))
    # print("Test Indices: {}...".format(test_indices[:4]))
    # print("Training SVC model using this configuration")
    
    # Split the data into training and test sets
    # X_train and y_train are the training features and target variable
    # X_test and y_test are the test features and target variable
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], y[train_indices], y[test_indices]
    
    # Initialize and fit the Support Vector Classifier
    # kernel='linear' specifies a linear kernel for the SVC
    # random_state=0 sets the random seed for reproducibility
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)
    
    # Calculate the accuracy for the current iteration and add to the correct counter
    # accuracy_score(y_test, clf.predict(X_test)) calculates the accuracy for the test set
    correct += accuracy_score(y_test, clf.predict(X_test))
    
    # Increment the total counter
    total += 1

# Calculate and print the average accuracy across all iterations
# correct/total gives the average accuracy
# format function is used to format the accuracy to 2 decimal places
print("Accuracy: {0:.2f}".format(correct / total))



Accuracy: 0.84


## Stratified KFold

### Introduction
KFold validation does not preserve the split of the output variable while splitting the data into k-folds. This can lead to issues, especially in cases where the distribution of the output variable is important.

### Example Scenario
Imagine training a Naive Bayes classifier using KFold validation with 10 samples, where 5 are positive and 5 are negative. Since KFold randomly selects the split, it could result in an unfortunate situation where one split contains all positive samples and another contains all negative samples. 

### Problem with KFold
In such a scenario, the Naive Bayes classifier will calculate the prior probabilities and find them to be 100%, meaning the model will incorrectly conclude that the output is always positive. This is clearly an inaccurate representation.

### Solution: Stratified KFold
To address this issue, we use **Stratified KFold**. This method ensures that the split in the original dataset is preserved in the training set. Essentially, if the original dataset has 50% positive and 50% negative outputs, then the training set will also have 50% positive and 50% negative outputs.

### Benefits of Stratified KFold
- **Preserves Distribution**: Maintains the balance of the output variable across the folds.
- **Reduces Bias**: Prevents the model from learning skewed distributions that could occur with random splits.

Stratified KFold is particularly useful in scenarios where the output variable has a skewed distribution, ensuring that each fold is representative of the overall dataset.


In [19]:
# Import necessary libraries for data manipulation, plotting, and numerical operations
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset into a pandas DataFrame
# 'Social_Network_Ads.csv' is the path to the dataset
df = pd.read_csv('Social_Network_Ads.csv')

# Extract features and target variable from the DataFrame
# Use iloc for integer-location based indexing
# df.iloc[:, 2:4] selects all rows and columns at indices 2 and 3, resulting in a 2D array
X = df.iloc[:, 2:4].values

# df.iloc[:, 4] selects all rows and the column at index 4, resulting in a 1D array
y = df.iloc[:, 4].values

# Display the first few rows of the DataFrame to preview the data
df.head()


Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [21]:
# Import the StandardScaler class from sklearn.preprocessing
# StandardScaler is used for scaling features to have mean = 0 and standard deviation = 1
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
# X_sca is an object of the StandardScaler class
X_sca = StandardScaler()

# Fit the StandardScaler to the data and transform it
# fit_transform method fits the scaler to the data and then transforms it
# This means it calculates the mean and standard deviation on the training set and then scales the data accordingly
X = X_sca.fit_transform(X)


## Validating Time Series Data

Time series data is associated with a time frame, such as stock prices. The goal is to predict future stock prices based on historical data. 

### Challenge with Traditional Splitting Techniques
Using traditional splitting techniques might lead to predicting past values from future data due to their random nature, which should be avoided. Predictions should always be made from past to future data.

### Solution: TimeSeriesSplit
This issue can be addressed by using **TimeSeriesSplit**, which ensures that the training data precedes the testing data, maintaining the temporal order of observations.

### Key Points
- **Time Series Data**: Data associated with a specific time frame.
- **Prediction Goal**: Predict future values based on historical data.
- **Avoiding Data Leakage**: Ensure that future data is not used to predict past values.
- **TimeSeriesSplit**: A technique that splits data such that the training set precedes the test set, preserving the temporal sequence of observations.

By using TimeSeriesSplit, we can validate time series data more effectively and ensure that our predictive models are robust and accurate.


In [23]:
# Import the TimeSeriesSplit class from sklearn.model_selection
# TimeSeriesSplit is used to split data specifically for time series analysis
from sklearn.model_selection import TimeSeriesSplit

# Import the numpy library for numerical operations
import numpy as np

# Generate a random dataset with 10 samples and 2 features
# np.random.rand(10, 2) generates an array of shape (10, 2) with random values between 0 and 1
X = np.random.rand(10, 2)

# Generate a random target variable with 10 samples
# np.random.rand(10) generates an array of shape (10,) with random values between 0 and 1
y = np.random.rand(10)

# Print the generated features (X)
print(X)

# Print the generated target variable (y)
print(y)



[[0.83117734 0.06808227]
 [0.8052737  0.72567031]
 [0.46656173 0.30350335]
 [0.19552339 0.11696537]
 [0.76339186 0.28476993]
 [0.57415496 0.69678205]
 [0.62575038 0.20006848]
 [0.52509505 0.85671262]
 [0.95508043 0.20284131]
 [0.7384194  0.12211289]]
[0.94828076 0.63947491 0.13084306 0.87328322 0.94336473 0.91112029
 0.54470935 0.16892438 0.14068336 0.9679299 ]


In [24]:
# Import the TimeSeriesSplit class from sklearn.model_selection
# TimeSeriesSplit is used to split data specifically for time series analysis
from sklearn.model_selection import TimeSeriesSplit

# Initialize the TimeSeriesSplit with 7 splits
# n_splits=7 specifies the number of splits
tss = TimeSeriesSplit(n_splits=7)

# Iterate through the train and test indices generated by TimeSeriesSplit
# tss.split(X) generates the train and test indices for each split
for train_indices, test_indices in tss.split(X):
    # Print the train and test indices for each split
    # format method is used to format the train and test indices
    print("Train indices: {0} Test indices: {1}".format(train_indices, test_indices))



Train indices: [0 1 2] Test indices: [3]
Train indices: [0 1 2 3] Test indices: [4]
Train indices: [0 1 2 3 4] Test indices: [5]
Train indices: [0 1 2 3 4 5] Test indices: [6]
Train indices: [0 1 2 3 4 5 6] Test indices: [7]
Train indices: [0 1 2 3 4 5 6 7] Test indices: [8]
Train indices: [0 1 2 3 4 5 6 7 8] Test indices: [9]
