<a href="https://colab.research.google.com/github/proffranciscofernando/introduction-to-data-science/blob/main/03-data-split-and-cross-validation-lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for demonstration of data split and cross-validation

## 1. Introduction
This notebook provides an introduction to linear regression. We will create a synthetic dataset, fit a linear regression model, and evaluate its performance in four different scenarios:
1. Without splitting the data
2. With data split into training and testing sets
3. Using cross-validation
4. Comparing different numbers of folds in cross-validation

## 2. Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

## 3. Creating the Synthetic Dataset
Defining the size of the dataset.

In [None]:
# Generating random data
n_samples = 100
np.random.seed(42)
X = 2 * np.random.rand(n_samples, 1)
y = 4 + 3 * X + np.random.randn(n_samples, 1)

# Creating a DataFrame
data = pd.DataFrame(data=np.hstack((X, y)), columns=['Feature', 'Target'])

# Displaying the first few rows of the dataset
data.head()

## 4. Exploratory Data Analysis (EDA)
Visualising the relationship between Feature and Target.

In [None]:
plt.scatter(data['Feature'], data['Target'])
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Feature vs Target')
plt.show()

## 5. Scenario 1: Without Splitting the Data

In [None]:
# Training the Linear Regression Model
model = LinearRegression()
model.fit(data[['Feature']], data['Target'])

In [None]:
# Evaluating the Model
y_pred = model.predict(data[['Feature']])
mse1 = mean_squared_error(data['Target'], y_pred)
r2_1 = r2_score(data['Target'], y_pred)
print(f'Scenario 1 - MSE: {mse1}')
print(f'Scenario 1 - R²: {r2_1}')

## 6. Scenario 2: Splitting the Data into Training and Testing Sets

In [None]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(data[['Feature']], data['Target'], test_size=0.2, random_state=42)

# Training the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluating the Model on Training Data
y_train_pred = model.predict(X_train)
mse2_train = mean_squared_error(y_train, y_train_pred)
r2_2_train = r2_score(y_train, y_train_pred)
print(f'Scenario 2 - Training MSE: {mse2_train}')
print(f'Scenario 2 - Training R²: {r2_2_train}')

In [None]:
# Evaluating the Model on Testing Data
y_test_pred = model.predict(X_test)
mse2_test = mean_squared_error(y_test, y_test_pred)
r2_2_test = r2_score(y_test, y_test_pred)
print(f'Scenario 2 - Testing MSE: {mse2_test}')
print(f'Scenario 2 - Testing R²: {r2_2_test}')

## 7. Scenario 3: Using Cross-Validation

In [None]:
# Cross-Validation
cv_mse = -cross_val_score(model, data[['Feature']], data['Target'], cv=5, scoring='neg_mean_squared_error')
cv_r2 = cross_val_score(model, data[['Feature']], data['Target'], cv=5, scoring='r2')

# Evaluating the Model
mse3 = cv_mse.mean()
r2_3 = cv_r2.mean()
print(f'Scenario 3 - MSE: {mse3}')
print(f'Scenario 3 - R²: {r2_3}')

## 8. Scenario 4: Comparing Different Numbers of Folds in Cross-Validation

In [None]:
# Comparing different numbers of folds in cross-validation
folds = [3, 5, 10, 20]
mse_scores = []
r2_scores = []
for k in folds:
    cv_mse = -cross_val_score(model, data[['Feature']], data['Target'], cv=k, scoring='neg_mean_squared_error')
    cv_r2 = cross_val_score(model, data[['Feature']], data['Target'], cv=k, scoring='r2')
    mse_scores.append(cv_mse.mean())
    r2_scores.append(cv_r2.mean())
    print(f'{k}-fold CV - MSE: {cv_mse.mean()}')
    print(f'{k}-fold CV - R²: {cv_r2.mean()}')

In [None]:
# Plotting the results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(folds, mse_scores, marker='o')
plt.title('MSE vs. Number of Folds')
plt.xlabel('Number of Folds')
plt.ylabel('MSE')

plt.subplot(1, 2, 2)
plt.plot(folds, r2_scores, marker='o')
plt.title('R² vs. Number of Folds')
plt.xlabel('Number of Folds')
plt.ylabel('R²')

plt.tight_layout()
plt.show()

## 9. Comparison of All Scenarios

In [None]:
# Comparing all scenarios
#scenarios = ['No Split', 'Train/Test Split (Train)', 'Train/Test Split (Test)', '5-Fold CV', '3-Fold CV', '10-Fold CV', '20-Fold CV']
#mse_all = [mse1, mse2_train, mse2_test, mse_cv] + mse_scores
#r2_all = [r2_1, r2_2_train, r2_2_test, r2_cv] + r2_scores

scenarios = ['No Split', 'Train/Test Split (Train)', 'Train/Test Split (Test)', '3-Fold CV', '5-Fold CV', '10-Fold CV', '20-Fold CV']
mse_all = [mse1, mse2_train, mse2_test] + mse_scores
r2_all = [r2_1, r2_2_train, r2_2_test] + r2_scores

# Plotting the comparison
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.barh(scenarios, mse_all, color='skyblue')
plt.xlabel('MSE')
plt.title('Comparison of MSE across Scenarios')

plt.subplot(1, 2, 2)
plt.barh(scenarios, r2_all, color='lightgreen')
plt.xlabel('R²')
plt.title('Comparison of R² across Scenarios')

plt.tight_layout()
plt.show()

## 10. Conclusion
In this notebook, we created a synthetic dataset and fitted a linear regression model. We evaluated the model's performance using MSE and R² in four different scenarios:
1. Without splitting the data
2. With data split into training and testing sets
3. Using cross-validation
4. Comparing different numbers of folds in cross-validation

Finally, we compared the results of all scenarios to understand the impact of different evaluation methods and the number of folds on the model's performance.