### Lab 5 - Cross-Validation for Model Selection
Task 1:
* Utilize the diabetes dataset from lab 4. 
* Perform cross-validation on nine polynomial models, ranging from degree 0 to 8.

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

Task 2 :
* Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. 
* Include the R-Squared, Mean Absolute Error (MAE) and MAPE metrics for each model. 
* Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values.

In [2]:
# Initialize lists to store metrics
results = []

# Loop through polynomial degrees from 0 to 8
for degree in range(9):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)

    # Create a linear regression model
    model = LinearRegression()

    # Perform cross-validation predictions with 5 folds
    y_pred = cross_val_predict(model, X_poly, y, cv=5)

    # Calculate R-Squared
    r2 = r2_score(y, y_pred)

    # Calculate Mean Absolute Error (MAE)
    mae = mean_absolute_error(y, y_pred)

    # Calculate Mean Absolute Percentage Error (MAPE)
    mape = mean_absolute_percentage_error(y, y_pred)

    results.append([degree, r2, mae, mape])

# Create a DataFrame to summarize the results
results_table = pd.DataFrame(
    results, columns=['Polynomial Degree', 'R-Squared', 'MAE', 'MAPE'])

# Calculate the mean and standard deviation of the metrics
mean_metrics = results_table.mean()
std_metrics = results_table.std()

# Add a row for mean and standard deviation to the table
mean_std_row = pd.DataFrame({
    'Polynomial Degree': 'Mean ± Std',
    'R-Squared': f'{mean_metrics["R-Squared"]:.4f} ± {std_metrics["R-Squared"]:.4f}',
    'MAE': f'{mean_metrics["MAE"]:.4f} ± {std_metrics["MAE"]:.4f}',
    'MAPE': f'{mean_metrics["MAPE"]:.4f} ± {std_metrics["MAPE"]:.4f}'
}, index=[len(results)])

raw_results_table = pd.concat([results_table])
results_table = pd.concat([results_table, mean_std_row])

# Print the results table
print("Summary of Cross-Validation Results:")
print(results_table)

Summary of Cross-Validation Results:
  Polynomial Degree           R-Squared                  MAE             MAPE
0                 0           -0.008824             66.03925         0.623684
1                 1            0.495322            44.274856         0.394893
2                 2            0.410853            46.602887          0.40275
3                 3          -170.75562           342.032729          2.32316
4                 4           -71.85994           303.102402         2.453773
5                 5          -68.544073           295.638158         2.405314
6                 6          -68.610219           295.584336         2.405038
7                 7          -68.611392           295.582874         2.405036
8                 8           -68.60516           295.533335         2.404673
9        Mean ± Std  -57.3432 ± 54.2533  220.4879 ± 127.1308  1.7576 ± 0.9656


Task 3: 
* Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. 
* Provide an explanation for choosing this specific model. 

In [3]:
best_model_r2 = raw_results_table.sort_values(by='R-Squared', ascending=False).iloc[0]
best_model_mae = raw_results_table.sort_values(by='MAE', ascending=True).iloc[0]
best_model_mape = raw_results_table.sort_values(by='MAPE', ascending=True).iloc[0]


# Print the best models
print("\nBest Model Based on R-Squared:")
print(best_model_r2)
print("\nBest Model Based on MAE:")
print(best_model_mae)
print("\nBest Model Based on MAPE:")
print(best_model_mape)


Best Model Based on R-Squared:
Polynomial Degree     1.000000
R-Squared             0.495322
MAE                  44.274856
MAPE                  0.394893
Name: 1, dtype: float64

Best Model Based on MAE:
Polynomial Degree     1.000000
R-Squared             0.495322
MAE                  44.274856
MAPE                  0.394893
Name: 1, dtype: float64

Best Model Based on MAPE:
Polynomial Degree     1.000000
R-Squared             0.495322
MAE                  44.274856
MAPE                  0.394893
Name: 1, dtype: float64


I am choosing the model with the lowest MAPE (Mean Absolute Percentage Error) 

* Percentage Accuracy: You want your predictions to be as close as possible to the actual values in percentage terms. MAPE measures how accurate your predictions are in relative percentages.

* Easy Understanding: MAPE is easy to understand. It directly tells you how far off your predictions are in terms of percentage errors, making it simple to communicate with non-technical people.

* Robustness: MAPE works well with data of different scales and helps balance the effect of unusual data points (outliers).

* Compliance: In some industries or for regulatory purposes, there may be specific requirements for percentage accuracy. Choosing the lowest MAPE helps meet such requirements.

Task 4:
* Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. 
* The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one (for example - an analysis of in which instances does it fail)

In [4]:
# Extract the best model based on the lowest MAPE
best_model = raw_results_table[raw_results_table['MAPE']
                               == raw_results_table['MAPE'].min()]

# Additional analysis and insights
if not best_model.empty:
    best_degree = best_model['Polynomial Degree'].values[0]
    print(f"Chosen Model (Polynomial Degree {best_degree}):")
    print(best_model)

    # Perform further analysis, if needed
    if best_degree == 1:
        print("Insight: The chosen model is a simple linear regression model.")
        print("Interpretability: Linear models are more interpretable and suitable for relative accuracy.")
        print("Model Simplicity: The linear model suggests that more complex polynomials may not significantly improve relative accuracy.")
    else:
        print("Insight: The chosen model has a polynomial degree higher than 1.")
        print("Considerations: Polynomial models capture non-linear relationships but may have higher complexity.")

    # Additional analysis steps can be added as needed, such as handling outliers, exploring data characteristics, or assessing model robustness.

else:
    print("No model with the lowest MAPE found in the results.")

Chosen Model (Polynomial Degree 1):
   Polynomial Degree  R-Squared        MAE      MAPE
1                  1   0.495322  44.274856  0.394893
Insight: The chosen model is a simple linear regression model.
Interpretability: Linear models are more interpretable and suitable for relative accuracy.
Model Simplicity: The linear model suggests that more complex polynomials may not significantly improve relative accuracy.
