# Project - Model Development, Evaluation, and Improvement

**Instructions:**

4. Modeling and Evaluation:

Model data using one of the supervised machine learning algorithms learned.
Evaluate the model's performance.

In [3]:
# Ignore Future Warnings for easier analysis
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Imports:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import KFold, HalvingGridSearchCV

In [5]:
# Split cleaned data into training, validation, and testing
data = pd.read_csv('C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/cleaned_data.csv')

features = data.drop(columns=['inflation', 'country', 'country_code'])
condition = data['inflation']

x_train, x_valtest, y_train, y_valtest = train_test_split(features, condition, random_state=42)
x_val, x_test, y_val, y_test = train_test_split(x_valtest, y_valtest, random_state=42)

x_train.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Train_features.csv", index=False)
y_train.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Train_labels.csv", index=False)
x_val.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Val_features.csv", index=False)
y_val.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Val_labels.csv", index=False)
x_test.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Test_features.csv", index=False)
y_test.to_csv("C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Test_labels.csv", index=False)

In [6]:
# Load Training, Validation, and Testing data - Features and Labels
train_features = pd.read_csv('C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Train_features.csv')
train_label = pd.read_csv('C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Train_labels.csv')

test_features = pd.read_csv('C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Test_features.csv')
test_label = pd.read_csv('C:/Users/miche/OneDrive/Desktop/Comp-2040 Python Essentials/Final Project/Train,Val,Test Data/Test_labels.csv')

In [7]:
# Use the GridSearchCV module for easier evaluation of data - Decision Tree Evaluation
param_grid = {
    'criterion': ['squared_error', 'absolute_error', 'friedman_mse'],
    'max_depth': [1, 5, 10, 15, 30, 40, 50],
    'min_samples_split': [1, 5, 10, 15, 30],
    'min_samples_leaf': [1, 5, 10, 15, 30]
}

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
mse_score = make_scorer(mean_squared_error)

dt_grid_search = HalvingGridSearchCV(estimator=DecisionTreeRegressor(), param_grid=param_grid, cv=kfold, scoring=mse_score, return_train_score=True)
dt_grid_search.fit(train_features, train_label)

print("Best parameters:", dt_grid_search.best_params_)
print("Weighted mean training score:", dt_grid_search.best_score_)

pred = dt_grid_search.predict(train_features)
print("MSE:", mean_squared_error(train_label, pred))

Best parameters: {'criterion': 'squared_error', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}
Weighted mean training score: 553.1622460463002
MSE: 50.605621792277674


In [8]:
# Use the HalvingGridSearchCV module for easier evaluation of data - SVM Evaluation
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR())
])

param_grid = {
    'svr__C': [5, 10, 15, 25, 50],
    'svr__degree': [1, 5, 10, 15, 30],
    'svr__gamma': [1, 5, 10, 15, 30],
    'svr__kernel': ['linear', 'rbf']
}

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
mse_score = make_scorer(mean_squared_error)

svm_grid_search = HalvingGridSearchCV(pipeline, param_grid, cv=kfold, scoring=mse_score, return_train_score=True)
svm_grid_search.fit(train_features, train_label)

print("Best parameters:", svm_grid_search.best_params_)
print("Weighted mean training score:", svm_grid_search.best_score_)

pred = svm_grid_search.predict(train_features)
print("MSE:", mean_squared_error(train_label, pred))

Best parameters: {'svr__C': 10, 'svr__degree': 1, 'svr__gamma': 5, 'svr__kernel': 'linear'}
Weighted mean training score: 334.59208716199817
MSE: 329.48693317948795


In [9]:
# Evaluate the best model using testing data
best_model = dt_grid_search.best_estimator_

preds = best_model.predict(test_features)
print("MSE:", mean_squared_error(test_label, preds))

MSE: 49.630083075257474


**Explanation:**

To determine the best model, I used the GridSearchCV module for efficient training and evaluation. I tested two models: a decision tree and an SVR model, obtaining the following results:

**Decision Tree:**
<pre>
Best parameters: {'criterion': 'squared_error', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}
Weighted mean training score: 553.1622460463002
MSE: 50.60562179227767</pre>

**SVR Model:**
<pre>
Best parameters: {'svr__C': 10, 'svr__degree': 1, 'svr__gamma': 5, 'svr__kernel': 'linear'}
Weighted mean training score: 334.59208716199817
MSE: 329.4869331794879</pre>

After testing various combinations of hyperparameters, the decision tree yielded better results on the training data due to its lower MSE and higher training score. Upon evaluating the model with testing data, the decision tree remained the most optimal choice, demonstrating consistent performancee>e>

In [None]:
# Explanation:

# To determine the best model, I used the GridSearchCV module for efficient training and evaluation. I tested two models: a decision tree and 
# an SVR model, obtaining the following results:

# Decision Tree:
# Best parameters: {'criterion': 'squared_error', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}
# Weighted mean training score: 553.1622460463002
# MSE: 50.60562179227767

# SVR Model:
# Best parameters: {'svr__C': 10, 'svr__degree': 1, 'svr__gamma': 5, 'svr__kernel': 'linear'}
# Weighted mean training score: 334.59208716199817
# MSE: 329.4869331794879

# After testing various combinations of hyperparameters, the decision tree yielded better results on the training data due to its lower MSE and higher 
# training score. Upon evaluating the model with testing data, the decision tree remained the most optimal choice, demonstrating consistent performance