# Random tree forest model after combining numeric and categorical variable

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

data = pd.read_csv('student-por.csv')

In [39]:
# One hot encoding
from sklearn.preprocessing import OneHotEncoder

paid = pd.DataFrame(data['paid'])

# Encoding the 'paid' column which is categorical (assuming 'yes' = 1 and 'no' = 0)
one_hot_encoded = pd.get_dummies(paid, dtype=int)

# Defining the features and target variable
combined = pd.concat([data[['studytime']], one_hot_encoded], axis=1)
y = data['G3']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(combined, y, test_size=0.2, random_state=4)

# Define the number of folds
k = 5

# Initialize the KFold
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# List to store cross-validation scores
cv_scores = []

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[test_index]
    
    # Train the model
    model.fit(X_train_fold, y_train_fold)
    
    # Predict on the validation set
    y_pred = model.predict(X_val_fold)
    
    # Evaluate the model
    score = model.score(X_val_fold, y_val_fold)
    
    # Append the score to the list of cross-validation scores
    cv_scores.append(score)

# Calculate and print the mean of the cross-validation scores
print("Mean R^2 Score:", np.mean(cv_scores))

# Predict on the test set
y_pred_test = model.predict(X_test)

# Calculate and print the mean squared error on the test set
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_test))


Mean R^2 Score: 0.029098006759019034
Mean Squared Error (MSE): 10.414881422945598


justification for not removing outliers in studytime:
1. Model Robustness: Since Random Forest can handle outliers as it makes use of the ensemble method, which is the utilising of, and combining multiple different models to obtain better predictive performance. This reduces the impact of any outliers, causing the predictive method to not be significantly affected by them.
2. Valuable Insights: Outliers may provide valuable insights into students' performance patterns. Education is not a one-size-fits all approach, and there will be students who will perform significantly better or worse than others. Hence removing outliers would be potentially overlooking any complex, non-linear relationships that might have been detected by our model.

# Conclusion - interpreting the results:
Mean R² Score:

studytime : 0.04641826761747649

paid : Not applicable for regression

combined : 0.029098006759019034

A higher R² score indicates a better fit of the model to the data. In this case, studytime has a slightly higher R² score compared to the combined model, suggesting that it explains a slightly larger proportion of the variance in G3 scores.

---------------------------------------------------------------------------

Mean Squared Error (MSE):

studytime : 6.381007474288847

paid : Not applicable for regression

combined : 10.414881422945598

A lower MSE indicates better accuracy of the predictions. The combined model has a higher MSE compared to the individual studytime model, suggesting that it has more error in its predictions. This is due to, when additional variables are introduced into a model, they may contribute noise or unnecessary complexity, which can result in a higher MSE.

------------------------------------------------------------------------

Based on R² Score and MSE: Despite the higher classification accuracy of paid, when considering R² score and MSE (which are more directly relevant to the regression task), studytime appears to have a slightly better performance compared to paid or the combined model.

In conclusion, studytime alone may offer a simpler and more interpretable model compared to the combined model involving categorical encoding and interactions between variables.