In [1]:
import pandas as pd
import re

import warnings
warnings.filterwarnings("ignore")

# Data Analysis
## Putting Data Together

Here, I put together some input data for my algorithm. The goal of this algorithm is to model classification using all homework, lab, and project data to predict the midterm letter grade (A+ to E).

In [2]:
data = pd.read_csv('pre_mt1_train.csv')

In [3]:
input_data = data.filter(regex='^(Homework|Lab|Project).*')
print(input_data.columns)

Index(['Homework 1', 'Homework 2', 'Homework 3', 'Lab 02', 'Lab 03', 'Lab 04',
       'Lab 05', 'Lab 06', 'Lab 07', 'Lab 08', 'Lab 09', 'Lab 10', 'Project 1',
       'Project 2A'],
      dtype='object')


In [6]:
regr_label_data = data[['Midterm 1']]
print(regr_label_data.columns)

Index(['Midterm 1'], dtype='object')


In [5]:
class_label_data = data[['Midterm 1 Grade']]
print(class_label_data.columns)

Index(['Midterm 1 Grade'], dtype='object')


## Classification

Here, I train 3 different classification models to the data: 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

X = input_data.to_numpy()
y = class_label_data.to_numpy()


param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


random_forest_clf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=random_forest_clf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X, y)

print(f"Best Parameters: {grid_search.best_params_}")

best_random_forest_clf = grid_search.best_estimator_

In [None]:
from sklearn.metrics import accuracy_score, f1_score



Here's where the juicy stuff happens. I used a random forest regressor to model this data, as I've never used it before and kinda wanted to try it. I also think it somewhat fits our data. I divided the data with a train-test split of 80/20. I did a grid search over the parameters in `param_grid` to get the best set of parameters for our model. You can see the calculated best parameters below.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

X = input_data.to_numpy()
y = regr_label_data.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


random_forest_model = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(estimator=random_forest_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")

best_random_forest_model = grid_search.best_estimator_

It looks like it did pretty well. The values for R^2 and MSE aren't bad at all, and the sample prediction, taken from the test set, is pretty dang close. It would make more sense to divide predictions for Midterm 1, Midterm 2, and the Final grades into separate models, as those things are temporally dependent on each other. However, even without that consideration, this model did really well.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = best_random_forest_model.predict(X_test)

r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mse}")

print(f"Student 1 predicted values (Midterm 1, Midterm 2, Final Exam, Final Score): {best_random_forest_model.predict(X_test[0,:].reshape(1, -1))}")
print(f"Student 1 actual values (Midterm 1, Midterm 2, Final Exam, Final Score): {y_test[0,:]}")