To-Do:
- [x] Make the train_test_split same for all three sub-models.
- [ ] correct whatever issues there are in the split. fix it by recreating from scratch.
    - create splits using 'Target' as stratify in both splits to preserve distribution and then drop extra columns from the arrays.

# Student Success and Failure Lookalike Project

School is one of the best ways to raise individuals' earnings potentials. If we can identify students who are at risk in school, it's possible we could intervene beforehand. If we can identify "lookalike's" of students that fail, we could intervene with these students and tailor solutions based on identifiers.

_"A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies. The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. The data is used to build classification models to predict students' dropout and academic sucess. The problem is formulated as a three category classification task, in which there is a strong imbalance towards one of the classes._

_Data Source: https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success_

This dataset was provided by the Polytechnic Institute of Portalegre and SATDAP - Capacitação da Administração Pública.

Citation: M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) "Early prediction of student’s performance in higher education: a case study" Trends and Applications in Information Systems and Technologies, vol.1, in Advances in Intelligent Systems and Computing series. Springer. DOI: 10.1007/978-3-030-72657-7_16

Note: While this group certainly did their own prediction, I have not and will not read the paper until after I complete my own data model.

<div style="margin-left: 50px;">
    <img src="./poly_screenshot.png" width="600" />
</div>

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, FunctionTransformer, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, roc_auc_score, RocCurveDisplay, accuracy_score, precision_score, recall_score
from numpy import log1p, arange, where
from pandas import Series, DataFrame
from sklearn.base import BaseEstimator, ClassifierMixin
import joblib

SD = 42

In [2]:
#pulling in the data for an initial first pass
datapath = './data/student_data.csv'
raw_df = pd.read_csv(datapath, low_memory=False, delimiter=';')
print(raw_df.shape)

(4424, 37)


Here, I am encoding the target variable into numerical classes. (Random Forest could handle categorical variables, but sklearn.ensemble.RandomForestClassifier complains if left as strings).

In [3]:
le = LabelEncoder()
le.fit(raw_df['Target'])
target_map = dict(zip(le.classes_, le.transform(le.classes_)))
print(f"Target Column Conversion: {target_map}")

def target_mapping(x):
    #Pulls out the number order of the item and assigns it
    y = target_map[x]
    return(y)

raw_df["Target"] = raw_df["Target"].apply(target_mapping)
raw_df['dropout'] = np.where(raw_df["Target"]==0,1,0)
raw_df['enrolled'] = np.where(raw_df["Target"]==1,1,0)
raw_df['graduate'] = np.where(raw_df["Target"]==2,1,0)
display(raw_df.head(10))

Target Column Conversion: {'Dropout': 0, 'Enrolled': 1, 'Graduate': 2}


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target,dropout,enrolled,graduate
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0.0,0,10.8,1.4,1.74,0,1,0,0
1,1,15,1,9254,1,1,160.0,1,1,3,...,6,13.666667,0,13.9,-0.3,0.79,2,0,0,1
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,0.0,0,10.8,1.4,1.74,0,1,0,0
3,1,17,2,9773,1,1,122.0,1,38,37,...,5,12.4,0,9.4,-0.8,-3.12,2,0,0,1
4,2,39,1,8014,0,1,100.0,1,37,38,...,6,13.0,0,13.9,-0.3,0.79,2,0,0,1
5,2,39,1,9991,0,19,133.1,1,37,37,...,5,11.5,5,16.2,0.3,-0.92,2,0,0,1
6,1,1,1,9500,1,1,142.0,1,19,38,...,8,14.345,0,15.5,2.8,-4.06,2,0,0,1
7,1,18,4,9254,1,1,119.0,1,37,37,...,0,0.0,0,15.5,2.8,-4.06,0,1,0,0
8,1,1,3,9238,1,1,137.0,62,1,1,...,6,14.142857,0,16.2,0.3,-0.92,2,0,0,1
9,1,1,1,9238,1,1,138.0,1,1,19,...,2,13.5,0,8.9,1.4,3.51,0,1,0,0


In [4]:
#split out a test set based on Target; stratify each split based on the Target distribution.
X_temp, X_test, y_temp, y_test = train_test_split(raw_df, raw_df['Target'], stratify=raw_df['Target'], test_size=0.2, random_state=SD)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, stratify=y_temp, test_size=0.2, random_state=SD)


# Dropout - 0

In [5]:
#Split
#y = raw_df['dropout']
#real_labels = raw_df['Target']
#X = raw_df.drop(["Target", "enrolled","graduate","dropout"], axis=1)

#do_X_train, do_X_val, do_y_train, do_y_val = splitter(X,y)

#display(do_X_train.head())


##Create the GB model using 1's and 0's for dropouts
#do_GB = GradientBoostingClassifier(class_weight='balanced')
#do_GB.fit(X=do_X_train, y=do_y_train)
##score
#do_y_preds = do_GB.predict(do_X_val)
#print(f"GB Accuracy: {accuracy_score(do_y_val, do_y_preds)}")
#rmse = mean_squared_error(do_y_val, do_y_preds)**(1/2)
#print(f"GB RMSE: {rmse}")
#prec_rec = precision_score(do_y_val, do_y_preds), recall_score(do_y_val, do_y_preds)
#print(f"Precision/Recall: {prec_rec}")
##display(pd.DataFrame({'vals':do_y_val, 'preds':do_y_preds}))
##test_preds = do_GB.predict(X_test.drop(['Target','dropout','enrolled','graduate'],axis=1))
##test_prec_rec = precision_score(y_test, test_preds), recall_score(y_test, test_preds)
##print(f"Precision/Recall: {test_prec_rec}")



#Create the RF model using 1's and 0's for dropouts
real = {'dropout':None, 'Target':None, 'enrolled':None, 'graduate':None}

#for k,v in real.items():
real_y_val = X_val['dropout']
real_X_val = X_val.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_train = X_train['dropout']
real_X_train = X_train.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_test = X_test['dropout']
real_X_test = X_test.drop(['Target','dropout','enrolled','graduate'], axis=1)

do_RF = RandomForestClassifier(class_weight='balanced')
do_RF.fit(X=real_X_train, y=real_y_train)
#score
y_preds = do_RF.predict(real_X_val)
print(f"RF Accuracy: {accuracy_score(real_y_val, y_preds)}")
rmse = mean_squared_error(real_y_val, y_preds)**(1/2)
print(f"RF RMSE: {rmse}")

#Save the model as a pickle when it looks good.
#joblib.dump(do_GB,'./models/distinct_gb_dropout.pkl')
joblib.dump(do_RF,'./models/distinct_rf_dropout.pkl')

RF Accuracy: 0.8757062146892656
RF RMSE: 0.35255323755531515


['./models/distinct_rf_dropout.pkl']

# Enrolled - 1

In [6]:
#Split
#y = raw_df['enrolled']
#X = raw_df.drop(["Target", "enrolled","graduate","dropout"], axis=1)
#er_X_train, er_X_val, er_y_train, er_y_val = splitter(X,y)

##Create the GB model using 1's and 0's for dropouts
#er_GB = GradientBoostingClassifier()
#er_GB.fit(X=er_X_train, y=er_y_train)
##score
#er_y_preds = er_GB.predict(er_X_val)
#print(f"GB Accuracy: {accuracy_score(er_y_val, er_y_preds)}")
#rmse = mean_squared_error(er_y_val, er_y_preds)**(1/2)
#print(f"GB RMSE: {rmse}")

#Resplitting multi-classes
real_y_val = X_val['enrolled']
real_X_val = X_val.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_train = X_train['enrolled']
real_X_train = X_train.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_test = X_test['enrolled']
real_X_test = X_test.drop(['Target','dropout','enrolled','graduate'], axis=1)

#Create the RF model using 1's and 0's for dropouts
er_RF = RandomForestClassifier(class_weight='balanced')
er_RF.fit(X=real_X_train, y=real_y_train)
#score
er_y_preds = er_RF.predict(real_X_val)
print(f"RF Accuracy: {accuracy_score(real_y_val, er_y_preds)}")
rmse = mean_squared_error(real_y_val, er_y_preds)**(1/2)
print(f"RF RMSE: {rmse}")

#Save the model as a pickle when it looks good.
#joblib.dump(er_GB,'./models/distinct_gb_enrolled.pkl')
joblib.dump(er_RF,'./models/distinct_rf_enrolled.pkl')

RF Accuracy: 0.8248587570621468
RF RMSE: 0.4184987968176887


['./models/distinct_rf_enrolled.pkl']

# Graduated - 2

In [7]:
#Split
#y = raw_df['graduate']
#X = raw_df.drop(["Target", "enrolled","graduate","dropout"], axis=1)
#gr_X_train, gr_X_val, gr_y_train, gr_y_val = splitter(X,y)

##Create the GB model using 1's and 0's for dropouts
#gr_GB = GradientBoostingClassifier()
#gr_GB.fit(X=gr_X_train, y=gr_y_train)
##score
#gr_y_preds = gr_GB.predict(gr_X_val)
#print(f"GB Accuracy: {accuracy_score(gr_y_val, gr_y_preds)}")
#rmse = mean_squared_error(gr_y_val, gr_y_preds)**(1/2)
#print(f"GB RMSE: {rmse}")

#Resplitting multi-classes
real_y_val = X_val['graduate']
real_X_val = X_val.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_train = X_train['graduate']
real_X_train = X_train.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_test = X_test['graduate']
real_X_test = X_test.drop(['Target','dropout','enrolled','graduate'], axis=1)

#Create the RF model using 1's and 0's for dropouts
gr_RF = RandomForestClassifier(class_weight='balanced')
gr_RF.fit(X=real_X_train, y=real_y_train)
#score
gr_y_preds = gr_RF.predict(real_X_val)
print(f"RF Accuracy: {accuracy_score(real_y_val, gr_y_preds)}")
rmse = mean_squared_error(real_y_val, gr_y_preds)**(1/2)
print(f"RF RMSE: {rmse}")

#Save the model as a pickle when it looks good.
#joblib.dump(gr_GB,'./models/distinct_gb_graduated.pkl')
joblib.dump(gr_RF,'./models/distinct_rf_graduated.pkl')

RF Accuracy: 0.8531073446327684
RF RMSE: 0.3832657764100933


['./models/distinct_rf_graduated.pkl']

# Base Model

In [14]:
#Split
#y = raw_df['Target']
#X = raw_df.drop(["Target", "enrolled","graduate","dropout"], axis=1)
#targ_X_train, targ_X_val, targ_y_train, targ_y_val = splitter(X,y)

#Resplitting multi-classes
real_y_val = X_val['Target']
real_X_val = X_val.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_train = X_train['Target']
real_X_train = X_train.drop(['Target','dropout','enrolled','graduate'], axis=1)

real_y_test = X_test['Target']
real_X_test = X_test.drop(['Target','dropout','enrolled','graduate'], axis=1)

#Create a base RF classifier for all classes.
targ_RF = RandomForestClassifier(class_weight='balanced')
targ_RF.fit(X=real_X_train, y=real_y_train)

#score
targ_y_preds = targ_RF.predict(real_X_val)
print(f"RF Accuracy: {accuracy_score(real_y_val, targ_y_preds)}")
rmse = mean_squared_error(real_y_val, targ_y_preds)**(1/2)
print(f"RF RMSE: {rmse}")

#joblib.dump(targ_RF, './models/distinct_rf_base.pkl')

RF Accuracy: 0.7782485875706214
RF RMSE: 0.6520285599096508
TEST - RF Accuracy: 0.768361581920904
TEST - RF RMSE: 0.6621314666258092


['./models/distinct_rf_base.pkl']

# Voting

In [18]:
#Our test set: X_test, y_test
real_X_test = X_test.drop(['Target','dropout','enrolled','graduate'],axis=1)
real_y_test_dropout = X_test['dropout']
real_y_test_enrolled = X_test['enrolled']
real_y_test_graduate = X_test['graduate']
real_y_test_graduate = X_test['Target']

#Our Models: do_RF, er_RF, gr_RF, targ_RF
#Dropout Preds
dropout_preds = do_RF.predict(real_X_test)
#print(f"Dropout Acc: {accuracy_score(real_y_test_dropout, dropout_preds)}")

#Enrollment Preds
enrolled_preds = er_RF.predict(real_X_test)

#Graduated Preds
graduate_preds = gr_RF.predict(real_X_test)

#Base Preds
targ_preds = targ_RF.predict(real_X_test)

#Make a DataFrame
vote_df = pd.DataFrame()
vote_df['True_Target'] = y_test
vote_df['dropout_pred'] = dropout_preds
vote_df['enrolled_pred'] = enrolled_preds
vote_df['graduate_pred'] = graduate_preds
vote_df['target_pred'] = targ_preds

display(vote_df.sample(10))

Unnamed: 0,True_Target,dropout_pred,enrolled_pred,graduate_pred,target_pred
955,0,1,0,0,0
3480,2,0,0,1,2
1731,0,1,0,0,0
3875,0,1,0,0,0
1776,2,0,0,1,2
1606,2,0,0,0,2
1820,1,1,0,0,0
4392,1,0,0,1,2
2002,2,0,0,1,2
3104,2,0,0,1,2


# Voting Analysis
We could consider different ways to set up our voting:
- Unanimous: When every row agrees (very strict)
- One vs. Rest: Count any binary "true" value and use targ_pred to decide ties or 3 "false" values.

Let's try all three and see what happens.

In [78]:
#Add columns for voting schemes

#Unanimous voting
vote_df['unanimous_voting'] = vote_df[['dropout_pred','enrolled_pred','graduate_pred']].sum(axis=1) == 1
check_targ_w_binary = ((vote_df['target_pred']==0) & (vote_df['dropout_pred']==1)) | ((vote_df['target_pred']==1) & (vote_df['enrolled_pred']==1)) | ((vote_df['target_pred']==2) & (vote_df['graduate_pred']==1))
vote_df['unanimous_voting'] = vote_df['unanimous_voting'] & check_targ_w_binary
#Unanimous prediction
vote_df['unanimous_pred'] = vote_df.loc[(vote_df['unanimous_voting']), 'target_pred']
vote_df['unanimous_pred'] = np.where(vote_df['unanimous_pred'].isna(), vote_df['target_pred'],vote_df['unanimous_pred'])

#One vs. Rest voting
#vote_df['ovr_voting'] = np.where(vote_df[['dropout_pred','enrolled_pred','graduate_pred']].sum(axis=1)==1,
#                                 np.where(),
#                                 None)
vote_df['ovr_voting'] = vote_df[['dropout_pred','enrolled_pred','graduate_pred']].idxmax(axis=1).where(vote_df[['dropout_pred','enrolled_pred','graduate_pred']].sum(axis=1) == 1, "tie")
ovr_map = {'dropout_pred':0, 'enrolled_pred':1,'graduate_pred':2}
vote_df['ovr_pred'] = vote_df['ovr_voting'].replace(to_replace=ovr_map)
vote_df['ovr_pred'] = np.where(vote_df['ovr_pred']=="tie",1, vote_df['ovr_pred'])
vote_df.loc[vote_df.loc[:,'unanimous_voting']==False].to_csv('_temp.csv')
display(vote_df.sample(10))

Unnamed: 0,True_Target,dropout_pred,enrolled_pred,graduate_pred,target_pred,unanimous_voting,unanimous_pred,ovr_voting,ovr_pred
1843,1,0,0,1,2,True,2.0,graduate_pred,2
3828,0,0,0,0,0,False,0.0,tie,1
1425,2,0,0,1,2,True,2.0,graduate_pred,2
966,0,1,0,0,0,True,0.0,dropout_pred,0
1859,1,1,0,0,0,True,0.0,dropout_pred,0
1735,2,0,0,1,2,True,2.0,graduate_pred,2
4041,2,0,0,1,2,True,2.0,graduate_pred,2
2399,2,0,0,1,2,True,2.0,graduate_pred,2
3971,2,0,0,1,2,True,2.0,graduate_pred,2
637,1,0,0,0,1,False,1.0,tie,1


In [77]:
#Evaluate voting
eval_df = vote_df.copy()[['True_Target','target_pred','unanimous_pred','ovr_pred']]
display(eval_df.head())

#RMSE
tar_rmse = mean_squared_error(eval_df['True_Target'],eval_df['target_pred'])**(1/2)
print(f"Single RF voting RMSE: {tar_rmse}")
una_rmse = mean_squared_error(eval_df['True_Target'],eval_df['unanimous_pred'])**(1/2)
print(f"Unanimous voting RMSE: {una_rmse}")
ovr_rmse = mean_squared_error(eval_df['True_Target'],eval_df['ovr_pred'])**(1/2)
print(f"One vs Rest voting RMSE: {ovr_rmse}")

#Accuracy Score
tar_acc = accuracy_score(eval_df['True_Target'],eval_df['target_pred'])
print(tar_acc)
una_acc = accuracy_score(eval_df['True_Target'],eval_df['unanimous_pred'])
print(una_acc)
ovr_acc = accuracy_score(eval_df['True_Target'],eval_df['ovr_pred'].astype(int))
print(ovr_acc)
#print(eval_df['ovr_pred'].unique())



Unnamed: 0,True_Target,target_pred,unanimous_pred,ovr_pred
1853,2,2,2.0,2
2399,2,2,2.0,2
510,1,0,0.0,0
242,2,2,2.0,2
3392,2,2,2.0,2


Single RF voting RMSE: 0.6621314666258092
Unanimous voting RMSE: 0.6621314666258092
One vs Rest voting RMSE: 0.7561424551163481
0.768361581920904
0.768361581920904
0.7299435028248588


In [None]:
https://docs.google.com/spreadsheets/d/1D9pZBTCTDLqvc7OvQI6YX_9Y9PNQi3Wzvseoi-m2I_I/edit?gid=1951636908#gid=1951636908