<a href="https://colab.research.google.com/github/jackiechen24/dropoutbias/blob/main/Chen_Jackie_DSCI_531_Project_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSCI 531 Project Literature Review

Predicting student dropout rates using machine learning models is a proactive approach that educational institutions employ to identify at-risk students and implement timely interventions. However, these predictive models can inadvertently introduce biases, leading to unfair treatment of certain student groups. This project aims to analyze and enhance the fairness of student dropout prediction models by utilizing the "Predict Students' Dropout and Academic Success" dataset from the UC Irvine Machine Learning Repository: https://doi.org/10.24432/C5MC89, which has a strong imbalance towards one of the classes in the target column (i.e. Graduate accounts for 50\% of the records, compared to 32\% and 18\% for Dropout and Enrolled). Specifically, we will investigate potential biases related to demographic and socioeconomic factors and explore mitigation strategies to ensure equitable predictions. Such potential biases include algorithmic bias (e.g. age, gender, and racial biases) and measuring bias (e.g. uneven amounts of students in each undergraduate degree).
Research Questions:

1. Bias Identification: Does the dropout prediction model exhibit biases against specific demographic groups, such as gender, socioeconomic status, or first-generation college students?

2. Fairness Metrics Application: How can fairness metrics, such as demographic parity, equalized odds, and disparate impact, be applied to assess and quantify bias in dropout prediction models?

3. Bias Mitigation Techniques: What methods can be employed to mitigate identified biases in student dropout predictions while maintaining or improving model performance?

Methodology:

1. Data Preprocessing:

Dataset Overview: The dataset comprises information available at the time of student enrollment, including academic paths, demographic details, and socioeconomic factors, along with academic performance data at the end of the first and second semesters.

Data Cleaning: Address missing values, remove duplicates, and encode categorical variables appropriately.

Feature Selection: Identify and select relevant features that contribute significantly to predicting student dropout rates.

2. Model Training:

Algorithm Selection: Implement various machine learning models such as logistic regression, random forest/balanced random forest, XGBoost, and neural networks to predict student dropout. We may also try comparing their performance and record any differences in bias mitigation in these different algorithms.

Training and Validation: Split the data into training and testing sets (e.g. 70\% train, 30\% test) and employ k-fold cross-validation to ensure model robustness.

3. Fairness Analysis:

Metric Calculation: Compute fairness metrics, including demographic parity (ensuring equal positive prediction rates across groups), equalized odds (equalizing true positive and false positive rates across groups), and disparate impact (assessing the ratio of positive outcomes between groups).

Bias Detection: Analyze metric results to identify any significant disparities in model predictions across different demographic/nationality groups (which include Portuguese, German, Angolan, Guinean, Mexican, and Cuban in our dataset).

4. Bias Mitigation:

Techniques Implementation: Apply methods such as reweighting data samples, adversarial debiasing, and fairness-aware learning algorithms to reduce identified biases.

Evaluation: Assess the effectiveness of mitigation strategies by comparing pre- and post-mitigation fairness metrics and model performance indicators.

Evaluation Plan:

Performance Metrics: Evaluate models using accuracy, precision, recall, and F1-score to measure predictive performance.

Fairness Assessment: Utilize the aforementioned fairness metrics to quantify bias levels before and after applying mitigation techniques.

Trade-off Analysis: Examine the balance between model fairness and predictive accuracy to ensure that bias reduction does not significantly compromise performance.

Dataset Creation and Usage:

Creation: The dataset was compiled from a higher education institution, integrating data from several disjoint databases related to students enrolled in different undergraduate degree programs such as agronomy, design, education, nursing, journalism, management, social service, and technologies.

Previous Applications: Prior studies have used this dataset to develop classification models aimed at predicting student dropout and academic success. For example, researchers have explored the impact of different variables on dropout rates and academic performance, implementing machine learning techniques to identify key predictors.

Related Work:

Bias in Predictive Models: Previous research has highlighted concerns regarding biases in educational predictive models. A study by Cornell University examined the implications of including or excluding protected attributes (e.g. gender, race) in dropout prediction models, finding that removing such attributes did not necessarily enhance fairness or accuracy. A different study stated that accuracy improvements had to be carefully contrasted with potential algorithmic discrimination, and they obtained parity in GFPR (Generalized False Positive Rate) while preserving calibration to prevent the probability scores from needing group-dependent interpretation. This study found that after bias mitigation, GFPR ratio in both dropout and dropout or underperformance predictions has been changed to a perfect value close to 1 across most groups (which are defined by nationality, gender, high school type and location, and admission grade) and caused better parities in other metrics (AUC (Area Under the Curve) and GFNR (Generalized False Negative Rate)) along most of the groups compared to the non-mitigated bias model. Another study noted that the binarization of sensitive attributes for the mitigation processes may not reduce fairness gaps for each group and so unfairness should be evaluated at the subgroup levels. A different study found that slicing analysis can be used to improve model fairness without necessarily sacrificing performance. This study noted that they did not find evidence of a strict tradeoff between performance and fairness.

Fairness and Uncertainty: Another study investigated the integration of fairness and uncertainty in student dropout predictions, emphasizing the need for models that not only predict outcomes accurately but also provide uncertainty estimates to inform decision-making.

Cross-Institutional Transferability: Research has also delved into the transferability of predictive models across institutions, assessing how models trained in one context perform in another and the fairness implications therein.

Novel Contribution: While existing studies have addressed bias in student dropout prediction models, this project distinguishes itself by:

Dataset-Specific Analysis: Conducting a comprehensive fairness analysis using the specified UCI dataset, which encompasses a diverse student population across various disciplines.

Mitigation Strategy Evaluation: Systematically implementing and evaluating multiple bias mitigation techniques to determine their effectiveness in enhancing model fairness without compromising accuracy.

Holistic Approach: Combining traditional performance metrics with fairness assessments to provide a balanced evaluation of different predictive models in educational settings.
# DSCI 531 Project Code
## Installations and Importing the Dataset
The "Predict Students' Dropout and Academic Success" dataset from the UC Irvine Machine Learning Repository (https://doi.org/10.24432/C5MC89) provided information on how to install and import this dataset.

In [None]:
!pip install ucimlrepo  # Install the ucimlrepo package

# Import dataset into code
from ucimlrepo import fetch_ucirepo

# fetch dataset
predict_students_dropout_and_academic_success = fetch_ucirepo(id=697)

# data (as pandas dataframes)
X = predict_students_dropout_and_academic_success.data.features
y = predict_students_dropout_and_academic_success.data.targets

# metadata
print(predict_students_dropout_and_academic_success.metadata)

# variable information
print(predict_students_dropout_and_academic_success.variables)

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
{'uci_id': 697, 'name': "Predict Students' Dropout and Academic Success", 'repository_url': 'https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success', 'data_url': 'https://archive.ics.uci.edu/static/public/697/data.csv', 'abstract': "A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.\nThe dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. \nThe data is used to build clas

In [None]:
# viewing the first 5 and last 5 rows of the data
predict_students_dropout_and_academic_success

{'data': {'ids': None,
  'features':       Marital Status  Application mode  Application order  Course  \
  0                  1                17                  5     171   
  1                  1                15                  1    9254   
  2                  1                 1                  5    9070   
  3                  1                17                  2    9773   
  4                  2                39                  1    8014   
  ...              ...               ...                ...     ...   
  4419               1                 1                  6    9773   
  4420               1                 1                  2    9773   
  4421               1                 1                  1    9500   
  4422               1                 1                  1    9147   
  4423               1                10                  1    9773   
  
        Daytime/evening attendance  Previous qualification  \
  0                              1              

In [None]:
X  # view the features dataframe

Unnamed: 0,Marital Status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0,0.000000,0,10.8,1.4,1.74
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,0,6,0,0,0.000000,0,10.8,1.4,1.74
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,0,6,10,5,12.400000,0,9.4,-0.8,-3.12
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,0,6,6,6,13.000000,0,13.9,-0.3,0.79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,9773,1,1,125.0,1,1,1,...,0,0,6,8,5,12.666667,0,15.5,2.8,-4.06
4420,1,1,2,9773,1,1,120.0,105,1,1,...,0,0,6,6,2,11.000000,0,11.1,0.6,2.02
4421,1,1,1,9500,1,1,154.0,1,37,37,...,0,0,8,9,1,13.500000,0,13.9,-0.3,0.79
4422,1,1,1,9147,1,1,180.0,1,37,37,...,0,0,5,6,5,12.000000,0,9.4,-0.8,-3.12


In [None]:
X.isna().sum() # check for missing values

Unnamed: 0,0
Marital Status,0
Application mode,0
Application order,0
Course,0
Daytime/evening attendance,0
Previous qualification,0
Previous qualification (grade),0
Nacionality,0
Mother's qualification,0
Father's qualification,0


None of the columns appear to have any missing values.

In [None]:
int(X.duplicated().sum()) # check for duplicates

0

There do not appear to be any duplicate values.

In [None]:
y # view the target dataframe

Unnamed: 0,Target
0,Dropout
1,Graduate
2,Dropout
3,Graduate
4,Graduate
...,...
4419,Graduate
4420,Dropout
4421,Dropout
4422,Graduate


In [None]:
# another way to view/access the data
import pandas as pd
uci_data = pd.read_csv("data.csv", sep = ";")
uci_data

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.000000,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.000000,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.400000,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.000000,0,13.9,-0.3,0.79,Graduate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,9773,1,1,125.0,1,1,1,...,0,6,8,5,12.666667,0,15.5,2.8,-4.06,Graduate
4420,1,1,2,9773,1,1,120.0,105,1,1,...,0,6,6,2,11.000000,0,11.1,0.6,2.02,Dropout
4421,1,1,1,9500,1,1,154.0,1,37,37,...,0,8,9,1,13.500000,0,13.9,-0.3,0.79,Dropout
4422,1,1,1,9147,1,1,180.0,1,37,37,...,0,5,6,5,12.000000,0,9.4,-0.8,-3.12,Graduate


In [None]:
uci_data.isna().sum() # check for missing values

Unnamed: 0,0
Marital status,0
Application mode,0
Application order,0
Course,0
Daytime/evening attendance\t,0
Previous qualification,0
Previous qualification (grade),0
Nacionality,0
Mother's qualification,0
Father's qualification,0


In [None]:
sum(uci_data.duplicated()) # check for duplicated values

0

## Model Training
We implement various machine learning models such as logistic regression, random forest/balanced random forest, XGBoost, and neural networks to predict student dropout. We may also try comparing their performance and record any differences in bias mitigation in these different algorithms. We split the data into training and testing sets (e.g. 70\% train, 30\% test) and employ k-fold cross-validation to ensure model robustness.

In [None]:
# split the data into train and test sets (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
from sklearn.model_selection import train_test_split
# 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# employ k-fold cross-validation (https://www.geeksforgeeks.org/cross-validation-using-k-fold-with-scikit-learn/)
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

# implement logistic regression, random forest/balanced random forest, XGBoost, and neural networks models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import to_categorical
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# create logistic regression model
log_model = LogisticRegression(random_state=42)
log_model.fit(X_train, y_train) # fit the model
# Make predictions and evaluate
log_y_pred = log_model.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, log_y_pred))
print("Logistic Regression Classification Report:\n", classification_report(y_test, log_y_pred))

# create random forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train) # fit the model
# Make predictions and evaluate
rf_y_pred = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, rf_y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_y_pred))

# create balanced random forest model (https://medium.com/@fadleemt/balanced-random-forest-d5dc9c896bb4)
brf_model = BalancedRandomForestClassifier(random_state=42)
brf_model.fit(X_train, y_train) # fit the model
# Make predictions and evaluate
brf_y_pred = brf_model.predict(X_test)

print("Balanced Random Forest Accuracy:", accuracy_score(y_test, brf_y_pred))
print("Balanced Random Forest Classification Report:\n", classification_report(y_test, brf_y_pred))

# initialize LabelEncoder (XGBoost requires target variable to be numerical)
le = LabelEncoder() # https://xgboosting.com/label-encode-categorical-target-variable-for-xgboost/
# Fit and transform the target variable
y_train_encoded = le.fit_transform(y_train)
# create XGBoost model (https://www.projectpro.io/recipes/use-xgboost-classifier-and-regressor-in-python)
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train_encoded) # fit the model
# make predictions and transform labels back to original encoding to evaluate
xgb_y_pred = xgb_model.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, le.inverse_transform(xgb_y_pred)))
print("XGBoost Classification Report:\n", classification_report(y_test, le.inverse_transform(xgb_y_pred)))

# create neural network model (https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/)
nn_model = Sequential()
nn_model.add(Dense(100, input_shape=(36,), activation='relu'))
nn_model.add(Dense(3, activation='softmax'))
# compile model
nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit and transform target variable
y_test_encoded = le.fit_transform(y_test)
# convert integer labels to one-hot encoded vectors (https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)
y_train_vec = to_categorical(y_train_encoded, num_classes = 3)
y_test_vec = to_categorical(y_test_encoded, num_classes = 3)
history = nn_model.fit(X_train, y_train_vec, validation_data=(X_test, y_test_vec), epochs=300, verbose=0) # fit the model
# evaluate the model
_, train_accuracy = nn_model.evaluate(X_train, y_train_vec, verbose=0)
_, test_accuracy = nn_model.evaluate(X_test, y_test_vec, verbose=0)
print('Neural Network Training Data Accuracy: %.3f, Neural Network Test Accuracy: %.3f' % (train_accuracy, test_accuracy))
# predict probabilities for test set
nn_yhat_probs = nn_model.predict(X_test, verbose=0)
# predict classes for test set (https://stackoverflow.com/questions/68776790/model-predict-classes-is-deprecated-what-to-use-instead)
nn_yhat_classes = np.argmax(nn_yhat_probs, axis=1)
# reduce prediction arrays to 1d arrays
nn_yhat_probs = nn_yhat_probs.flatten()
nn_yhat_classes = nn_yhat_classes.flatten()
print("Neural Network Classification Report:\n", classification_report(y_test_vec.flatten(), nn_yhat_classes))
# ROC AUC for Neural Network
nn_auc = roc_auc_score(y_test_vec.flatten(), nn_yhat_probs)
print('ROC AUC for Neural Network: %f' % nn_auc)
# Neural Network confusion matrix
nn_matrix = confusion_matrix(y_test_vec.flatten(), nn_yhat_classes)
print("Neural Network Confusion Matrix")
print(nn_matrix)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return fit_method(estimator, *args, **kwargs)


Logistic Regression Accuracy: 0.6701807228915663
Logistic Regression Classification Report:
               precision    recall  f1-score   support

     Dropout       0.75      0.63      0.68       441
    Enrolled       0.42      0.05      0.09       245
    Graduate       0.65      0.94      0.77       642

    accuracy                           0.67      1328
   macro avg       0.60      0.54      0.51      1328
weighted avg       0.64      0.67      0.61      1328

Random Forest Accuracy: 0.759789156626506
Random Forest Classification Report:
               precision    recall  f1-score   support

     Dropout       0.80      0.79      0.79       441
    Enrolled       0.54      0.29      0.37       245
    Graduate       0.77      0.92      0.84       642

    accuracy                           0.76      1328
   macro avg       0.71      0.66      0.67      1328
weighted avg       0.74      0.76      0.74      1328



  return fit_method(estimator, *args, **kwargs)


Balanced Random Forest Accuracy: 0.7620481927710844
Balanced Random Forest Classification Report:
               precision    recall  f1-score   support

     Dropout       0.84      0.74      0.79       441
    Enrolled       0.49      0.57      0.53       245
    Graduate       0.83      0.85      0.84       642

    accuracy                           0.76      1328
   macro avg       0.72      0.72      0.72      1328
weighted avg       0.77      0.76      0.77      1328



  y = column_or_1d(y, warn=True)


XGBoost Accuracy: 0.7725903614457831
XGBoost Classification Report:
               precision    recall  f1-score   support

     Dropout       0.83      0.77      0.80       441
    Enrolled       0.58      0.43      0.49       245
    Graduate       0.79      0.91      0.84       642

    accuracy                           0.77      1328
   macro avg       0.73      0.70      0.71      1328
weighted avg       0.76      0.77      0.76      1328



  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  y = column_or_1d(y, warn=True)


Neural Network Training Data Accuracy: 0.426, Neural Network Test Accuracy: 0.403


ValueError: Found input variables with inconsistent numbers of samples: [3984, 1328]