In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Welcome to my machine learning project!

In this notebook, I take you through the entire development process of a machine learning and data science project, starting from data exploration to model evaluation. Initially, this notebook was created as part of an interview process, and I've adapted it here for a broader audience. I hope you find it insightful and useful. If you appreciate the work, feel free to leave feedback or an upvote. Thank you!

The project started by deciding which problem we were going to solve. You could choose between making a clustering model in order to profile the students or make a forecasting model to predict whether a student would pass or fail their modules.

I chose the prediction modeling problem and the data given was the same as given here. You can also go to https://analyse.kmi.open.ac.uk/open_dataset to find a more complete description of the data used here.

# Model Evaluation: Choosing the Best Approach

To assess the performance of our models, we'll fit them using different feature sets: one with both `weighted_grade` and `pass_rate`, and then two versions where we remove one of the features in each dataset. This will help us understand the impact of these features.

In [None]:
studInfo=pd.read_csv("/kaggle/input/open-university-learning-analytics-dataset/anonymiseddata/studentInfo.csv")
assessments=pd.read_csv("/kaggle/input/open-university-learning-analytics-dataset/anonymiseddata/assessments.csv")
studAss=pd.read_csv("/kaggle/input/open-university-learning-analytics-dataset/anonymiseddata/studentAssessment.csv")
studVle=pd.read_csv("/kaggle/input/open-university-learning-analytics-dataset/anonymiseddata/studentVle.csv")
vle=pd.read_csv("/kaggle/input/open-university-learning-analytics-dataset/anonymiseddata/vle.csv")

In [None]:
# Model 1: Logistic Regression
Logistic Regression is a linear model used for binary classification tasks. It will serve as our first baseline for predicting student performance.

# Part 1: Feature Engineering

Here we will discuss how we used the given data in order to create features that made sense in order to build the model

# Model 2: Linear Discriminant Analysis (LDA)
LDA is another linear classification technique that works by finding a linear combination of features that best separates the classes.

In [None]:
exams=assessments[assessments["assessment_type"]=="Exam"]
others=assessments[assessments["assessment_type"]!="Exam"]
amounts=others.groupby(["code_module","code_presentation"]).count()["id_assessment"] 
amounts=amounts.reset_index()
amounts.head()
#Here we have the total amount of assessments by module

In [None]:
# Model 3: Random Forest
Random Forest is an ensemble learning method that combines multiple decision trees. It's useful for both classification and regression tasks, and it tends to improve model accuracy by reducing overfitting.

In [None]:
stud_ass

In [None]:
# Model 4: Neural Network Classifier
The Neural Network model mimics the human brain by using layers of interconnected neurons. It is well-suited for capturing complex relationships in data, although it can be prone to overfitting with small datasets.

In [None]:
#Pass rate per student per module
pass_rate=pd.merge((stud_ass[stud_ass["pass"]==True].groupby(["id_student","code_module","code_presentation"]).count()["pass"]).reset_index(),amounts,how="left",on=["code_module","code_presentation"])
pass_rate["pass_rate"]=pass_rate["pass"]/pass_rate["id_assessment"]
pass_rate.drop(["pass","id_assessment"], axis=1,inplace=True)
pass_rate.head()

In [None]:
# Final Reflections on Model Performance

* The models that included both `weighted_grade` and `pass_rate` generally outperformed the versions with either feature removed, indicating that both features are valuable predictors.

* Neural networks struggled to predict student failures accurately but demonstrated strong overall performance, likely benefiting from class imbalance.

* Other models like Logistic Regression and Random Forest can be useful in scenarios such as identifying top-performing students for awards or scholarships.

* There are many additional features that could be engineered to further enhance these models. Feel free to explore the dataset further and experiment with other features!

# VLE

The datasets referring to the VLE (Virtual Learning Environment) contain the interaction feed of the students with the content available for reference throughout the duration of the period. From this data we can infer how in touch a student was with their subjects, whether they studied it on a solid basis and how they used the content.

In [None]:
vle

In [None]:
vle[~vle["week_from"].isna()]
#Only 1121 from the 6364 entries have the reference week for the materials (the week in which they would be used in course.)
#With this in mind, the construction of a metric to track study commitment becomes impractical

In [None]:
studVle.head()

In [None]:
#Here we can track the average time after the start of the course the student took to use the materials
#and the average amount of clicks per material
avg_per_site=studVle.groupby(["id_student","id_site","code_module","code_presentation"]).mean().reset_index()
avg_per_site.head()

In [None]:
#General average per student per module
avg_per_student=avg_per_site.groupby(["id_student","code_module","code_presentation"]).mean()[["date","sum_click"]].reset_index()
avg_per_student.head()

# StudentInfo

The studentInfo table contains various info about the students, but the relevant ones for this analysis are:

* The amount of times the student has already tried to finish the module
* The students' final result

The last one is our interest variable as we build our prediction model

In [None]:
#Removing the cases where the student has withdrawn their registration to the module
studInfo=studInfo[studInfo["final_result"]!="Withdrawn"]
studInfo=studInfo[["code_module","code_presentation","id_student","num_of_prev_attempts","final_result"]]
studInfo.head()

# Compiling all relevant tables

In [None]:
df_1=pd.merge(avg_grade,pass_rate,how="inner",on=["id_student","code_module","code_presentation"])
assessment_info=pd.merge(df_1, stud_exams, how="inner", on=["id_student","code_module","code_presentation"])
assessment_info.head()

In [None]:
df_2=pd.merge(studInfo,assessment_info,how="inner",on=["id_student","code_module","code_presentation"])
final_df=pd.merge(df_2,avg_per_student,how="inner", on=["id_student","code_module","code_presentation"])
final_df.drop(["id_student","code_module","code_presentation"],axis=1,inplace=True)
final_df.head()
#The final dataframe only has information relevant to the problem

# Part 2: EDA

We start the exploratory data analysis by checking the dataframe integrity

In [None]:
final_df.describe()

In [None]:
final_df.info()

# The fact that the goal feature is categorical makes it not possible for us to include it in a correlation matrix, but we can see a tendency of correlation between the grading features (*weighted_grade, pass_rate* and *exam_score*)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8,6))
sns.heatmap(final_df.corr(),annot=True)

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(data=final_df, x="final_result")

# With a "Pass" count much higher than the other labels, we must pay attention to the performance metrics fot the models and analyse the least represented cases more closely

In [None]:
sns.pairplot(final_df)

# On the pairplot we can detect two outliers: One with an average click number way above average and another one with a sole occurrence of an amount of previous attempts. In order to keep our data as consistent as possible, these cases will be removed

In [None]:
final_df[final_df["sum_click"]>10]

In [None]:
final_df[final_df["num_of_prev_attempts"]>4]

In [None]:
final_df=final_df[final_df["sum_click"]<=10]
final_df=final_df[final_df["num_of_prev_attempts"]<=4]
final_df.head()

# Part 3: Modeling

For the modeling step we will use the following techniques and models:

* Cross validation paired with classification reports and confusion matrices to evaluate model performance
* Logistic Regression
* Linear Discriminant Analysis
* Random Forest
* Neural Network Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
X=final_df.drop("final_result", axis=1)
y=final_df["final_result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Given the high correlation between *weighted_grade* and *pass_rate* the models will be fit to 3 types of inputs: One with both features and one with each one of them removed from the dataset

In [None]:
#1 contains both, 2 just pass_rate e 3 just weighted_grade
X1_test=X_test
X1_train=X_train
X2_test=X_test.drop("weighted_grade",axis=1)
X2_train=X_train.drop("weighted_grade",axis=1)
X3_test=X_test.drop("pass_rate",axis=1)
X3_train=X_train.drop("pass_rate",axis=1)

In [None]:
scaler1=MinMaxScaler()
scaler2=MinMaxScaler()
scaler3=MinMaxScaler()

In [None]:
X1_train=scaler1.fit_transform(X1_train)
X1_test=scaler1.transform(X1_test)
X2_train=scaler2.fit_transform(X2_train)
X2_test=scaler2.transform(X2_test)
X3_train=scaler3.fit_transform(X3_train)
X3_test=scaler3.transform(X3_test)

# Model 1: Logistic Regression

$$
\text{precision} = \frac{TP}{TP + FP}
$$

$$
\text{recall} = \frac{TP}{TP + FN}
$$

$$
\text{F1-measure} = \frac{2 * \text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2 * TP}{2 * TP + FP + FN}
$$
 

In [None]:
lr1=LogisticRegression(max_iter=10000)
lr1.fit(X1_train,y_train)
result_lr1=lr1.predict(X1_test)
print(confusion_matrix(y_test,result_lr1))
print("\n")
print(classification_report(y_test,result_lr1))

In [None]:
lr2=LogisticRegression(max_iter=10000)
lr2.fit(X2_train,y_train)
result_lr2=lr2.predict(X2_test)
print(confusion_matrix(y_test,result_lr2))
print("\n")
print(classification_report(y_test,result_lr2))

In [None]:
lr3=LogisticRegression(max_iter=10000)
lr3.fit(X3_train,y_train)
result_lr3=lr3.predict(X3_test)
print(confusion_matrix(y_test,result_lr3))
print("\n")
print(classification_report(y_test,result_lr3))

# Model 2: LDA

In [None]:
lda1=LinearDiscriminantAnalysis()
lda1.fit_transform(X1_train,y_train)
result_lda1=lda1.predict(X1_test)
print(confusion_matrix(y_test,result_lda1))
print("\n")
print(classification_report(y_test,result_lda1))

In [None]:
lda2=LinearDiscriminantAnalysis()
lda2.fit_transform(X2_train,y_train)
result_lda2=lda2.predict(X2_test)
print(confusion_matrix(y_test,result_lda2))
print("\n")
print(classification_report(y_test,result_lda2))

In [None]:
lda3=LinearDiscriminantAnalysis()
lda3.fit_transform(X3_train,y_train)
result_lda3=lda3.predict(X3_test)
print(confusion_matrix(y_test,result_lda3))
print("\n")
print(classification_report(y_test,result_lda3))

# Model 3: Random Forest

In [None]:
rf1=RandomForestClassifier(n_estimators=300)
rf1.fit(X1_train,y_train)
result_rf1=rf1.predict(X1_test)
print(confusion_matrix(y_test,result_rf1))
print("\n")
print(classification_report(y_test,result_rf1))

In [None]:
rf2=RandomForestClassifier(n_estimators=300)
rf2.fit(X2_train,y_train)
result_rf2=rf2.predict(X2_test)
print(confusion_matrix(y_test,result_rf2))
print("\n")
print(classification_report(y_test,result_rf2))

In [None]:
rf3=RandomForestClassifier(n_estimators=300)
rf3.fit(X3_train,y_train)
result_rf3=rf3.predict(X3_test)
print(confusion_matrix(y_test,result_rf3))
print("\n")
print(classification_report(y_test,result_rf3))

# Model 4: Neural Network Classifier

In [None]:
model1=Sequential()

model1.add(Dense(6, activation="relu"))
model1.add(Dropout(0.5))
model1.add(Dense(3, activation="relu"))
model1.add(Dense(1, activation="sigmoid"))

model1.compile(loss="binary_crossentropy", optimizer="adam")

In [None]:
model2=Sequential()

model2.add(Dense(5, activation="relu"))
model2.add(Dropout(0.5))
model2.add(Dense(3, activation="relu"))
model2.add(Dense(1, activation="sigmoid"))

model2.compile(loss="binary_crossentropy", optimizer="adam")

In [None]:
model3=Sequential()

model3.add(Dense(5, activation="relu"))
model3.add(Dropout(0.5))
model3.add(Dense(3, activation="relu"))
model3.add(Dense(1, activation="sigmoid"))

model3.compile(loss="binary_crossentropy", optimizer="adam")

In [None]:
#For the neural network training, the outputs needed to be codified, and in order to avoid the ordinalization
#of the classes I chose to classify the distintion cases toghether with thw pass cases
def categories(cat):
    if cat=="Fail":
        return 0
    if cat=="Pass":
        return 1
    if cat=="Distinction":
        return 1
    
y_test=list(map(categories,y_test))
y_train=list(map(categories,y_train))

In [None]:
y_train=np.asarray(y_train)
y_test=np.asarray(y_test)
early_stop=EarlyStopping(monitor="val_loss", mode="min", verbose=1, patience=25)

In [None]:
model1.fit(x=X1_train, y=y_train, epochs=2000, validation_data=(X1_test,y_test),callbacks=[early_stop])

In [None]:
losses=pd.DataFrame(model1.history.history)
losses.plot()

In [None]:
predictions=model1.predict_classes(X1_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

In [None]:
model2.fit(x=X2_train, y=y_train, epochs=2000, validation_data=(X2_test,y_test),callbacks=[early_stop])

In [None]:
losses=pd.DataFrame(model2.history.history)
losses.plot()

In [None]:
predictions=model2.predict_classes(X2_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

In [None]:
model3.fit(x=X3_train, y=y_train, epochs=2000, validation_data=(X3_test,y_test),callbacks=[early_stop])

In [None]:
losses=pd.DataFrame(model3.history.history)
losses.plot()

In [None]:
predictions=model3.predict_classes(X3_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

# We have sucessfully developed models for the prediction of the student performance, but how can we choose one?

* First of all, the models in which both *weighted_grade* and *pass_rate* were included overall performed better than their omitted counterparts, suggesting that our hypotheses was wrong

* The neural network classifiers had difficulties predicting the cases of failure, but overall had a better performance tha the other models, probably due to the removal of one class.

* The other models could be used in headhunting programs, developed to select students who are very likely to graduate with distinction and offering them scholarships, jobs, etc.

* Altough I created a lot of features I wonder how many other features could be created for this problem and how would they improve model performance. If you are curious too, fork this notebook and give it a try too!

Thanks for reading my kernel and keep learning!