# Models – Predicting Student Performance Using Machine Learning

This notebook focuses on implementing machine learning regression models to predict student academic performance based on various features such as moodle or caronte activity, past grades, participation, and other contextual factors. By anticipating students' future marks, this system aims to identify at-risk individuals early and send alerts to enable timely intervention and support.

The ultimate goal of this predictive approach is to enhance student outcomes by allowing educators and support systems to take data-driven, proactive actions rather than reactive ones.

### Objectives:

- Build and evaluate multiple regression models (e.g., Linear Regression, Random Forest, Gradient Boosting, ..).

- Select the best-performing model for prediction.

- Provide predicted marks for individual students.

- Enable early warning systems based on risk thresholds.

### Step 1: Importing data and spliting data into train (historical) and test (2023/24)

DO NOT include any 2023/24 data in training. Including 2023/24 in training would leak future information into the model.

In [2]:
"""
i dont know if i have correctly selected the dataframe, but i'll make the first approach of models code. 
"""
from DATASET import clean_df
import pandas as pd
import numpy as np

# Load dataset
df = clean_df.copy()
df = df.dropna(subset=['nota_assignatura'])

# Separate current year data (to predict)
df_train = df[df['curs_academic'] != '2023/24'].copy()
df_pred_target = df[df['curs_academic'] == '2023/24'].copy()

                              Estudi Curs acadèmic  \
0  Graduat en Enginyeria Informàtica       2020/21   
1  Graduat en Enginyeria Informàtica       2020/21   
2  Graduat en Enginyeria Informàtica       2020/21   
3  Graduat en Enginyeria Informàtica       2020/21   
4  Graduat en Enginyeria Informàtica       2020/21   

                          Id Anonim  Sexe                 Assignatura  \
0  1DFB71F2B000D1421808D0B3F67B335E  Home                     Àlgebra   
1  1DFB71F2B000D1421808D0B3F67B335E  Home                      Càlcul   
2  1DFB71F2B000D1421808D0B3F67B335E  Home  Electricitat i Electrònica   
3  1DFB71F2B000D1421808D0B3F67B335E  Home      Fonaments d'Enginyeria   
4  1DFB71F2B000D1421808D0B3F67B335E  Home     Fonaments d'Informàtica   

   Codi assignatura  Nota_assignatura     Via Accés Estudi  Nota d'accés  \
0            103801               0.0  BATX. / COU AMB PAU        10.798   
1            103802               0.0  BATX. / COU AMB PAU        10.798   
2       

### Step 2: Feature Engineering

It will be necessary to add some features like student’s average past grade, student’s performance in similar courses, historical mean performance per assignatura. 

In [3]:
# Average past grade per student
avg_student_grade = df_train.groupby('id_anonim')['nota_assignatura'].mean().reset_index()
avg_student_grade.columns = ['id_anonim', 'avg_nota_estudiant']

# Average grade per assignatura
avg_subject_grade = df_train.groupby('assignatura')['nota_assignatura'].mean().reset_index()
avg_subject_grade.columns = ['assignatura', 'avg_nota_assignatura']

# Merge with training data
df_train = df_train.merge(avg_student_grade, on='id_anonim', how='left')
df_train = df_train.merge(avg_subject_grade, on='assignatura', how='left')

# Apply same to prediction target data
df_pred_target = df_pred_target.merge(avg_student_grade, on='id_anonim', how='left')
df_pred_target = df_pred_target.merge(avg_subject_grade, on='assignatura', how='left')


Step 3: Preparing features

In [12]:
# Select features
features = ['assignatura', 'sexe', 'via_acces_estudi', 'dedicacio_de_l_estudiant', 'beca_concedida',
            'estudis_mare', 'estudis_pare', 'avg_nota_estudiant', 'avg_nota_assignatura']
target = 'nota_assignatura'

# Split into X, y
X_train = df_train[features]
y_train = df_train[target]
X_test = df_pred_target[features]
X_test

Unnamed: 0,assignatura,sexe,via_acces_estudi,dedicacio_de_l_estudiant,beca_concedida,estudis_mare,estudis_pare,avg_nota_estudiant,avg_nota_assignatura
0,Àlgebra,Dona,Batx. / cou amb pau,Temps complet,No,Graus universitaris o diplomatures universitàries,Batxillerat,2.763636,6.313152
1,Àlgebra,Dona,Batx. / cou amb pau,Temps complet,No,Graus universitaris o diplomatures universitàries,Batxillerat,2.763636,6.313152
2,Àlgebra,Dona,Batx. / cou amb pau,Temps complet,No,Graus universitaris o diplomatures universitàries,Batxillerat,2.763636,6.313152
3,Àlgebra,Dona,Batx. / cou amb pau,Temps complet,No,Graus universitaris o diplomatures universitàries,Batxillerat,2.763636,6.313152
4,Càlcul,Dona,Batx. / cou amb pau,Temps complet,No,Graus universitaris o diplomatures universitàries,Batxillerat,2.763636,5.609170
...,...,...,...,...,...,...,...,...,...
114728,Visió i aprenentatge,Dona,Batx. / cou amb pau,Temps complet,No,Diplomat o enginyer tèc.,Diplomat o enginyer tèc.,4.777778,7.156175
114729,Visió i aprenentatge,Dona,Batx. / cou amb pau,Temps complet,No,Diplomat o enginyer tèc.,Diplomat o enginyer tèc.,4.777778,7.156175
114730,Visió i aprenentatge,Dona,Batx. / cou amb pau,Temps complet,No,Diplomat o enginyer tèc.,Diplomat o enginyer tèc.,4.777778,7.156175
114731,Visió i aprenentatge,Dona,Batx. / cou amb pau,Temps complet,No,Diplomat o enginyer tèc.,Diplomat o enginyer tèc.,4.777778,7.156175


### Step 4: MODELS

...

### Linear Regression


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Encode categorical variables
categorical_features = X_train.select_dtypes(include='object').columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)],
    remainder='passthrough'
)

# Create pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),  #chatito jjiji va muy mal asiq lo tengo que reveer
    ('regressor', LinearRegression())
])

# Train model
model.fit(X_train, y_train)

# Align the test set with the cleaned features
df_pred_target = df_pred_target.loc[X_test.index].copy()

# Predict for 2023/24
df_pred_target['predicted_nota_assignatura'] = model.predict(X_test)


"\n# Predict for 2023/24\ndf_pred_target['predicted_nota_assignatura'] = model.predict(X_test)\n"

Reviewing predictions of Linear Regression model. 

In [16]:
# Show predictions for 2023/24 students
df_pred_target[['id_anonim', 'assignatura', 'nota_assignatura', 'predicted_nota_assignatura']][:20]

Unnamed: 0,id_anonim,assignatura,nota_assignatura,predicted_nota_assignatura
0,0f45e92fec9ab96236dc5ad7f96713e2,Àlgebra,5.6,3.19561
1,0f45e92fec9ab96236dc5ad7f96713e2,Àlgebra,5.6,3.19561
2,0f45e92fec9ab96236dc5ad7f96713e2,Àlgebra,5.6,3.19561
3,0f45e92fec9ab96236dc5ad7f96713e2,Àlgebra,5.6,3.19561
4,0f45e92fec9ab96236dc5ad7f96713e2,Càlcul,5.4,2.487509
5,0f45e92fec9ab96236dc5ad7f96713e2,Càlcul,5.4,2.487509
6,0f45e92fec9ab96236dc5ad7f96713e2,Càlcul,5.4,2.487509
7,0f45e92fec9ab96236dc5ad7f96713e2,Càlcul,5.4,2.487509
8,0f45e92fec9ab96236dc5ad7f96713e2,Electricitat i electrònica,6.3,3.306296
9,0f45e92fec9ab96236dc5ad7f96713e2,Electricitat i electrònica,6.3,3.306296
