## Title

In [1]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder

# Summary

# Introduction

# Methods

## Data

The data set used in this project is the Student Performance dataset created by Paulo Cortez from the University of Minho, Portugal (Cortez and Silva 2008). It was sourced from the UCI Machine Learning Repository (Cortez 2014) and can be found [here](https://archive.ics.uci.edu/dataset/320/student+performance). The dataset contains information on 649 students from two Portuguese secondary schools, with data collected through school reports and questionnaires. Each row represents a student with 30 features including demographic information (age, sex, family size), educational background (parental education, past failures, study time), social factors (going out, romantic relationships, alcohol consumption), and school-related features (absences, extra support, desire for higher education). The dataset also includes grades from the first period (G1), second period (G2), and final grade (G3), with G3 serving as the target variable for prediction.


## Analysis

Linear regression was used to build a predictive model to estimate students' final grades (G3). All variables included in the original data set were used to fit the model. Data was split with 75% being partitioned into the training set and 25% being partitioned into the test set. Root mean squared error (RMSE), mean absolute error (MAE), and $R^2$ score were used to evaluate model performance on the test set. All numerical variables were standardized and categorical variables were one-hot encoded just prior to model fitting.

# Results & Discussion

In [2]:
# download data as zip and extract
url = "https://archive.ics.uci.edu/static/public/320/student+performance.zip"

request = requests.get(url)
with open("../data/raw/student+performance.zip", 'wb') as f:
    f.write(request.content)

with zipfile.ZipFile("../data/raw/student+performance.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/raw")

with zipfile.ZipFile("../data/raw/student.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/raw")


In [3]:
# Read the Math performance data
student_df = pd.read_csv("../data/raw/student-mat.csv", sep=';')

X = student_df.drop('G3', axis=1)
y = student_df['G3']

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

np.random.seed(522)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=522
)

student_train = pd.concat([X_train, y_train], axis=1)
student_test = pd.concat([X_test, y_test], axis=1)

student_train.to_csv("../data/processed/student_train.csv", index=False)
student_test.to_csv("../data/processed/student_test.csv", index=False)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")


Training set size: 296
Test set size: 99


In [4]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(drop='if_binary', handle_unknown='ignore'), categorical_features),
    verbose_feature_names_out=False
)

preprocessor.fit(X_train)
scaled_X_train = preprocessor.transform(X_train)
scaled_X_test = preprocessor.transform(X_test)

scaled_student_train = pd.DataFrame(
    scaled_X_train, 
    columns=preprocessor.get_feature_names_out()
)
scaled_student_train['G3'] = y_train.values

scaled_student_test = pd.DataFrame(
    scaled_X_test,
    columns=preprocessor.get_feature_names_out()
)
scaled_student_test['G3'] = y_test.values

scaled_student_train.to_csv("../data/processed/scaled_student_train.csv", index=False)
scaled_student_test.to_csv("../data/processed/scaled_student_test.csv", index=False)

# References

Cortez, P. (2008). _Student Performance_ [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5TG7T.

Cortez, P. & Silva, A. (2008). Using data mining to predict secondary school student performance. _EUROSIS._ https://doi.org/10.24432/C5TG7T.

Ma, Y., Liu, B., Wong, C., Yu, P., & Lee, S. (2000). Targeting the right students using data mining. _Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining,_ 457-464. https://doi.org/10.1145/347090.347184.

Pritchard, M. & Wilson, G. (2003). Using Emotional and Social Factors to Predict Student Success. _Journal of College Student Development,_ 44, 18-28. https://doi.org/10.1353/csd.2003.0008.

Johora, F. T., Hasan, M. N., Rajbongshi, A., Ashrafuzzaman, M., & Akter, F. (2025). An explainable AI-based approach for predicting undergraduate students academic performance. _Array,_ 26, 100384. https://doi.org/10.1016/j.array.2025.100384.