# KNN Regression: Predicting Student Final Grades

## Project Overview

This project uses **K-Nearest Neighbors (KNN) regression** to predict students’ final math grades (G3) based on study behavior, attendance, and prior academic performance.

Beyond prediction, the project examines how **model complexity (choice of k)** affects performance and illustrates the **bias–variance tradeoff** in a supervised learning setting. The analysis also highlights the dominant predictive role of prior academic outcomes in educational data.

**Why this matters:**  
Understanding which factors most strongly influence academic performance helps distinguish between models that are accurate versus models that are actionable for early intervention.

## Dataset
The analysis uses the Student Performance (Math) dataset from the UCI Machine Learning Repository.  
The target variable is **G3** (final grade), and predictors include:
- Study time
- Absences
- Grades from earlier terms (G1, G2)

## Methods Overview
- Data cleaning and feature selection  
- Train/test split  
- Feature scaling  
- KNN regression  
- Hyperparameter tuning using RMSE  
- Visualization using Altair

## Data Preparation

In [2]:
# Data handling
import pandas as pd

# import the K-NN regression model
from sklearn.model_selection import GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import set_config

# import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Visualization
import altair as alt

In [3]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [4]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets


In [5]:
# combine data to make it a single dataframe (for reference)
student_df = pd.concat([X, y], axis="columns")
student_df.columns
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## Train-Test Split


In [6]:
student_df = student_df[['studytime', 'absences', 'G1', 'G2', 'G3']]
student_df.dropna(inplace=True)

student_X = student_df.drop(columns='G3')
student_y = student_df['G3']

In [7]:
# Split the data into training and testing sets
student_X_train, student_X_test, student_y_train, student_y_test = train_test_split(
    student_X, student_y,
    test_size=0.2,
    random_state=42
)

## Model Pipeline and Hyperparameter Tuning

KNN regression was chosen because it is a non-parametric method that can capture nonlinear relationships between prior grades and final performance without assuming a fixed functional form.


In [8]:
# --- model

# preprocess the data, make the pipeline
preprocessor = make_column_transformer(
    (StandardScaler(), student_X.columns),
    remainder="drop"
)
pipeline = make_pipeline(
    preprocessor,
    KNeighborsRegressor()
)
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 31)
}

gridsearch = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error"
)
gridsearch.fit(student_X_train, student_y_train)

In [9]:
results = pd.DataFrame(gridsearch.cv_results_)
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.043306,0.024744,0.031206,0.022015,1,{'kneighborsregressor__n_neighbors': 1},-2.579505,-2.20794,-1.839732,-1.562788,-1.765363,-1.991066,0.360711,30
1,0.010428,0.00152,0.008248,0.001149,2,{'kneighborsregressor__n_neighbors': 2},-1.782662,-2.062136,-1.606777,-1.478207,-1.309969,-1.64795,0.258626,29
2,0.009211,0.000288,0.007001,0.001065,3,{'kneighborsregressor__n_neighbors': 3},-1.556624,-1.942826,-1.446703,-1.339728,-1.11332,-1.47984,0.273977,15
3,0.005753,0.000127,0.004568,0.000126,4,{'kneighborsregressor__n_neighbors': 4},-1.501802,-1.802776,-1.543052,-1.316729,-1.22152,-1.477176,0.20103,13
4,0.009313,0.00288,0.008151,0.001638,5,{'kneighborsregressor__n_neighbors': 5},-1.473745,-1.766679,-1.52366,-1.341211,-1.132572,-1.447573,0.209209,9


## Results

In [10]:
gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 6}

In [11]:
predicted = gridsearch.predict(student_X_test)

rmse_knn = mean_squared_error(
    y_true=student_y_test,
    y_pred=predicted
) ** 0.5

rmse_knn

1.3764658076603717

## Results

Hyperparameter tuning using 5-fold cross-validation identified **k = 6** as the optimal number of neighbors.

On the held-out test set, the final KNN model achieved:

- **RMSE ≈ 1.38**

This indicates that predictions are typically within **one to two grade points** of the true final grade.

##Plots


In [12]:
results_plot__df = pd.DataFrame({
    "Actual Grade": student_y_test,
    "Predicted Grade": predicted
})
results_plot= alt.Chart(results_plot__df).mark_circle(size=60, opacity=0.6).encode(
    x=alt.X("Actual Grade:Q", title="Actual Final Grade (G3)"),
    y=alt.Y("Predicted Grade:Q", title="Predicted Final Grade"),
    tooltip=["Actual Grade", "Predicted Grade"]
).properties(
    title="Predicted vs Actual Student Final Grades",
    width=500,
    height=500)
results_plot

The predicted vs. actual grade plot shows a strong positive relationship, with most points clustering near the diagonal line representing perfect prediction. This indicates that the KNN model captures key patterns in student performance. Some dispersion is visible for mid-range grades, reflecting natural variability in educational outcomes and unobserved factors not included in the model. Overall, the visualization aligns with the observed RMSE of approximately 1.38.

In [13]:
error_df = pd.DataFrame({
    "Prediction Error": predicted - student_y_test
})
error_plot= alt.Chart(error_df).mark_bar().encode(
    x=alt.X(
        "Prediction Error:Q",
        bin=alt.Bin(maxbins=30),
        title="Prediction Error (Predicted − Actual)"
    ),
    y=alt.Y("count()", title="Number of Students"),
    tooltip=["count()"]
).properties(
    title="Distribution of Prediction Errors",
    width=500,
    height=400
)
error_plot

The final KNN regression model achieved an RMSE of approximately 1.38 on the test set, indicating that predictions are, on average, within about 1–2 grade points of the true final grades. Visual inspection of predicted versus actual values and the error distribution suggests stable performance without strong systematic bias.


## Limitations

- The inclusion of prior grades (`G1`, `G2`) limits usefulness for early intervention.
- Model performance depends on distance metrics and feature scaling.
- The dataset is relatively small and specific to one subject and population.
- Important qualitative factors (e.g., motivation, teaching quality) are not observed.

## Next Steps

- Evaluate performance without prior grades to assess early-term predictive power  
- Compare KNN with linear regression and tree-based models  
- Investigate feature importance and partial dependence  
- Extend the analysis to additional student performance datasets