# KNN Regression: Predicting Student Final Grades

## Project Overview
This project applies K-Nearest Neighbors (KNN) regression to predict a student's final math grade
(G3) based on study habits, attendance, and prior academic performance.

The goal is to explore how model performance changes with different values of *k* and to demonstrate the bias–variance tradeoff using a supervised learning approach.

## Dataset
The analysis uses the Student Performance (Math) dataset from the UCI Machine Learning Repository.  
The target variable is **G3** (final grade), and predictors include:
- Study time
- Absences
- Grades from earlier terms (G1, G2)

## Methods Overview
- Data cleaning and feature selection  
- Train/test split  
- Feature scaling  
- KNN regression  
- Hyperparameter tuning using RMSE  
- Visualization using Altair

## Data Preparation

In [23]:
# Data handling
import pandas as pd

# import the K-NN regression model
from sklearn.model_selection import GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import set_config

# import the K-NN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Visualization
import altair as alt

In [24]:
pip install ucimlrepo



In [25]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets


In [26]:
# combine data to make it a single dataframe (for reference)
student_df = pd.concat([X, y], axis="columns")
student_df.columns
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## Train-Test Split


In [27]:
student_df = student_df[['studytime', 'absences', 'G1', 'G2', 'G3']]
student_df.dropna(inplace=True)

student_X = student_df.drop(columns='G3')
student_y = student_df['G3']

In [35]:
# Split the data into training and testing sets
student_X_train, student_X_test, student_y_train, student_y_test = train_test_split(
    student_X, student_y,
    test_size=0.2,
    random_state=42
)

## Model Pipeline and Hyperparameter Tuning

KNN regression was chosen because it is a non-parametric method that can capture nonlinear relationships between prior grades and final performance without assuming a fixed functional form.


In [29]:
# --- model

# preprocess the data, make the pipeline
preprocessor = make_column_transformer(
    (StandardScaler(), student_X.columns),
    remainder="drop"
)
pipeline = make_pipeline(
    preprocessor,
    KNeighborsRegressor()
)
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 31)
}

gridsearch = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error"
)

In [30]:
gridsearch.fit(student_X_train, student_y_train)

In [31]:
results = pd.DataFrame(gridsearch.cv_results_)
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006047,0.002102,0.004737,0.001528,1,{'kneighborsregressor__n_neighbors': 1},-2.579505,-2.20794,-1.839732,-1.562788,-1.765363,-1.991066,0.360711,30
1,0.008106,0.000694,0.006274,0.000858,2,{'kneighborsregressor__n_neighbors': 2},-1.782662,-2.062136,-1.606777,-1.478207,-1.309969,-1.64795,0.258626,29
2,0.007712,0.000379,0.006271,0.000493,3,{'kneighborsregressor__n_neighbors': 3},-1.556624,-1.942826,-1.446703,-1.339728,-1.11332,-1.47984,0.273977,15
3,0.007475,0.000372,0.006131,0.000257,4,{'kneighborsregressor__n_neighbors': 4},-1.501802,-1.802776,-1.543052,-1.316729,-1.22152,-1.477176,0.20103,13
4,0.007537,0.000496,0.006226,0.000589,5,{'kneighborsregressor__n_neighbors': 5},-1.473745,-1.766679,-1.52366,-1.341211,-1.132572,-1.447573,0.209209,9


## Results

In [32]:
gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 6}

In [33]:
predicted = gridsearch.predict(student_X_test)

rmse_knn = mean_squared_error(
    y_true=student_y_test,
    y_pred=predicted
) ** 0.5

rmse_knn

1.3764658076603717

##Plots


In [38]:
results_plot__df = pd.DataFrame({
    "Actual Grade": student_y_test,
    "Predicted Grade": predicted
})
results_plot= alt.Chart(results_plot__df).mark_circle(size=60, opacity=0.6).encode(
    x=alt.X("Actual Grade:Q", title="Actual Final Grade (G3)"),
    y=alt.Y("Predicted Grade:Q", title="Predicted Final Grade"),
    tooltip=["Actual Grade", "Predicted Grade"]
).properties(
    title="Predicted vs Actual Student Final Grades",
    width=500,
    height=500)
results_plot

The predicted vs. actual grade plot shows a strong positive relationship, with most points clustering near the diagonal line representing perfect prediction. This indicates that the KNN model captures key patterns in student performance. Some dispersion is visible for mid-range grades, reflecting natural variability in educational outcomes and unobserved factors not included in the model. Overall, the visualization aligns with the observed RMSE of approximately 1.38.

In [39]:
error_df = pd.DataFrame({
    "Prediction Error": predicted - student_y_test
})
error_plot= alt.Chart(error_df).mark_bar().encode(
    x=alt.X(
        "Prediction Error:Q",
        bin=alt.Bin(maxbins=30),
        title="Prediction Error (Predicted − Actual)"
    ),
    y=alt.Y("count()", title="Number of Students"),
    tooltip=["count()"]
).properties(
    title="Distribution of Prediction Errors",
    width=500,
    height=400
)
error_plot

The final KNN regression model achieved an RMSE of approximately 1.38 on the test set, indicating that predictions are, on average, within about 1–2 grade points of the true final grades. Visual inspection of predicted versus actual values and the error distribution suggests stable performance without strong systematic bias.
