## **CSI4142 - A3: Part 2**

**Group:** 9

**Members:** 
- Jay Ghosh (300243766) 
- Alexander Azizi-Martin (300236257)

**Introduction**

This notebook presents our solution to Part 2 of Assignment 2. The notebook begins by defining helper functions to simulate various types of missing data (MCAR, MAR, MNAR) and evaluate imputation methods. It then explores the dataset, transforms categorical features into numerical format to facilitate their use in analysis, and concludes with an evaluation of median, KNN, and MICE imputation techniques.

In [1]:
import kagglehub
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

#### **Dataset description**
**Dataset name:** Student Performance & Learning Style [1]

**Authors:** Adil Shamim [1]

**Purpose:** This dataset examines the relationship between study habits, learning preferences, and external factors in shaping student performance. It provides insights into how study time, participation, sleep, and technology use influence academic success, supporting research in education and predictive modeling for student outcomes.

**Shape:** 10,000 rows and 15 columns. [1]

In [2]:
# Loading dataset
df_path = kagglehub.dataset_download("adilshamim8/student-performance-and-learning-style")
df = pd.read_csv(f"{df_path}/student_performance_large_dataset.csv")
df.head()



Unnamed: 0,Student_ID,Age,Gender,Study_Hours_per_Week,Preferred_Learning_Style,Online_Courses_Completed,Participation_in_Discussions,Assignment_Completion_Rate (%),Exam_Score (%),Attendance_Rate (%),Use_of_Educational_Tech,Self_Reported_Stress_Level,Time_Spent_on_Social_Media (hours/week),Sleep_Hours_per_Night,Final_Grade
0,S00001,18,Female,48,Kinesthetic,14,Yes,100,69,66,Yes,High,9,8,C
1,S00002,29,Female,30,Reading/Writing,20,No,71,40,57,Yes,Medium,28,8,D
2,S00003,20,Female,47,Kinesthetic,11,No,60,43,79,Yes,Low,13,7,D
3,S00004,23,Female,13,Auditory,0,Yes,63,70,60,Yes,Low,24,10,B
4,S00005,19,Female,24,Auditory,19,Yes,59,63,93,Yes,Medium,26,8,C


In [3]:
df.shape

(10000, 15)

The dataset contains 5k rows and 15 columns, as is reported by accessing the shape property.

In [4]:
# Running info() to get a basic understanding of the data and the types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Student_ID                               10000 non-null  object
 1   Age                                      10000 non-null  int64 
 2   Gender                                   10000 non-null  object
 3   Study_Hours_per_Week                     10000 non-null  int64 
 4   Preferred_Learning_Style                 10000 non-null  object
 5   Online_Courses_Completed                 10000 non-null  int64 
 6   Participation_in_Discussions             10000 non-null  object
 7   Assignment_Completion_Rate (%)           10000 non-null  int64 
 8   Exam_Score (%)                           10000 non-null  int64 
 9   Attendance_Rate (%)                      10000 non-null  int64 
 10  Use_of_Educational_Tech                  10000 non-null  ob

The `.info()` method reveals that all columns are complete, with no missing values across all 10000 records. This is further corroborated by the Kaggle page, which also reports 100% completeness [1].

### Feature Details

**Student_ID**  
- Type: Numerical.  
- Description: Unique identifier assigned to each student.  

**Age**  
- Type: Numerical.  
- Description: Student's age at the time of data collection.  
- Range: 18-30 years.  

**Gender**  
- Type: Categorical.  
- Description: Student's self-identified gender.  
- Categories: Male, Female, Other.  

**Study_Hours_per_Week**  
- Type: Numerical.  
- Description: Total hours a student studies per week.  
- Range: 5-50 hours.  
- Unit: Hours.  

**Preferred_Learning_Style**  
- Type: Categorical.  
- Description: Primary learning method preferred by the student.  
- Categories: Visual, Auditory, Reading/Writing, Kinesthetic.  

**Online_Courses_Completed**  
- Type: Numerical.  
- Description: Number of online courses the student has completed.  
- Range: 0-20.  

**Participation_in_Discussions**  
- Type: Categorical.  
- Description: Whether the student actively participates in academic discussions.  
- Categories: Yes, No.  

**Assignment_Completion_Rate (%)**  
- Type: Numerical.  
- Description: Percentage of assignments completed by the student.  
- Range: 50%-100%.  
- Unit: Percentage.  

**Exam_Score (%)**  
- Type: Numerical.  
- Description: Student's final exam score.  
- Range: 40%-100%.  
- Unit: Percentage.  

**Attendance_Rate (%)**  
- Type: Numerical.  
- Description: Percentage of classes attended by the student.  
- Range: 50%-100%.  
- Unit: Percentage.  

### **Feature Engineering**

In [None]:
df['Study_Efficiency'] = df['Exam_Score'] / df['Study_Hours_per_Week']

### **Conclusion**


Median imputation is computationally efficient and suitable for MCAR or low-correlation MAR data, where missingness is random and variable relationships are negligible. However, its simplicity becomes a weakness in high-correlation MAR and MNAR scenarios, as it ignores interdependencies between variables, leading to skewed imputations.

KNN imputation offers a middle ground, improving upon median in high-correlation MAR and MNAR settings by leveraging local similarity, but at the cost of increased computational load (scaling with dataset size and neighborhood selection). Its sensitivity to noise or sparse data limits its utility in MCAR or low-correlation MAR cases, where median imputation is both faster and comparably accurate.

MICE excels in accuracy across all missingness mechanisms, particularly where variable relationships matter (high-correlation MAR, MNAR), thanks to its iterative modeling of interdependencies. However, this comes with high computational complexity, making it less practical for large datasets or time-sensitive applications.

### **References**

[1] https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style