# Data Preparation

## Getting the System Ready and Loading the Data (Step 3)

### Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load Raw Data
We load the raw dataset from a CSV file and perform initial cleaning by removing the StudentID column. StudentID is a unique identifier that does not provide predictive value for machine learning, as it has no relationship with student performance or other features. Removing it ensures the dataset focuses on meaningful variables from the start.

In [3]:
# Load the raw dataset
data = pd.read_csv("../data/raw/student_performance_data.csv")

# Remove the StudentID column, as it is a non-predictive unique identifier
data = data.drop(columns=['StudentID'])

# Display the first few rows and dataset info to verify the change
print("First few rows after removing StudentID:")
print(data.head())
print("\nDataset info:")
data.info()

First few rows after removing StudentID:
   Age  Gender  Ethnicity  ParentalEducation  StudyTimeWeekly  Absences  \
0   17       1          0                  2        19.833723         7   
1   18       0          0                  1        15.408756         0   
2   15       0          2                  3         4.210570        26   
3   17       1          0                  3        10.028829        14   
4   17       1          0                  2         4.672495        17   

   Tutoring  ParentalSupport  Extracurricular  Sports  Music  Volunteering  \
0         1                2                0       0      1             0   
1         0                1                0       0      0             0   
2         0                2                0       0      0             0   
3         0                3                1       0      0             0   
4         1                3                0       0      0             0   

        GPA  GradeClass  
0  2.929196  

## Understanding the Data (Step 4)

In [11]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
StudentID,2392.0,2196.5,690.655244,1001.0,1598.75,2196.5,2794.25,3392.0
Age,2392.0,16.468645,1.123798,15.0,15.0,16.0,17.0,18.0
Gender,2392.0,0.51087,0.499986,0.0,0.0,1.0,1.0,1.0
Ethnicity,2392.0,0.877508,1.028476,0.0,0.0,0.0,2.0,3.0
ParentalEducation,2392.0,1.746237,1.000411,0.0,1.0,2.0,2.0,4.0
StudyTimeWeekly,2392.0,9.771992,5.652774,0.001057,5.043079,9.705363,14.40841,19.978094
Absences,2392.0,14.541388,8.467417,0.0,7.0,15.0,22.0,29.0
Tutoring,2392.0,0.301421,0.458971,0.0,0.0,0.0,1.0,1.0
ParentalSupport,2392.0,2.122074,1.122813,0.0,1.0,2.0,3.0,4.0
Extracurricular,2392.0,0.383361,0.486307,0.0,0.0,0.0,1.0,1.0


## Missing Value and Outlier Treatment (Step 6)