# Student Performance Data - Initial Exploration and Cleaning

This notebook loads the raw student performance dataset, performs basic exploratory checks, converts appropriate columns to categorical data types, and saves a cleaned version of the data for downstream analysis and modeling

In [None]:
import pandas as pd


## 1. Load the dataset

We start by importing `pandas` and loading the raw student performance data from CSV. We will take a look at the first few rows to understand the dataframe's structure.

In [16]:
data = pd.read_csv("../data/student_performance.csv")



# 2. Intital data exploration

Here we:
- Inspect the first few rows (`.head()`)
- Check column types and non-null counts (`.info()`)
- View summary statistics for numeric columns (`.describe()`)
- Check for missing values in each column (`.isnull().sum()`)

This helps us understand the shape and quality of the raw data.

In [18]:
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14003 entries, 0 to 14002
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   StudyHours            14003 non-null  int64
 1   Attendance            14003 non-null  int64
 2   Resources             14003 non-null  int64
 3   Extracurricular       14003 non-null  int64
 4   Motivation            14003 non-null  int64
 5   Internet              14003 non-null  int64
 6   Gender                14003 non-null  int64
 7   Age                   14003 non-null  int64
 8   LearningStyle         14003 non-null  int64
 9   OnlineCourses         14003 non-null  int64
 10  Discussions           14003 non-null  int64
 11  AssignmentCompletion  14003 non-null  int64
 12  ExamScore             14003 non-null  int64
 13  EduTech               14003 non-null  int64
 14  StressLevel           14003 non-null  int64
 15  FinalGrade            14003 non-null  int64
dtypes: i

In [11]:
data.describe()

Unnamed: 0,StudyHours,Attendance,Resources,Extracurricular,Motivation,Internet,Gender,Age,LearningStyle,OnlineCourses,Discussions,AssignmentCompletion,ExamScore,EduTech,StressLevel,FinalGrade
count,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0
mean,19.987431,80.194316,1.104406,0.594158,0.905806,0.925516,0.551953,23.532172,1.515461,9.891952,0.60587,74.502535,70.346926,0.709062,1.304363,1.447904
std,5.890637,11.472181,0.697362,0.491072,0.695896,0.262566,0.497311,3.514293,1.112941,6.112801,0.48868,14.632177,17.688113,0.454211,0.785383,1.12155
min,5.0,60.0,0.0,0.0,0.0,0.0,0.0,18.0,0.0,0.0,0.0,50.0,40.0,0.0,0.0,0.0
25%,16.0,70.0,1.0,0.0,0.0,1.0,0.0,20.0,1.0,5.0,0.0,62.0,55.0,0.0,1.0,0.0
50%,20.0,80.0,1.0,1.0,1.0,1.0,1.0,24.0,2.0,10.0,1.0,74.0,70.0,1.0,2.0,1.0
75%,24.0,90.0,2.0,1.0,1.0,1.0,1.0,27.0,3.0,15.0,1.0,87.0,86.0,1.0,2.0,2.0
max,44.0,100.0,2.0,1.0,2.0,1.0,1.0,29.0,3.0,20.0,1.0,100.0,100.0,1.0,2.0,3.0


In [12]:
data.isnull().sum()

StudyHours              0
Attendance              0
Resources               0
Extracurricular         0
Motivation              0
Internet                0
Gender                  0
Age                     0
LearningStyle           0
OnlineCourses           0
Discussions             0
AssignmentCompletion    0
ExamScore               0
EduTech                 0
StressLevel             0
FinalGrade              0
dtype: int64

## 3. Defining and converting categorical features

Some columns represent categories rather than continuous numeric values (e.g., **Gender**, **Motivation level**, **Learning style**, **Final grade**).

We explicitly mark these columns as `category` dtype so that:
- Pandas handles them more efficiently
- They are easier to encode later for machine learning
- Summary statisitics for categories are more informative

In [13]:
categorical_cols = ['Resources', 'Extracurricular', 'Motivation', 'Internet', 'Gender', 'LearningStyle', 'Discussions', 'EduTech', 'FinalGrade']
for col in categorical_cols:
    data[col] = data[col].astype('category')

data.dtypes

StudyHours                 int64
Attendance                 int64
Resources               category
Extracurricular         category
Motivation              category
Internet                category
Gender                  category
Age                        int64
LearningStyle           category
OnlineCourses              int64
Discussions             category
AssignmentCompletion       int64
ExamScore                  int64
EduTech                 category
StressLevel                int64
FinalGrade              category
dtype: object

In [14]:
data.describe(include='category')

Unnamed: 0,Resources,Extracurricular,Motivation,Internet,Gender,LearningStyle,Discussions,EduTech,FinalGrade
count,14003,14003,14003,14003,14003,14003,14003,14003,14003
unique,3,2,3,2,2,4,2,2,4
top,1,1,1,1,1,1,1,1,0
freq,7041,8320,7098,12960,7729,3580,8484,9929,3832


In [15]:
for col in categorical_cols:
    print(col, data[col].unique())

Resources [1, 0, 2]
Categories (3, int64): [0, 1, 2]
Extracurricular [0, 1]
Categories (2, int64): [0, 1]
Motivation [0, 1, 2]
Categories (3, int64): [0, 1, 2]
Internet [1, 0]
Categories (2, int64): [0, 1]
Gender [0, 1]
Categories (2, int64): [0, 1]
LearningStyle [2, 3, 1, 0]
Categories (4, int64): [0, 1, 2, 3]
Discussions [1, 0]
Categories (2, int64): [0, 1]
EduTech [0, 1]
Categories (2, int64): [0, 1]
FinalGrade [3, 2, 0, 1]
Categories (4, int64): [0, 1, 2, 3]


In [20]:
data.to_csv("../data/cleaned_student_performance.csv", index=False)