# Student Performance Indicator

### Life Cycle of Machine Learning Project
- Understanding the Problem Statement
- Data Collection
- Exploratory Data Analysis
- Data Pre-processing
- Model Training
- Choose Best Model

### 1 - Problem Statement
- This project understands how the student's performance (test score) is affected by other variables such as Gender, Ethnicity, Parental Level of Education, Lunch and Test Preparation Course.

### 2 - Data Collection
- Data Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
- The data consists of 8 columns and 1000 rows.

#### 2.1 - Import Data and Required Packages

##### 2.1.1 - Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library.

In [1]:
import numpy as numpy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

##### 2.1.2 - Import the CSV Data as Pandas DataFrame

In [2]:
df = pd.read_csv('data/stud.csv')

##### 2.1.3 - Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


##### 2.1.4 - Shape of Dataset

In [5]:
df.shape

(1000, 8)

#### 2.2 Dataset Information

Category | Definition | Values
-|-|-|
Gender | Gender of students | Male/Female
Race/Ethnicity | Ethnicity of Students | Group A/B/C/D/E
Parental Level of Education | Parent's Highest Qualification | Bachelor's Degree/Master's Degree/Some College/Associate's Degree/High School
Lunch | Whether student had lunch before test | Standard or Free/Reduced
Test Preparation Course | Whether preparation done | Complete/Note Complete
Math Score || int
Reading Score || int
Writing Score || int

### 3 - Data Checks to Perform

- Check Missing Values
- Check Duplicates
- Check Data Types
- Check Number of Unique Values in each Column
- Check Statistics of Dataset
- Check type of Categories present

#### 3.1 - Check Missing Values

In [6]:
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

There are no missing values in the dataset.

#### 3.2 - Check Duplicates

In [7]:
df.duplicated().sum()

0

There are no duplicate values in the data set.

#### 3.3 - Check Data Types

In [8]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


#### 3.5 - Check Number of Unique Values in each Feature

In [9]:
df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

#### 3.5 - Check Statistics of Numerical Features in Dataset

In [10]:
df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


##### 3.5.1 Insight
1. Mean scores across Math, Reading and Writing are close.
2. Students on average are weakest in Math (Mean: 66.1), followed by Writing (Mean: 68.1) then Reading (Mean: 69.2).
3. There is at least 1 student that scored 0 in Math.
4. There is at least 1 student in each subject that scored full marks (100).
5. Standard Deviations are close.
6. Percentile scores across subjects are close, but Math scores across percentiles are still lower than the other subjects.

#### 3.7 - Exploring Data

In [13]:
print("Categories in 'gender' variable:     ", end=" ")
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable:     ", end=" ")
print(df['race_ethnicity'].unique())

print("Categories in 'parental_level_of_education' variable:     ", end=" ")
print(df['parental_level_of_education'].unique())

print("Categories in 'lunch' variable:     ", end=" ")
print(df['lunch'].unique())

print("Categories in 'test_preparation_course' variable:     ", end=" ")
print(df['gender'].unique())

Categories in 'gender' variable:      ['female' 'male']
Categories in 'race_ethnicity' variable:      ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in 'parental_level_of_education' variable:      ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch' variable:      ['standard' 'free/reduced']
Categories in 'test_preparation_course' variable:      ['female' 'male']
