# Student Performance Indicator

### Life cycle of Machine learning Project

- Understanding the Problem statement

- Data Collection

- Data Checks to perform

- Exploratory data analysis

- Data Pre-Processing

- Model Training

- Choose best model

## 1) Problem statement

This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

## 2) Data Collection

- __Dataset Source__ - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977


- The data consists of 8 column and 1000 rows.

### 2.1 Import Data and Required Packages
- Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Import the CSV Data as Pandas DataFrame

In [14]:
df = pd.read_csv('data\stud.csv')

### Shape of data

In [15]:
df.shape

(1000, 8)

### Show Top 5 Records

In [11]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### 2.2 Dataset information

- gender : sex of students -> (Male/female)

- race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)

- parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)

- lunch : having lunch before test (standard or free/reduced)

- test preparation course : complete or not complete before test

- math score

- reading score

- writing score

## 3. Data Checks to perform

- Check Missing values

- Check Duplicates

- Check data type

- Check the number of unique values of each column

- Check statistics of data set

- Check various categories present in the different categorical column

### 3.3 Check data types

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


### 3.2 Data statistics 

In [16]:
df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


Insight

- From above description of numerical data, all means are very close to each other - between 66 and 68.05;

- All standard deviations are also close - between 14.6 and 15.19;

- While there is a minimum score 0 for math, for writing minimum is much higher = 10 and for reading myet higher = 17

### 3.3 Check Missing values

In [17]:
df.isnull().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

There are no missing values in the data set

### 3.4 Check Duplicates

In [19]:
df.duplicated().sum()

0

There are no duplicates values in the data set

### 3.5 Checking the number of unique values of each column

In [20]:
df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

### 3.6 Exploring Data

In [22]:
df.columns

Index(['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch',
       'test_preparation_course', 'math_score', 'reading_score',
       'writing_score'],
      dtype='object')

In [30]:
for i in df.columns:
    print('column name :',i)
    print(df[i].value_counts(normalize=True))
    print()

column name : gender
gender
female    0.518
male      0.482
Name: proportion, dtype: float64

column name : race_ethnicity
race_ethnicity
group C    0.319
group D    0.262
group B    0.190
group E    0.140
group A    0.089
Name: proportion, dtype: float64

column name : parental_level_of_education
parental_level_of_education
some college          0.226
associate's degree    0.222
high school           0.196
some high school      0.179
bachelor's degree     0.118
master's degree       0.059
Name: proportion, dtype: float64

column name : lunch
lunch
standard        0.645
free/reduced    0.355
Name: proportion, dtype: float64

column name : test_preparation_course
test_preparation_course
none         0.642
completed    0.358
Name: proportion, dtype: float64

column name : math_score
math_score
65    0.036
62    0.035
69    0.032
59    0.032
61    0.027
      ...  
24    0.001
28    0.001
33    0.001
18    0.001
8     0.001
Name: proportion, Length: 81, dtype: float64

column name : readi

In [51]:
# numerical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
numeric_features


['math_score', 'reading_score', 'writing_score']

In [52]:
# categorical columns
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']
categorical_features 


['gender',
 'race_ethnicity',
 'parental_level_of_education',
 'lunch',
 'test_preparation_course']

#### categories in each categorical columns : 

In [50]:
for i in df.columns:
    if df[i].dtype=='O':
        print(i,'has ' ,df[i].nunique(),'categories')
        print(set([x for x in df[i]]))
        print()

gender has  2 categories
{'male', 'female'}

race_ethnicity has  5 categories
{'group C', 'group E', 'group D', 'group B', 'group A'}

parental_level_of_education has  6 categories
{"master's degree", 'high school', "associate's degree", 'some high school', "bachelor's degree", 'some college'}

lunch has  2 categories
{'free/reduced', 'standard'}

test_preparation_course has  2 categories
{'none', 'completed'}



In [53]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### 3.7 Adding total and average score by each student

In [66]:
df['total_scored'] = df['math_score'] + df['reading_score'] + df['writing_score']
df['averge_score'] = np.round(df['total_scored']/3,2)
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_scored,averge_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.67
1,female,group C,some college,standard,completed,69,90,88,247,82.33
2,female,group B,master's degree,standard,none,90,95,93,278,92.67
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.33
4,male,group C,some college,standard,none,76,78,75,229,76.33
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,94.00
996,male,group C,high school,free/reduced,none,62,55,55,172,57.33
997,female,group C,high school,free/reduced,completed,59,71,65,195,65.00
998,female,group D,some college,standard,completed,68,78,77,223,74.33


In [74]:
reading_full = df[df['reading_score'] == 100]['reading_score'].count()
reading_full

17

In [78]:
reading_full = df[df['reading_score'] == 100]['reading_score'].count()
writing_full = df[df['writing_score'] == 100]['reading_score'].count()
math_full = df[df['math_score'] == 100]['reading_score'].count()



print(f'Number of students with full marks in Maths: {math_full}')
print()
print(f'Number of students with full marks in Writing: {writing_full}')
print()
print(f'Number of students with full marks in Reading: {reading_full}')


Number of students with full marks in Maths: 7

Number of students with full marks in Writing: 14

Number of students with full marks in Reading: 17
