# Introduction

#### In this notebook I have tried to determine which features affect the student's subjectwise and overall performance by looking at the data distribution overall and also based on clusters formed on the basis of scores. The features involved are:

*  gender : sex of students
*  race/ethnicity : ethnicity of students
*  parental level of education : parents' final education
* lunch : standard or free/reduced
* test preparation course : Any course done/completed to prepare for test or not

# Importing Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.cluster import KMeans
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
%config InteractiveShell.ast_node_interactivity = 'all'

# Loading Dataset

In [None]:
data = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
data.head()

# Basic EDA

In [None]:
data.describe()

In [None]:
data.describe(include = 'object')

In [None]:
data.shape

In [None]:
data.duplicated().sum()

In [None]:
data.isnull().sum()

In [None]:
data['sum'] = data['math score'] + data['reading score'] + data['writing score'] # making new feature which takes overall score

In [None]:
data['parental level of education'].unique()

In [None]:
data['race/ethnicity'].unique()

In [None]:
lst = ['math score','reading score','writing score','sum']
features_cat = ['gender','race/ethnicity','parental level of education','test preparation course','lunch']
features_num = ['math score','reading score','writing score','sum']
data_copy = data.drop( features_cat,axis = 'columns')

# KMeans Clustering 

##### To classify students into groups and check which feature affects the scores significantly

In [None]:
preprocessor = make_column_transformer(
    (StandardScaler(), features_num ))
X = preprocessor.fit_transform(data_copy)
wcss=[]
for i in range(1,30):
    kmeans = KMeans(i)
    kmeans.fit(X)
    wcss_iter = kmeans.inertia_
    wcss.append(wcss_iter)

number_clusters = range(1,30)
plt.figure(figsize = (10,10))
plt.plot(number_clusters,wcss,marker = 'o')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

In [None]:
kmeans = KMeans(7)
kmeans.fit(X)
identified_clusters = kmeans.fit_predict(X)
data_with_clusters = data.copy()
data_with_clusters['Clusters'] = identified_clusters 

In [None]:
# %config InteractiveShell.ast_node_interactivity = 'last'
for i in lst:
    df = pd.concat([data_with_clusters.loc[data_with_clusters['Clusters'] == 0][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 1][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 2][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 3][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 4][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 5][i].describe(),
    data_with_clusters.loc[data_with_clusters['Clusters'] == 6][i].describe()],axis = 'columns')
    df.columns = [0,1,2,3,4,5,6] 
    df.style.set_caption(i.title())
 
    

#### Here we can see that based on the scores the descending order of scoring clusters is 6,2,4,0,3,5,1 (overall scores)

#### Below we are grouping parental level of education into 3 classes low,medium and high. Also, we are simplifying the entries in the race/ethnicity column

In [None]:
data['parental level of education v2'] = data['parental level of education'].replace(['some high school','high school','associate\'s degree','some college',"bachelor's degree","master's degree"],['low','low','medium','low','high','high'])
data_with_clusters['parental level of education v2'] = data_with_clusters['parental level of education'].replace(['some high school','high school','associate\'s degree','some college',"bachelor's degree","master's degree"],['low','low','medium','low','high','high'])
data['race/ethnicity v2'] = data['race/ethnicity'].replace(['group A','group B','group C','group D','group E'],['A','B','C','D','E'])
data_with_clusters['race/ethnicity v2'] = data_with_clusters['race/ethnicity'].replace(['group A','group B','group C','group D','group E'],['A','B','C','D','E'])

#### Now we classify the clusters into ranks based on scores

In [None]:
ranksdf = df.sort_values(by = 'mean',axis = 1,ascending = False)
ranks = ranksdf.columns
ranks
data_with_clusters['Ranks'] = data_with_clusters['Clusters'].replace(ranks,[1,2,3,4,5,6,7]) #getting ranks based on scores

# Analysis based on Gender 

In [None]:
plt.figure(figsize = (5,6))
ax = plt.subplot(111)
sns.countplot(data['gender'],order = ['male','female'],palette = ['blue','pink'])
ax.bar_label(ax.containers[0])
ax.set_ylabel('Count', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Gender', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Here we see the count of females is greater than men but not by too much.

#### Some descriptive stats for individual and overall scores can be seen below

In [None]:
data.groupby(['gender'])['math score'].describe()

In [None]:
data.groupby(['gender'])['reading score'].describe()

In [None]:
data.groupby(['gender'])['writing score'].describe()

In [None]:
data.groupby(['gender'])['sum'].describe()

In [None]:
for i in lst:
    desc = data.groupby(['gender'])[i].describe()
    desc.reset_index(level = 0,inplace = True)
    plt.figure(figsize = (5,6))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.barplot(x = 'gender',y = 'mean' , data = desc, order = ['male','female'],palette = ['blue','pink'])
    ax.bar_label(ax.containers[0])
    ax.set_ylabel(f'Avg. {i.title()}', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax.set_xlabel('Gender', size = 'large',backgroundcolor = 'yellow',labelpad = 20)


#### These plots indicate that on average male students did better in math but female students did better in reading and writing and hence, overall average scores were better for female students.

#### Distribution of scores is represented below to get a better idea in boxplot form and kdeplot form.

In [None]:
for i in lst:
    plt.figure(figsize = (10,10))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.boxplot(x = data['gender'], y = data[i], palette = ['blue','pink'],order = ['male','female'])
    tick_spacing = 5
    ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax.set_xlabel('Gender', size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    ax.set_ylabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    
data_female = data.loc[data['gender']== 'female']
data_male = data.loc[data['gender'] == 'male']
for i in lst:
    plt.figure(figsize = (20,5))
    ax1 = plt.subplot(111)
    sns.kdeplot(data_male[i], color = 'blue',multiple = 'stack')
    sns.kdeplot(data_female[i], color = 'pink', multiple = 'stack')
    tick_spacing = 5
    ax1.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax1.set_ylabel('Density', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax1.set_xlabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Next comes the analysis based on our kmeans clustering.

In [None]:
desc = data_with_clusters.groupby(['Ranks','gender'])['sum'].describe()
desc.reset_index(level = [0,1],inplace = True)
count_female = desc.loc[desc['gender']=='female']['count'].sum()
count_male = desc.loc[desc['gender']=='male']['count'].sum()
index_female = desc.loc[desc['gender']=='female'].index
index_male = desc.loc[desc['gender']=='male'].index
desc['percentage'] = pd.Series()
for row in index_female:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_female
for row in index_male:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_male
desc
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
sns.barplot(x = 'Ranks',y ='percentage' ,data = desc,hue = 'gender',palette = ['pink','blue'])
ax.set_ylabel('Percentage', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Ranks', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### This is the distribution based on our kmeans clusters which tells us that in these clusters based on scores there is no clear trend and hence gender does not seem to be a major factor even though overall score average was higher for female students.

# Analysis based on Test Preparation Course

In [None]:
plt.figure(figsize = (5,6))
ax = plt.subplot(111)
sns.countplot(data['test preparation course'],palette = ['red','green'])
ax.bar_label(ax.containers[0])
ax.set_ylabel('Count', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Test Preparation Course', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### The number of students who have completed some course to prepare for tests is much less than those who have not.

#### Below are some descriptive stats based on individual and overall scores.

In [None]:
data.groupby(['test preparation course'])['math score'].describe()

In [None]:
data.groupby(['test preparation course'])['reading score'].describe()

In [None]:
data.groupby(['test preparation course'])['writing score'].describe()

In [None]:
data.groupby(['test preparation course'])['sum'].describe()

In [None]:
for i in lst:
    desc = data.groupby(['test preparation course'])[i].describe()
    desc.reset_index(level = 0,inplace = True)
    plt.figure(figsize = (5,6))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.barplot(x = 'test preparation course',y = 'mean' , data = desc, order = ['none','completed'],palette = ['red','green'])
    ax.bar_label(ax.containers[0])
    ax.set_ylabel(f'Avg. {i.title()}', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax.set_xlabel('Test Preparation Course', size = 'large',backgroundcolor = 'yellow',labelpad = 20)


#### On average we can say that students who did some preparation course scored better.

#### Below we can see the distribution of the scores through boxplot and kdeplot forms.

In [None]:
data_completed = data.loc[data['test preparation course']== 'completed']
data_none = data.loc[data['test preparation course'] == 'none']

In [None]:
for i in lst:
    plt.figure(figsize = (10,10))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.boxplot(x = data['test preparation course'], y = data[i], palette = ['red','green'])
    tick_spacing = 5
    ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax.set_xlabel('Test Completion Course', size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    ax.set_ylabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    
for i in lst:
    plt.figure(figsize = (20,5))
    ax1 = plt.subplot(111)
    sns.kdeplot(data_none[i], color = 'red',multiple = 'stack')
    sns.kdeplot(data_completed[i], color = 'green',multiple = 'stack' )
    tick_spacing = 5
    ax1.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax1.set_ylabel('Density', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax1.set_xlabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Next comes the analysis based on our kmeans clustering.

In [None]:
desc = data_with_clusters.groupby(['Ranks','test preparation course'])['sum'].describe()
desc.reset_index(level = [0,1],inplace = True)
count_none = desc.loc[desc['test preparation course']=='none']['count'].sum()
count_completed = desc.loc[desc['test preparation course']=='completed']['count'].sum()
index_none = desc.loc[desc['test preparation course']=='none'].index
index_completed = desc.loc[desc['test preparation course']=='completed'].index
desc['percentage'] = pd.Series()
for row in index_none:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_none
for row in index_completed:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_completed
desc
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
sns.barplot(x = 'Ranks',y ='percentage' ,data = desc,hue = 'test preparation course',palette = ['green','red'])
ax.set_ylabel('Percentage', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Ranks', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Here we can clearly see that the clusters with high scoring students had high percentages of students who had done some test preparation course.

# Analysis based on Lunch type

In [None]:
plt.figure(figsize = (5,6))
ax = plt.subplot(111)
sns.countplot(data['lunch'])
ax.bar_label(ax.containers[0])
ax.set_ylabel('Count', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Lunch', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### The count of students who eat a standard lunch is much higher than those who eat a reduced meal

#### Below are some descriptive stats for individual and overall scores.

In [None]:
data.groupby(['lunch'])['math score'].describe()

In [None]:
data.groupby(['lunch'])['reading score'].describe()

In [None]:
data.groupby(['lunch'])['writing score'].describe()

In [None]:
data.groupby(['lunch'])['sum'].describe()

In [None]:
for i in lst:
    desc = data.groupby(['lunch'])[i].describe()
    desc.reset_index(level = 0,inplace = True)
    plt.figure(figsize = (5,6))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.barplot(x = 'lunch',y = 'mean' , data = desc, order = ['standard','free/reduced'])
    ax.bar_label(ax.containers[0])
    ax.set_ylabel(f'Avg. {i.title()}', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax.set_xlabel('Lunch', size = 'large',backgroundcolor = 'yellow',labelpad = 20)


#### Here we can see that on average students eating a standard lunch scored better.

#### Below we can see the distribution of the scores through boxplot and kdeplot forms.

In [None]:
data_free = data.loc[data['lunch']== 'free/reduced']
data_standard = data.loc[data['lunch'] == 'standard']

In [None]:
for i in lst:
    plt.figure(figsize = (10,10))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.boxplot(x = data['lunch'], y = data[i])
    tick_spacing = 5
    ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax.set_xlabel('Lunch', size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    ax.set_ylabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)
    
for i in lst:
    plt.figure(figsize = (20,5))
    ax1 = plt.subplot(111)
    sns.kdeplot(data_standard[i],multiple = 'stack')
    sns.kdeplot(data_free[i],multiple = 'stack' )
    tick_spacing = 5
    ax1.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    ax1.set_ylabel('Density', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax1.set_xlabel(i.title(), size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Now comes the analysis based on our kmeans clusters.

In [None]:
desc = data_with_clusters.groupby(['Ranks','lunch'])['sum'].describe()
desc.reset_index(level = [0,1],inplace = True)
count_free = desc.loc[desc['lunch']=='free/reduced']['count'].sum()
count_standard = desc.loc[desc['lunch']=='standard']['count'].sum()
index_free = desc.loc[desc['lunch']=='free/reduced'].index
index_standard = desc.loc[desc['lunch']=='standard'].index
desc['percentage'] = pd.Series()
for row in index_free:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_free
for row in index_standard:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_standard
desc
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
sns.barplot(x = 'Ranks',y ='percentage' ,data = desc,hue = 'lunch')
ax.set_ylabel('Percentage', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Ranks', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### We can easily notice the trend that of the students who eat a standard lunch high percentage are scoring well. Hence proper nourishment is important

# Analysis based on Parental Level of Education

In [None]:
plt.figure(figsize = (20,6))
ax = plt.subplot(111)
sns.countplot(data['parental level of education v2'],palette = 'Greens',order = ['low','medium','high'] )
ax.bar_label(ax.containers[0])
ax.set_ylabel('Count', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Parental level of Education', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Here we can see that most of the parents only have a high school education or are college dropouts.

#### Some descriptive stats for individual and overall scores can be seen below

In [None]:
data.groupby(['parental level of education v2'])['math score'].describe()

In [None]:
data.groupby(['parental level of education v2'])['reading score'].describe()

In [None]:
data.groupby(['parental level of education v2'])['writing score'].describe()

In [None]:
for i in lst:
    desc = data.groupby(['parental level of education v2'])[i].describe()
    desc.reset_index(level = 0,inplace = True)
    plt.figure(figsize = (10,6))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.barplot(x = 'parental level of education v2',y = 'mean' , data = desc, order = ['low','medium','high'], palette = 'Greens')
    ax.bar_label(ax.containers[0])
    ax.set_ylabel(f'Avg. {i.title()}', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax.set_xlabel('Parental Level of education', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Here we see that on average students whose parents have a higher level of education scored better. 

#### Now comes the analysis based on our kmeans clusters.

In [None]:
desc = data_with_clusters.groupby(['Ranks','parental level of education v2'])['sum'].describe()
desc.reset_index(level = [0,1],inplace = True)
count_low = desc.loc[desc['parental level of education v2']=='low']['count'].sum()
count_medium = desc.loc[desc['parental level of education v2']=='medium']['count'].sum()
count_high = desc.loc[desc['parental level of education v2']=='high']['count'].sum()
index_low = desc.loc[desc['parental level of education v2']=='low'].index
index_medium = desc.loc[desc['parental level of education v2']=='medium'].index
index_high = desc.loc[desc['parental level of education v2']=='high'].index
desc['percentage'] = pd.Series()
for row in index_low:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_low
for row in index_medium:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_medium
for row in index_high:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_high
desc
plt.figure(figsize = (10,10))
sns.barplot(x = 'Ranks',y ='percentage' ,data = desc,hue = 'parental level of education v2', palette = 'Greens')

#### Here we can see the trend that among the students whose parents have a higher level of education a higher ratio tend to do well in the examinations. However this trend is weaker than the other strong trends we have seen before

# Analysis based on Race/Ethnicity

In [None]:
plt.figure(figsize = (20,6))
ax = plt.subplot(111)
sns.countplot(data['race/ethnicity v2'],order = ['A','B','C','D','E'] )
ax.bar_label(ax.containers[0])
ax.set_ylabel('Count', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Race/Ethnicity', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Most of the students belong to ethnicity represented by group C and group D

#### Some descriptive stats for individual and overall scores can be seen below

In [None]:
data.groupby(['race/ethnicity v2'])['math score'].describe()

In [None]:
data.groupby(['race/ethnicity v2'])['reading score'].describe()

In [None]:
data.groupby(['race/ethnicity v2'])['writing score'].describe()

In [None]:
data.groupby(['race/ethnicity v2'])['sum'].describe()

In [None]:
for i in lst:
    desc = data.groupby(['race/ethnicity v2'])[i].describe()
    desc.reset_index(level = 0,inplace = True)
    plt.figure(figsize = (10,6))
    ax = plt.subplot(111)
    ax.title.set_text(i.title())
    sns.barplot(x = 'race/ethnicity v2',y = 'mean' , data = desc)
    ax.bar_label(ax.containers[0])
    ax.set_ylabel(f'Avg. {i.title()}', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
    ax.set_xlabel('Race/Ethnicity ', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### Here we can see that on average there is standard trend showing that students from ethnicity E tend to do well while students from group A dont.

#### Now comes the analysis based on our kmeans clusters.

In [None]:
desc = data_with_clusters.groupby(['Ranks','race/ethnicity v2'])['sum'].describe()
desc.reset_index(level = [0,1],inplace = True)
count_a = desc.loc[desc['race/ethnicity v2']=='A']['count'].sum()
count_b = desc.loc[desc['race/ethnicity v2']=='B']['count'].sum()
count_c = desc.loc[desc['race/ethnicity v2']=='C']['count'].sum()
count_d = desc.loc[desc['race/ethnicity v2']=='D']['count'].sum()
count_e = desc.loc[desc['race/ethnicity v2']=='E']['count'].sum()
index_a = desc.loc[desc['race/ethnicity v2']=='A'].index
index_b = desc.loc[desc['race/ethnicity v2']=='B'].index
index_c = desc.loc[desc['race/ethnicity v2']=='C'].index
index_d = desc.loc[desc['race/ethnicity v2']=='D'].index
index_e = desc.loc[desc['race/ethnicity v2']=='E'].index
desc['percentage'] = pd.Series()
for row in index_a:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_a
for row in index_b:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_b
for row in index_c:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_c
for row in index_d:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_d
for row in index_e:
    desc.loc[row,'percentage']= desc.loc[row,'count']/count_e
desc
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
sns.barplot(x = 'Ranks',y ='percentage' ,data = desc,hue = 'race/ethnicity v2')
ax.set_ylabel('Percentage', size = 'large', backgroundcolor = 'yellow',labelpad = 20)
ax.set_xlabel('Ranks', size = 'large',backgroundcolor = 'yellow',labelpad = 20)

#### The trends here are same as the trends seen for averages.

# Conclusion

1. The Gender ratio is not very bad and the count of female students is a bit higher than that of male students. When it comes to scores, the gender does not play a significant or standard role in determining score. So, regardless of gender, the student can do well. 

2. Number of students who have currently availed the test preparation courses are much less than those who have not. But, according to our analysis, this needs to change and students should start using these courses to do well in examinations.

3. The Lunch type which is linked with the nourishment of the student also plays an important role in determining how the student does in the examinations. Luckily, the number of students getting a standard lunch and hence proper nourishment is high.

4. The conclusions from the Race/Ethncity feature cant be derived directly as we have to see the different conditions for people from different groups and so we have to examine the conditions for group E vs the other groups to see why the students perform better.

5. Parental level of education seems to be a relevant factor but not a major one. Also, it is factor with which conditions are linked as well.
 
Overall, taking test preparation courses and proper nourishment are beneficial for the students.