## Student Performance Indicator


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.


### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The data consists of 8 column and 1000 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [None]:
df = pd.read_csv('data/stud.csv')

#### Show Top 5 Records

In [None]:
df.head()

#### Shape of the dataset

In [None]:
df.shape

### 2.2 Dataset information

- gender : sex of students  -> (Male/female)
- race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
- parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
- lunch : having lunch before test (standard or free/reduced) 
- test preparation course : complete or not complete before test
- math score
- reading score
- writing score

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [None]:
# Checking for missing values in the DataFrame

# The isna() function returns a DataFrame of the same shape as df, 
# with Boolean values indicating where values are missing (NaN).
missing_values = df.isna()

# The sum() function, when called on a DataFrame, 
# returns the sum of values for each column. 
# Since the DataFrame contains Boolean values (True for NaN, False otherwise),
# summing them will count the number of missing values in each column.
missing_values_count = missing_values.sum()

# Displaying the count of missing values for each column
missing_values_count


#### There are no missing values in the data set

### 3.2 Check Duplicates

In [None]:
# Checking for duplicated rows in the DataFrame

# The duplicated() function returns a Boolean Series where each element is True 
# if the corresponding row is a duplicate (has identical values in all columns as another row).
duplicated_rows = df.duplicated()
#print(duplicated_rows)
# The sum() function, when called on this Boolean Series, 
# counts the number of True values, which represents the number of duplicated rows.
duplicated_rows_count = duplicated_rows.sum()

# Displaying the count of duplicated rows
duplicated_rows_count


#### There are no duplicates  values in the data set

### 3.3 Check data types

In [None]:
# Displaying summary information about the DataFrame

# The info() function provides a concise summary of the DataFrame.
# It includes the following details:
# 1. Index range (i.e., the number of rows).
# 2. Column names and their data types.
# 3. The number of non-null (non-missing) values in each column.
# 4. Memory usage of the DataFrame.
df.info()


### 3.4 Checking the number of unique values of each column

In [None]:
# Checking the number of unique values in each column of the DataFrame

# The nunique() function returns a Series with the number of unique values for each column.
# This can help in understanding the variability and uniqueness of data in each column.
unique_values_count = df.nunique()

# Displaying the count of unique values for each column
unique_values_count


### 3.5 Check statistics of data set

In [None]:
# Generating descriptive statistics of the DataFrame

# The describe() function returns a summary of statistics for numerical columns in the DataFrame.
# By default, it includes count, mean, standard deviation, minimum, 25th percentile, 50th percentile (median), 
# 75th percentile, and maximum for each numerical column.
# For categorical columns, it can provide statistics if the 'include' parameter is set to 'all' or 'object'.
descriptive_stats = df.describe()

# Displaying the summary of statistics
descriptive_stats


#### Insight
- From above description of numerical data, all means are very close to each other - between 66 and 68.05;
- All standard deviations are also close - between 14.6 and 15.19;
- While there is a minimum score  0 for math, for writing minimum is much higher = 10 and for reading much higher = 17

### 3.7 Exploring Data

In [None]:
df.head()

### Displaying unique categories in specific categorical columns of the DataFrame

In [None]:
# Printing unique categories in the 'gender' variable
print("Categories in 'gender' variable:     ", end=" ")
# The unique() function returns an array of unique values in the specified column
print(df['gender'].unique())

# Printing unique categories in the 'race_ethnicity' variable
print("Categories in 'race_ethnicity' variable:  ", end=" ")
print(df['race_ethnicity'].unique())

# Printing unique categories in the 'parental level of education' variable
print("Categories in 'parental level of education' variable:", end=" ")
print(df['parental_level_of_education'].unique())

# Printing unique categories in the 'lunch' variable
print("Categories in 'lunch' variable:     ", end=" ")
print(df['lunch'].unique())

# Printing unique categories in the 'test preparation course' variable
print("Categories in 'test preparation course' variable:     ", end=" ")
print(df['test_preparation_course'].unique())


In [None]:
# Defining numerical and categorical columns

# Creating a list of numerical features by checking the data type of each column
# If the data type of a column is not 'O' (which stands for 'object', typically used for strings in pandas),
# it is considered a numerical feature.
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']

# Creating a list of categorical features by checking the data type of each column
# If the data type of a column is 'O', it is considered a categorical feature.
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# Printing the number of numerical features and their names
print('We have {} numerical features: {}'.format(len(numeric_features), numeric_features))

# Printing the number of categorical features and their names
print('\nWe have {} categorical features: {}'.format(len(categorical_features), categorical_features))


In [None]:
df.head(2)

### 3.8 Adding columns for "Total Score" and "Average"

In [None]:
# Creating a new column 'total score' by summing up the scores from 'math_score', 'reading_score', and 'writing_score'
df['total score'] = df['math_score'] + df['reading_score'] + df['writing_score']

# Creating a new column 'average' by dividing the 'total score' by 3
# This calculates the average score across the three subjects
df['average'] = df['total score'] / 3

# Displaying the first 5 rows of the DataFrame to verify the new columns
df.head()


In [None]:
# Counting the number of students with full marks (100) in each subject

# Counting the number of students with a reading score of 100
# This filters the DataFrame to include only rows where 'reading_score' is 100,
# and then counts the number of such rows by using the count() method on the 'average' column.
reading_full = df[df['reading_score'] == 100]['average'].count()

# Counting the number of students with a writing score of 100
# This filters the DataFrame to include only rows where 'writing_score' is 100,
# and then counts the number of such rows by using the count() method on the 'average' column.
writing_full = df[df['writing_score'] == 100]['average'].count()

# Counting the number of students with a math score of 100
# This filters the DataFrame to include only rows where 'math_score' is 100,
# and then counts the number of such rows by using the count() method on the 'average' column.
math_full = df[df['math_score'] == 100]['average'].count()

# Printing the number of students with full marks in each subject
print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')


In [None]:
# Counting the number of students with scores less than or equal to 20 in each subject

# Counting the number of students with a reading score less than or equal to 20
# This filters the DataFrame to include only rows where 'reading_score' is <= 20,
# and then counts the number of such rows by using the count() method on the 'average' column.
reading_less_20 = df[df['reading_score'] <= 20]['average'].count()

# Counting the number of students with a writing score less than or equal to 20
# This filters the DataFrame to include only rows where 'writing_score' is <= 20,
# and then counts the number of such rows by using the count() method on the 'average' column.
writing_less_20 = df[df['writing_score'] <= 20]['average'].count()

# Counting the number of students with a math score less than or equal to 20
# This filters the DataFrame to include only rows where 'math_score' is <= 20,
# and then counts the number of such rows by using the count() method on the 'average' column.
math_less_20 = df[df['math_score'] <= 20]['average'].count()

# Printing the number of students with less than or equal to 20 marks in each subject
print(f'Number of students with less than or equal to 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than or equal to 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than or equal to 20 marks in Reading: {reading_less_20}')


#####  Insights
 - From above values we get students have performed the worst in Maths 
 - Best performance is in reading section

### 4. Exploring Data ( Visualization )
#### 4.1 Visualize average score distribution to make some conclusion. 
- Histogram
- Kernel Distribution Function (KDE)

#### 4.1.1 Histogram & KDE

In [None]:
# Creating a figure with two subplots side by side

# fig, axs = plt.subplots(1, 2, figsize=(15, 7)) initializes a figure with 1 row and 2 columns of subplots.
# figsize=(15, 7) sets the size of the entire figure to be 15 inches wide and 7 inches tall.
fig, axs = plt.subplots(1, 2, figsize=(15, 7))

# Plotting the first subplot
# plt.subplot(121) specifies that the following plot will be drawn in the first subplot (1st row, 2nd column)
plt.subplot(121)
# sns.histplot() creates a histogram with KDE (Kernel Density Estimate) for the 'average' column of the DataFrame 'df'.
# bins=30 sets the number of bins for the histogram.
# kde=True adds a KDE line to the histogram.
# color='g' sets the color of the histogram to green.
sns.histplot(data=df, x='average', bins=30, kde=True, color='g')

# Plotting the second subplot
# plt.subplot(122) specifies that the following plot will be drawn in the second subplot (1st row, 2nd column)
plt.subplot(122)
# sns.histplot() creates a histogram with KDE for the 'average' column of the DataFrame 'df'.
# kde=True adds a KDE line to the histogram.
# hue='gender' adds a hue dimension, splitting the data by 'gender' and coloring the histogram differently for each gender.
sns.histplot(data=df, x='average', kde=True, hue='gender')

# Displaying the plots
plt.show()


In [None]:
# Creating a figure with two subplots side by side

# fig, axs = plt.subplots(1, 2, figsize=(15, 7)) initializes a figure with 1 row and 2 columns of subplots.
# figsize=(15, 7) sets the size of the entire figure to be 15 inches wide and 7 inches tall.
fig, axs = plt.subplots(1, 2, figsize=(15, 7))

# Plotting the first subplot
# plt.subplot(121) specifies that the following plot will be drawn in the first subplot (1st row, 1st column)
plt.subplot(121)
# sns.histplot() creates a histogram with KDE (Kernel Density Estimate) for the 'total score' column of the DataFrame 'df'.
# bins=30 sets the number of bins for the histogram.
# kde=True adds a KDE line to the histogram.
# color='g' sets the color of the histogram to green.
sns.histplot(data=df, x='total score', bins=30, kde=True, color='g')

# Plotting the second subplot
# plt.subplot(122) specifies that the following plot will be drawn in the second subplot (1st row, 2nd column)
plt.subplot(122)
# sns.histplot() creates a histogram with KDE for the 'total score' column of the DataFrame 'df'.
# kde=True adds a KDE line to the histogram.
# hue='gender' adds a hue dimension, splitting the data by 'gender' and coloring the histogram differently for each gender.
sns.histplot(data=df, x='total score', kde=True, hue='gender')

# Displaying the plots
plt.show()


#####  Insights
- Female students tend to perform well then male students.

In [None]:
# Creating a figure with three subplots side by side

# plt.subplots(1, 3, figsize=(25, 6)) initializes a figure with 1 row and 3 columns of subplots.
# figsize=(25, 6) sets the size of the entire figure to be 25 inches wide and 6 inches tall.
plt.subplots(1, 3, figsize=(25, 6))

# Plotting the first subplot
# plt.subplot(141) specifies that the following plot will be drawn in the first subplot (1st row, 4 columns, 1st position)
plt.subplot(141)
# sns.histplot() creates a histogram with KDE (Kernel Density Estimate) for the 'average' column of the DataFrame 'df'.
# kde=True adds a KDE line to the histogram.
# hue='lunch' adds a hue dimension, splitting the data by 'lunch' and coloring the histogram differently for each lunch type.
sns.histplot(data=df, x='average', kde=True, hue='lunch')

# Plotting the second subplot
# plt.subplot(142) specifies that the following plot will be drawn in the second subplot (1st row, 4 columns, 2nd position)
plt.subplot(142)
# sns.histplot() creates a histogram with KDE for the 'average' column of the DataFrame 'df' filtered to include only female students.
# kde=True adds a KDE line to the histogram.
# hue='lunch' adds a hue dimension, splitting the data by 'lunch' and coloring the histogram differently for each lunch type.
sns.histplot(data=df[df.gender == 'female'], x='average', kde=True, hue='lunch')

# Plotting the third subplot
# plt.subplot(143) specifies that the following plot will be drawn in the third subplot (1st row, 4 columns, 3rd position)
plt.subplot(143)
# sns.histplot() creates a histogram with KDE for the 'average' column of the DataFrame 'df' filtered to include only male students.
# kde=True adds a KDE line to the histogram.
# hue='lunch' adds a hue dimension, splitting the data by 'lunch' and coloring the histogram differently for each lunch type.
sns.histplot(data=df[df.gender == 'male'], x='average', kde=True, hue='lunch')

# Displaying the plots
plt.show()


#####  Insights
- Standard lunch helps perform well in exams.
- Standard lunch helps perform well in exams be it a male or a female.

In [None]:
# Creating a figure with three subplots side by side

# plt.subplots(1, 3, figsize=(25, 6)) initializes a figure with 1 row and 3 columns of subplots.
# figsize=(25, 6) sets the size of the entire figure to be 25 inches wide and 6 inches tall.
plt.subplots(1, 3, figsize=(25, 6))

# Plotting the first subplot
# plt.subplot(141) specifies that the following plot will be drawn in the first subplot (1st row, 4 columns, 1st position)
plt.subplot(141)
# sns.histplot() creates a histogram with KDE (Kernel Density Estimate) for the 'average' column of the DataFrame 'df'.
# kde=True adds a KDE line to the histogram.
# hue='parental_level_of_education' adds a hue dimension, splitting the data by 'parental_level_of_education' and coloring the histogram differently for each education level.
ax = sns.histplot(data=df, x='average', kde=True, hue='parental_level_of_education')

# Plotting the second subplot
# plt.subplot(142) specifies that the following plot will be drawn in the second subplot (1st row, 4 columns, 2nd position)
plt.subplot(142)
# sns.histplot() creates a histogram with KDE for the 'average' column of the DataFrame 'df' filtered to include only male students.
# kde=True adds a KDE line to the histogram.
# hue='parental_level_of_education' adds a hue dimension, splitting the data by 'parental_level_of_education' and coloring the histogram differently for each education level.
ax = sns.histplot(data=df[df.gender == 'male'], x='average', kde=True, hue='parental_level_of_education')

# Plotting the third subplot
# plt.subplot(143) specifies that the following plot will be drawn in the third subplot (1st row, 4 columns, 3rd position)
plt.subplot(143)
# sns.histplot() creates a histogram with KDE for the 'average' column of the DataFrame 'df' filtered to include only female students.
# kde=True adds a KDE line to the histogram.
# hue='parental_level_of_education' adds a hue dimension, splitting the data by 'parental_level_of_education' and coloring the histogram differently for each education level.
ax = sns.histplot(data=df[df.gender == 'female'], x='average', kde=True, hue='parental_level_of_education')

# Displaying the plots
plt.show()


#####  Insights
- In general parent's education don't help student perform well in exam.
- 2nd plot shows that parent's whose education is of associate's degree or master's degree their male child tend to perform well in exam
- 3rd plot we can see there is no effect of parent's education on female students.

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='race_ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race_ethnicity')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race_ethnicity')
plt.show()

#####  Insights
- Students of group A and group B tends to perform poorly in exam.
- Students of group A and group B tends to perform poorly in exam irrespective of whether they are male or female

#### 4.2 Maximumum score of students in all three subjects

In [None]:
# Creating a figure with a specific size
plt.figure(figsize=(18, 8))

# Plotting the first subplot
# plt.subplot(1, 4, 1) specifies that the following plot will be drawn in the first subplot (1st row, 4 columns, 1st position)
plt.subplot(1, 4, 1)
# Setting the title for the first subplot
plt.title('MATH SCORES')
# sns.violinplot() creates a violin plot for the 'math_score' column of the DataFrame 'df'.
# y='math_score' specifies that the 'math_score' values will be plotted on the y-axis.
# color='red' sets the color of the violin plot to red.
# linewidth=3 sets the width of the lines in the violin plot to 3.
sns.violinplot(y='math_score', data=df, color='red', linewidth=3)

# Plotting the second subplot
# plt.subplot(1, 4, 2) specifies that the following plot will be drawn in the second subplot (1st row, 4 columns, 2nd position)
plt.subplot(1, 4, 2)
# Setting the title for the second subplot
plt.title('READING SCORES')
# sns.violinplot() creates a violin plot for the 'reading_score' column of the DataFrame 'df'.
# y='reading_score' specifies that the 'reading_score' values will be plotted on the y-axis.
# color='green' sets the color of the violin plot to green.
# linewidth=3 sets the width of the lines in the violin plot to 3.
sns.violinplot(y='reading_score', data=df, color='green', linewidth=3)

# Plotting the third subplot
# plt.subplot(1, 4, 3) specifies that the following plot will be drawn in the third subplot (1st row, 4 columns, 3rd position)
plt.subplot(1, 4, 3)
# Setting the title for the third subplot
plt.title('WRITING SCORES')
# sns.violinplot() creates a violin plot for the 'writing_score' column of the DataFrame 'df'.
# y='writing_score' specifies that the 'writing_score' values will be plotted on the y-axis.
# color='blue' sets the color of the violin plot to blue.
# linewidth=3 sets the width of the lines in the violin plot to 3.
sns.violinplot(y='writing_score', data=df, color='blue', linewidth=3)

# Displaying the plots
plt.show()


#### Insights
- From the above three plots its clearly visible that most of the students score in between 60-80 in Maths whereas in reading and writing most of them score from 50-80

#### 4.3 Multivariate analysis using pieplot

In [None]:
# Setting the figure size for all subplots
plt.rcParams['figure.figsize'] = (30, 12)

# Plotting the first pie chart for gender distribution
plt.subplot(1, 5, 1)
# Calculating the value counts for the 'gender' column
size = df['gender'].value_counts()
# Defining labels and colors for the pie chart
labels = 'Female', 'Male'
color = ['red', 'green']
# Creating the pie chart
plt.pie(size, colors=color, labels=labels, autopct='.%2f%%')
# Setting the title for the pie chart
plt.title('Gender', fontsize=20)
# Turning off the axis
plt.axis('off')

# Plotting the second pie chart for race/ethnicity distribution
plt.subplot(1, 5, 2)
# Calculating the value counts for the 'race_ethnicity' column
size = df['race_ethnicity'].value_counts()
# Defining labels and colors for the pie chart
labels = 'Group C', 'Group D', 'Group B', 'Group E', 'Group A'
color = ['red', 'green', 'blue', 'cyan', 'orange']
# Creating the pie chart
plt.pie(size, colors=color, labels=labels, autopct='.%2f%%')
# Setting the title for the pie chart
plt.title('Race/Ethnicity', fontsize=20)
# Turning off the axis
plt.axis('off')

# Plotting the third pie chart for lunch type distribution
plt.subplot(1, 5, 3)
# Calculating the value counts for the 'lunch' column
size = df['lunch'].value_counts()
# Defining labels and colors for the pie chart
labels = 'Standard', 'Free'
color = ['red', 'green']
# Creating the pie chart
plt.pie(size, colors=color, labels=labels, autopct='.%2f%%')
# Setting the title for the pie chart
plt.title('Lunch', fontsize=20)
# Turning off the axis
plt.axis('off')

# Plotting the fourth pie chart for test preparation course completion
plt.subplot(1, 5, 4)
# Calculating the value counts for the 'test_preparation_course' column
size = df['test_preparation_course'].value_counts()
# Defining labels and colors for the pie chart
labels = 'None', 'Completed'
color = ['red', 'green']
# Creating the pie chart
plt.pie(size, colors=color, labels=labels, autopct='.%2f%%')
# Setting the title for the pie chart
plt.title('Test Course', fontsize=20)
# Turning off the axis
plt.axis('off')

# Plotting the fifth pie chart for parental level of education
plt.subplot(1, 5, 5)
# Calculating the value counts for the 'parental_level_of_education' column
size = df['parental_level_of_education'].value_counts()
# Defining labels and colors for the pie chart
labels = 'Some College', "Associate's Degree", 'High School', 'Some High School', "Bachelor's Degree", "Master's Degree"
color = ['red', 'green', 'blue', 'cyan', 'orange', 'grey']
# Creating the pie chart
plt.pie(size, colors=color, labels=labels, autopct='.%2f%%')
# Setting the title for the pie chart
plt.title('Parental Education', fontsize=20)
# Turning off the axis
plt.axis('off')

# Adjusting layout to prevent overlap
plt.tight_layout()

# Adding a grid to the entire figure (optional and might not be very useful for pie charts)
plt.grid()

# Displaying all the plots
plt.show()


#####  Insights
- Number of Male and Female students is almost equal
- Number students are greatest in Group C
- Number of students who have standard lunch are greater
- Number of students who have not enrolled in any test preparation course is greater
- Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree"

#### 4.4 Feature Wise Visualization
#### 4.4.1 GENDER COLUMN
- How is distribution of Gender ?
- Is gender has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

In [None]:
# Creating a figure with two subplots side by side
f, ax = plt.subplots(1, 2, figsize=(20, 10))

# Plotting a count plot for gender distribution
# sns.countplot() creates a bar plot that shows the count of observations in each categorical bin using bars.
# x=df['gender'] specifies the data to be plotted on the x-axis (the 'gender' column of the DataFrame 'df').
# data=df specifies the DataFrame containing the data.
# palette='bright' sets the color palette to 'bright'.
# ax=ax[0] specifies that the plot should be drawn on the first subplot.
# saturation=0.95 sets the saturation level of the colors.
sns.countplot(x=df['gender'], data=df, palette='bright', ax=ax[0], saturation=0.95)

# Adding labels to the bars in the count plot
# for container in ax[0].containers iterates over all the containers (bars) in the first subplot.
# ax[0].bar_label(container, color='black', size=20) adds labels to the bars with specified color and size.
for container in ax[0].containers:
    ax[0].bar_label(container, color='black', size=20)

# Plotting a pie chart for gender distribution
# plt.pie() creates a pie chart.
# x=df['gender'].value_counts() provides the sizes of the wedges (the count of each gender).
# labels=['Male', 'Female'] specifies the labels for the wedges.
# explode=[0, 0.1] "explodes" the second wedge (Female) out slightly to emphasize it.
# autopct='%1.1f%%' formats the labels to show percentages with one decimal place.
# shadow=True adds a shadow to the pie chart.
# colors=['#ff4d4d', '#ff8000'] specifies the colors of the wedges.
plt.pie(x=df['gender'].value_counts(), labels=['Male', 'Female'], explode=[0, 0.1], autopct='%1.1f%%', shadow=True, colors=['#ff4d4d', '#ff8000'])

# Displaying the plots
plt.show()


#### BIVARIATE ANALYSIS ( Is gender has any impact on student's performance ? ) 

In [None]:
# Selecting only the numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=[np.number])

# Grouping the DataFrame by the 'gender' column and calculating the mean for each group
gender_group = df.groupby('gender')[numeric_df.columns].mean()

# Displaying the resulting DataFrame
gender_group


In [None]:
# Setting the size of the figure
plt.figure(figsize=(10, 8))

# Defining the categories for the x-axis
X = ['Total Average', 'Math Average']

# Extracting the average scores for females and males from the gender_group DataFrame
female_scores = [gender_group['average'][0], gender_group['math_score'][0]]
male_scores = [gender_group['average'][1], gender_group['math_score'][1]]

# Creating the x-axis positions
X_axis = np.arange(len(X))

# Plotting the bar chart
# plt.bar() creates bar charts
# X_axis - 0.2 shifts the male bars slightly to the left
# male_scores is the height of the bars for males
# 0.4 is the width of the bars
# label='Male' provides a label for the legend
plt.bar(X_axis - 0.2, male_scores, 0.4, label='Male')

# X_axis + 0.2 shifts the female bars slightly to the right
# female_scores is the height of the bars for females
plt.bar(X_axis + 0.2, female_scores, 0.4, label='Female')

# Setting the x-axis tick positions and labels
plt.xticks(X_axis, X)

# Labeling the y-axis
plt.ylabel("Marks")

# Setting the title of the plot
plt.title("Total average vs Math average marks of both the genders", fontweight='bold')

# Displaying the legend
plt.legend()

# Displaying the plot
plt.show()


#### Insights 
- On an average females have a better overall score than men.
- whereas males have scored higher in Maths.

#### 4.4.2 RACE/EHNICITY COLUMN
- How is Group wise distribution ?
- Is Race/Ehnicity has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( How is Group wise distribution ?)

In [None]:
# Creating a figure with two subplots side by side
f, ax = plt.subplots(1, 2, figsize=(20, 10))

# Plotting a count plot for race/ethnicity distribution
# sns.countplot() creates a bar plot that shows the count of observations in each categorical bin using bars.
# x=df['race_ethnicity'] specifies the data to be plotted on the x-axis (the 'race_ethnicity' column of the DataFrame 'df').
# data=df specifies the DataFrame containing the data.
# palette='bright' sets the color palette to 'bright'.
# ax=ax[0] specifies that the plot should be drawn on the first subplot.
# saturation=0.95 sets the saturation level of the colors.
sns.countplot(x=df['race_ethnicity'], data=df, palette='bright', ax=ax[0], saturation=0.95)

# Adding labels to the bars in the count plot
# for container in ax[0].containers iterates over all the containers (bars) in the first subplot.
# ax[0].bar_label(container, color='black', size=20) adds labels to the bars with specified color and size.
for container in ax[0].containers:
    ax[0].bar_label(container, color='black', size=20)

# Plotting a pie chart for race/ethnicity distribution
# plt.pie() creates a pie chart.
# x=df['race_ethnicity'].value_counts() provides the sizes of the wedges (the count of each race/ethnicity group).
# labels=df['race_ethnicity'].value_counts().index specifies the labels for the wedges.
# explode=[0.1, 0, 0, 0, 0] "explodes" the first wedge out slightly to emphasize it.
# autopct='%1.1f%%' formats the labels to show percentages with one decimal place.
# shadow=True adds a shadow to the pie chart.
plt.pie(x=df['race_ethnicity'].value_counts(), 
        labels=df['race_ethnicity'].value_counts().index, 
        explode=[0.1, 0, 0, 0, 0], 
        autopct='%1.1f%%', 
        shadow=True)

# Displaying the plots
plt.show()


#### Insights 
- Most of the student belonging from group C /group D.
- Lowest number of students belong to groupA.

#### BIVARIATE ANALYSIS ( Is Race/Ehnicity has any impact on student's performance ? )

In [None]:
# Grouping the DataFrame by the 'race/ethnicity' column
Group_data2 = df.groupby('race_ethnicity')

# Creating a figure with three subplots side by side
f, ax = plt.subplots(1, 3, figsize=(20, 8))

# Plotting the bar plot for average math scores by race/ethnicity
sns.barplot(x=Group_data2['math_score'].mean().index, 
            y=Group_data2['math_score'].mean().values, 
            palette='mako', 
            ax=ax[0])
ax[0].set_title('Math score', color='#005ce6', size=20)

# Adding labels to the bars in the math score bar plot
for container in ax[0].containers:
    ax[0].bar_label(container, color='black', size=15)

# Plotting the bar plot for average reading scores by race/ethnicity
sns.barplot(x=Group_data2['reading_score'].mean().index, 
            y=Group_data2['reading_score'].mean().values, 
            palette='flare', 
            ax=ax[1])
ax[1].set_title('Reading score', color='#005ce6', size=20)

# Adding labels to the bars in the reading score bar plot
for container in ax[1].containers:
    ax[1].bar_label(container, color='black', size=15)

# Plotting the bar plot for average writing scores by race/ethnicity
sns.barplot(x=Group_data2['writing_score'].mean().index, 
            y=Group_data2['writing_score'].mean().values, 
            palette='coolwarm', 
            ax=ax[2])
ax[2].set_title('Writing score', color='#005ce6', size=20)

# Adding labels to the bars in the writing score bar plot
for container in ax[2].containers:
    ax[2].bar_label(container, color='black', size=15)

# Displaying the plots
plt.show()


#### Insights 
- Group E students have scored the highest marks. 
- Group A students have scored the lowest marks. 
- Students from a lower Socioeconomic status have a lower avg in all course subjects

#### 4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN
- What is educational background of student's parent ?
- Is parental education has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( What is educational background of student's parent ? )

In [None]:
# Ensure the DataFrame has only strings in 'parental_level_of_education'
df['parental_level_of_education'] = df['parental_level_of_education'].astype(str)

# Setting the default figure size for all plots
plt.rcParams['figure.figsize'] = (15, 9)

# Applying the 'fivethirtyeight' style to the plot
plt.style.use('fivethirtyeight')

# Creating a count plot for the 'parental_level_of_education' column
# sns.countplot() creates a bar plot that shows the count of observations in each categorical bin using bars.
# df['parental_level_of_education'] specifies the data to be plotted on the x-axis.
# palette='Blues' sets the color palette to 'Blues'.
sns.countplot(df['parental_level_of_education'], palette='Blues')

# Setting the title of the plot
plt.title('Comparison of Parental Education', fontweight=30, fontsize=20)

# Labeling the x-axis
plt.xlabel('Degree')

# Labeling the y-axis
plt.ylabel('Count')

# Displaying the plot
plt.show()

#### Insights 
- Largest number of parents are from some college.

#### BIVARIATE ANALYSIS ( Is parental education has any impact on student's performance ? )

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Group by 'parental_level_of_education' and calculate the mean for numeric columns
grouped_df = df.groupby('parental_level_of_education')[numeric_cols].mean()

# Plot the result as a horizontal bar chart
grouped_df.plot(kind='barh', figsize=(10, 10))

# Add legend outside of the plot
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Display the plot
plt.show()

#### Insights 
- The score of student whose parents possess master and bachelor level education are higher than others.

#### 4.4.4 LUNCH COLUMN 
- Which type of lunch is most common amoung students ?
- What is the effect of lunch type on test results?


#### UNIVARIATE ANALYSIS ( Which type of lunch is most common amoung students ? )

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('seaborn-talk')
sns.countplot(df['lunch'], palette = 'PuBu')
plt.title('Comparison of different types of lunch', fontweight = 30, fontsize = 20)
plt.xlabel('types of lunch')
plt.ylabel('count')
plt.show()

#### Insights 
- Students being served Standard lunch was more than free lunch

#### BIVARIATE ANALYSIS (  Is lunch type intake has any impact on student's performance ? )

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='test preparation course',saturation=0.95,ax=ax[0])
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)
    
sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='lunch',saturation=0.95,ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=20)   

#### Insights 
- Students who get Standard Lunch tend to perform better than students who got free/reduced lunch

#### 4.4.5 TEST PREPARATION COURSE COLUMN 
- Which type of lunch is most common amoung students ?
- Is Test prepration course has any impact on student's performance ?

#### BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(2,2,1)
sns.barplot (x=df['lunch'], y=df['math_score'], hue=df['test preparation course'])
plt.subplot(2,2,2)
sns.barplot (x=df['lunch'], y=df['reading_score'], hue=df['test preparation course'])
plt.subplot(2,2,3)
sns.barplot (x=df['lunch'], y=df['writing_score'], hue=df['test preparation course'])

#### Insights  
- Students who have completed the Test Prepration Course have scores higher in all three categories than those who haven't taken the course

#### 4.4.6 CHECKING OUTLIERS

In [None]:
plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average'],color='lightgreen')
plt.show()

#### 4.4.7 MUTIVARIATE ANALYSIS USING PAIRPLOT

In [None]:
sns.pairplot(df,hue = 'gender')
plt.show()

#### Insights
- From the above plot it is clear that all the scores increase linearly with each other.

### 5. Conclusions
- Student's Performance is related with lunch, race, parental level education
- Females lead in pass percentage and also are top-scorers
- Student's Performance is not much related with test preparation course
- Finishing preparation course is benefitial.