## Import Libraries

In [19]:
import kagglehub
import pandas as pd
import os

## Download Dataset from Kaggle

In [20]:
path = kagglehub.dataset_download("spscientist/students-performance-in-exams")
files = os.listdir(path)
print(files)
print("Path to dataset files:", path)

['StudentsPerformance.csv']
Path to dataset files: /kaggle/input/students-performance-in-exams


## Read DataSet

In [34]:
df = pd.read_csv(os.path.join(path, "StudentsPerformance.csv"))

print(df.head())

print('Columns')
df.columns.to_list()

#print(df.info())

   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  
Columns


['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course',
 'math score',
 'reading score',
 'writing score']

### 1. Which parental education level is linked with the highest average math score?


In [23]:
print("Average math score by parental education:")
print(df.groupby('parental level of education')['math score'].mean().sort_values(ascending=False))

Average math score by parental education:
parental level of education
master's degree       69.745763
bachelor's degree     69.389831
associate's degree    67.882883
some college          67.128319
some high school      63.497207
high school           62.137755
Name: math score, dtype: float64


###2. Is there a significant score difference between males and females across all subjects?


In [24]:
print("Average scores by gender:")
print(df.groupby('gender')[['math score', 'reading score', 'writing score']].mean())

Average scores by gender:
        math score  reading score  writing score
gender                                          
female   63.633205      72.608108      72.467181
male     68.728216      65.473029      63.311203


### 3. How much does completing the test preparation course improve performance in each subject?

In [25]:
print("Score improvement from test preparation:")
print(df.groupby('test preparation course')[['math score', 'reading score', 'writing score']].mean().diff().dropna())

Score improvement from test preparation:
                         math score  reading score  writing score
test preparation course                                          
none                      -5.617649      -7.359587      -9.914322


### 4. Which combination of gender, lunch type, and test preparation status produces the top 10% of scores?
- First, calculate the average score for each student

In [26]:
df['average score'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

- Determine the threshold for the top 10%

In [27]:
top_10_percent_threshold = df['average score'].quantile(0.90)

- Filter the DataFrame to include only students in the top 10%

In [28]:
top_performers = df[df['average score'] >= top_10_percent_threshold]

- Find the most common combination among top performers

In [29]:
top_combination = top_performers.groupby(['gender', 'lunch', 'test preparation course']).size().sort_values(ascending=False).index[0]

print("Combination of gender, lunch type, and test preparation status that produces the top 10% of scores:")
top_combination

Combination of gender, lunch type, and test preparation status that produces the top 10% of scores:


('female', 'standard', 'none')

### 5. Does lunch type have a uniform impact across all race/ethnicity groups, or does its effect vary?

In [30]:
print("Average scores by race/ethnicity and lunch type:")
print(df.groupby(['race/ethnicity', 'lunch'])[['math score', 'reading score', 'writing score']].mean())

Average scores by race/ethnicity and lunch type:
                             math score  reading score  writing score
race/ethnicity lunch                                                 
group A        free/reduced   55.222222      60.555556      57.194444
               standard       65.981132      67.471698      66.396226
group B        free/reduced   57.434783      63.971014      61.521739
               standard       66.884298      69.280992      67.925620
group C        free/reduced   56.412281      63.412281      61.412281
               standard       68.941463      72.268293      71.395122
group D        free/reduced   61.115789      66.431579      66.452632
               standard       70.916168      72.077844      72.245509
group E        free/reduced   66.560976      68.731707      67.195122
               standard       76.828283      74.808081      73.151515


- The varying mean scores for different race/ethnicity groups within each lunch type indicate that the impact varies.

### 6. What is the correlation between reading and writing scores? Is it stronger than math and writing?


In [31]:
print("Correlation matrix:")
correlation_matrix = df[['math score', 'reading score', 'writing score']].corr()
print(correlation_matrix)

Correlation matrix:
               math score  reading score  writing score
math score       1.000000       0.817580       0.802642
reading score    0.817580       1.000000       0.954598
writing score    0.802642       0.954598       1.000000


- The correlation between reading and writing is in the 'reading score' row and 'writing score' column (or vice versa). Compare this value to the correlation between math and writing.

### 7. Identify the top 5% performing students and analyze their demographic profiles. What patterns emerge?


In [32]:
top_5_percent_threshold = df['average score'].quantile(0.95)
top_5_percent_df = df[df['average score'] >= top_5_percent_threshold]
print("Demographic profile of top 5% performers:")
print(top_5_percent_df.describe(include='object')) # Describe categorical columns
print("\nCounts within demographic categories for top 5% performers:")
for col in ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']:
  print(f"\n{col}:")
  print(top_5_percent_df[col].value_counts())

Demographic profile of top 5% performers:
        gender race/ethnicity parental level of education     lunch  \
count       50             50                          50        50   
unique       2              5                           6         2   
top     female        group E          associate's degree  standard   
freq        36             14                          16        46   

       test preparation course  
count                       50  
unique                       2  
top                  completed  
freq                        33  

Counts within demographic categories for top 5% performers:

gender:
gender
female    36
male      14
Name: count, dtype: int64

race/ethnicity:
race/ethnicity
group E    14
group C    13
group D    12
group B     7
group A     4
Name: count, dtype: int64

parental level of education:
parental level of education
associate's degree    16
bachelor's degree     13
some college          10
master's degree        6
some high school      

- Patterns to look for could be the distribution of gender, race/ethnicity, parental education, lunch type, and test preparation status within this group compared to the overall dataset.

### 8. Can we cluster students into performance categories (e.g., low, medium, high performers) using just Pandas logic? If yes, how?
- Yes, using quantile-based binning.

In [33]:
df['performance_category'] = pd.qcut(df['average score'], q=3, labels=['low', 'medium', 'high'])
print("Student performance categories (using quantiles):")
print(df['performance_category'].value_counts())
print("Average scores by performance category:")
print(df.groupby('performance_category', observed=False)['average score'].mean())

Student performance categories (using quantiles):
performance_category
low       336
medium    332
high      332
Name: count, dtype: int64
Average scores by performance category:
performance_category
low       51.942460
medium    68.502008
high      83.058233
Name: average score, dtype: float64


### **Power BI dashboard file name** = Nauman_Khalid_Student_Performance_Analysis_Project03.pbix