🎓 **Welcome to My Student Performance Analysis Portfolio!** 📚  

Ever wondered what truly impacts a student’s success in exams? 🌟 In this project, I explored a rich dataset from Kaggle, formatted as CSV, which dives deep into various factors shaping academic performance. From study habits 📖 and attendance 🎯 to parental involvement 🤝 and educational resources 📋, this dataset paints a comprehensive picture of what drives student achievement.  

My journey began with **data wrangling** 🛠️, where I tackled challenges like missing values, inconsistent data formats, and duplicate entries. Once the data was polished and ready, I dove into **Exploratory Data Analysis (EDA)** to uncover insightful trends, such as the relationship between motivational levels 💡, resource availability, and exam performance.  

To bring it all together, I crafted and answered key questions, including:  
- 📈 **Correlation Analysis:** Which factor has the strongest influence on exam scores?  
- ⏳ **Study Time by School Type:** How do private and public school students compare in hours studied per week?  
- 🧑‍🏫 **Teacher Quality:** How does the quality of teachers affect average exam scores?  
- 🎭 **Extracurricular Participation:** Do students who engage in extracurricular activities outperform their peers?  
- 📝 **Tutoring Insights:** How many tutoring sessions do top-performing students attend?  

This analysis highlights the power of data in understanding the intricate dynamics of education. Ready to explore the stories hidden in student performance data? Let’s dive in! 🎓✨

# Data Gathering

Data Source : https://www.kaggle.com/datasets/lainguyn123/student-performance-factors

In [3]:
import pandas as pd
import numpy as np

In [4]:
student_performance_df = pd.read_csv('/Users/anakagungngurahanandasuryawedhana/Documents/learning material/DS/surya-wedhana-data-analysis-portofolio/StudentPerformanceFactors.csv')
student_performance_df.sample(5)

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
2238,18,68,Medium,High,No,7,69,Medium,Yes,1,Low,High,Public,Neutral,4,Yes,High School,Moderate,Female,63
4031,12,77,Low,Low,Yes,7,85,Medium,Yes,1,Medium,Medium,Public,Positive,3,No,College,Far,Male,62
2616,16,86,High,Medium,Yes,7,78,Low,Yes,1,Low,Medium,Private,Positive,4,No,Postgraduate,Moderate,Male,68
988,19,95,High,Low,No,10,58,Low,Yes,2,High,Medium,Public,Positive,3,No,High School,Near,Male,69
2404,30,67,Medium,Medium,Yes,8,94,Medium,Yes,1,Medium,Medium,Public,Neutral,2,No,College,Near,Female,69


# Data Wrangling

In [None]:
student_performance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

In [None]:
student_performance_df.isna().sum()

Unnamed: 0,0
Hours_Studied,0
Attendance,0
Parental_Involvement,0
Access_to_Resources,0
Extracurricular_Activities,0
Sleep_Hours,0
Previous_Scores,0
Motivation_Level,0
Internet_Access,0
Tutoring_Sessions,0


In [None]:
student_performance_df.Teacher_Quality.unique()

array(['Medium', 'High', 'Low', nan], dtype=object)

In [None]:
student_performance_df['Teacher_Quality'].fillna(student_performance_df['Teacher_Quality'].mode().iloc[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  student_performance_df['Teacher_Quality'].fillna(student_performance_df['Teacher_Quality'].mode().iloc[0], inplace=True)


In [None]:
student_performance_df.Parental_Education_Level.unique()

array(['High School', 'College', 'Postgraduate', nan], dtype=object)

In [None]:
student_performance_df['Parental_Education_Level'].fillna(student_performance_df['Parental_Education_Level'].mode().iloc[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  student_performance_df['Parental_Education_Level'].fillna(student_performance_df['Parental_Education_Level'].mode().iloc[0], inplace=True)


In [None]:
student_performance_df.Distance_from_Home.unique()

array(['Near', 'Moderate', 'Far', nan], dtype=object)

In [None]:
student_performance_df['Distance_from_Home'].fillna(student_performance_df['Distance_from_Home'].mode().iloc[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  student_performance_df['Distance_from_Home'].fillna(student_performance_df['Distance_from_Home'].mode().iloc[0], inplace=True)


In [None]:
print(student_performance_df.duplicated().sum())

0


In [None]:
student_performance_df.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


**Table Description**

1. The **highest number of hours studied** by a student is **44 hours per week**, while the **minimum is 1 hour per week**, with an **average of 19.9 hours per week**.  
2. The **minimum attendance** recorded is **60%**.  
3. The **maximum sleep duration** is **10 hours per night**, while the **minimum is 4 hours per night**.  
4. The **lowest previous exam score** is **55**, and the **highest is recorded as 101**, which is likely an error. It would be better to correct this to **100**. The **average score is 67.2**.  
5. The **maximum number of tutoring sessions** attended by a student is **8 times per month**.  
6. The **highest average number of hours of physical activity per week** is **6 hours**.  

In [None]:
student_performance_df['Exam_Score'].replace(101, 100, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  student_performance_df['Exam_Score'].replace(101, 100, inplace=True)


# EDA

In [None]:
student_performance_df = student_performance_df.reset_index()
# we make a new column called student id because
student_performance_df.rename(columns={'index': 'student_id'}, inplace=True)

In [None]:
student_performance_df.groupby(by='Motivation_Level').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Motivation_Level,student_id
0,Medium,3351
1,Low,1937
2,High,1319


**Student Motivational Level**

- From the table above, **medium motivational level** is the most common among students, with **3,351 students**.  
- The second most common is **low motivational level**, with **1,937 students**.  
- Lastly, **high motivational level** is observed in **1,319 students**.  

In [None]:
student_performance_df.groupby(by='Access_to_Resources').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Access_to_Resources,student_id
0,Medium,3319
1,High,1975
2,Low,1313


**Availability of Educational Resources (Low, Medium, High)**

- From the table above, **medium availability of educational resources** is the most common, observed among **3,319 students**.  
- The second most common is **high availability of educational resources**, with **1,975 students**.  
- Lastly, **low availability of educational resources** is observed among **1,313 students**.  

In [None]:
student_performance_df.groupby(by='Parental_Involvement').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Parental_Involvement,student_id
0,Medium,3362
1,High,1908
2,Low,1337


**Level of parental involvement in the student's education (Low, Medium, High)**

- From the table above, **medium parental involvement in student's education** is the most common, recorded among **3,362 students**.  
- The second most common is **high parental involvement in student's education**, with **1,908 students**.  
- Lastly, **low parental involvement in student's education** is observed among **1,337 students**.  

In [None]:
student_performance_df.groupby(by='Extracurricular_Activities').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Extracurricular_Activities,student_id
0,Yes,3938
1,No,2669


**Student's Extracurricular Activities**

- From the table above, **3,938 students** are participating in extracurricular activities.  
- Conversely, **2,669 students** are not participating in extracurricular activities.  

In [None]:
student_performance_df.groupby(by='Internet_Access').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Internet_Access,student_id
0,Yes,6108
1,No,499


**Student's Internet Access**

- From the table above, **6,108 students** have internet access.  
- This is a significantly higher number compared to the **499 students** who do not have internet access.  

In [None]:
student_performance_df.groupby(by='Family_Income').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Family_Income,student_id
0,Low,2672
1,Medium,2666
2,High,1269


**Student's Family Income**

- According to the **Student's Family Income** table, the most common family income group is **low family income**, with **2,672 students**.  
- The second most common group is **medium family income**, with **2,666 students**.  
- Lastly, **high family income** is the least common, with only **1,269 students**.  

In [None]:
student_performance_df.groupby(by='Teacher_Quality').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Teacher_Quality,student_id
0,Medium,4003
1,High,1947
2,Low,657


**Student's Teacher Quality**

- According to the **Student's Teacher Quality** table, the most common teacher quality is **medium**, with **4,003 students** being taught by medium-quality teachers.  
- The second most common is **high-quality teachers**, who teach **1,947 students**.  
- Lastly, **low-quality teachers** teach only **657 students**, making this the lowest number among the teacher quality groups.

In [None]:
student_performance_df.groupby(by='Peer_Influence').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Peer_Influence,student_id
0,Positive,2638
1,Neutral,2592
2,Negative,1377


**Influence of Peers on Academic Performance**

- According to the table, **2,638 students** find that peers have a **positive influence** on their academic performance.  
- **2,592 students** consider the influence to be **neutral**.  
- Lastly, **1,377 students** believe that peers have a **negative influence** on their academic performance.  

In [None]:
student_performance_df.groupby(by='Parental_Education_Level').student_id.count().sort_values(ascending=False).reset_index()

Unnamed: 0,Parental_Education_Level,student_id
0,High School,3313
1,College,1989
2,Postgraduate,1305


**Parental Education Level**

- According to the table, the **most common parental education level** is **High School**, with **3,313 parents** having graduated from high school.  
- The **second most common level** is **College**, with **1,989 parents** having completed a college degree.  
- Lastly, the **least common level** is **Postgraduate education**, with **1,305 parents** having achieved this level.  

#Question

1. **Highest Correlation to Exam Score**  
   Identify which column has the **strongest correlation** with exam scores, and explain the relationship.  

2. **Hours Studied per Week by School Type**  
   Compare the **average, maximum, and minimum hours studied per week** between students from **private** and **public schools**.  

3. **Average Exam Scores by Teacher Quality**  
   Compare the **average exam scores** among students based on the quality of their teachers (**high**, **medium**, or **low**).  

4. **Exam Performance and Extracurricular Participation**  
   Determine which group of students—those who **participate in extracurricular activities** or those who do not—achieved **exam scores greater than 75**.  

5. **Tutoring Sessions for High-Scoring Students**  
   Analyze the **number of tutoring sessions attended** by students who scored above a certain threshold (e.g., **exam scores > 90**).  


In [None]:
import scipy.stats
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

student_performance_correlation_df = student_performance_df.copy()
student_performance_correlation_df['Parental_Involvement']= label_encoder.fit_transform(student_performance_correlation_df['Parental_Involvement'])
student_performance_correlation_df['Access_to_Resources']= label_encoder.fit_transform(student_performance_correlation_df['Access_to_Resources'])
student_performance_correlation_df['Extracurricular_Activities']= label_encoder.fit_transform(student_performance_correlation_df['Extracurricular_Activities'])
student_performance_correlation_df['Motivation_Level']= label_encoder.fit_transform(student_performance_correlation_df['Extracurricular_Activities'])
student_performance_correlation_df['Internet_Access']= label_encoder.fit_transform(student_performance_correlation_df['Internet_Access'])
student_performance_correlation_df['Family_Income']= label_encoder.fit_transform(student_performance_correlation_df['Family_Income'])
student_performance_correlation_df['Teacher_Quality']= label_encoder.fit_transform(student_performance_correlation_df['Teacher_Quality'])
student_performance_correlation_df['School_Type']= label_encoder.fit_transform(student_performance_correlation_df['School_Type'])
student_performance_correlation_df['Peer_Influence']= label_encoder.fit_transform(student_performance_correlation_df['Peer_Influence'])
student_performance_correlation_df['Learning_Disabilities']= label_encoder.fit_transform(student_performance_correlation_df['Learning_Disabilities'])
student_performance_correlation_df['Parental_Education_Level']= label_encoder.fit_transform(student_performance_correlation_df['Parental_Education_Level'])
student_performance_correlation_df['Distance_from_Home']= label_encoder.fit_transform(student_performance_correlation_df['Distance_from_Home'])
student_performance_correlation_df['Gender']= label_encoder.fit_transform(student_performance_correlation_df['Gender'])


correlation_columns = student_performance_correlation_df.iloc[:,1:-1].columns

correlation_value = []
for column in correlation_columns:
    correlation_value.append(scipy.stats.spearmanr(student_performance_correlation_df[column], student_performance_correlation_df['Exam_Score'])[0])

correlation_data = {'columns_name': student_performance_correlation_df.iloc[:,1:-1].columns,
        'dependent_correlation': correlation_value}

correlation_table = pd.DataFrame(correlation_data)

correlation_table.sort_values(by='dependent_correlation', ascending=False).head(3)

Unnamed: 0,columns_name,dependent_correlation
1,Attendance,0.672366
0,Hours_Studied,0.480956
6,Previous_Scores,0.191941


**Exam Score Correlation Analysis**  

- Using the **Spearman correlation method**, the **Percentage of Classes Attended** shows the **highest correlation** with exam scores, with a correlation coefficient of **0.672366**.  
- The **second highest correlation** with exam scores is the **Number of Hours Spent Studying per Week**, which has a correlation coefficient of **0.480956**.  
- These positive correlations (**0.672366** and **0.480956**) indicate that as the percentage of classes attended and the number of study hours per week increase, exam scores tend to improve.  
- The results suggest a **moderate positive relationship** between these factors and exam scores, emphasizing the importance of both attendance and study habits, while acknowledging that other variables might also significantly impact performance.  

In [None]:
student_performance_df.groupby(by='School_Type').agg({
    'Hours_Studied' : ['max','min','mean']
})

Unnamed: 0_level_0,Hours_Studied,Hours_Studied,Hours_Studied
Unnamed: 0_level_1,max,min,mean
School_Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Private,44,1,19.972623
Public,43,1,19.976512


**Hours Studied per Week by School Type**  

- According to the data, there is no significant difference in the highest number of hours spent studying per week between the two types of schools. Students in private schools report a maximum of **44 hours per week**, while those in public schools report **43 hours per week**.  
- Similarly, the **minimum time spent studying** per week is **1 hour** for both school types.  

In [None]:
student_performance_df.groupby(by='Teacher_Quality').agg({
    'Exam_Score' : ['max','min','mean']
})

Unnamed: 0_level_0,Exam_Score,Exam_Score,Exam_Score
Unnamed: 0_level_1,max,min,mean
Teacher_Quality,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
High,100,58,67.676425
Low,94,58,66.753425
Medium,100,55,67.100175


**Teacher Quality and Exam Scores**  

- According to the data, students with **high** and **medium teacher quality** can achieve the **same maximum exam score** of **100**.  
- Students across all three teacher quality categories (high, medium, and low) have **slightly similar average exam scores**, ranging between **66 and 67**.  

In [None]:
high_exam_score_df = student_performance_df[student_performance_df['Exam_Score'] >= 75].copy()

In [None]:
high_exam_score_df.groupby(by='Extracurricular_Activities').agg({
    'student_id' : 'count'
})

Unnamed: 0_level_0,student_id
Extracurricular_Activities,Unnamed: 1_level_1
No,37
Yes,87


**Extracurricular Activities and Exam Score**

- Based on the data, **87 students** who scored above **75** participate in **extracurricular activities**, indicating a higher proportion of high performers among this group.  
- Meanwhile, **37 students** who do not participate in extracurricular activities also achieved an exam score greater than **75**.  
- This suggests that participating in extracurricular activities may be associated with better academic performance, though non-participants can still achieve high scores.  

In [None]:
high_exam_score_df = student_performance_df[student_performance_df['Exam_Score'] >= 90].copy()
print('most common tutoring session per month is ',high_exam_score_df['Tutoring_Sessions'].mode()[0], 'times')

most common tutoring session per month is  1 times


**Tutoring Sessions per Month for Students with High Exam Scores**  

- The majority of students who achieved **high exam scores** (above 90) attend **tutoring sessions once a month**, highlighting this frequency as the most common among top-performing students.