
# Descriptive Analysis of Vietnam National High School Graduation Exam Scores (2024)

This notebook provides a comprehensive descriptive analysis of the 2024 Vietnam National High School Graduation Exam scores. The objective is to explore the distributions, variability, and potential insights from the exam scores in major subjects, focusing on trends, deviation, and implications for higher education admissions.

Dataset Source: Kaggle - [Vietnam National Exam 2024](https://www.kaggle.com/datasets/hakudan/im-thi-thpt-nm-2024)


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Load dataset
df = pd.read_csv("diem_thi_thpt_2024.csv")
df.head()


## Basic Dataset Overview

In [None]:

df.info()


## Missing Values by Subject

In [None]:

df.isnull().sum().sort_values(ascending=False)


## Summary Statistics of All Subjects

In [None]:

df.describe()


## Distribution of Core Subject Scores

In [None]:

core_subjects = ['toan', 'ngu_van', 'ngoai_ngu']

for subject in core_subjects:
    sns.histplot(df[subject], bins=50, kde=True)
    plt.title(f"Distribution of {subject.title()} scores")
    plt.xlabel("Score")
    plt.ylabel("Frequency")
    plt.show()


## Comparison Between Natural and Social Science Combinations

In [None]:

# Count number of students who took each combination
df['KHTN'] = df[['vat_li', 'hoa_hoc', 'sinh_hoc']].notnull().all(axis=1)
df['KHXH'] = df[['lich_su', 'dia_li', 'gdcd']].notnull().all(axis=1)
df[['KHTN', 'KHXH']].sum()


## Score Distribution by Science Type (Boxplots)

In [None]:

sns.boxplot(data=df[['toan', 'ngu_van', 'ngoai_ngu']])
plt.title("Boxplot of Core Subject Scores")
plt.ylabel("Score")
plt.show()


## Skewness and Standard Deviation Analysis

In [None]:

skew_std = pd.DataFrame({
    'std_dev': df.std(),
    'skewness': df.skew()
}).dropna().sort_values(by='std_dev', ascending=False)
skew_std


## Correlation Heatmap Between Subjects

In [None]:

plt.figure(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix of Subjects")
plt.show()



## Key Insights and Implications

- **Toán (Math)** scores show a relatively normal distribution with high participation.
- **Ngữ Văn (Literature)** has lower skewness, indicating better centering around the mean.
- **Ngoại Ngữ (Foreign Language)** shows higher standard deviation, suggesting wide variation in performance.
- Many students opted for either Natural Sciences (KHTN) or Social Sciences (KHXH), not both.
- Strong positive correlations are seen among Natural Science subjects, while Literature is less correlated.
- The distribution suggests that the **difficulty level** of the 2024 exam was moderate, with few students scoring near the minimum or maximum, indicating good test differentiation.
- Implications for university admissions:
  - Highly competitive programs should consider score deviation and selectivity.
  - Foreign language scores might be a strong differentiator due to their wide distribution.
