# Capstone Project: Global Primary School Enrollment Analysis


**Course**: INSY 8413 – Introduction to Big Data Analytics  
**Student**: Clever Karenzi  
**Dataset**: World Bank – School enrollment, primary (% gross)  
**Link**: https://data.worldbank.org/indicator/SE.PRM.ENRR  
**Objective**: Analyze global primary school enrollment trends from 2000 to 2020 and identify clusters of countries based on similar patterns.


In [None]:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


## Load Dataset

In [None]:

# Load World Bank CSV (adjust filename if necessary)
df = pd.read_csv("API_SE.PRM.ENRR_DS2_en_csv_v2_6302777.csv", skiprows=4)
df = df[["Country Name", "Country Code"] + [str(year) for year in range(2000, 2021)]]
df.dropna(thresh=10, inplace=True)
df.set_index("Country Name", inplace=True)
df.head()


## Exploratory Analysis

In [None]:

# Visualize enrollment trends for selected countries
countries = ["Rwanda", "Finland", "India", "United States"]
df.loc[countries].T.plot(figsize=(10,6))
plt.title("Primary School Enrollment (% Gross) [2000–2020]")
plt.xlabel("Year")
plt.ylabel("Enrollment %")
plt.grid(True)
plt.tight_layout()
plt.show()


## Clustering Countries

In [None]:

# Fill missing values and scale
df_clean = df.fillna(df.median())
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_clean)

# KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df_clean['Cluster'] = clusters
df_clean[['Cluster']].head()


## Evaluation

In [None]:

# Evaluate clustering quality
score = silhouette_score(X_scaled, clusters)
print("Silhouette Score:", round(score, 3))


## Insight Summary


- Cluster 0: Low and improving enrollment countries  
- Cluster 1: High but declining or stagnant countries  
- Cluster 2: High and consistent performers  
- Recommend policy benchmarking using Cluster 2 countries.
