
# Project 2: Female Tertiary Enrollment vs. CO₂ Per Capita

I'm exploring whether countries with higher female tertiary enrollment also tend to have higher CO₂ emissions per person, or if the two move independently. My hunch is that industrialization and income growth can raise both emissions and education access, but energy mix and policy might break the link.



## What I'm trying to answer
- For China, India, the US, and the UK, do female tertiary enrollment and CO₂ per capita rise together in overlapping years?
- Over time, who moves in tandem and who diverges?
- I want one combined visualization (animated scatter + a static snapshot) that shows both measures on the same chart.



## Data I'm using (2 provided + 1 external)
- **Dataset A (provided)** `co2-emissions-per-capita.csv`: annual CO₂ per person for China, India, UK, US, World.
- **Dataset B (provided)** `owid-co2-data.csv`: I only use this to grab ISO codes so the country names line up cleanly.
- **Dataset C (external API)** World Bank indicator `SE.TER.ENRR.FE`: female gross tertiary enrollment (%).

Why I need Dataset C: the provided files don't include education. To meet the "two datasets in one chart" requirement, I have to pull education from the World Bank, which requires internet.



## My plan
1) Load Dataset A, standardize column names, and attach ISO codes from Dataset B; drop the World aggregate.
2) Download Dataset C from the World Bank, reshape wide → long, and keep numeric years/values.
3) Merge A and C on ISO + year so I only keep overlapping observations.
4) Check Pearson correlation; build a 1990+ animated scatter with an OLS trendline.
5) Add a latest-year static scatter for easy screenshots; jot down what I see and what’s missing.


## Environment & libraries

In [None]:

import io
import zipfile
import urllib.request

import pandas as pd
import plotly.express as px

pd.set_option('display.max_columns', 10)


## Load Dataset A: CO₂ per capita

In [None]:

co2_pc = pd.read_csv('co2-emissions-per-capita.csv')
co2_pc = co2_pc.rename(columns={
    'Entity': 'country',
    'Year': 'year',
    'Annual CO₂ emissions (per capita)': 'co2_per_capita'
})
co2_pc.head()


## Attach ISO codes from Dataset B (clean country names)

In [None]:

iso_lookup = (
    pd.read_csv('owid-co2-data.csv', usecols=['country', 'iso_code'])
    .dropna()
    .drop_duplicates()
)
co2_pc = co2_pc.merge(iso_lookup, on='country', how='left')
# drop World aggregate and rows without ISO
co2_pc = co2_pc[co2_pc['iso_code'].notna() & (co2_pc['country'] != 'World')]
co2_pc.sample(5, random_state=0)



## Download Dataset C: female tertiary enrollment (World Bank `SE.TER.ENRR.FE`)
I'm downloading the indicator ZIP, picking the CSV, converting wide → long, and keeping numeric years/values plus ISO codes.


In [None]:

url = 'https://api.worldbank.org/v2/en/indicator/SE.TER.ENRR.FE?downloadformat=csv'
with urllib.request.urlopen(url) as resp:
    z = zipfile.ZipFile(io.BytesIO(resp.read()))
    csv_name = [n for n in z.namelist() if n.startswith('API_SE.TER.ENRR.FE') and n.endswith('.csv')][0]
    female_raw = pd.read_csv(z.open(csv_name), skiprows=4)

female_long = female_raw.melt(
    id_vars=['Country Name', 'Country Code'],
    var_name='year',
    value_name='female_tertiary_enrollment'
)
female_long['year'] = pd.to_numeric(female_long['year'], errors='coerce')
female_long['female_tertiary_enrollment'] = pd.to_numeric(
    female_long['female_tertiary_enrollment'], errors='coerce'
)
female_long = female_long.dropna(subset=['year', 'female_tertiary_enrollment'])
female_long = female_long.rename(columns={'Country Name': 'country_name', 'Country Code': 'iso_code'})
female_long.head()



## Filter to CO₂ countries and merge (ISO + year)
I keep only the countries present in the CO₂ file, then merge on ISO and year so name differences don’t cause trouble.


In [None]:

focus_iso = co2_pc['iso_code'].unique()
female_focus = female_long[female_long['iso_code'].isin(focus_iso)]

merged = co2_pc.merge(
    female_focus,
    on=['iso_code', 'year'],
    how='inner',
    suffixes=('_co2', '_edu')
)
merged = merged[['country', 'iso_code', 'year', 'co2_per_capita', 'female_tertiary_enrollment']]

print('Countries:', merged['country'].unique())
print('Rows merged:', len(merged))
merged.head()



## Quick stats
I check the overall Pearson correlation and also look at the latest year to see who’s highest on enrollment and emissions.


In [None]:

corr = merged[['female_tertiary_enrollment', 'co2_per_capita']].corr().iloc[0, 1]
print(f'Pearson correlation (all years, all countries): {corr:.2f}')

latest_year = merged['year'].max()
print(f'Latest year in merge: {latest_year}')
print(merged[merged['year'] == latest_year].sort_values('female_tertiary_enrollment', ascending=False).head())



## Visualization choices
- Scatter plot to show both measures; color = country; animate by year (1990+) to see movement.
- Add an OLS trendline for direction; also make a latest-year static scatter that’s easy to drop into slides.


In [None]:

# Animated scatter (1990+)
viz_df = merged[merged['year'] >= 1990].copy()
fig = px.scatter(
    viz_df,
    x='female_tertiary_enrollment',
    y='co2_per_capita',
    color='country',
    animation_frame='year',
    hover_name='country',
    trendline='ols',
    labels={
        'female_tertiary_enrollment': 'Female tertiary enrollment (% gross)',
        'co2_per_capita': 'CO₂ per capita (tons/person)'
    },
    title='Female tertiary enrollment vs CO₂ per capita (1990+, animation)'
)
fig.update_layout(height=600)
fig.show()


In [None]:

# Latest-year static snapshot (for slides)
snapshot_year = viz_df['year'].max()
snap = viz_df[viz_df['year'] == snapshot_year]
fig2 = px.scatter(
    snap,
    x='female_tertiary_enrollment',
    y='co2_per_capita',
    color='country',
    text='country',
    labels={
        'female_tertiary_enrollment': 'Female tertiary enrollment (% gross)',
        'co2_per_capita': 'CO₂ per capita (tons/person)'
    },
    title=f'Female tertiary enrollment vs CO₂ per capita ({snapshot_year})'
)
fig2.update_traces(textposition='top center')
fig2.update_layout(height=500)
fig2.show()



## What I see
- Overall correlation is positive: in this small set, higher female tertiary enrollment often comes with higher CO₂ per capita.
- US/UK stay high on both; China ramps up on both after 1990; India’s enrollment rises but emissions stay relatively low.
- Correlation isn’t causation: energy mix, industry, population, and policy matter a lot, and education depends on fiscal effort and demographics.



## Limitations and how I’d improve
- Only four countries—adding more would make the pattern more convincing.
- World Bank data needs internet; I could cache it locally for offline runs.
- I didn’t control for GDP or renewables share; a multivariate model would tease out confounders.
- GitHub may not render the animation interactively; best viewed locally or on an interactive host.



## How to reproduce
1) Stay online (World Bank download).
2) Run all cells in order inside `Project2`.
3) To publish: `jupyter-book build .` then push.
