In [229]:
import pandas as pd
import numpy as np
import matplotlib as plt

- [X] Remove non-essential columns (e.g., "Country Code", "Indicator Code").
- [X] Reshape from wide → long format (`Country`, `Year`, `GDP_per_capita`).
- [X] Handle missing values:
  - [X] Drop countries with >30% missing data.
  - [X] Fill missing GDP values with linear interpolation.

In [230]:
df = pd.read_csv("./data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_122367.csv", skiprows=4)

In [231]:
data = df.copy()

In [232]:
# Remove non-essential columns (e.g., "Country Code", "Indicator Code").
data.drop("Indicator Code", inplace=True, axis=1)
data.drop("Country Code", inplace=True, axis=1)

In [233]:
data.head()

Unnamed: 0,Country Name,Indicator Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,GDP per capita (current US$),,,,,,,,,...,27441.529662,28440.051964,30082.127645,31096.205074,22855.93232,27200.061079,30559.533535,33984.79062,,
1,Africa Eastern and Southern,GDP per capita (current US$),186.121835,186.941781,197.402402,225.440494,208.999748,226.876513,240.955232,243.817323,...,1329.807285,1520.212231,1538.901679,1493.817938,1344.10321,1522.393346,1628.318944,1568.159891,1673.841139,
2,Afghanistan,GDP per capita (current US$),,,,,,,,,...,522.082216,525.469771,491.337221,496.602504,510.787063,356.496214,357.261153,413.757895,,
3,Africa Western and Central,GDP per capita (current US$),121.939925,127.454189,133.827044,139.008291,148.549379,155.565216,162.110768,144.94348,...,1630.039447,1574.23056,1720.14028,1798.340685,1680.039332,1765.954788,1796.668633,1599.392983,1284.154441,
4,Angola,GDP per capita (current US$),,,,,,,,,...,1807.952941,2437.259712,2538.591391,2189.855714,1449.922867,1925.874661,2929.694455,2309.53413,2122.08369,


In [234]:
# Reshape from wide → long format (`Country`, `Year`, `GDP_per_capita`).
rows = []
for row in data.iterrows():
    country = row[1]["Country Name"]
    for year in range(1960, 2024):
        new_row = {"Country": country, "Year": year, "GDP_per_capita": row[1][str(year)]}
        rows.append(new_row)
data_long = pd.DataFrame(rows)
data_long = data_long.reset_index(drop=True)

In [235]:
# Handle missing values:
#   Drop countries with >30% missing data.
#   Fill missing GDP values with linear interpolation.# Convert Year column to integer and GDP to numeric.

missing = data_long.groupby("Country")["GDP_per_capita"].apply(lambda x: x.isna().sum())
size = data_long.groupby("Country").size()
missing_ratio = missing / size

In [236]:
countries_to_drop = missing_ratio[missing_ratio > 0.3].index
data_long = data_long[~data_long["Country"].isin(countries_to_drop)]

In [239]:
data_long["GDP_per_capita"] = data_long["GDP_per_capita"].interpolate(method="linear")

- [ ] Compute the **global mean GDP per capita** per year (line plot).
- [ ] Identify the **richest 10 and poorest 10 countries** in the latest available year.
- [ ] Compute **year-to-year percentage growth** for a few sample countries.
- [ ] Use NumPy to:
  - [ ] Calculate global mean, median, and standard deviation for each year.
  - [ ] Identify years with unusually high volatility (std dev spikes).

- [ ] Add a mapping: Country → Continent (manual or external dataset).
- [ ] Group by continent:
  - [ ] Compute average GDP per capita per year.
  - [ ] Plot trend lines for each continent (one line per continent).
- [ ] Compare continents:
  - [ ] Which is converging toward the global mean?
  - [ ] Which is diverging further?

- [ ] Line plot: GDP trends of top 5 largest economies (US, China, Japan, Germany, India).
- [ ] Histogram: Distribution of GDP per capita for all countries in 2020.
- [ ] Boxplot: GDP per capita grouped by continent in 2020.
- [ ] Scatter plot: GDP per capita vs Year for China and India (compare growth paths).
- [ ] Rolling mean (5-year window) for global GDP per capita trend (smooth line plot).

- [ ] Standardize GDP per capita (z-score) for a given year.
- [ ] Identify outliers (countries >3 std deviations from mean).
- [ ] Compare outliers across decades (1970s vs 2020s).
- [ ] Compute correlations:
  - [ ] GDP per capita vs Population (merge population dataset).
  - [ ] GDP per capita vs Life Expectancy (merge life expectancy dataset).