# Project 2

## Overview

Read in two datasets (GDP per capita and life expectancy)
- Filter both datasets to a single year (2019)
- Merge them by country
- Create a scatter plot to show the relationship between income and life expectancy
- Describe the key takeaway from the visualization

### Dataset & Source

- **Dataset 1 title:** GDP Per Capita
- **Dataset 2 title:** Life Expectancy
- **Primary source:** World Bank
- **Link to source dataset 1:** (https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
- **Link to source dataset 2:** (https://data.worldbank.org/indicator/SP.DYN.LE00.IN?locations=1W)

### Step 1 
**Save and load datasets**

In [136]:
# ensure the visualizations render properly across VSCode, Jupyter Book, etc.
# https://plotly.com/python/renderers/

import plotly.io as pio
pio.renderers.default = "notebook_connected+plotly_mimetype"

import pandas as pd 
df_gdp = pd.read_csv("gdp_per_capita.csv", skiprows= 4) #load csv file
df_life = pd.read_csv("life_expectancy.csv", skiprows= 4) #load csv file

In [137]:
df_life.head() #See the top 5 rows

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,64.049,64.215,64.602,64.944,65.303,65.615,...,75.54,75.62,75.88,76.019,75.406,73.655,76.226,76.353,,
1,Africa Eastern and Southern,AFE,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,44.169658,44.468838,44.87789,45.160583,45.535695,45.770723,...,62.167981,62.591275,63.330691,63.857261,63.766484,62.979999,64.48702,65.146291,,
2,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.799,33.291,33.757,34.201,34.673,35.124,...,62.646,62.406,62.443,62.941,61.454,60.417,65.617,66.035,,
3,Africa Western and Central,AFW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.779636,38.058956,38.681792,38.936918,39.19458,39.479784,...,56.392452,56.626439,57.036976,57.149847,57.364425,57.362572,57.987813,58.855722,,
4,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.933,36.902,37.168,37.419,37.704,37.968,...,61.619,62.122,62.622,63.051,63.116,62.958,64.246,64.617,,


In [138]:
df_gdp.head() #See the top 5 rows

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,27441.529662,28440.051964,30082.127645,31096.205074,22855.93232,27200.061079,30559.533535,33984.79062,,
1,Africa Eastern and Southern,AFE,GDP per capita (current US$),NY.GDP.PCAP.CD,186.121835,186.941781,197.402402,225.440494,208.999748,226.876513,...,1329.807285,1520.212231,1538.901679,1493.817938,1344.10321,1522.393346,1628.318944,1510.742951,1567.635839,
2,Afghanistan,AFG,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,522.082216,525.469771,491.337221,496.602504,510.787063,356.496214,357.261153,413.757895,,
3,Africa Western and Central,AFW,GDP per capita (current US$),NY.GDP.PCAP.CD,121.939925,127.454189,133.827044,139.008291,148.549379,155.565216,...,1630.039447,1574.23056,1720.14028,1798.340685,1680.039332,1765.954788,1796.668633,1599.392983,1284.154441,
4,Angola,AGO,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,1807.952941,2437.259712,2538.591391,2189.855714,1449.922867,1925.874661,2929.694455,2309.53413,2122.08369,


### Step 2
**Filter both datasets for the year 2019**

In [139]:
# Choose the analysis year
year = 2019
year_col = str(year)  # '2019' column in the World Bank files

# For GDP: keep country info + 2019 column, then rename
gdp_2019 = df_gdp[["Country Name", "Country Code", year_col]].copy()
gdp_2019.rename(columns={year_col: "gdp_per_capita"}, inplace=True)

# Make sure it's numeric
gdp_2019["gdp_per_capita"] = pd.to_numeric(gdp_2019["gdp_per_capita"], errors="coerce")
gdp_2019 = gdp_2019.dropna(subset=["gdp_per_capita"])
gdp_2019.head()

Unnamed: 0,Country Name,Country Code,gdp_per_capita
0,Aruba,ABW,31096.205074
1,Africa Eastern and Southern,AFE,1493.817938
2,Afghanistan,AFG,496.602504
3,Africa Western and Central,AFW,1798.340685
4,Angola,AGO,2189.855714


In [140]:
# For life expectancy: keep country info + 2019 column, then rename
life_2019 = df_life[["Country Name", "Country Code", year_col]].copy()
life_2019.rename(columns={year_col: "life_expectancy"}, inplace=True)

life_2019["life_expectancy"] = pd.to_numeric(life_2019["life_expectancy"], errors="coerce")
life_2019 = life_2019.dropna(subset=["life_expectancy"])

life_2019.head()

Unnamed: 0,Country Name,Country Code,life_expectancy
0,Aruba,ABW,76.019
1,Africa Eastern and Southern,AFE,63.857261
2,Afghanistan,AFG,62.941
3,Africa Western and Central,AFW,57.149847
4,Angola,AGO,63.051


### Step 3
**Merge the datasets on country name**

In [141]:
merged = pd.merge(
    gdp_2019,
    life_2019,
    on=["Country Name", "Country Code"],
    how="inner"
)

print("Number of countries:", merged.shape[0])
merged.head()

Number of countries: 258


Unnamed: 0,Country Name,Country Code,gdp_per_capita,life_expectancy
0,Aruba,ABW,31096.205074,76.019
1,Africa Eastern and Southern,AFE,1493.817938,63.857261
2,Afghanistan,AFG,496.602504,62.941
3,Africa Western and Central,AFW,1798.340685,57.149847
4,Angola,AGO,2189.855714,63.051


### Step 4 

**Before plotting, I do a little basic cleaning:**

- Drop rows where GDP per capita or life expectancy is missing  
- Keep only positive values  
- Remove a few extremely high GDP values so the rest of the countries are easier to see on the plot

In [142]:
#Simple cleaning for plotting

#Start from the merged data
clean = merged.copy()

#Drop rows where either value is missing
clean = clean.dropna(subset=["gdp_per_capita", "life_expectancy"])

#Keep only positive values
clean = clean[(clean["gdp_per_capita"] > 0) & (clean["life_expectancy"] > 0)]

#Remove extremely high GDP values so the plot is easier to read
clean = clean[clean["gdp_per_capita"] < 100_000]

clean.shape

(254, 4)

### Step 5
**Create a scatter plot**

In [143]:
import plotly.express as px

fig = px.scatter(
    clean,
    x="gdp_per_capita",
    y="life_expectancy",
    hover_name="Country Name",
    title=f"GDP per capita vs life expectancy ({year})",
    labels={
        "gdp_per_capita": "GDP per capita (constant 2015 US$)",
        "life_expectancy": "Life expectancy at birth (years)"
    }
)

fig.show()

### Takeaway
**This scatter plot compares GDP per capita and life expectancy for each country in 2019.**

Overall, there is a clear pattern:

- Countries with **low GDP per capita** usually have **lower life expectancy** (around 50–70 years).
- As **GDP per capita increases**, life expectancy also tends to **increase**, often into the high 70s or 80s.
- At very high income levels, life expectancy **flattens out** – richer countries still cluster around similar life expectancy values.
- There are a few countries that sit off the main pattern, which suggests that income is important but not the only factor that affects health and longevity.

**This shows that higher income is generally associated with longer lives, but the relationship is not perfectly linear and other factors also matter.**