# Are the world’s largest economies also the most populous countries?

This notebook explores how country rankings by gross domestic product (GDP) compare to rankings by population. The goal is to see whether the countries with the largest economies are also the countries with the largest populations, and to notice where the rankings do not match.

I use two public datasets from the World Bank:

- **GDP ranking**: a table with each economy’s GDP in current US dollars and a ranking.
- **Population ranking**: a table with each economy’s population and a ranking.

Both datasets were downloaded from the World Bank Data Catalog and saved locally as `GDP.csv` and `POP.csv`. All data cleaning, merging, and visualization steps are done in Python using `pandas` and `plotly.express`.

In [2]:
import pandas as pd
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

## Setup

In the cell above I import the libraries used in this project. `pandas` is used for reading, cleaning, and merging the data tables. `plotly.express` is used to create the final visualization.

In [3]:
gdp_raw = pd.read_csv("GDP.csv")
gdp_raw.head()

Unnamed: 0.1,Unnamed: 0,Gross domestic product 2024,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,,,,,
1,,,,,(millions of,
2,,Ranking,,Economy,US dollars),
3,,,,,,
4,USA,1,,United States,29184890,


In [4]:
pop_raw = pd.read_csv("POP.csv")
pop_raw.head()

Unnamed: 0.1,Unnamed: 0,Population 2024,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,,,,,,,
1,,,,,,,
2,,Ranking,,Economy,(thousands),,
3,,,,,,,
4,IND,1,,India,1450936,,


## Data sources and loading

In the two code cells above I read the GDP and population ranking files from the local CSV files `GDP.csv` and `POP.csv`. These files come from the World Bank’s “GDP ranking” and “Population ranking” tables.

I downloaded the original files from the World Bank Data Catalog and saved them as CSV:

- GDP ranking: https://datacatalog.worldbank.org/search/dataset/0038130/gdp-ranking  
- Population ranking: https://datacatalog.worldbank.org/search/dataset/0038126/population-ranking  

The `head()` output for each table lets me quickly check that the files loaded correctly and shows the structure of the first few rows.

In [5]:
gdp_raw.columns

Index(['Unnamed: 0', 'Gross domestic product 2024', 'Unnamed: 2', 'Unnamed: 3',
       'Unnamed: 4', 'Unnamed: 5'],
      dtype='object')

### Understanding the raw GDP table

The `gdp_raw.columns` output shows that the GDP file contains several unnamed columns and that the first few rows hold titles and units rather than real country data. Because of this layout, I cannot use the table directly. I first need to remove the non-data rows and keep only the rows that contain a ranking and numeric values for each country.

In [6]:
gdp = gdp_raw.dropna(subset=["Gross domestic product 2024"])
pop = pop_raw.dropna(subset=["Population 2024"])

gdp.head()
pop.head()

Unnamed: 0.1,Unnamed: 0,Population 2024,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
2,,Ranking,,Economy,(thousands),,
4,IND,1,,India,1450936,,
5,CHN,2,,China,1408975,,
6,USA,3,,United States,340111,,
7,IDN,4,,Indonesia,283488,,


### Keeping only rows with valid rankings

In this step I drop rows where the ranking columns are missing. This removes the descriptive header rows at the top of each file and keeps only rows with a GDP rank or a population rank. The `head()` output now starts with actual country entries such as the United States, China, India, and Indonesia.

In [7]:
gdp = gdp.rename(columns={
    "Unnamed: 0": "country_code",
    "Gross domestic product 2024": "gdp_rank",
    "Unnamed: 3": "country_name",
    "Unnamed: 4": "gdp_usd_millions"
})

gdp = gdp[["country_code", "country_name", "gdp_rank", "gdp_usd_millions"]]
gdp.head()

Unnamed: 0,country_code,country_name,gdp_rank,gdp_usd_millions
2,,Economy,Ranking,US dollars)
4,USA,United States,1,29184890
5,CHN,China,2,18743803
6,DEU,Germany,3,4659929
7,JPN,Japan,4,4026211


In [8]:
pop = pop.rename(columns={
    "Unnamed: 0": "country_code",
    "Population 2024": "pop_rank",
    "Unnamed: 3": "country_name",
    "Unnamed: 4": "population_thousands"
})

pop = pop[["country_code", "country_name", "pop_rank", "population_thousands"]]
pop.head()

Unnamed: 0,country_code,country_name,pop_rank,population_thousands
2,,Economy,Ranking,(thousands)
4,IND,India,1,1450936
5,CHN,China,2,1408975
6,USA,United States,3,340111
7,IDN,Indonesia,4,283488


### Keeping only rows with valid rankings

In this step I drop rows where the ranking columns are missing. This removes the descriptive header rows at the top of each file and keeps only rows with a GDP rank or a population rank. The `head()` output now starts with actual country entries such as the United States, China, India, and Indonesia.

In [9]:
gdp["gdp_rank"] = pd.to_numeric(gdp["gdp_rank"], errors="coerce")
gdp["gdp_usd_millions"] = (
    gdp["gdp_usd_millions"]
    .astype(str)
    .str.replace(",", "")
)
gdp["gdp_usd_millions"] = pd.to_numeric(gdp["gdp_usd_millions"], errors="coerce")

gdp = gdp.dropna(subset=["gdp_rank", "gdp_usd_millions"])
gdp["gdp_rank"] = gdp["gdp_rank"].astype(int)

gdp.head()

Unnamed: 0,country_code,country_name,gdp_rank,gdp_usd_millions
4,USA,United States,1,29184890.0
5,CHN,China,2,18743803.0
6,DEU,Germany,3,4659929.0
7,JPN,Japan,4,4026211.0
8,IND,India,5,3912686.0


In [10]:
pop["pop_rank"] = pd.to_numeric(pop["pop_rank"], errors="coerce")
pop["population_thousands"] = (
    pop["population_thousands"]
    .astype(str)
    .str.replace(",", "")
)
pop["population_thousands"] = pd.to_numeric(pop["population_thousands"], errors="coerce")

pop = pop.dropna(subset=["pop_rank", "population_thousands"])
pop["pop_rank"] = pop["pop_rank"].astype(int)

pop.head()

Unnamed: 0,country_code,country_name,pop_rank,population_thousands
4,IND,India,1,1450936.0
5,CHN,China,2,1408975.0
6,USA,United States,3,340111.0
7,IDN,Indonesia,4,283488.0
8,PAK,Pakistan,5,251269.0


### Converting strings to numeric values

The GDP and population size columns are stored as strings with commas. Before I can analyze them, I remove the commas and convert these columns to numeric types using `pd.to_numeric`. I also drop rows where the conversion fails and turn the ranking columns into integers.

After this step, both tables treat GDP and population as numbers rather than text. This guarantees that later calculations and plots work correctly.

In [11]:
merged = pd.merge(
    gdp[["country_code", "country_name", "gdp_rank", "gdp_usd_millions"]],
    pop[["country_code", "pop_rank", "population_thousands"]],
    on="country_code",
    how="inner"
)

merged.head()

Unnamed: 0,country_code,country_name,gdp_rank,gdp_usd_millions,pop_rank,population_thousands
0,USA,United States,1,29184890.0,3,340111.0
1,CHN,China,2,18743803.0,2,1408975.0
2,DEU,Germany,3,4659929.0,19,83511.0
3,JPN,Japan,4,4026211.0,12,123975.0
4,IND,India,5,3912686.0,1,1450936.0


## Combining the two datasets

This cell merges the cleaned GDP and population tables into a single DataFrame called `merged`. I join on the shared `country_code` column and keep only countries that appear in both sources.

Each row in `merged` now contains, for the same country, its GDP rank, its population rank, its GDP in millions of US dollars, and its population in thousands. This combined table is the basis for the final visualization.

In [12]:
fig = px.scatter(
    merged,
    x="pop_rank",
    y="gdp_rank",
    hover_name="country_name",
    title="GDP rank vs Population rank",
    labels={
        "pop_rank": "Population rank (1 = most populous)",
        "gdp_rank": "GDP rank (1 = largest GDP)"
    }
)

fig.update_layout(
    xaxis=dict(autorange="reversed"),
    yaxis=dict(autorange="reversed")
)

fig.show()

## Visualization: GDP rank versus population rank

The scatter plot above compares each country’s GDP rank (vertical axis) with its population rank (horizontal axis). Rank 1 represents the largest value, so I reverse both axes. Countries in the top left corner of the plot have both very large populations and very large economies.

If every country’s economic size matched its population size exactly, all points would lie on a diagonal line where the GDP rank equals the population rank. Instead, the points are spread around this line.

### Takeaways

There is a clear overall pattern: many of the most populous countries also have high GDP rankings. Large economies such as the United States, China, and India appear near the top ranks on both axes.

However, there are also important exceptions. Some countries with relatively small populations still appear high in the GDP ranking. These are high-income economies where output per person is high. Other countries have very large populations but lower GDP ranks, which suggests that a large population does not automatically translate into a large economy.

Overall, the plot shows that population size and economic size are related but not in a one-to-one way. Productivity, income per person, and other structural factors also matter.

### Limitations and possible extensions

This project uses only one year of data and focuses on rankings rather than exact GDP or population values. A more complete analysis could look at trends over time, compute GDP per capita, or group countries by income level or region. For the purposes of this assignment, the main goal is to practice loading two different datasets, cleaning them, merging them, and presenting the combined information in a clear visualization.