# World population data, by country

In this program, I used BeautifulSoup to scrape data from the World Bank about population numbers in different countries. I used pandas to clean and transform the data, including splitting the data into two dataframes, before exporting the data as two csv files.

In [72]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import html5lib

In [73]:
wikiurl = "http://wdi.worldbank.org/table/2.1"
response = requests.get(wikiurl)
print(response.status_code)

200


# Get columns

In [74]:
soup = bs(response.text, 'html.parser')
headers = soup.find('table', {'id':'fixedTable'})

In [75]:
headers

<table border="0" cellpadding="0" cellspacing="0" class="indicators-table" id="fixedTable"> <caption id="report_caption">Population dynamics </caption> <tr class="level0"> <th class="first"></th> <th class="separator" colspan="2"><div class="spacer"><a data-text="Metadata:Population" href="javascript:void(0)" onclick="loadWDIMetaData('SP.POP.TOTL', 'S', 'Series', 'Population', 'Population@Population ages 0-14 (% of total population)@Population ages 15-64 (% of total population)@Population ages 65 and above (% of total population)@Age dependency ratio, young (% of working-age population)@Age dependency ratio, old (% of working-age population)@Crude death rate@Crude birth rate@', 'SP.POP.TOTL@SP.POP.0014.TO.ZS@SP.POP.1564.TO.ZS@SP.POP.65UP.TO.ZS@SP.POP.DPND.YG@SP.POP.DPND.OL@SP.DYN.CDRT.IN@SP.DYN.CBRT.IN@')">Population</a></div></th> <th class="separator" colspan="1"><div class="spacer">Average annual population growth %</div></th> <th class="separator" colspan="3"><div class="spacer">Po

In [76]:
df0 = pd.read_html(str(headers), flavor="bs4")
df0 = pd.DataFrame(df0[0])
df0.head(10)

Unnamed: 0_level_0,Unnamed: 0_level_0,Population,Population,Average annual population growth %,Population age composition,Population age composition,Population age composition,Dependency ratio,Dependency ratio,Crude death rate,Crude birth rate
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Ages 0-14,Ages 15-64,Ages 65+,young,old,Unnamed: 9_level_1,Unnamed: 10_level_1
Unnamed: 0_level_2,Unnamed: 0_level_2,millions,millions,Unnamed: 3_level_2,%,%,%,% of working-age population,% of working-age population,"per 1,000 people","per 1,000 people"
Unnamed: 0_level_3,Unnamed: 0_level_3.1,2000,2020,2000-2020,2020,2020,2020,2020,2020,2019,2019


In [77]:
columns = ['country_name', 'pop_2000', 'pop_2020', 'avg_pop_growth',
           'pop_0to14', 'pop_15to64', 'pop_65plus',
          'young_pop', 'old_pop',
          'death_rate', 'birth_rate']

# Get table data

In [78]:
soup = bs(response.text, 'html.parser')
popTable = soup.find('table', {'id':'scrollTable'})

In [79]:
popTable

<table border="0" cellpadding="0" cellspacing="0" class="indicators-table" id="scrollTable"> <tbody> <tr> <td class="country"><div class="spacer"><a class="metaLink" data-text="Metadata:Afghanistan" href="javascript:void(0)" onclick="loadMetaData('AFG', 'C' ,'Country',  'Afghanistan')">Afghanistan</a></div></td> <td class=""><div class="spacer">20.8</div></td> <td class=""><div class="spacer">38.9</div></td> <td class=""><div class="spacer">3.1</div></td> <td class=""><div class="spacer">42</div></td> <td class=""><div class="spacer">56</div></td> <td class=""><div class="spacer">3</div></td> <td class=""><div class="spacer">75</div></td> <td class=""><div class="spacer">5</div></td> <td class=""><div class="spacer">6</div></td> <td class=""><div class="spacer">32</div></td> </tr> <tr> <td class="country"><div class="spacer"><a class="metaLink" data-text="Metadata:Albania" href="javascript:void(0)" onclick="loadMetaData('ALB', 'C' ,'Country',  'Albania')">Albania</a></div></td> <td cla

In [80]:
df = pd.read_html(str(popTable), flavor="bs4")
df = pd.DataFrame(df[0])
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Afghanistan,20.8,38.9,3.1,42,56,3,75,5,6,32
1,Albania,3.1,2.8,-0.4,17,68,15,25,22,8,12
2,Algeria,31.0,43.9,1.7,31,62,7,49,11,5,24
3,American Samoa,0.1,0.1,-0.2,..,..,..,..,..,..,..
4,Andorra,0.1,0.1,0.8,..,..,..,..,..,4,7


# Add column names

In [81]:
df.columns = columns
df.head()

Unnamed: 0,country_name,pop_2000,pop_2020,avg_pop_growth,pop_0to14,pop_15to64,pop_65plus,young_pop,old_pop,death_rate,birth_rate
0,Afghanistan,20.8,38.9,3.1,42,56,3,75,5,6,32
1,Albania,3.1,2.8,-0.4,17,68,15,25,22,8,12
2,Algeria,31.0,43.9,1.7,31,62,7,49,11,5,24
3,American Samoa,0.1,0.1,-0.2,..,..,..,..,..,..,..
4,Andorra,0.1,0.1,0.8,..,..,..,..,..,4,7


## Column information (for reference) 

**country_name:** The name of the country for the row.\
**pop_2000:** The country's population in the year 2000, in millions.\
**pop_2020:** The country's population in the year 2020, in millions.\
**avg_pop_growth:** The average annual population growth from 2000 to 2020, as a percent.\
**pop_0to14:** The portion of the population that is age 0-14 in 2020, as a percent.\
**pop_15to64:** The portion of the population that is age 15-64 in 2020, as a percent.\
**pop_65plus:** The portion of the population that is age 65+ in 2020, as a percent.\
**young_pop:** The portion of the population that is 0-14, as a percent of the working-age population (15-64).\
**old_pop:** The portion of the population that is 65+, as a percent of the working-age population (15-64).\
**death_rate:** The crude death rate, per 1000 people.\
**birth_rate:** The crude birth rate, per 1000 people.

# Split table

The last several rows of the table have aggregate data rather than country-specific data. Thus, I split the data into two dataframes: df_countries and df_summary.

In [82]:
index = df.index[df.country_name == 'Zimbabwe'].tolist()
index

[210]

In [83]:
df_countries = df[:211]
df_countries.tail()

Unnamed: 0,country_name,pop_2000,pop_2020,avg_pop_growth,pop_0to14,pop_15to64,pop_65plus,young_pop,old_pop,death_rate,birth_rate
206,Virgin Islands (U.S.),0.1,0.1,-0.1,19,60,20,32,34,8,12
207,West Bank and Gaza,2.9,4.8,2.5,38,58,3,66,6,3,29
208,"Yemen, Rep.",17.4,29.8,2.7,39,58,3,67,5,6,30
209,Zambia,10.4,18.4,2.8,44,54,2,82,4,6,36
210,Zimbabwe,11.9,14.9,1.1,42,55,3,76,5,8,30


In [84]:
df_summary = df[211:].reset_index(drop=True)
df_summary.head()

Unnamed: 0,country_name,pop_2000,pop_2020,avg_pop_growth,pop_0to14,pop_15to64,pop_65plus,young_pop,old_pop,death_rate,birth_rate
0,World,6114.3,7752.8,1.2,25,65,9,39,14,8,18
1,East Asia & Pacific,2047.6,2352.0,0.7,20,69,12,28,17,7,12
2,Europe & Central Asia,861.3,923.5,0.3,18,65,17,28,26,10,11
3,Latin America & Caribbean,520.9,652.3,1.1,24,67,9,36,13,6,16
4,Middle East & North Africa,315.3,464.6,1.9,30,65,5,46,8,5,22


In [88]:
df_countries.to_csv('pops_bycountry.csv', index=False, header=True)
df_summary.to_csv('pops_summary.csv', index=False, header=True)