Clean UN Population Data, Source: https://population.un.org/wup/Download/ (using dataset WUP2018-F12-Cities_Over_300K.xls). Normalized formatting and converted to csv using Excel. This notebook will perform the rest of the cleaning on the dataset

In [42]:
import pandas as pd
import numpy as np

In [43]:
# import dataset
enc = open('pop_data.csv').encoding
df = pd.read_csv("pop_data.csv", encoding=enc)
df

Unnamed: 0,Index,Country Code,Country or area,City Code,Urban Agglomeration,Note,Latitude,Longitude,1950,1955,...,1990,1995,2000,2005,2010,2015,2020,2025,2030,2035
0,1,4,Afghanistan,20001,Herat,,34.3482,62.1997,82.47,85.75,...,183.47,207.19,233.99,275.68,358.69,466.70,605.58,752.91,897.04,1057.57
1,2,4,Afghanistan,20002,Kabul,,34.5289,69.1725,170.78,220.75,...,1549.32,1928.69,2401.11,2905.18,3289.01,3723.54,4221.53,4877.02,5737.14,6760.50
2,3,4,Afghanistan,20003,Kandahar,,31.6133,65.7101,82.20,89.79,...,233.24,263.40,297.46,336.75,383.50,436.74,498.00,577.13,679.28,800.46
3,4,4,Afghanistan,20004,Mazar-e Sharif,,36.7090,67.1109,30.00,37.14,...,135.15,152.63,172.37,206.40,283.53,389.48,532.69,681.53,816.04,962.26
4,5,8,Albania,20005,Tiranë (Tirana),,41.3275,19.8189,84.51,106.93,...,247.27,287.95,335.34,371.80,408.70,449.30,493.71,535.70,565.30,581.63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1855,1856,894,Zambia,23277,Lusaka,,-15.4134,28.2771,31.17,53.24,...,757.43,901.62,1073.30,1357.14,1722.88,2187.18,2774.13,3470.87,4266.52,5182.67
1856,1857,894,Zambia,23279,Ndola,,-12.9587,28.6366,30.62,47.15,...,333.98,353.34,373.50,408.95,448.79,492.52,542.50,628.60,762.48,925.73
1857,1858,716,Zimbabwe,22510,Bulawayo,,-20.1500,28.5833,91.64,94.50,...,570.05,637.02,664.56,669.88,658.24,646.80,638.19,668.95,753.31,874.48
1858,1859,716,Zimbabwe,22511,Chitungwiza,,-18.0127,31.0756,38.65,48.78,...,248.92,287.99,312.29,332.56,349.41,367.10,386.45,419.61,475.42,552.03


In [44]:
# select necessary columns only
df = df[['Country or area',
          'Urban Agglomeration',
         'Latitude',
         'Longitude',
         '2020']]
df

Unnamed: 0,Country or area,Urban Agglomeration,Latitude,Longitude,2020
0,Afghanistan,Herat,34.3482,62.1997,605.58
1,Afghanistan,Kabul,34.5289,69.1725,4221.53
2,Afghanistan,Kandahar,31.6133,65.7101,498.00
3,Afghanistan,Mazar-e Sharif,36.7090,67.1109,532.69
4,Albania,Tiranë (Tirana),41.3275,19.8189,493.71
...,...,...,...,...,...
1855,Zambia,Lusaka,-15.4134,28.2771,2774.13
1856,Zambia,Ndola,-12.9587,28.6366,542.50
1857,Zimbabwe,Bulawayo,-20.1500,28.5833,638.19
1858,Zimbabwe,Chitungwiza,-18.0127,31.0756,386.45


In [45]:
# rename columns
df.columns = ['country',
             'city',
             'lat',
             'lon',
             'pop_2020']
# population currently in thousands, convert to actual value
df['pop_2020'] = df['pop_2020']*1000
df = df.astype({'pop_2020':'int'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1860 entries, 0 to 1859
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   country   1860 non-null   object 
 1   city      1860 non-null   object 
 2   lat       1860 non-null   float64
 3   lon       1860 non-null   float64
 4   pop_2020  1860 non-null   int32  
dtypes: float64(2), int32(1), object(2)
memory usage: 65.5+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['pop_2020'] = df['pop_2020']*1000


City column has alternate city names for some areas in ()s. They will be split into two columns for more join options. 

In [46]:
# function for extracting alternate city name
def alt_city(city):
    paren = city.find('(')
    if paren == -1:
        city_alt = ''
    else:
        city_alt = city[paren+1:-1]
    return city_alt

# function for extracting original city name
def og_city(city):
    paren = city.find('(')
    if paren == -1:
        city_og = city
    else:
        city_og = city[:paren]
    return city_og

# create column for alternate name, modify city column to only hold original name
df['city_alt'] = df.city.apply(alt_city)
df['city'] = df.city.apply(og_city)
df

Unnamed: 0,country,city,lat,lon,pop_2020,city_alt
0,Afghanistan,Herat,34.3482,62.1997,605580,
1,Afghanistan,Kabul,34.5289,69.1725,4221530,
2,Afghanistan,Kandahar,31.6133,65.7101,498000,
3,Afghanistan,Mazar-e Sharif,36.7090,67.1109,532690,
4,Albania,Tiranë,41.3275,19.8189,493710,Tirana
...,...,...,...,...,...,...
1855,Zambia,Lusaka,-15.4134,28.2771,2774130,
1856,Zambia,Ndola,-12.9587,28.6366,542500,
1857,Zimbabwe,Bulawayo,-20.1500,28.5833,638190,
1858,Zimbabwe,Chitungwiza,-18.0127,31.0756,386450,


Add iso3 country codes as match keys. Kosovo = KSV for the purposes of this project, even though it is technically not assigned an iso3 code currently.

In [47]:
enc = open('iso3.csv').encoding
iso3 = pd.read_csv('iso3.csv', encoding=enc)
iso3

Unnamed: 0,country,iso3
0,Afghanistan,AFG
1,Åland Islands,ALA
2,Albania,ALB
3,Algeria,DZA
4,American Samoa,ASM
...,...,...
261,Wallis and Futuna,WLF
262,Western Sahara,ESH
263,Yemen,YEM
264,Zambia,ZMB


Some country names are not matched with ones in the iso3 file. These will be adjusted here.

In [48]:
mismatch = {'Bolivia (Plurinational State of)':'Bolivia, Plurinational State of', 
            'China, Hong Kong SAR':'Hong Kong',
            'China, Macao SAR':'Macao',
            'China, Taiwan Province of China':'Taiwan',
            'Czechia':'Czech Republic',
            "Dem. People's Republic of Korea":"Korea, Democratic People's Republic of",
            'Iran (Islamic Republic of)':'Iran',
            'South Sudan':'Sudan',
            'State of Palestine':'Palestinian Territory, Occupied',
            'TFYR Macedonia':'Macedonia, the former Yugoslav Republic of',
            'Venezuela (Bolivarian Republic of)':'Venezuela, Bolivarian Republic of'}

df = df.replace({'country':mismatch})

Merge the two datasets, then make sure all values are matched.

In [49]:
df = pd.merge(df, iso3, how="left", on="country")
print(df[df['iso3'].isnull()])

Empty DataFrame
Columns: [country, city, lat, lon, pop_2020, city_alt, iso3]
Index: []


All data is matched, reorder columns next.

In [50]:
df = df[['iso3',
          'country',
          'city',
          'city_alt',
          'lat',
          'lon',
          'pop_2020']]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1860 entries, 0 to 1859
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   iso3      1860 non-null   object 
 1   country   1860 non-null   object 
 2   city      1860 non-null   object 
 3   city_alt  1860 non-null   object 
 4   lat       1860 non-null   float64
 5   lon       1860 non-null   float64
 6   pop_2020  1860 non-null   int32  
dtypes: float64(2), int32(1), object(4)
memory usage: 109.0+ KB


In [51]:
# write final dataset to csv
df.to_csv('pop_clean.csv', index=False)