In [9]:
'''
Population data

The population dataset by gapminder is a dataset containing information about the worlds population from 1800 until 2100.
The data is a composition from different sources. There a two main sources - a dataset by Angus Maddison and the CLIO Infra Project -  and the World Population Prospects (WPP) provided by the UN.
The dataset by Angus Maddison provides the data for the years 1800 - 1950. Population data after 1950 was taken from the WPP dataset.
Additional sources were used, to fill missing values for years or regions. A details list of sources can be found in the documentation. (https://www.gapminder.org/data/documentation/gd003/)
The primary data from the sources originates from census, informal census, indirect estimate and arbitrary guesses.

Modifications and Estimations:
- Summations of parts
- Larger area minus non-included parts
- Geographical interpolation
- Geographical extrapolation
- Temporal interpolation
- Temporal extrapolation
- Adjustments for under-enumeration
- Recalculated to fit present borders

Since the data for the year 1950 of the two main datasets did not match for every country, small adjustments and smoothing were applied.

Preprocessing:
For our purposes, the subdatasets "data-for-world-by-year", "data-for-regions-by-year" and "data-for-countries-by-year" are relevant.
We consider the period from 1900 until 2021 and remove redundant rows. We do this because the data after 2021 are based on estimated values.

There are no missing values.
Since the regions of the dataset did not match with the regions of our other datasets, we chose to compute the regions-population according to the regions from the un-country-codes.csv file.

Subdataset:
    population-global.csv
        Columns: year, population

    population-region.csv
        Columns: region_code, region_name, year, population

    population-country.csv
        Columns: country_code, country_name, year, population
'''

'\nPopulation data\n\nThe population dataset by gapminder is a dataset containing information about the worlds population from 1800 until 2100.\nThe data is a composition from different sources. There a two main sources - a dataset by Angus Maddison and the CLIO Infra Project -  and the World Population Prospects (WPP) provided by the UN.\nThe dataset by Angus Maddison provides the data for the years 1800 - 1950. Population data after 1950 was taken from the WPP dataset.\nAdditional sources were used, to fill missing values for years or regions. A details list of sources can be found in the documentation. (https://www.gapminder.org/data/documentation/gd003/)\nThe primary data from the sources originates from census, informal census, indirect estimate and arbitrary guesses.\n\nModifications and Estimations:\n- Summations of parts\n- Larger area minus non-included parts\n- Geographical interpolation\n- Geographical extrapolation\n- Temporal interpolation\n- Temporal extrapolation\n- Adju

In [10]:
import pandas as pd
import os

In [11]:
if not os.path.isdir("data/processed/population/"):
    os.makedirs("data/processed/population/")

# Read Excel

In [12]:
pop_dict = pd.read_excel('data/raw/population/gapminder-population-v7.xlsx', sheet_name=['data-for-world-by-year', 'data-for-regions-by-year', 'data-for-countries-etc-by-year'])

## Preprocess world data

In [13]:
pop_global_df = pop_dict.get('data-for-world-by-year')

print('raw data for world population')
print(pop_global_df.head(3))

# remove unnecessary columns
pop_global_df = pop_global_df[['time', 'Population']]
pop_global_df.rename(columns={'Population': 'population', 'time': 'year'}, inplace=True)
pop_global_df.set_index('year')

print('extract population data from 1900 until now')
pop_global_df = pop_global_df[pop_global_df['year'] >= 1900]
pop_global_df = pop_global_df[pop_global_df['year'] <= 2021]
pop_global_df['population'] = pop_global_df['population'].astype(int)
print(pop_global_df)

pop_global_df.to_csv('data/processed/population/population-global.csv', sep=';', index=False, header=True)

raw data for world population
     geo   name  time   Population
0  world  world  1800  985083734.9
1  world  world  1801  988518009.0
2  world  world  1802  991993182.0
extract population data from 1900 until now
     year  population
100  1900  1627123965
101  1901  1639684222
102  1902  1652898195
103  1903  1666760587
104  1904  1680799825
..    ...         ...
217  2017  7599822404
218  2018  7683789828
219  2019  7764951032
220  2020  7840952880
221  2021  7909295151

[122 rows x 2 columns]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pop_global_df.rename(columns={'Population': 'population', 'time': 'year'}, inplace=True)


# Prepare dataset for countries and regions

In [14]:
pop_country_df = pop_dict.get('data-for-countries-etc-by-year')
un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")
un_country_codes = un_country_codes[['Region Code', 'Region Name', 'ISO-alpha3 Code']]

# extract population data from 1900 until now
pop_country_df = pop_country_df[pop_country_df['time'] >= 1900]
pop_country_df = pop_country_df[pop_country_df['time'] <= 2021]

# format column that will be joined
pop_country_df['geo'] = pop_country_df['geo'].str.upper()

# merge country-population and un-country-codes
pop_country_region_df = pd.merge(pop_country_df, un_country_codes, how='left', left_on='geo', right_on='ISO-alpha3 Code')

pop_country_region_df.rename(columns={'Region Code': 'region_code', 'Region Name': 'region_name', 'time': 'year', 'Population': 'population', 'geo': 'country_code', 'name': 'country_name'}, inplace=True)

# add missing region code to holy see and taiwan
pop_country_region_df.loc[pop_country_region_df['country_code'] == 'HOS', 'region_code'] = 150
pop_country_region_df.loc[pop_country_region_df['country_code'] == 'HOS', 'region_name'] = 'Europe'

pop_country_region_df.loc[pop_country_region_df['country_code'] == 'TWN', 'region_code'] = 142
pop_country_region_df.loc[pop_country_region_df['country_code'] == 'TWN', 'region_name'] = 'Asia'

# remove unused column 'ISO-alpha3 Code'
pop_country_region_df.drop('ISO-alpha3 Code', axis=1, inplace=True)

# check if we worked correctly
print(pop_country_region_df[pop_country_region_df.isnull().any(axis=1)])

# the remaining missing values are population data for the country 'holy see' since the missing population is in the range < 1000 we, accept the missing values and calculate with the latest documented value.

     country_code country_name  year  population  region_code region_name
8885          HOS     Holy See  2001         NaN        150.0      Europe
8886          HOS     Holy See  2002         NaN        150.0      Europe
8887          HOS     Holy See  2003         NaN        150.0      Europe
8888          HOS     Holy See  2004         NaN        150.0      Europe
8889          HOS     Holy See  2005         NaN        150.0      Europe
8890          HOS     Holy See  2006         NaN        150.0      Europe
8891          HOS     Holy See  2007         NaN        150.0      Europe
8892          HOS     Holy See  2008         NaN        150.0      Europe
8893          HOS     Holy See  2009         NaN        150.0      Europe
8894          HOS     Holy See  2010         NaN        150.0      Europe
8895          HOS     Holy See  2011         NaN        150.0      Europe
8896          HOS     Holy See  2012         NaN        150.0      Europe
8897          HOS     Holy See  2013  

## Get population regions dataset

In [15]:
columns_regions = ['region_code', 'region_name', 'country_code', 'country_name', 'year', 'population']

# select columns
pop_regions_df = pop_country_region_df[columns_regions]
pop_regions_df = pop_regions_df.groupby(['region_code', 'region_name', 'year'], as_index=False)['population'].sum()
pop_regions_df['population'] = pop_regions_df['population'].astype(int)

#store in csv file
pop_regions_df.to_csv('data/processed/population/population-region.csv', sep=';', index=False, header=True)

## Get population countries dataset

In [16]:
columns_country = ['country_code', 'country_name', 'year', 'population']

# select columns
pop_country_df = pop_country_region_df[columns_country]

#store in csv file
pop_country_df.to_csv('data/processed/population/population-country.csv', sep=';', index=False, header=True)

## Compare calculated region population to gapminder dataset

In [17]:
regions_population_gapminder = pop_dict.get('data-for-regions-by-year')
regions_population_2021_gapminder = regions_population_gapminder[regions_population_gapminder['time'] == 2021]
total_population_gapminder = regions_population_2021_gapminder['Population'].sum()

regions_population_2021_calc = pop_regions_df[pop_regions_df['year'] == 2021]
total_population_calc = regions_population_2021_calc['population'].sum()

print('Total population from the year 2021')
print('Gapminder ' + str(total_population_gapminder))
print('Calculated ' + str(total_population_calc))

print('\n')
print('Gapminder region population')
print(regions_population_2021_gapminder[['name', 'Population']])

print('\n')
print('Calculated region population')
print(regions_population_2021_calc[['region_name', 'population']])

Total population from the year 2021
Gapminder 7898737625
Calculated 7898737625


Gapminder region population
          name  Population
221     africa  1391823318
522       asia  4634610444
823     europe   846050284
1124  americas  1026253579


Calculated region population
    region_name  population
121      Africa  1391823318
243     Oceania    43602426
365    Americas  1026253579
487        Asia  4693889556
609      Europe   743168746


As we can see, we get the same results for the total populations in the year 2021.
If we take a look at the region population, we see that the calculated population of america and africa shows no difference to the gapminder-dataset population. For the regions asia and europe, we get different results. Furthermore, the calculated region oceania does not exist in the gapminder dataset. Since the documentation of the gapminder dataset does not include the country-to-region assignment, we cannot compare the differences between our calculation and gapminder. However, since the overall score is equal, we can argue that those differences occur tue to different country-to-region assignments.