In [1]:
'''
Population data

The population dataset by gapminder is a dataset containing information about the worlds population from 1800 until 2100.
The data is a composition from different sources. There a two main sources - a dataset by Angus Maddison and the CLIO Infra Project -  and the World Population Prospects (WPP) provided by the UN.
The dataset by Angus Maddison provides the data for the years 1800 - 1950. Population data after 1950 was taken from the WPP dataset.
Additional sources were used, to fill missing values for years or regions. A details list of sources can be found in the documentation. (https://www.gapminder.org/data/documentation/gd003/)
The primary data from the sources originates from census, informal census, indirect estimate and arbitrary guesses.

Modifications and Estimations:
- Summations of parts
- Larger area minus non-included parts
- Geographical interpolation
- Geographical extrapolation
- Temporal interpolation
- Temporal extrapolation
- Adjustments for under-enumeration
- Recalculated to fit present borders

Since the data for the year 1950 of the two main datasets did not match for every country, small adjustments and smoothing were applied.

Preprocessing:
For our purposes, the subdatasets "data-for-world-by-year", "data-for-regions-by-year" and "data-for-countries-by-year" are relevant.
We consider the period from 1900 until 2021 and remove redundant rows. We do this because the data after 2021 are based on estimated values.

There are no missing values.
Since the regions of the dataset did not match with the regions of our other datasets, we chose to compute the regions-population according to the regions from the un-country-codes.csv file.

Subdataset:
    population-global.csv
        Columns: year, population

    population-region.csv
        Columns: region_code, region_name, year, population

    population-country.csv
        Columns: country_code, country_name, year, population
'''

'\nPopulation data\n\nThe population dataset by gapminder is a dataset containing information about the worlds population from 1800 until 2100.\nThe data is a composition from different sources. There a two main sources - a dataset by Angus Maddison and the CLIO Infra Project -  and the World Population Prospects (WPP) provided by the UN.\nThe dataset by Angus Maddison provides the data for the years 1800 - 1950. Population data after 1950 was taken from the WPP dataset.\nAdditional sources were used, to fill missing values for years or regions. A details list of sources can be found in the documentation. (https://www.gapminder.org/data/documentation/gd003/)\nThe primary data from the sources originates from census, informal census, indirect estimate and arbitrary guesses.\n\nModifications and Estimations:\n- Summations of parts\n- Larger area minus non-included parts\n- Geographical interpolation\n- Geographical extrapolation\n- Temporal interpolation\n- Temporal extrapolation\n- Adju

In [2]:
import pandas as pd

# Read Excel

In [3]:
population_dict = pd.read_excel('data/raw/population/gapminder-population-v7.xlsx', sheet_name=['data-for-world-by-year', 'data-for-regions-by-year', 'data-for-countries-etc-by-year'])

## Preprocess world data

In [4]:
world_population_df = population_dict.get('data-for-world-by-year')

print('raw data for world population')
print(world_population_df.head(3))

# remove unnecessary columns
world_population_df = world_population_df[['time', 'Population']]
world_population_df.set_index('time')
world_population_df.rename(columns={'Population': 'population'}, inplace=True)

print('extract population data from 1900 until now')
world_population_df = world_population_df[world_population_df['time'] >= 1900]
world_population_df = world_population_df[world_population_df['time'] <= 2021]
print(world_population_df)

world_population_df.to_csv('data/processed/population/population-global.csv', sep=';', index=False, header=True)

raw data for world population
     geo   name  time   Population
0  world  world  1800  985083734.9
1  world  world  1801  988518009.0
2  world  world  1802  991993182.0
extract population data from 1900 until now
     time    population
100  1900  1.627124e+09
101  1901  1.639684e+09
102  1902  1.652898e+09
103  1903  1.666761e+09
104  1904  1.680800e+09
..    ...           ...
218  2018  7.683790e+09
219  2019  7.764951e+09
220  2020  7.840953e+09
221  2021  7.909295e+09
222  2022  7.975105e+09

[123 rows x 2 columns]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  world_population_df.rename(columns={'Population': 'population'}, inplace=True)


# Prepare dataset for countries and regions

In [5]:
countries_population_df = population_dict.get('data-for-countries-etc-by-year')
un_country_codes = pd.read_csv("data/raw/country-codes/un-country-codes.csv", sep=";")
un_country_codes = un_country_codes[['Region Code', 'Region Name', 'ISO-alpha3 Code']]

# extract population data from 1900 until now
countries_population_df = countries_population_df[countries_population_df['time'] >= 1900]
countries_population_df = countries_population_df[countries_population_df['time'] <= 2021]

# format column that will be joined
countries_population_df['geo'] = countries_population_df['geo'].str.upper()

# merge country-population and un-country-codes
countries_with_regions = pd.merge(countries_population_df, un_country_codes, how='inner', left_on='geo', right_on='ISO-alpha3 Code')

countries_with_regions.rename(columns={'Region Code': 'region_code', 'Region Name': 'region_name', 'time': 'year', 'Population': 'population', 'geo': 'country_code', 'name': 'country_name'}, inplace=True)

## Get population regions dataset

In [6]:
columns_regions = ['region_code', 'region_name', 'country_code', 'country_name', 'year', 'population']

# select columns
population_regions = countries_with_regions[columns_regions]
population_regions = population_regions.groupby(['region_code', 'region_name', 'year'], as_index=False)['population'].sum()

#store in csv file
population_regions.to_csv('data/processed/population/population-region.csv', sep=';', index=False, header=True)

## Get population countries dataset

In [7]:
columns_country = ['country_code', 'country_name', 'year', 'population']

# select columns
population_countries = countries_with_regions[columns_country]

#store in csv file
population_countries.to_csv('data/processed/population/population-country.csv', sep=';', index=False, header=True)

## Compare calculated region population to gapminder dataset

In [15]:
regions_population_gapminder = population_dict.get('data-for-regions-by-year')
regions_population_2021_gapminder = regions_population_gapminder[regions_population_gapminder['time'] == 2021]
regions_population_2021_calc = population_regions[population_regions['year'] == 2021]
print('Gapminder region population')
print(regions_population_2021_gapminder[['name', 'Population']])

print('\n')
print('Calculated region population')
print(regions_population_2021_calc[['region_name', 'population']])

Gapminder region population
          name  Population
221     africa  1391823318
522       asia  4634610444
823     europe   846050284
1124  americas  1026253579


Calculated Regions
    region_name    population
121      Africa  1.391823e+09
244     Oceania  4.360243e+07
367    Americas  1.026254e+09
490        Asia  4.670030e+09
613      Europe  7.431687e+08
