In [1]:
import os

import pandas as pd

In [2]:
INPUT_FOLDER = "preproc"
OUTPUT_FOLDER = "enrich"

In [3]:
os.makedirs("../data/enrich", exist_ok=True)

In [4]:
datasources = {source.replace(".zip",""):source for source in os.listdir(f"../data/{INPUT_FOLDER}") if source.endswith(".zip")}
datasources

{'demographics': 'demographics.zip',
 'epidemiology': 'epidemiology.zip',
 'health': 'health.zip',
 'hospitalizations': 'hospitalizations.zip',
 'index': 'index.zip',
 'vaccinations': 'vaccinations.zip'}

## ENRICH

This is the most optional part of the whole process. The main goal of this stage is to enrich current data tables from the previous `preproc` stage with any other variable present in other tables that are necessary for the aggregation process in the next satage.

In this case, we're going to do just one thing:
 - **Append the column `Country` to all tables** - Remember that we need to build a predictor per country, so we need that column in order to make later aggregations (in the next step) by country.
 - **Impute missing values in the demographics using the recetly new acquired `Country` column** - We have a lot of missing values here, so we'll take advantage of the recent `Country` column to apply a smart missing value imputation strategy.

### Join: Include `country_name` from `index` table in the rest of tables

We skip `index` because is the table from which we have to extract the `Country`, and also the `demographics` table, because we're going to apply a special treatment for it

In [5]:
index = pd.read_csv(f"../data/{INPUT_FOLDER}/index.zip")

In [6]:
for key,value in datasources.items():
    data = pd.read_csv(f"../data/{INPUT_FOLDER}/{key}.zip")
    if key not in ["index","demographics"]:
        data = data.merge(index, on="location_key")
        print(f"Table {key} processed!")
        
    data.to_csv(f"../data/{OUTPUT_FOLDER}/{key}.zip", index=False)

Table epidemiology processed!
Table health processed!
Table hospitalizations processed!
Table vaccinations processed!


### Missing values: `demographics`

We have a lot of missing values here, so we'll take advantage of the recent `Country` column to apply a smart missing value imputation strategy.

The imputation strategy will consist on
 1. For all regions without missing values, *calculate the proportion of population for every age range, per country*.
 2. For all regions with missing values, *extract the total population and append the proportions from the previous step*
 3. Impute missing values for each age range by multiplying population in each region by the proportion from step 1.

In [7]:
data = pd.read_csv(f"../data/{INPUT_FOLDER}/demographics.zip")
data = data.merge(index, on="location_key")
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5097 entries, 0 to 5096
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   location_key                 5097 non-null   object 
 1   population                   5097 non-null   float64
 2   population_age_00_09         3743 non-null   float64
 3   population_age_10_19         3743 non-null   float64
 4   population_age_20_29         3743 non-null   float64
 5   population_age_30_39         3743 non-null   float64
 6   population_age_40_49         3743 non-null   float64
 7   population_age_50_59         3743 non-null   float64
 8   population_age_60_69         3743 non-null   float64
 9   population_age_70_79         3743 non-null   float64
 10  population_age_80_and_older  3743 non-null   float64
 11  country_name                 5097 non-null   object 
dtypes: float64(10), object(2)
memory usage: 517.7+ KB


**1. For non missing records: calculate the proportion of population per age range, per each country**

In [8]:
whole_population = data[~data.population_age_00_09.isna()].groupby("country_name").sum()
nonmissing_population = whole_population["population"]
nonmissing_population_age = whole_population.filter(regex=r"population_age")
proportion = nonmissing_population_age.div(nonmissing_population, axis=0)

In [None]:
proportion

**2. For missing records: get the population and append proportion of population per age range, per each country**

In [None]:
missings = data[data.population_age_00_09.isna()]
missings = missings[["location_key", "population", "country_name"]]

In [None]:
missings.head()

In [None]:
missings = missings.merge(proportion.reset_index(), on="country_name")
missings = missings.set_index(["location_key"])

In [None]:
missings.head()

**3. Calculate the estimated population per region from the proportions**

In [None]:
missings_population = missings.population
missings_population_ages = missings.filter(regex=r"population_age")
result = missings_population_ages.mul(missings_population, axis=0)

In [None]:
result.head()

Impute missing values in original dataset from the previously calculated result

In [None]:
data = data.set_index("location_key")

In [None]:
data = data.fillna(result)

In [None]:
data.info()

Save resulting table

In [None]:
data.reset_index().to_csv(f"../data/{OUTPUT_FOLDER}/demographics.zip", index=False)