# 0_Challenges analysis

**Objective:** Assign accurately demographics data to individual households in Germany


Here, we will start to "brainstorm" all the possible challenges for the mini-project. Let's start with the data we can think of:
1. List of postal codes (PLZ) in Germany
1. List of municipal codes (regional key) more details can be found [HERE](!https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Glossar/regionalschluessel.html)
1. Statistics from Regionaldatenbank (demographics data) with the most granular level of data is municipal. Population distribution by ages and genders [HERE](!https://www.regionalstatistik.de/genesis//online/data?operation=table&code=12411-02-03-5-B&levelindex=1&levelid=1604855547818)
1. Building objects from OpenStreetMap. Can be query by postal code-level or municipal-level.

## Challenges

### Classify building objects

### Assign demographics data

There are 3 possible cases when matching postal codes data with regional-key data:
1. 1 PLZ = 1 regional-key. This is the easiest case since we can just divide the demographics data (municipal-level) equally to all residential buildings in the postal code.
1. 1 Regional-key contains multiple PLZs. For example: Aalen, Stadt in Baden-Wüttemberg which has regional-key as 081365001088 but contains 5 PLZ from 73430 to 73434
1. 1 PLZ is laid within multiple regional-keys. For example: PLZ 37339 is consisted of 7 municipals (last 3 digits change) 160615001003/015/026/031/094/103/114

## Initialization

In [None]:
import pandas as pd
import numpy as np
import sys
import os
import importlib
import logging 

from datetime import datetime
timestamp = datetime.now().strftime("%d%m%y_%H%M")

### Load custom modules

In [None]:
pkg_path = '../src/'

sys.path.append(pkg_path)
import data_acquisition as da

In [None]:
# Reload module (incase new update)
importlib.reload(da)

## Inputs / Outputs

In [None]:
# Paths
pop_gmd_path = '../data/01_raw/pop_gem_de.csv'
plz_gmd_path = '../data/01_raw/zuordnung_plz_ort_landkreis.csv'

In [None]:
# Postal code list with population data (2011)
pop_plz_csv = '../data/01_raw/plz_einwohner.csv'
pop_plz_de = pd.read_csv(pop_plz_csv, dtype={'plz': str, 'einwhoner': int})  

In [None]:
pop_plz_de.shape

In [None]:
pop_plz_de.head()

PLZ (postal code) - Gemeinden (municipal) - Landkreis (district) map

In [None]:
plz_gmd_lk = pd.read_csv(plz_gmd_path)
plz_gmd_lk.rename(columns = {'ags':'Regionalschluessel'}, inplace = True)

In [None]:
plz_gmd_lk.shape

In [None]:
print(f'Total number of unique PLZ: {len(plz_gmd_lk.plz.drop_duplicates())}')
print(f'Total number of unique Gemeinden (municipal): {len(plz_gmd_lk.Regionalschluessel.drop_duplicates())}')

In [None]:
# 1 RS = multiple PLZ
plz_gmd_lk[plz_gmd_lk.duplicated(['Regionalschluessel'])]

In [None]:
# 1 PLZ = multiple RS
plz_gmd_lk[plz_gmd_lk.duplicated(['plz'])]

Population distribution per municipal

In [None]:
# Population distribution municipal-level
pop_gmd_df = pd.read_csv(pop_gmd_path, sep=';',
                         encoding='iso-8859-1',
                         header=[0, 1]
                        )

# Merge first 2 rows as column names
pop_gmd_df.columns = pop_gmd_df.columns.map('_'.join)

# Rename first 2 columns
pop_gmd_df.rename(columns={pop_gmd_df.columns[0]:'Regionalschluessel',
                           pop_gmd_df.columns[1]:'Gemeinden'},
                  inplace=True)

# Take out only total population per gender
pop_gmd_df = pop_gmd_df[['Regionalschluessel',
                         'Gemeinden',
                         'Insgesamt_Insgesamt',
                         'männlich_Insgesamt',
                         'weiblich_Insgesamt']]

In [None]:
pop_gmd_df.shape

In [None]:
pop_gmd_df.head()

In [None]:
pop_gmd_df[pop_gmd_df.Regionalschluessel==7232003]

In [None]:
print(f'Total number of unique Gemeinden (municipal): {len(pop_gmd_df.Regionalschluessel.drop_duplicates())}')

## EDA

In [None]:
# Different between 2 data sets about Gemeinden
a = set(pop_gmd_df.Regionalschluessel.drop_duplicates())

In [None]:
b = set(plz_gmd_lk.Regionalschluessel.drop_duplicates())

In [None]:
# anti-join to get differences between 2 datasets
c = (a ^ b)
len(c)

In [None]:
plz_gmd_lk[plz_gmd_lk.Regionalschluessel.isin(c)]

In [None]:
pop_gmd_df.Insgesamt_Insgesamt = pop_gmd_df.Insgesamt_Insgesamt.astype(int, errors='ignore')

In [None]:
pop_gmd_df[(pop_gmd_df.Regionalschluessel.isin(c)) & 
           (pop_gmd_df.Insgesamt_Insgesamt.isin(['-0','.']) == False)].reset_index(drop = True)