# Preprocessing the INKAR dataset
#### This Jupyter Notebook preprocesses the dataset given by 'inkar_2021.csv'. In addition, we include data on RS of municipal associations provided by 'ma_keys_inkar.csv'.

Data files required for the execution of this Jupyter notebook:
- inkar_2021.csv (raw data)
- ma_keys_inkar.csv (intermediate data)

Python scripts required for import:
- utils.utils.py: parameter values
- preprocessing.photovoltaics.preprocess_inkar_functions_updated.py: functions used for preprocessing which I outsourced from this script

The following levels of spatial entities are relevant:
- counties (c; Landkreise und kreisfreie Städte)
- municipal associations (ma; Gemeindeverbände)

The original INKAR dataset gives data at the c- and ma-level and many more levels. One data point corresponds to one row of the dataset, i.e., one row give the value of one indicator of one spatial entity (on a specific regional level) for one year.

The goal is to derive a preprocessed dataset which includes values of (relevant) indicators on the level of municipal associations. We consider indicators provided on ma- or c-level by the INKAR-dataset. All indicators originally given on c-level are relative or per-capita values and can therefore be added to the corresponding ma instances.

In addition, values do not exist for all indicators for all spatial entities. Substituting these NaNs is another step of the preprocessing.

#### Preprocessing steps
Step 1: Transformation of the raw dataset.
1a. Import INKAR-dataset and obtain relevant rows and columns.
1b. Divide the dataset into two subsets according to the spatial level considered (c and ma) and fix and add columns.
1c. Transform the ma dataset.
1d. Transform the c dataset.
Step 2: Detect and fix missing values.
2a. Detect and fix NaNs in the ma-dataframe.
2b. Detect and fix NaNs in the c-dataframe.
Step 3: Merge both datasets.
 Integrate the data on c-level into the dataset on ma-level to derive the final (compact) INKAR dataframe which serves as a basis for the analysis.

In [4]:
import numpy as np
import pandas as pd
import os

os.chdir("../../..")
from xai_green_tech_adoption.utils.utils import *
from xai_green_tech_adoption.preprocessing.photovoltaics.preprocess_inkar_functions import *

pd.options.mode.chained_assignment = None

In [2]:
# Set parameters given by config_preprocessing

read_col_inkar = [
    col_spatial_inkar,
    col_time_inkar,
    col_value_inkar,
    col_id_general_inkar,
    col_name_inkar,
    col_ind_inkar,
]

# all relevant indicators on ma-level (from 2019 and other time periods)
ind_all_ma = add_ind_ma_2019 + list(add_ind_ma_dict.keys())
# all relevant indicators on c-level
ind_all_c = add_ind_c_2019 + list(add_ind_c_dict.keys())

### Step 1: Transformation of the raw dataset.
#### Step 1a Import INKAR-dataset and obtain relevant rows and columns.

In [3]:
# read INKAR data set
df_inkar_compl = pd.read_csv(
    "data/raw_data/descriptive_features/inkar_2021/inkar_2021.csv",
    sep=";",
    decimal=",",
    usecols=read_col_inkar,
    dtype={col_time_inkar: "str", col_value_inkar: "float"},
)

In [4]:
# only include data of spatial categories on relevant level
df_inkar_m_ma_c = df_inkar_compl[
    df_inkar_compl[col_spatial_inkar].isin(relevant_statial_entities)
]
# some indicators are associated with multiple topics (Bereiche) and, hence, appear multiple times in the dataframe
df_inkar_m_ma_c.drop_duplicates(inplace=True)

In [5]:
# Check dataset for na and inf values
if (
    (df_inkar_m_ma_c.isna().sum().sum() == 0)
    & (df_inkar_m_ma_c.isnull().sum().sum() == 0)
    & (df_inkar_m_ma_c.isin([np.inf, -np.inf]).sum().sum() == 0)
):
    print(
        "There are neither missing nor infinity values in the relevant parts of the raw INKAR dataset."
    )

There are neither missing nor infinity values in the relevant parts of the raw INKAR dataset.


#### 1b. Divide the dataset into two subsets according to the spatial level considered (c and ma) and fix and add columns.

In [6]:
# fix id and time column
df_inkar_m_ma_c.loc[:, col_id_general_inkar] = df_inkar_m_ma_c[
    col_id_general_inkar
].astype(int)
# some periods of time are given as 'year1 bis year2', 'year1 - year2' or 'year1/year2/year3/...'
# derive a single value for each year
df_inkar_m_ma_c = fix_time_periods(df_inkar_m_ma_c, t_column=col_time_inkar)

In [7]:
# 11007 municipalities, 4618 municipality associations, 401 counties

df_inkar_ma = df_inkar_m_ma_c[df_inkar_m_ma_c[col_spatial_inkar] == "Gemeindeverbände"]
df_inkar_c = df_inkar_m_ma_c[df_inkar_m_ma_c[col_spatial_inkar] == "Kreise"]

n_ma = df_inkar_ma[col_id_general_inkar].nunique()
n_c = df_inkar_c[col_id_general_inkar].nunique()
print(
    f'There are {df_inkar_m_ma_c.loc[(df_inkar_m_ma_c[col_spatial_inkar] == "Gemeinden"),col_id_general_inkar].nunique()} municipalities, {n_ma} municipality associations and {n_c} counties the dataset is based on.'
)

for df in [df_inkar_ma, df_inkar_c]:
    df.drop([col_spatial_inkar], axis=1, inplace=True)

There are 11007 municipalities, 4618 municipality associations and 401 counties the dataset is based on.


In [8]:
# ma dataset

# only use relevant indicators
df_inkar_ma = df_inkar_ma[df_inkar_ma[col_ind_inkar].isin(ind_all_ma)]

# add regional identifiers (Regionalschlüssel; RS) of ma's
df_ma_id = pd.read_csv(
    "data/intermediate_data/ma_keys_inkar.csv",
    sep=";",
    usecols=["Gemeindeverbände Kennziffer", "Gemeindeverbände Regionalschlüssel"],
)
df_ma_id.rename(
    columns={
        "Gemeindeverbände Kennziffer": col_id_general_inkar,
        "Gemeindeverbände Regionalschlüssel": col_id_ma,
    },
    inplace=True,
)
df_inkar_ma = df_inkar_ma.merge(df_ma_id, how="left", on=col_id_general_inkar)

# add column with id of corresponding c
df_inkar_ma[col_id_c] = df_inkar_ma[col_id_ma] // 10000

# add column with corresponding state of ma
df_inkar_ma[col_id_s] = df_inkar_ma[col_id_ma] // 10000000
df_inkar_ma.drop([col_id_general_inkar], axis=1, inplace=True)
df_inkar_ma.rename({col_name_inkar: col_name_ma}, axis=1, inplace=True)

In [9]:
# only include relevant indicators:
# ind_all_c: relevant indicators only available on county-level
# ind_all_ma: relevant indicators also available on ma-level. Values on c-level are used to substitute NaNs
df_inkar_c = df_inkar_c[df_inkar_c[col_ind_inkar].isin(ind_all_c + ind_all_ma)]

df_inkar_c.loc[:, col_id_s] = df_inkar_c[col_id_general_inkar] // 1000

df_inkar_c.rename(
    {col_id_general_inkar: col_id_c, col_name_inkar: col_name_c}, axis=1, inplace=True
)

#### 1c. Transform the ma dataset from the original INKAR-structure into the compact format.

In [10]:
# transform ma dataframe and derive a dataframe with all relevant indicators available for 2019
df_ma = transform_inkar_df(
    df_inkar_ma,
    id_col=col_id_ma,
    ind_col=col_ind_inkar,
    ind_list=add_ind_ma_2019,
    value_col=col_value_inkar,
    t_col=col_time_inkar,
    t_period=t_inkar,
)

# add all remaining indicators that are not available for 2019 but only for other periods of time
df_ma = add_ind(
    df_ma,
    df_inkar_ma,
    ind_dict=add_ind_ma_dict,
    id_col=col_id_ma,
    ind_col=col_ind_inkar,
    value_col=col_value_inkar,
    t_col=col_time_inkar,
)

for ind in ind_all_ma:
    if ind not in df_ma.columns:
        print(f"Attention: {ind} is not included in the transformed dataframe.")

#### 1d. Transform the c dataset from the original INKAR-structure into the compact format.

In [11]:
df_c = transform_inkar_df(
    df_inkar_c,
    id_col=col_id_c,
    ind_col=col_ind_inkar,
    ind_list=add_ind_c_2019,
    value_col=col_value_inkar,
    t_col=col_time_inkar,
    t_period=t_inkar,
)
df_c = add_ind(
    df=df_c,
    df_inkar=df_inkar_c,
    ind_dict=add_ind_c_dict,
    id_col=col_id_c,
    ind_col=col_ind_inkar,
    value_col=col_value_inkar,
    t_col=col_time_inkar,
)

for ind in ind_all_c:
    if ind not in df_c.columns:
        print(f"Attention: {ind} is not included in the transformed dataframe.")


### Step 2: Detect and fix missing values.
#### 2a. Detect and fix NaNs in the ma-dataframe.

In [12]:
# detect missing values

print(
    f'There are {df_ma[df_ma["Bevölkerung gesamt"] == 0].shape[0]} municipality associations with a zero population. I exclude those from the dataset and the analysis.\n'
)

df_ma_nonzero_pop = df_ma[df_ma["Bevölkerung gesamt"] != 0]

print("++++++++++ Municipality associations with zero population included ++++++++++\n")
_ = detect_missings(
    df_ma,
    inds=ind_all_ma,
    cat_id=col_id_ma,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_ma,
    comments_dict=comments_dict_missing_ma,
    entire_df="(partial)",
    types="municipality associations",
)

print(
    "++++++++++ Only municipality associations with non-zero population included ++++++++++\n"
)
_ = detect_missings(
    df_ma_nonzero_pop,
    inds=ind_all_ma,
    cat_id=col_id_ma,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_ma,
    comments_dict=comments_dict_missing_ma,
    entire_df="(partial)",
    types="municipality associations",
)

There are 208 municipality associations with a zero population. I exclude those from the dataset and the analysis.

++++++++++ Municipality associations with zero population included ++++++++++

++++++++++ Checking (partial) dataframe of municipality associations +++++++++++

0.7555555555555555 (68 out of 90) of the indicators have missing values.
0.07882200086617583 (364 out of 4618) of the municipality associations have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments
0,sozialversicherungspflichtig Beschäftigte am A...,2,0.000433,,"small ma, prob. (close to) 0, no jobs"
1,Baugenehmigungen für Wohnungen,208,0.045041,,
2,Baugenehmigungen für Wohnungen in Ein- und Zwe...,252,0.054569,Baugenehmigungen für Wohnungen,
3,Baugenehmigungen für Wohnungen in Mehrfamilien...,252,0.054569,Baugenehmigungen für Wohnungen,
4,Fertiggestellte Wohnungen im Bestand,208,0.045041,,
...,...,...,...,...,...
63,Nahversorgung Apotheken Anteil der Bev. 1km Ra...,111,0.024036,,
64,Nahversorgung Grundschulen Durchschnittsdistanz,111,0.024036,,
65,Nahversorgung Grundschulen Anteil der Bev. 1km...,111,0.024036,,
66,Nahversorgung Haltestellen des ÖV Durchschnitt...,118,0.025552,,prob. no bus stops


++++++++++ Only municipality associations with non-zero population included ++++++++++

++++++++++ Checking (partial) dataframe of municipality associations +++++++++++

0.13333333333333333 (12 out of 90) of the indicators have missing values.
0.03537414965986395 (156 out of 4410) of the municipality associations have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments
0,sozialversicherungspflichtig Beschäftigte am A...,2,0.000454,,"small ma, prob. (close to) 0, no jobs"
1,Baugenehmigungen für Wohnungen in Ein- und Zwe...,44,0.009977,Baugenehmigungen für Wohnungen,
2,Baugenehmigungen für Wohnungen in Mehrfamilien...,44,0.009977,Baugenehmigungen für Wohnungen,
3,Neue Ein- und Zweifamilienhäuser,102,0.023129,Neubauwohnungen je Einwohner,
4,Neubauwohnungen in Ein- und Zweifamilienhäusern,74,0.01678,Neubauwohnungen je Einwohner,
5,Ein- und Zweiraumwohnungen,15,0.003401,,probably close to 0
6,Steuerkraft,6,0.001361,,
7,Einwohner-Arbeitsplatz-Dichte,2,0.000454,,"prob. no jobs, = Einwohnerdichte?"
8,Einpendler,2,0.000454,,prob. no jobs
9,Pendlersaldo,2,0.000454,,prob. no jobs


In [13]:
# Take manually care of NaNs of particular indicators in ma dataframe

# 1. Taking care of Einwohner-Arbeitsplatz-Dichte, Einpendler, Pendlersaldo, sozialversicherungspflichtig Beschäftigte am Arbeitsort:
# delete municipality associations with a NaN value for these indicators
print(
    "Municipality associations with missing values for Einwohner-Arbeitsplatz-Dichte, Einpendler, Pendlersaldo, sozialversicherungspflichtig Beschäftigte am Arbeitsort"
)
display(
    df_ma_nonzero_pop.loc[
        df_ma_nonzero_pop["Pendlersaldo"].isna(),
        [
            col_name_ma,
            col_id_ma,
            "Einwohner-Arbeitsplatz-Dichte",
            "Einpendler",
            "Pendlersaldo",
            "sozialversicherungspflichtig Beschäftigte am Arbeitsort",
        ],
    ]
)
df_ma_preprocessed = df_ma_nonzero_pop[df_ma_nonzero_pop["Pendlersaldo"].notna()]

# 2. Taking care of indicators "Nahversorgung Haltestelle des ÖV ..." (V345, V346)
print(
    "Municipality associations with NaN values for distances to public transport (V345, V346)"
)
display(
    df_ma_preprocessed.loc[
        (
            df_ma_preprocessed[
                "Nahversorgung Haltestellen des ÖV Durchschnittsdistanz"
            ].isna()
        )
        | (
            df_ma_preprocessed[
                "Nahversorgung Haltestellen des ÖV Anteil der Bev. 1km Radius"
            ].isna()
        ),
        [
            col_name_ma,
            col_id_ma,
            "Nahversorgung Haltestellen des ÖV Durchschnittsdistanz",
            "Nahversorgung Haltestellen des ÖV Anteil der Bev. 1km Radius",
        ],
    ]
)
# islands in the north sea, all except of Föhr are completely free of cars -> no/ hardly any public transport on islands. Hence, distance = inf, population within 1 km distance to bus stop: 0
df_ma_preprocessed.loc[
    df_ma_preprocessed["Nahversorgung Haltestellen des ÖV Durchschnittsdistanz"].isna(),
    "Nahversorgung Haltestellen des ÖV Durchschnittsdistanz",
] = np.inf
df_ma_preprocessed.loc[
    df_ma_preprocessed[
        "Nahversorgung Haltestellen des ÖV Anteil der Bev. 1km Radius"
    ].isna(),
    "Nahversorgung Haltestellen des ÖV Anteil der Bev. 1km Radius",
] = 0

# 3. Taking care of Baugenehmigungen (V46-V48): turn shares/ relative variables (V47, V48) in absolute variables (per-capita) multiplying by the (per-capita) denominator
df_ma_preprocessed.loc[
    df_ma_preprocessed["Baugenehmigungen für Wohnungen"] == 0,
    "Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern pro 1000 Einwohner",
] = 0
df_ma_preprocessed.loc[
    df_ma_preprocessed["Baugenehmigungen für Wohnungen"] != 0,
    "Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern pro 1000 Einwohner",
] = (
    df_ma_preprocessed["Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern"]
    * df_ma_preprocessed["Baugenehmigungen für Wohnungen"]
    / 100
)

df_ma_preprocessed.loc[
    df_ma_preprocessed["Baugenehmigungen für Wohnungen"] == 0,
    "Baugenehmigungen für Wohnungen in Mehrfamilienhäusern pro 1000 Einwohner",
] = 0
df_ma_preprocessed.loc[
    df_ma_preprocessed["Baugenehmigungen für Wohnungen"] != 0,
    "Baugenehmigungen für Wohnungen in " "Mehrfamilienhäusern pro 1000 Einwohner",
] = (
    df_ma_preprocessed["Baugenehmigungen für Wohnungen in Mehrfamilienhäusern"]
    * df_ma_preprocessed["Baugenehmigungen für Wohnungen"]
    / 100
)
# drop original relative variables
df_ma_preprocessed.drop(
    [
        "Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern",
        "Baugenehmigungen für Wohnungen in Mehrfamilienhäusern",
    ],
    axis=1,
    inplace=True,
)

# 4. Taking care of variables related to newly built flats (V50-V53)
# 'Neue Ein- und Zweifamilienhäuser' (V50): denominator not available, similar explanatory power as V53
# 'Neubauwohnungen in Ein- und Zweifamilienhäusern' (V51): relative number. The corresponding abolute per-capita number is already included in the dataset ( V53)
df_ma_preprocessed.drop(
    [
        "Neue Ein- und Zweifamilienhäuser",
        "Neubauwohnungen in Ein- und Zweifamilienhäusern",
    ],
    axis=1,
    inplace=True,
)

# Updating list of indicators
add_ind_ma_2019_updated = add_ind_ma_2019 + [
    "Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern pro 1000 Einwohner",
    "Baugenehmigungen für Wohnungen in Mehrfamilienhäusern pro 1000 Einwohner",
]
for deleted_ind in [
    "Neue Ein- und Zweifamilienhäuser",
    "Neubauwohnungen in Ein- und Zweifamilienhäusern",
    "Baugenehmigungen für Wohnungen in Ein- und Zweifamilienhäusern",
    "Baugenehmigungen für Wohnungen in Mehrfamilienhäusern",
]:
    add_ind_ma_2019_updated.remove(deleted_ind)

ind_all_ma_updated = add_ind_ma_2019_updated + list(add_ind_ma_dict.keys())
# detect and save the remaining missing values
df_detected_missings_ma = detect_missings(
    df_ma_preprocessed,
    inds=ind_all_ma_updated,
    cat_id=col_id_ma,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_ma,
    comments_dict=comments_dict_missing_ma,
    entire_df="(partial)",
    types="municipality associations",
)
remaining_missing_ind_ma = list(df_detected_missings_ma[col_ind_inkar])

Municipality associations with missing values for Einwohner-Arbeitsplatz-Dichte, Einpendler, Pendlersaldo, sozialversicherungspflichtig Beschäftigte am Arbeitsort


Unnamed: 0,Name of municipality ass.,Code of municipality associations (RS),Einwohner-Arbeitsplatz-Dichte,Einpendler,Pendlersaldo,sozialversicherungspflichtig Beschäftigte am Arbeitsort
361,"Lohheide, gemfr. Bezirk",33519501,,,,
2725,Zandt,93720177,,,,


Municipality associations with NaN values for distances to public transport (V345, V346)


Unnamed: 0,Name of municipality ass.,Code of municipality associations (RS),Nahversorgung Haltestellen des ÖV Durchschnittsdistanz,Nahversorgung Haltestellen des ÖV Anteil der Bev. 1km Radius
36,Föhr-Amrum,10545488,,
65,Helgoland,10560025,,
473,Baltrum,34520002,,
478,"Juist, Inselgemeinde",34520013,,
524,"Wangerooge, Nordseebad",34550021,,
597,Langeoog,34620007,,
598,Spiekeroog,34620014,,


++++++++++ Checking (partial) dataframe of municipality associations +++++++++++

0.022727272727272728 (2 out of 88) of the indicators have missing values.
0.004537205081669692 (20 out of 4408) of the municipality associations have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments
0,Ein- und Zweiraumwohnungen,14,0.003176,,probably close to 0
1,Steuerkraft,6,0.001361,,


In [14]:
# Substitute remaining NaNs with values from the past (considering the most recent value available.)
df_ma_preprocessed = insert_and_save_substitutes_from_past(
    df_ma_preprocessed,
    df_inkar=df_inkar_ma,
    inds_missing=remaining_missing_ind_ma,
    cat_id=col_id_ma,
    cat_ind=col_ind_inkar,
    cat_time=col_time_inkar,
    time_period=t_inkar,
    cat_value=col_value_inkar,
    spatial_type="ma",
)
_ = detect_missings(
    df_ma_preprocessed,
    inds=ind_all_ma_updated,
    cat_id=col_id_ma,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_ma,
    comments_dict=comments_dict_missing_ma,
    entire_df="(partial)",
    types="municipality associations",
)
# Values from the past are available for all NaNs on ma-level. 


++++++++++ Checking (partial) dataframe of municipality associations +++++++++++

0.0 (0 out of 88) of the indicators have missing values.
0.0 (0 out of 4408) of the municipality associations have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments


#### 2b. Detect and fix NaNs in the c-dataframe.

In [15]:
_ = detect_missings(
    df_c,
    inds=ind_all_c,
    cat_id=col_id_c,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_c,
    comments_dict=comments_dict_missing_c,
    entire_df="(partial)",
    types="counties",
)

++++++++++ Checking (partial) dataframe of counties +++++++++++

0.14893617021276595 (14 out of 94) of the indicators have missing values.
0.14713216957605985 (59 out of 401) of the counties have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments
0,Baulandpreise,5,0.012469,,
1,Industriequote,14,0.034913,,similar to other sector related ind.
2,Dienstleistungsquote,2,0.004988,,similar to other sector related ind.
3,Beschäftigte in wissensintensiven Industrien,7,0.017456,,
4,Ausbildungsplätze,1,0.002494,,missing for Berlin
5,Schulabgänger mit allgemeiner Hochschulreife,2,0.004988,,
6,Schlüsselzuweisungen,4,0.009975,,
7,Ausgaben für Sachinvestitionen,6,0.014963,,same NaNs as Zuweisungen für Investitionsförde...
8,Zuweisungen für Investitionsfördermaßnahmen,6,0.014963,,same NaNs as Ausgaben für Sachinvestitionen
9,Bruttowertschöpfung je Erwerbstätigen Primärer...,2,0.004988,,Bruttowertschöpfung im Prim. Sektor approx. 0;...


In [16]:
# Take care of NaNs in c dataframe

df_c_preprocessed = df_c.copy()

display(
    df_c_preprocessed.loc[
        df_c_preprocessed["Anteil Bruttowertschöpfung Primärer Sektor"].isna(),
        [
            col_name_c,
            col_id_c,
            "Anteil Bruttowertschöpfung Primärer Sektor",
            "Anteil Bruttowertschöpfung Sekundärer Sektor",
            "Anteil Bruttowertschöpfung Tertiärer Sektor",
        ],
    ]
)
df_c_preprocessed.loc[
    (df_c_preprocessed["Anteil Bruttowertschöpfung Primärer Sektor"].isna())
    & (~df_c_preprocessed["Anteil Bruttowertschöpfung Sekundärer Sektor"].isna())
    & (~df_c_preprocessed["Anteil Bruttowertschöpfung Tertiärer Sektor"].isna()),
    "Anteil " "Bruttowertschöpfung Primärer Sektor",
] = (
    100
    - df_c_preprocessed["Anteil Bruttowertschöpfung Sekundärer Sektor"]
    - df_c_preprocessed["Anteil Bruttowertschöpfung Tertiärer Sektor"]
)
assert (
    sum(df_c_preprocessed["Anteil Bruttowertschöpfung Primärer Sektor"] < 0) == 0
), f"Attention: Anteil Bruttowertschöpfung < 0 for some instances."

Unnamed: 0,County name,County code,Anteil Bruttowertschöpfung Primärer Sektor,Anteil Bruttowertschöpfung Sekundärer Sektor,Anteil Bruttowertschöpfung Tertiärer Sektor
343,"Rostock, Stadt",13003.0,,18.38,81.57
344,"Schwerin, Stadt",13004.0,,17.36,82.58


In [17]:
df_c_preprocessed.loc[
    df_c_preprocessed[
        [
            "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor",
            "Erwerbstätige Primärer Sektor",
        ]
    ]
    .isna()
    .any(axis=1),
    [
        col_name_c,
        col_id_c,
        "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor",
        "Erwerbstätige Primärer Sektor",
    ],
]

Unnamed: 0,County name,County code,Bruttowertschöpfung je Erwerbstätigen Primärer Sektor,Erwerbstätige Primärer Sektor
343,"Rostock, Stadt",13003.0,,0.06
344,"Schwerin, Stadt",13004.0,,0.09


In [18]:
# 2. Approximate 'Bruttowertschöpfung je Erwerbstätigen Primärer Sektor' using the state-wide average of the quotient 'Bruttowertschöpfung je Erwerbstätigen Primärer Sektor'/'Bruttowertschöpfung je Erwerbstätigen'

df_c_preprocessed["temp_quotient_Bruttowertschöpfung_prim"] = (
    df_c_preprocessed["Bruttowertschöpfung je Erwerbstätigen Primärer Sektor"]
    / df_c_preprocessed["Bruttowertschöpfung je Erwerbstätigen"]
)

# get states of NaNs
states_of_nans = list(
    df_c_preprocessed.loc[
        df_c_preprocessed[
            "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor"
        ].isna(),
        col_id_s,
    ].unique()
)

avg_dict = {}
for state in states_of_nans:
    avg_dict.update(
        {
            state: df_c_preprocessed.loc[
                (
                    df_c_preprocessed[
                        "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor"
                    ].notna()
                )
                & (df_c_preprocessed[col_id_s] == state),
                "temp_quotient_Bruttowertschöpfung_prim",
            ].mean()
        }
    )

idx_missing = df_c_preprocessed[
    "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor"
].isna()

df_c_preprocessed.loc[
    df_c_preprocessed["Bruttowertschöpfung je Erwerbstätigen Primärer Sektor"].isna(),
    "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor",
] = df_c_preprocessed["Bruttowertschöpfung je Erwerbstätigen"] * df_c_preprocessed[
    col_id_s
].map(
    avg_dict
)

df_c_preprocessed.drop(["temp_quotient_Bruttowertschöpfung_prim"], axis=1, inplace=True)

display(
    df_c_preprocessed.loc[
        idx_missing,
        [
            col_id_c,
            col_name_c,
            "Bruttowertschöpfung je Erwerbstätigen Primärer Sektor",
            "Bruttowertschöpfung je Erwerbstätigen",
        ],
    ]
)

Unnamed: 0,County code,County name,Bruttowertschöpfung je Erwerbstätigen Primärer Sektor,Bruttowertschöpfung je Erwerbstätigen
343,13003.0,"Rostock, Stadt",57.249096,59.53
344,13004.0,"Schwerin, Stadt",52.344504,54.43


In [19]:
inds_missing_c_to_fix = list(
    df_c_preprocessed.columns[df_c_preprocessed.isna().any(axis=0)]
)
print('Variables with missing values.')
display(inds_missing_c_to_fix)
print('NaNs are substituted by values from the past if available.')

df_c_preprocessed = insert_and_save_substitutes_from_past(
    df_c_preprocessed,
    df_inkar=df_inkar_c,
    inds_missing=inds_missing_c_to_fix,
    cat_id=col_id_c,
    cat_ind=col_ind_inkar,
    cat_time=col_time_inkar,
    time_period=t_inkar,
    cat_value=col_value_inkar,
    spatial_type="c",
)
df_ind_missing_c = detect_missings(
    df_c_preprocessed,
    inds=ind_all_c,
    cat_id=col_id_c,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_c,
    comments_dict=comments_dict_missing_c,
    entire_df="(partial)",
    types="counties",
)

inds_missing_c_to_fix = list(df_ind_missing_c[col_ind_inkar])

display(
    df_c_preprocessed.loc[
        df_c_preprocessed["Umsatz im Handwerk"].isna(),
        [col_id_c, col_name_c, "Umsatz im Handwerk"],
    ]
)

print('State average serves as substitute.')
df_c_preprocessed = insert_and_save_subs_higher_level_c(
    df_c_preprocessed,
    inds_missing=inds_missing_c_to_fix,
    cat_state=col_id_s,
    cat_value=col_value_inkar,
)

_ = detect_missings(
    df_c_preprocessed,
    inds=ind_all_c,
    cat_id=col_id_c,
    cat_ind=col_ind_inkar,
    dependencies_dict=dependencies_ind_c,
    comments_dict=comments_dict_missing_c,
    entire_df="(partial)",
    types="counties",
)

Variables with missing values.


['Baulandpreise',
 'Industriequote',
 'Dienstleistungsquote',
 'Beschäftigte in wissensintensiven Industrien',
 'Ausbildungsplätze',
 'Schulabgänger mit allgemeiner Hochschulreife',
 'Schlüsselzuweisungen',
 'Ausgaben für Sachinvestitionen',
 'Zuweisungen für Investitionsfördermaßnahmen',
 'Umsatz im Bergbau u. Verarb. Gewerbe',
 'Umsatz Bauhauptgewerbe',
 'Umsatz im Handwerk']

NaNs are substituted by values from the past if available.
++++++++++ Checking (partial) dataframe of counties +++++++++++

0.010638297872340425 (1 out of 94) of the indicators have missing values.
0.004987531172069825 (2 out of 401) of the counties have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments
0,Umsatz im Handwerk,2,0.004988,,


Unnamed: 0,County code,County name,Umsatz im Handwerk
364,15001.0,"Dessau-Roßlau, Stadt",
372,15086.0,Jerichower Land,


State average serves as substitute.
++++++++++ Checking (partial) dataframe of counties +++++++++++

0.0 (0 out of 94) of the indicators have missing values.
0.0 (0 out of 401) of the counties have at least one missing indicator.



Unnamed: 0,Indikator,Abs. number of NaNs,Rel. number of NaNs,Identified dependencies,Further comments


### Step 3: Merge both datasets.
#### Integrate the data on c-level into the dataset on ma-level to derive the final (compact) INKAR dataframe which will serve as a basis for the analysis.

In [20]:
df_complete_preprocessed = df_ma_preprocessed.merge(
    df_c_preprocessed[[col_id_c] + ind_all_c], on=col_id_c, how="left"
)

In [21]:
# check completeness of full dataframe
assert (
    df_complete_preprocessed.isna().sum().sum() == 0
), "There are NaNs in the preprocessed INKAR dataset."

In [22]:
df_complete_preprocessed.to_csv(
    "data/intermediate_data/preprocessed_inkar_data.csv", sep=";", index=False
)