#### Fill missing values

This script is used to fill missing values. Some columns contain missing values. Those columns are: 
- year of construction of the substructure (`Baujahr Unterbau`)
- district (`Kreis`)
- name of the state (`Bundeslandname`)
- X
- Y



In [53]:
# load libraries 
import pandas as pd

In [73]:
# read data
data = pd.read_csv('../data/reduced_bridge_statistic_germany.csv', sep=';')

# columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
print("Columns with missing values:", cols_with_missing)

Columns with missing values: ['Baujahr Unterbau', 'Kreis', 'Bundeslandname', 'X', 'Y']


We are going to fill the mising `Baujahr Unterbau` using the respective `Baujahr Überbau`. 

In [74]:
# use 'Baujahr Überbau' for 'Baujahr Unterbau' in case it is missing
data.fillna({'Baujahr Unterbau': data['Baujahr Überbau']}, inplace=True)

The next step is to normalize the names of the states as described below. Furthermore, there are some bridges with missing states. In order to fill those we look at another bridge with the same district (`Kreis`) and a given state name (`Bundeslandname`). 

In [75]:
# print unique values in 'Bundeslandname' to verify
#print("Unique values in 'Bundeslandname':", data['Bundeslandname'].unique())

# 'Schlweswig - Holstein' into 'Schlweswig-Holstein'
data['Bundeslandname'] = data['Bundeslandname'].replace('Schleswig - Holstein', 'Schlweswig-Holstein')
# 'Rheinland - Pfalz' into 'Rheinland-Pfalz'
data['Bundeslandname'] = data['Bundeslandname'].replace('Rheinland - Pfalz', 'Rheinland-Pfalz')
# 'Freie u. Hansestadt Hamburg' into 'Hamburg'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freie u. Hansestadt Hamburg', 'Hamburg')
# 'Freie Hansestadt Bremen' into 'Bremen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freie Hansestadt Bremen', 'Bremen')
# 'Nordrhein-Westfalen (NRW)' into 'Nordrhein-Westfalen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Nordrhein-Westfalen (NRW)', 'Nordrhein-Westfalen')
# 'Freistaat Sachsen' into 'Sachsen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freistaat Sachsen', 'Sachsen')
# 'Freistaat Bayern' into 'Bayern'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freistaat Bayern', 'Bayern')
# 'Land Baden-Württemberg' into 'Baden-Württemberg'
data['Bundeslandname'] = data['Bundeslandname'].replace('Land Baden-Württemberg', 'Baden-Württemberg')

# print unique values in 'Bundeslandname' to verify
#print("Unique values in 'Bundeslandname':", data['Bundeslandname'].unique())

In [76]:
# try to fill 'Bundeslandname' based on another row with same 'Kreis'
for index, row in data.iterrows(): 
    if pd.isnull(row['Bundeslandname']): 
        district = row['Kreis']
        matching_rows = data[data['Kreis'] == district]
        for _, match_row in matching_rows.iterrows():
            if pd.notnull(match_row['Bundeslandname']): 
                data.at[index, 'Bundeslandname'] = match_row['Bundeslandname']
                break

There is one bridge that does not contain a state name and a district nor the coordinates. Thus, we decided to remove this bridge. 

In [77]:
# remove bridge with missing 'Bundeslandname' and 'Kreis' and coordinates
data = data.dropna(subset=['Bundeslandname'])

The next step is to look at the districts (`Kreis`). First, there are some districts written in upper case latters. 

In [78]:
# modify 'Kreis' column
# make all upper case letters to lower case letters (except for the first of each word)
data['Kreis'] = data['Kreis'].str.title()

# for 'Bundeslandname' == 'Berlin' there are no districts available -> set 'Kreis' to 'Berlin'
data.loc[data['Bundeslandname'] == 'Berlin', 'Kreis'] = 'Berlin'


In [79]:
# save filled data
data.to_csv('../data/filled_bridge_statistic_germany.csv', sep=';', index=False)

In [80]:
# columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
print("Columns with missing values:", cols_with_missing)

Columns with missing values: ['Kreis', 'X', 'Y']
