#### Fill missing values

This script is used to fill missing values. Some columns contain missing values. Those columns are: 
- year of construction of the substructure (`Baujahr Unterbau`)
- district (`Kreis`)
- name of the state (`Bundeslandname`)
- X
- Y



In [168]:
# load libraries 
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import numpy as np

In [208]:
# read data
data = pd.read_csv('../data/reduced_bridge_statistic_germany.csv', sep=';')

# columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
print("Columns with missing values:", cols_with_missing)

# print number of rows
print("Number of rows in the dataset:", len(data))

Columns with missing values: ['Baujahr Unterbau', 'Kreis', 'Bundeslandname', 'X', 'Y']
Number of rows in the dataset: 52559


We are going to fill the mising `Baujahr Unterbau` using the respective `Baujahr Überbau`. 

In [209]:
# use 'Baujahr Überbau' for 'Baujahr Unterbau' in case it is missing
data.fillna({'Baujahr Unterbau': data['Baujahr Überbau']}, inplace=True)

The next step is to normalize the names of the states as described below. Furthermore, there are some bridges with missing states. In order to fill those we look at another bridge with the same district (`Kreis`) and a given state name (`Bundeslandname`). 

In [210]:
# print unique values in 'Bundeslandname' to verify
#print("Unique values in 'Bundeslandname':", data['Bundeslandname'].unique())

# 'Schlweswig - Holstein' into 'Schlweswig-Holstein'
data['Bundeslandname'] = data['Bundeslandname'].replace('Schleswig - Holstein', 'Schlweswig-Holstein')
# 'Rheinland - Pfalz' into 'Rheinland-Pfalz'
data['Bundeslandname'] = data['Bundeslandname'].replace('Rheinland - Pfalz', 'Rheinland-Pfalz')
# 'Freie u. Hansestadt Hamburg' into 'Hamburg'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freie u. Hansestadt Hamburg', 'Hamburg')
# 'Freie Hansestadt Bremen' into 'Bremen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freie Hansestadt Bremen', 'Bremen')
# 'Nordrhein-Westfalen (NRW)' into 'Nordrhein-Westfalen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Nordrhein-Westfalen (NRW)', 'Nordrhein-Westfalen')
# 'Freistaat Sachsen' into 'Sachsen'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freistaat Sachsen', 'Sachsen')
# 'Freistaat Bayern' into 'Bayern'
data['Bundeslandname'] = data['Bundeslandname'].replace('Freistaat Bayern', 'Bayern')
# 'Land Baden-Württemberg' into 'Baden-Württemberg'
data['Bundeslandname'] = data['Bundeslandname'].replace('Land Baden-Württemberg', 'Baden-Württemberg')

# print unique values in 'Bundeslandname' to verify
#print("Unique values in 'Bundeslandname':", data['Bundeslandname'].unique())

In [211]:
# try to fill 'Bundeslandname' based on another row with same 'Kreis'
for index, row in data.iterrows(): 
    if pd.isnull(row['Bundeslandname']): 
        district = row['Kreis']
        matching_rows = data[data['Kreis'] == district]
        for _, match_row in matching_rows.iterrows():
            if pd.notnull(match_row['Bundeslandname']): 
                data.at[index, 'Bundeslandname'] = match_row['Bundeslandname']
                break

There is one bridge that does not contain a state name and a district nor the coordinates. Thus, we decided to remove this bridge. 

In [212]:
# remove bridge with missing 'Bundeslandname' and 'Kreis' and coordinates
data = data.dropna(subset=['Bundeslandname'])

The next step is to look at the districts (`Kreis`). First, there are some districts written in upper case latters. They are written in lower case with capital letter at the beginning. Second, with use of a GEOJSON file the mssing districts are found based on the given coordinates. Third, the names of the districts are normalized. We do not distinguish between `Landkreis` and `Kreisfreie Stadt` to make it easier. 

The geojson file that is used differentiates between the following districts what we do not do: 
- Offenbach am Main & Offenbach (we only have Offenbach)
- Bremerhaven & Bremen (we only have Bremen)

In [213]:
# modify 'Kreis' column
# make all upper case letters to lower case letters (except for the first of each word)
data['Kreis'] = data['Kreis'].str.title()

# for 'Bundeslandname' == 'Berlin' there are no districts available -> set 'Kreis' to 'Berlin'
data.loc[data['Bundeslandname'] == 'Berlin', 'Kreis'] = 'Berlin'


In [214]:
# load germany districts shapefile
germany_districts = gpd.read_file('../landkreise_simplify200.geojson')

# subset data to relevant rows with missing districts and where Kreis' equals 'Bundesrepublik Deutschland' or 'Mecklenburg Vorpommern' or 
# 'Niedersachsen' or 'Nrw' or 'Saarland' because it is wrong
data_missing_districts = data[data['Kreis'].isnull() |
                                (data['Kreis'] == 'Bundesrepublik Deutschland') | 
                                (data['Kreis'] == 'Mecklenburg Vorpommern') |
                                (data['Kreis'] == 'Niedersachsen') |
                                (data['Kreis'] == 'Nrw') |
                                (data['Kreis'] == 'Saarland')].copy()

# create a GeoDataFrame from the data DataFrame
geometry = [Point(xy) for xy in zip(data_missing_districts['X'], data_missing_districts['Y'])]
geo_data_missing = gpd.GeoDataFrame(data_missing_districts, geometry=geometry, crs="EPSG:4326")

# perform spatial join to find the district for each point
joined = gpd.sjoin(geo_data_missing, germany_districts, how="left", predicate="within")

# print(joined[['BEZ', 'GEN']])
# BEZ: Landkreis, Kreisfreie Stadt, Kreis
# GEN: Name of the district

# update the 'Kreis' column in the original data DataFrame
for index, row in joined.iterrows():
    original_index = row.name
    if pd.notnull(row['GEN']):
        data.at[original_index, 'Kreis'] = row['GEN']
    else:
        data.at[original_index, 'Kreis'] = 'Unknown'
        print(f"Could not find district for index {original_index} with coordinates ({row['X']}, {row['Y']})")
        print(f"Point geometry: {row['geometry']}")
        print()

# remove rows where 'Kreis' is still null or 'Unknown'
data_filled = data[~data['Kreis'].isnull() & (data['Kreis'] != 'Unknown')].copy()

# print number of rows
print("Number of rows after filling missing districts:", len(data_filled))

Could not find district for index 11835 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11836 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11837 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11869 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Number of rows after filling missing districts: 52554


In [None]:
# normalized district names
districts = sorted(germany_districts['GEN'].unique())
#print("Known districts:", len(districts))

# check whether there are districts that are contained in each other
#for i in range(len(districts)):
#    for j in range(len(districts)):
#        if i != j and districts[i].lower() in districts[j].lower():
#            print(f"District '{districts[i]}' is contained in '{districts[j]}'")

# normalize names using germany_districts
found = False
for index, row in data_filled.iterrows():
    district_name = row['Kreis'].lower()
    found = False

    if 'amberg-sulzbach' in district_name:
        data_filled.at[index, 'Kreis'] = 'Amberg-Sulzbach'
        continue
    elif 'bamberg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Bamberg'
        continue
    elif 'amberg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Amberg'
        continue
    elif 'dieburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Darmstadt-Dieburg'
        continue
    elif 'darmstadt' in district_name:
        data_filled.at[index, 'Kreis'] = 'Darmstadt'
        continue
    elif 'erlangen-höchstadt' in district_name:
        data_filled.at[index, 'Kreis'] = 'Erlangen-Höchstadt'
        continue
    elif 'erlangen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Erlangen'
        continue
    elif 'schleswig-flensburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Schleswig-Flensburg'
        continue
    elif 'flensburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Flensburg'
        continue
    elif 'nordfriesland' in district_name:
        data_filled.at[index, 'Kreis'] = 'Nordfriesland'
        continue
    elif 'friesland' in district_name:
        data_filled.at[index, 'Kreis'] = 'Friesland'
        continue
    elif 'groß - gerau' in district_name:
        data_filled.at[index, 'Kreis'] = 'Groß-Gerau'
        continue
    elif 'gera' in district_name:
        data_filled.at[index, 'Kreis'] = 'Gera'
        continue
    elif 'mansfeld-südharz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Mansfeld-Südharz'
        continue
    elif 'harz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Harz'
        continue
    elif 'pfaffenhofen a.d.ilm' in district_name:
        data_filled.at[index, 'Kreis'] = 'Pfaffenhofen'
        continue
    elif 'hof' in district_name:
        data_filled.at[index, 'Kreis'] = 'Hof'
        continue
    elif 'mayen-koblenz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Mayen-Koblenz'
        continue
    elif 'koblenz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Koblenz'
        continue
    elif 'mainz-bingen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Mainz-Bingen'
        continue
    elif 'mainz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Mainz'
        continue
    elif 'neumünster' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neumünster'
        continue
    elif 'münster' in district_name:
        data_filled.at[index, 'Kreis'] = 'Münster'
        continue
    elif 'nürnberger land' in district_name:
        data_filled.at[index, 'Kreis'] = 'Nürnberger Land'
        continue
    elif 'nürnberg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Nürnberg'
        continue
    elif 'offenbach am main' in district_name:
        data_filled.at[index, 'Kreis'] = 'Offenbach am Main'
        continue
    elif 'offenbach' in district_name:
        data_filled.at[index, 'Kreis'] = 'Offenbach'
        continue
    elif 'oldenburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Oldenburg'
        continue
    elif 'potsdam-mittelmark' in district_name:
        data_filled.at[index, 'Kreis'] = 'Potsdam-Mittelmark'
        continue
    elif 'potsdam' in district_name:
        data_filled.at[index, 'Kreis'] = 'Potsdam'
        continue
    elif 'ostprignitz-ruppin' in district_name:
        data_filled.at[index, 'Kreis'] = 'Ostprignitz-Ruppin'
        continue
    elif 'prignitz' in district_name:
        data_filled.at[index, 'Kreis'] = 'Prignitz'
        continue
    elif 'regensburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Regensburg'
        continue
    elif 'regen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Regen'
        continue
    elif 'straubing-bogen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Straubing-Bogen'
        continue
    elif 'straubing' in district_name:
        data_filled.at[index, 'Kreis'] = 'Straubing'
        continue
    elif 'trier-saarburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Trier-Saarburg'
        continue
    elif 'trier' in district_name:
        data_filled.at[index, 'Kreis'] = 'Trier'
        continue
    elif 'kulmbach' in district_name:
        data_filled.at[index, 'Kreis'] = 'Kulmbach'
        continue
    elif 'neu-ulm' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neu-Ulm'
        continue
    elif 'ulm' in district_name:
        data_filled.at[index, 'Kreis'] = 'Ulm'
        continue
    elif 'weimarer land' in district_name:
        data_filled.at[index, 'Kreis'] = 'Weimarer Land'
        continue
    elif 'weimar' in district_name:
        data_filled.at[index, 'Kreis'] = 'Weimar'
        continue
    elif 'alzey-worms' in district_name:
        data_filled.at[index, 'Kreis'] = 'Alzey-Worms'
        continue
    elif 'worms' in district_name:
        data_filled.at[index, 'Kreis'] = 'Worms'
        continue
    elif 'hannover' in district_name: 
        data_filled.at[index, 'Kreis'] = 'Region Hannover'
        continue
    elif 'dessau-rosslau, stadt' in district_name:
        data_filled.at[index, 'Kreis'] = 'Dessau-Roßlau'
        continue
    elif 'kyffhaeuserkreis' in district_name:
        data_filled.at[index, 'Kreis'] = 'Kyffhäuserkreis'
        continue
    elif 'landkreis soemmerda' in district_name:
        data_filled.at[index, 'Kreis'] = 'Sömmerda'
        continue
    elif 'landkreis altenburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Altenburger Land'
        continue
    elif 'landkreis marburg - biedenkopf' in district_name:
        data_filled.at[index, 'Kreis'] = 'Marburg-Biedenkopf'
        continue
    elif 'altenkirchen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Altenkirchen'
        continue
    elif 'lahn - dill - kreis' in district_name:
        data_filled.at[index, 'Kreis'] = 'Lahn-Dill-Kreis'
        continue
    elif 'freiburg' in district_name:
        data_filled.at[index, 'Kreis'] = 'Freiburg im Breisgau'
        continue
    elif 'landkreis mühldorf a.inn' in district_name:
        data_filled.at[index, 'Kreis'] = 'Mühldorf'
        continue
    elif 'main - kinzig - kreis' in district_name:
        data_filled.at[index, 'Kreis'] = 'Main-Kinzig-Kreis'
        continue
    elif 'rheingau - taunus - kreis' in district_name:
        data_filled.at[index, 'Kreis'] = 'Rheingau-Taunus-Kreis'
        continue
    elif 'main - taunus - kreis' in district_name:
        data_filled.at[index, 'Kreis'] = 'Main-Taunus-Kreis'
        continue
    elif 'ostalb' in district_name:
        data_filled.at[index, 'Kreis'] = 'Ostalbkreis'
        continue
    elif 'kreisfreie stadt frankfurt main' in district_name:
        data_filled.at[index, 'Kreis'] = 'Frankfurt am Main'
        continue
    elif 'landkreis neumarkt i.d.opf.' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neumarkt'
        continue
    elif 'landkreis wunsiedel i.fichtelgebirge' in district_name:
        data_filled.at[index, 'Kreis'] = 'Wunsiedel'
        continue
    elif 'südliche weinstrasse' in district_name:
        data_filled.at[index, 'Kreis'] = 'Südliche Weinstraße'
        continue
    elif 'kfr. stadt landau (pfalz)' in district_name:
        data_filled.at[index, 'Kreis'] = 'Landau'
        continue
    elif 'landkreis neustadt a.d.waldnaab' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neustadt an der Waldnaab'
        continue
    elif 'landkreis bergstrasse' in district_name:
        data_filled.at[index, 'Kreis'] = 'Bergstraße'
        continue
    elif 'regionalverband saarbruecken' in district_name:
        data_filled.at[index, 'Kreis'] = 'Regionalverband Saarbrücken'
        continue
    elif 'landkreis neustadt a.d. aisch - bad windsheim' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neustadt an der Aisch-Bad Windsheim'
        continue
    elif 'kfr. stadt neustadt/weinstraße' in district_name:
        data_filled.at[index, 'Kreis'] = 'Neustadt an der Weinstraße'
        continue
    elif 'stadt weiden i.d.opf.' in district_name:
        data_filled.at[index, 'Kreis'] = 'Weiden'
        continue
    elif 'kfr. stadt ludwigshafen' in district_name:
        data_filled.at[index, 'Kreis'] = 'Ludwigshafen am Rhein'
        continue
    elif 'landkreis dillingen a.d.donau' in district_name:
        data_filled.at[index, 'Kreis'] = 'Dillingen'
        continue
    for district in districts:
        if district.lower() in district_name.lower():
            data_filled.at[index, 'Kreis'] = district
            found = True
            break
    if not found: 
        print(f"No match found for district name: {district_name}")

In [218]:
# save filled data
data_filled.to_csv('../data/filled_bridge_statistic_germany.csv', sep=';', index=False)

In [None]:
# columns with missing values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
print("Columns with missing values:", cols_with_missing)

Columns with missing values: ['X', 'Y']
534
5166
