#### Fill missing values

This notebook is used to fill missing values. We identified columns in our data that contain missing values using the notebook `explore_reduce_data.ipynb`. Those columns are: 
- year of construction of the substructure (`Baujahr Unterbau`)
- width (`Breite (m)`)
- district (`Kreis`)
- name of the state (`Bundeslandname`)
- X-coordinate (`X`)
- Y-coordinate (`Y`)
  
We use the reduced data set `reduced_bridge_statistic_germany.csv`, fill all missing values or remove respective rows and store the preprocessed data set as `filled_bridge_statistic_germany.csv`. 



In [40]:
# load libraries 
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import numpy as np

In [41]:
# read data
data = pd.read_csv('../data/reduced_bridge_statistic_germany.csv', sep=';')

We decided to fill the mising year of construction of the substructure (`Baujahr Unterbau`) using the respective year of construction of the superstructure (`Baujahr Überbau`). 

In [42]:
# copy data
data_filled = data.copy()

# use 'Baujahr Überbau' for 'Baujahr Unterbau' in case it is missing
data_filled.fillna({'Baujahr Unterbau': data_filled['Baujahr Überbau']}, inplace=True)

There is only one bridge, where the width is missing. Thus, we remove this bridge. 

In [43]:
# remove bride with unknown width
data_filled = data_filled.dropna(subset=['Breite (m)'])

The next step is to normalize the names of the states as described below. Furthermore, there are some bridges with missing states. In order to fill those we look at another bridge with the same district (`Kreis`) and a given state name (`Bundeslandname`), which is used to fill the missing one. 

In [44]:
# print unique values in 'Bundeslandname' to verify
print("Unique values in 'Bundeslandname':", data_filled['Bundeslandname'].unique())

# 'Schlweswig - Holstein' into 'Schlweswig-Holstein'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Schleswig - Holstein', 'Schlweswig-Holstein')
# 'Rheinland - Pfalz' into 'Rheinland-Pfalz'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Rheinland - Pfalz', 'Rheinland-Pfalz')
# 'Freie u. Hansestadt Hamburg' into 'Hamburg'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Freie u. Hansestadt Hamburg', 'Hamburg')
# 'Freie Hansestadt Bremen' into 'Bremen'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Freie Hansestadt Bremen', 'Bremen')
# 'Nordrhein-Westfalen (NRW)' into 'Nordrhein-Westfalen'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Nordrhein-Westfalen (NRW)', 'Nordrhein-Westfalen')
# 'Freistaat Sachsen' into 'Sachsen'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Freistaat Sachsen', 'Sachsen')
# 'Freistaat Bayern' into 'Bayern'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Freistaat Bayern', 'Bayern')
# 'Land Baden-Württemberg' into 'Baden-Württemberg'
data_filled['Bundeslandname'] = data_filled['Bundeslandname'].replace('Land Baden-Württemberg', 'Baden-Württemberg')

# print unique values in 'Bundeslandname' to verify
print("Unique values in 'Bundeslandname':", data_filled['Bundeslandname'].unique())

Unique values in 'Bundeslandname': ['Schleswig - Holstein' 'Mecklenburg-Vorpommern' nan 'Niedersachsen'
 'Freie u. Hansestadt Hamburg' 'Freie Hansestadt Bremen' 'Brandenburg'
 'Sachsen-Anhalt' 'Nordrhein-Westfalen (NRW)' 'Berlin' 'Hessen'
 'Thueringen' 'Freistaat Sachsen' 'Rheinland - Pfalz' 'Freistaat Bayern'
 'Land Baden-Württemberg' 'Saarland']
Unique values in 'Bundeslandname': ['Schlweswig-Holstein' 'Mecklenburg-Vorpommern' nan 'Niedersachsen'
 'Hamburg' 'Bremen' 'Brandenburg' 'Sachsen-Anhalt' 'Nordrhein-Westfalen'
 'Berlin' 'Hessen' 'Thueringen' 'Sachsen' 'Rheinland-Pfalz' 'Bayern'
 'Baden-Württemberg' 'Saarland']


In [45]:
# try to fill 'Bundeslandname' based on another row with same 'Kreis'
for index, row in data_filled.iterrows(): 
    if pd.isnull(row['Bundeslandname']): 
        district = row['Kreis']
        matching_rows = data_filled[data_filled['Kreis'] == district]
        for _, match_row in matching_rows.iterrows():
            if pd.notnull(match_row['Bundeslandname']): 
                data_filled.at[index, 'Bundeslandname'] = match_row['Bundeslandname']
                break

The next step is to look at the districts (`Kreis`). First, there are some districts written in upper case latters. They are converted in lower case letters with capital letter at the beginning. Second, with use of the GEOJSON file (`data/districts.geojson`) the missing districts are found based on the given coordinates. Third, the names of the districts are normalized. We do not distinguish between `Landkreis` and `Kreisfreie Stadt` to make it easier. 

The GEOJSON file that is used differentiates between the following districts what we do not apply for our data set: 
- Offenbach am Main & Offenbach (we only have Offenbach)
- Bremerhaven & Bremen (we only have Bremen)

In [46]:
data_districts = data_filled.copy()

# print unique values contained in district column
print(data_districts['Kreis'].unique())

['Kreis Nordfriesland' 'Stadt Flensburg' 'Kreis Schleswig-Flensburg'
 'Kreis Rendsburg-Eckernförde' 'Landkreis Vorpommern-Rügen'
 'Landeshauptstadt Kiel' 'Kreis Plön' 'Kreis Ostholstein'
 'Kreis Dithmarschen' 'Landkreis Rostock' 'Rostock, Hansestadt'
 'Landkreis Vorpommern-Greifswald' 'Kreis Steinburg' 'Stadt Neumünster'
 'Kreis Segeberg' 'Hansestadt Lübeck' 'Landkreis Nordwestmecklenburg'
 'Landkreis Mecklenburgische Seenplatte' 'Cuxhaven' 'Kreis Pinneberg'
 'Kreis Stormarn' 'Stade' 'Kreis Herzogtum Lauenburg'
 'Landkreis Ludwigslust-Parchim' 'Aurich' nan 'Hamburg'
 'Schwerin, Landeshauptstadt' 'Friesland' 'Wilhelmshaven' 'Wesermarsch'
 'Bremen' 'Rotenburg (Wümme)' 'Harburg' 'Landkreis Uckermark' 'Emden'
 'Osterholz' 'Lüneburg' 'Landkreis Prignitz' 'Leer' 'Ammerland'
 'Cloppenburg' 'Oldenburg' 'Heidekreis' 'Uelzen' 'Lüneburg (ehem. RegBez)'
 'Lüchow-Dannenberg' 'Landkreis Ostprignitz-Ruppin' 'Landkreis Oberhavel'
 'Emsland' 'Delmenhorst' 'Diepholz' 'Verden' 'Altmarkkreis Salzwedel'
 '

In [47]:
# make all upper case letters to lower case letters (except for the first of each word)
data_districts['Kreis'] = data_districts['Kreis'].str.title()

# for 'Bundeslandname' == 'Berlin' there are no districts available -> set 'Kreis' to 'Berlin'
data_districts.loc[data_districts['Bundeslandname'] == 'Berlin', 'Kreis'] = 'Berlin'


In [48]:
# load germany districts shapefile
germany_districts = gpd.read_file('../data/districts.geojson')

# subset data to relevant rows with missing districts and where Kreis' equals 'Bundesrepublik Deutschland' or 'Mecklenburg Vorpommern' or 
# 'Niedersachsen' or 'Nrw' or 'Saarland' because it is wrong
data_missing_districts = data_districts[data_districts['Kreis'].isnull() |
                                (data_districts['Kreis'] == 'Bundesrepublik Deutschland') | 
                                (data_districts['Kreis'] == 'Mecklenburg Vorpommern') |
                                (data_districts['Kreis'] == 'Niedersachsen') |
                                (data_districts['Kreis'] == 'Nrw') |
                                (data_districts['Kreis'] == 'Saarland')].copy()

# create a GeoDataFrame from the data DataFrame
geometry = [Point(xy) for xy in zip(data_missing_districts['X'], data_missing_districts['Y'])]
geo_data_missing = gpd.GeoDataFrame(data_missing_districts, geometry=geometry, crs="EPSG:4326")

# perform spatial join to find the district for each point
joined = gpd.sjoin(geo_data_missing, germany_districts, how="left", predicate="within")

# print(joined[['BEZ', 'GEN']])
# BEZ: Landkreis, Kreisfreie Stadt, Kreis
# GEN: Name of the district

# update the 'Kreis' column in the original data DataFrame
for index, row in joined.iterrows():
    original_index = row.name
    if pd.notnull(row['GEN']):
        data_districts.at[original_index, 'Kreis'] = row['GEN']
    else:
        data_districts.at[original_index, 'Kreis'] = np.nan
        print(f"Could not find district for index {original_index} with coordinates ({row['X']}, {row['Y']})")
        print(f"Point geometry: {row['geometry']}\n")

Could not find district for index 11835 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11836 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11837 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 11869 with coordinates (nan, nan)
Point geometry: POINT EMPTY

Could not find district for index 48521 with coordinates (nan, nan)
Point geometry: POINT EMPTY



There are still 5 bridges without known district due to the fact that no coordinates are available. Those bridges are removed from our dataset. 

In [49]:
# remove bridges with unknown district
data_districts = data_districts.dropna(subset=['Kreis'])

In [50]:
data_normalized = data_districts.copy()

# normalized district names
districts = sorted(germany_districts['GEN'].unique())
# print("Known districts:", len(districts))

# check whether there are districts that are contained in each other
# for i in range(len(districts)):
#     for j in range(len(districts)):
#         if i != j and districts[i].lower() in districts[j].lower():
#             print(f"District '{districts[i]}' is contained in '{districts[j]}'")

# normalize names using germany_districts
found = False
for index, row in data_normalized.iterrows():
    district_name = row['Kreis'].lower()
    found = False

    if 'amberg-sulzbach' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Amberg-Sulzbach'
        continue
    elif 'bamberg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Bamberg'
        continue
    elif 'amberg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Amberg'
        continue
    elif 'dieburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Darmstadt-Dieburg'
        continue
    elif 'darmstadt' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Darmstadt'
        continue
    elif 'erlangen-höchstadt' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Erlangen-Höchstadt'
        continue
    elif 'erlangen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Erlangen'
        continue
    elif 'schleswig-flensburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Schleswig-Flensburg'
        continue
    elif 'flensburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Flensburg'
        continue
    elif 'nordfriesland' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Nordfriesland'
        continue
    elif 'friesland' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Friesland'
        continue
    elif 'groß - gerau' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Groß-Gerau'
        continue
    elif 'gera' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Gera'
        continue
    elif 'mansfeld-südharz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Mansfeld-Südharz'
        continue
    elif 'harz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Harz'
        continue
    elif 'pfaffenhofen a.d.ilm' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Pfaffenhofen'
        continue
    elif 'hof' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Hof'
        continue
    elif 'mayen-koblenz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Mayen-Koblenz'
        continue
    elif 'koblenz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Koblenz'
        continue
    elif 'mainz-bingen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Mainz-Bingen'
        continue
    elif 'mainz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Mainz'
        continue
    elif 'neumünster' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neumünster'
        continue
    elif 'münster' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Münster'
        continue
    elif 'nürnberger land' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Nürnberger Land'
        continue
    elif 'nürnberg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Nürnberg'
        continue
    elif 'offenbach am main' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Offenbach am Main'
        continue
    elif 'offenbach' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Offenbach'
        continue
    elif 'oldenburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Oldenburg'
        continue
    elif 'potsdam-mittelmark' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Potsdam-Mittelmark'
        continue
    elif 'potsdam' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Potsdam'
        continue
    elif 'ostprignitz-ruppin' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Ostprignitz-Ruppin'
        continue
    elif 'prignitz' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Prignitz'
        continue
    elif 'regensburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Regensburg'
        continue
    elif 'regen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Regen'
        continue
    elif 'straubing-bogen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Straubing-Bogen'
        continue
    elif 'straubing' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Straubing'
        continue
    elif 'trier-saarburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Trier-Saarburg'
        continue
    elif 'trier' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Trier'
        continue
    elif 'kulmbach' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Kulmbach'
        continue
    elif 'neu-ulm' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neu-Ulm'
        continue
    elif 'ulm' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Ulm'
        continue
    elif 'weimarer land' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Weimarer Land'
        continue
    elif 'weimar' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Weimar'
        continue
    elif 'alzey-worms' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Alzey-Worms'
        continue
    elif 'worms' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Worms'
        continue
    elif 'hannover' in district_name: 
        data_normalized.at[index, 'Kreis'] = 'Region Hannover'
        continue
    elif 'dessau-rosslau, stadt' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Dessau-Roßlau'
        continue
    elif 'kyffhaeuserkreis' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Kyffhäuserkreis'
        continue
    elif 'landkreis soemmerda' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Sömmerda'
        continue
    elif 'landkreis altenburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Altenburger Land'
        continue
    elif 'landkreis marburg - biedenkopf' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Marburg-Biedenkopf'
        continue
    elif 'altenkirchen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Altenkirchen'
        continue
    elif 'lahn - dill - kreis' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Lahn-Dill-Kreis'
        continue
    elif 'freiburg' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Freiburg im Breisgau'
        continue
    elif 'landkreis mühldorf a.inn' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Mühldorf'
        continue
    elif 'main - kinzig - kreis' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Main-Kinzig-Kreis'
        continue
    elif 'rheingau - taunus - kreis' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Rheingau-Taunus-Kreis'
        continue
    elif 'main - taunus - kreis' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Main-Taunus-Kreis'
        continue
    elif 'ostalb' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Ostalbkreis'
        continue
    elif 'kreisfreie stadt frankfurt main' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Frankfurt am Main'
        continue
    elif 'landkreis neumarkt i.d.opf.' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neumarkt'
        continue
    elif 'landkreis wunsiedel i.fichtelgebirge' in district_name:
        data_filled.at[index, 'Kreis'] = 'Wunsiedel'
        continue
    elif 'südliche weinstrasse' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Südliche Weinstraße'
        continue
    elif 'kfr. stadt landau (pfalz)' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Landau'
        continue
    elif 'landkreis neustadt a.d.waldnaab' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neustadt an der Waldnaab'
        continue
    elif 'landkreis bergstrasse' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Bergstraße'
        continue
    elif 'regionalverband saarbruecken' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Regionalverband Saarbrücken'
        continue
    elif 'landkreis neustadt a.d. aisch - bad windsheim' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neustadt an der Aisch-Bad Windsheim'
        continue
    elif 'kfr. stadt neustadt/weinstraße' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Neustadt an der Weinstraße'
        continue
    elif 'stadt weiden i.d.opf.' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Weiden'
        continue
    elif 'kfr. stadt ludwigshafen' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Ludwigshafen am Rhein'
        continue
    elif 'landkreis dillingen a.d.donau' in district_name:
        data_normalized.at[index, 'Kreis'] = 'Dillingen'
        continue
    for district in districts:
        if district.lower() in district_name.lower():
            data_normalized.at[index, 'Kreis'] = district
            found = True
            break
    if not found: 
        print(f"No match found for district name: {district_name}")

All bridges that have unknown coordinates (534) are deleted due to the fact that for our visualisation coordinates are essential. 

In [51]:
data_final = data_normalized.copy()

# delete bridges with unknown coordinates
data_final = data_final.dropna(subset=["X"])

In [52]:
print("Statstics for bridges with Traglastindex equal to 0 (undefined)")
print(f"Number: {len(data_final[data_final["Traglastindex"] == 0])}")
years = sorted(data_final[data_final["Traglastindex"] == 0]["Baujahr Überbau"].unique())
print(f"Unique years of construction: {years}")
conditions = sorted(data_final[data_final["Traglastindex"] == 0]["Zustandsnote"].unique())
print(f"Unique conditions: {conditions}")
lengths = sorted(data_final[data_final["Traglastindex"] == 0]["Länge (m)"].unique())
print(f"Min length: {lengths[1]} & Max length: {lengths[len(lengths)-1]}")
widths = sorted(data_final[data_final["Traglastindex"] == 0]["Breite (m)"].unique())
print(f"Min width: {widths[1]} & Max width: {widths[len(widths)-1]}")
As = data_final[(data_final["Traglastindex"] == 0) & (data_final["Zugeordneter Sachverhalt vereinfacht"] == "A")]
print(f"Number of Autobahn bridges: {len(As)}")
Bs = data_final[(data_final["Traglastindex"] == 0) & (data_final["Zugeordneter Sachverhalt vereinfacht"] == "B")]
print(f"Number of Bundestraße bridges: {len(Bs)}")
state = sorted(data_final[data_final["Traglastindex"] == 0]["Bundeslandname"].unique())
print(f"Unique Statenames: {state}")
material = sorted(data_final[data_final["Traglastindex"] == 0]["Baustoffklasse"].unique())
print(f"Unique material: {material}")

Statstics for bridges with Traglastindex equal to 0 (undefined)
Number: 5067
Unique years of construction: [1000, 1800, 1845, 1877, 1878, 1880, 1892, 1897, 1898, 1900, 1903, 1904, 1905, 1907, 1910, 1911, 1912, 1920, 1927, 1930, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1947, 1948, 1949, 1950, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
Unique conditions: [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8, 3.9, 4.0]
Min length: 2.07 & Max length: 456.0
Min width: 0.3 & Max width: 138.0
Number of Autobahn bridges: 1452


In the original data set, the `Traglastindex` consists of the following symbols: 
- `I`
- `II`
- `III`
- `IV`
- `V`
- `-`
- `kZN`
- `GR`
- `>GR`
- `*`
`I` to `V` was converted into `1` to `5`, whereas all remaining symbols were exchanged by `0`. Due to the fact that there is no visible pattern in further features explaining the undefined `Traglastindex` Symbols, we remove those 5067 bridges from our data set.

In [53]:
# remove bridges with Traglastindex = 0
data_final = data_final.query('Traglastindex != 0')

len(data_final)

46952

Finally, we save the complete data set with 46952 bridges as `filled_bridge_statistic_germany.csv` in our `data` directory.

In [54]:
# save filled data
data_final.to_csv('../data/filled_bridge_statistic_germany.csv', sep=';', index=False)