#### Explore our data set and remove unnecessary columns

This notebook is used to explore the original data set (`original_bridge_statistic_germany.csv`) with respect to its columns. The data is restricted to particular columns that we want to look at in the following analysis. Furthermore, the notebook is used to find out which columns contain missing values. 
We reduce the data to the following columns: 
- `Bauwerksname`: categorical
- `Baujahr Überbau`: 1000 -> 2024
- `Baujahr Unterbau`: 1000 -> 2024 (61 missing values)
- `Zustandsnote`: 1.0 -> 4.0
- `Baustoffklasse`: categorical (7)
- `Baustoff Überbau`: categorical (45)
- `Länge (m)`: 2.0 -> 1800.0
- `Breite (m)`: 0.0 -> 1056.7 (1 missing value)
- `Zugeordneter Sachverhalt`: categorical (806)
- `Zugeordneter Sachverhalt vereinfacht`: categorical (2)
- `Traglastindex`: categorical (10)
- `Teilbauwerksstadium`: categorical (20)
- `Teilbauwerksart`: categorical (42)
- `Kreis`: categorical (45 missing values)
- `Bundeslandname`: categorical (763 missing values)
- `x2`: 5.87689718 -> 15.00696414 (539 missing values)
- `y2`: 47.38716274 -> 54.90381663 (539 missing values)

Additionally, the columns `x2` and `y2` are renamed to `X` and `Y`. It further maps the `Traglastindex` to numbers, as follows: 
- `I` -> 1
- `II` -> 2
- `III` -> 3
- `IV` -> 4
- `V` -> 5
- anything else -> 0. 

The reduced data set is stored as `reduced_bridge_statistic_germany.csv`.


In [1]:
# load libraries
import pandas as pd
import numpy as np

In [2]:
# read original data
data_original = pd.read_csv('../../data/original_bridge_statistic_germany.csv', sep=';')

First, we print the column names, its type and the number of unique entries for categorical columns or the range for continuous columns to get an overview and to be able to find interesting and relevant columns. 

In [3]:
# function to print summarized information about individual columns
# @param data [dataframe]
def column_information(data): 
    # number of rows
    nrws = len(data)
    for col in data.columns: 
        current = data[col]
        non_null = current.dropna()
        col_type = current.dtype
        if pd.api.types.is_numeric_dtype(current): 
            mn = non_null.min() if not non_null.empty else np.nan
            mx = non_null.max() if not non_null.empty else np.nan
            print(f"{col}: numeric ({col_type}) | # of missing values: {nrws-len(non_null)} | range: {mn} -> {mx}")
        elif isinstance(current, pd.DatetimeTZDtype): 
            mn = non_null.min() if not non_null.empty else pd.NaT
            mx = non_null.max() if not non_null.empty else pd.NaT
            print(f"{col}: numeric ({col_type}) | # of missing values: {nrws-len(non_null)} | range: {mn} -> {mx}")
        else: 
            nunique = non_null.nunique()
            print(f"{col}: categorical ({col_type}) | # of missing values: {nrws-len(non_null)} | unique: {nunique}")

column_information(data_original)

ObjectID: numeric (int64) | # of missing values: 0 | range: 1 -> 52559
Bauwerksname: categorical (object) | # of missing values: 0 | unique: 47302
Bauwerksnummer: numeric (int64) | # of missing values: 0 | range: 1019500 -> 8627501
Teilbauwerksnummer: categorical (object) | # of missing values: 0 | unique: 80
ID Nr: categorical (object) | # of missing values: 0 | unique: 52559
Baujahr Überbau: numeric (int64) | # of missing values: 0 | range: 1000 -> 2024
Baujahr Unterbau: numeric (float64) | # of missing values: 61 | range: 1000.0 -> 2024.0
Altersklasse: categorical (object) | # of missing values: 0 | unique: 20
Zustandsnotenklasse: categorical (object) | # of missing values: 0 | unique: 6
Zustandsnote: categorical (object) | # of missing values: 0 | unique: 31
Baustoffklasse: categorical (object) | # of missing values: 0 | unique: 7
Baustoff Überbau: categorical (object) | # of missing values: 0 | unique: 45
Länge (m): categorical (object) | # of missing values: 0 | unique: 8422
Läng

Due to the fact that columns that should include numerical values are not recognized as such, we have to convert `,` to `.`. The relevant columns are: `Zustandsnote`, `Länge (m)`, `Breite (m)`, `X`, `Y`, `x2` and `y2`. After that, we again print the column information. 

In [4]:
data_modified = data_original.copy()

# Baujahr Überbau and Baujahr Unterbau -> Integer, no decimals
data_modified['Baujahr Überbau'] = data_modified['Baujahr Überbau'].astype('Int64')
data_modified['Baujahr Unterbau'] = data_modified['Baujahr Unterbau'].astype('Int64')

# "," to "."
data_modified['Zustandsnote'] = data_modified['Zustandsnote'].str.replace(',', '.').astype(float)
data_modified['Länge (m)'] = data_modified['Länge (m)'].str.replace(',', '.').astype(float)
data_modified['Breite (m)'] = data_modified['Breite (m)'].str.replace(',', '.').astype(float)
data_modified['X'] = data_modified['X'].str.replace(',', '.').astype(float)
data_modified['Y'] = data_modified['Y'].str.replace(',', '.').astype(float)
data_modified['x2'] = data_modified['x2'].str.replace(',', '.').astype(float)
data_modified['y2'] = data_modified['y2'].str.replace(',', '.').astype(float)

# print column information
column_information(data_modified)

ObjectID: numeric (int64) | # of missing values: 0 | range: 1 -> 52559
Bauwerksname: categorical (object) | # of missing values: 0 | unique: 47302
Bauwerksnummer: numeric (int64) | # of missing values: 0 | range: 1019500 -> 8627501
Teilbauwerksnummer: categorical (object) | # of missing values: 0 | unique: 80
ID Nr: categorical (object) | # of missing values: 0 | unique: 52559
Baujahr Überbau: numeric (Int64) | # of missing values: 0 | range: 1000 -> 2024
Baujahr Unterbau: numeric (Int64) | # of missing values: 61 | range: 1000 -> 2024
Altersklasse: categorical (object) | # of missing values: 0 | unique: 20
Zustandsnotenklasse: categorical (object) | # of missing values: 0 | unique: 6
Zustandsnote: numeric (float64) | # of missing values: 0 | range: 1.0 -> 4.0
Baustoffklasse: categorical (object) | # of missing values: 0 | unique: 7
Baustoff Überbau: categorical (object) | # of missing values: 0 | unique: 45
Länge (m): numeric (float64) | # of missing values: 0 | range: 2.0 -> 1800.0
L

In [5]:
# select relevant columns
data_reduced = data_modified[['Bauwerksname', 'Baujahr Überbau', 'Baujahr Unterbau', 'Zustandsnote',  'Baustoffklasse', 'Baustoff Überbau', 
             'Länge (m)', 'Breite (m)', 'Zugeordneter Sachverhalt', 'Zugeordneter Sachverhalt vereinfacht', 'Traglastindex', 'Teilbauwerksstadium',
             'Teilbauwerksart', 'Kreis', 'Bundeslandname', 'x2', 'y2']]

# rename 'x2' and 'y2' columns
data_reduced = data_reduced.rename(columns={'x2': 'X'})
data_reduced = data_reduced.rename(columns={'y2': 'Y'})

As mentioned before, the `Traglastindex` is mapped to numbers 1 to 5. The reason is that even though our dataset contains 10 different values for this column, we could not find out what those other abbreviations (`kZN, GR, *, >GR`) mean. We just found the following [article](https://www.bast.de/DE/Themen/Infrastruktur/HF_2/Massnahmen/Traglastindex.html?nn=417492) containing information on the indices `I to V`. 

In [6]:
# print counts of unique values in Traglastindex
print(data_reduced['Traglastindex'].value_counts())

# Traglastindex: I -> 1, II -> 2, III -> 3, IV -> 4, V -> 5, Rest -> 0
data_reduced['Traglastindex'] = data_reduced['Traglastindex'].map({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5}).fillna(0).astype(int)

# print counts of unique values in Traglastindex
print(data_reduced['Traglastindex'].value_counts())

Traglastindex
II     17546
III    12941
I      10996
IV      3521
V       2384
-       2367
kZN     1533
GR       817
*        357
>GR       95
Name: count, dtype: int64
Traglastindex
2    17546
3    12941
1    10996
0     5171
4     3521
5     2384
Name: count, dtype: int64


Finally, we save the processed data set (`reduced_bridge_statistic_germany.csv`) in our `data` directory.

In [None]:
# save reduced data
data_reduced.to_csv('../../data/reduced_bridge_statistic_germany.csv', sep=';')