# Berlin Transport Network - Data Enrichment

This notebook processes and enriches Berlin public transportation data for a specific year and side (East/West). It performs the following steps:

1. **Configuration**: Set up year and side to process
2. **Data Loading**: Load base data and intermediate files
3. **Line Enrichment**: Add profile and capacity information to lines
4. **Administrative Data**: Add district/neighborhood information to stops
5. **Postal Code Data**: Add postal code information to stops
6. **Line-Stop Relationships**: Process relationships between lines and stops
7. **Data Finalization**: Finalize and save processed data
8. **Reference Data**: Update the reference stations dataset

Most of the implementation logic is in the `src.enrichment` module.

In [11]:
import sys
import pandas as pd
import geopandas as gpd
import logging
from pathlib import Path

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Import modules
sys.path.append('..')
from src import enricher

## 1. Configuration

Set up the year and side (east/west) to process, and define paths to data files.

In [12]:
# Configuration
YEAR = 1965
SIDE = "west"

# Set up paths
BASE_DIR = Path('../data')
paths = {
    'base_dir': BASE_DIR,
    'raw_dir': BASE_DIR / 'raw',
    'interim_dir': BASE_DIR / 'interim',
    'processed_dir': BASE_DIR / 'processed',
    'geo_data_dir': BASE_DIR / 'data-external',
    'existing_stations': BASE_DIR / 'processed' / 'existing_stations.csv'
}

## 2. Data Loading

Load the raw and intermediate data files required for processing.

In [13]:
# Load data
try:
    line_df_initial, final_stops, existing_stations_df = enricher.load_data(paths, YEAR, SIDE)
except Exception as e:
    logger.error(f"Error loading data: {e}")
    raise

2025-03-01 20:43:08,878 - INFO - Loaded base data: 102 lines


2025-03-01 20:43:08,891 - INFO - Loaded verified stops: 1066 stops
2025-03-01 20:43:08,896 - INFO - Loaded existing stations: 1107 stations


## 3. Line Enrichment

Enrich line data with profile and capacity information based on transport type.

In [14]:
# Enrich lines with profile and capacity information
line_df = enricher.enrich_lines(line_df_initial, SIDE)

# Display a sample of the enriched lines
line_df.head()

2025-03-01 20:43:08,917 - INFO - Enriched lines with profile and capacity information


Unnamed: 0,line_id,year,line_name,type,start_stop,length (time),length (km),east_west,frequency (7:30),profile,capacity
0,19651_west,1965,15,tram,"Marienfelde, Daimlerstrasse<> Schulenburgpark",36.0,,west,10.0,,195
1,19652_west,1965,47,tram,Gradestrasse Ecke Tempelhofer Weg<> Groß-Zieth...,21.0,,west,10.0,,195
2,19653_west,1965,47P,tram,Groß-Ziethener-Chaussee Ecke Waltersdorferchau...,6.0,,west,20.0,,195
3,19654_west,1965,53,tram,"Richard-Wagner-Platz<> Hakenfelde, Niederneuen...",40.0,,west,20.0,,195
4,19655_west,1965,54,tram,"Richard-Wagner-Platz<> Spandau, Johannesstift",41.0,,west,20.0,,195


## 4. Administrative Data

Add district and neighborhood information to stops based on their geographic location.

In [None]:
# Load district data
districts_gdf, west_berlin_districts = enricher.load_district_data(paths['geo_data_dir'])

# Add administrative data
if districts_gdf is not None and west_berlin_districts is not None:
    enriched_stops_df = enricher.add_administrative_data(SIDE, final_stops, districts_gdf, west_berlin_districts)

    logger.info(f"Enriched stops created, not saved")
else:
    logger.warning("Could not load district data, skipping administrative enrichment")
    enriched_stops_df = final_stops

# Display a sample of the enriched stops
enriched_stops_df.head()

2025-03-01 20:43:09,333 - INFO - Loaded district data: 96 districts
2025-03-01 20:43:09,335 - INFO - Loaded 53 West Berlin districts
2025-03-01 20:43:09,355 - INFO - Created GeoDataFrame with 1066 valid geometries from 1066 total stops
2025-03-01 20:43:09,373 - INFO - Added administrative data to 1066 stops
2025-03-01 20:43:09,379 - INFO - Enriched stops saved to interim/stops_enriched directory


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,neighbourhood,district,east_west
0,"Marienfelde, Daimlerstrasse",tram,15,19650,"52.42393712,13.38022295",,Marienfelde,Tempelhof-Schöneberg,west
1,Großbeerenstrasse Ecke Daimlerstrasse,tram,15,19651,"52.42636276,13.37438168",,Marienfelde,Tempelhof-Schöneberg,west
2,Körtingstrasse Ecke Großbeerenstrasse,tram,15,19652,"52.43481353,13.37831564",,Mariendorf,Tempelhof-Schöneberg,west
3,Mariendorferdamm Ecke Alt-Mariendorf,tram,15,19653,"52.44016815,13.38730997",,Mariendorf,Tempelhof-Schöneberg,west
4,Imbrosweg Ecke Rixdorferstrasse,tram,15,19654,"52.44352627,13.39862544",,Mariendorf,Tempelhof-Schöneberg,west


## 5. Postal Code Data

Add postal code information to stops based on their geographic location.

In [16]:
# Add postal code data
enriched_stops_df = enricher.add_postal_code_data(
    enriched_stops_df, 
    geo_data_dir=paths['geo_data_dir']
)
# Display a sample of the enriched stops
enriched_stops_df.head()

2025-03-01 20:43:09,390 - INFO - Loading postal code data from local file ..\data\data-external\berlin_postal_codes.geojson
2025-03-01 20:43:09,586 - INFO - Loaded postal code data: 193 areas
2025-03-01 20:43:09,594 - INFO - Created GeoDataFrame with 1066 valid geometries from 1066 total stops
2025-03-01 20:43:09,601 - INFO - Added postal codes to 1066 out of 1066 stops


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,neighbourhood,district,east_west,postal_code
0,"Marienfelde, Daimlerstrasse",tram,15,19650,"52.42393712,13.38022295",,Marienfelde,Tempelhof-Schöneberg,west,12277
1,Großbeerenstrasse Ecke Daimlerstrasse,tram,15,19651,"52.42636276,13.37438168",,Marienfelde,Tempelhof-Schöneberg,west,12277
2,Körtingstrasse Ecke Großbeerenstrasse,tram,15,19652,"52.43481353,13.37831564",,Mariendorf,Tempelhof-Schöneberg,west,12107
3,Mariendorferdamm Ecke Alt-Mariendorf,tram,15,19653,"52.44016815,13.38730997",,Mariendorf,Tempelhof-Schöneberg,west,12107
4,Imbrosweg Ecke Rixdorferstrasse,tram,15,19654,"52.44352627,13.39862544",,Mariendorf,Tempelhof-Schöneberg,west,12109


## 6. Line-Stop Relationships

Process relationships between lines and stops, including creating a line-stops DataFrame, adding line type information, and adding stop foreign keys.

In [17]:
raw_df = pd.read_csv(f"../data/raw/{YEAR}_{SIDE}.csv")

In [18]:
from src import table_creation

# Create line-stops DataFrame
line_stops = table_creation.create_line_stops_df(raw_df)

# # Add stop foreign keys
line_stops = table_creation.add_stop_foreign_keys(line_stops, enriched_stops_df, YEAR, SIDE)

# Display a sample of the line-stops relationships
line_stops.head()

2025-03-01 20:43:09,710 - INFO - Added stop foreign keys to 1396 line-stop relationships


Unnamed: 0,stop_order,stop_id,line_id
0,0,19650,19651_west
1,1,19651,19651_west
2,2,19652,19651_west
3,3,19653,19651_west
4,4,19654,19651_west


## 7. Data Finalization

Finalize and save the processed data to the output directory.

In [19]:
# Finalize data
final_line_df, final_stops_df, final_line_stops_df = table_creation.finalize_data(
    line_df, enriched_stops_df, line_stops
)

# Save final data
table_creation.save_data(paths, final_line_df, final_stops_df, final_line_stops_df, YEAR, SIDE)

2025-03-01 20:43:09,727 - INFO - Finalized data: 102 lines, 1066 stops, 1396 line-stop relationships
2025-03-01 20:43:09,739 - INFO - Saved processed data to ..\data\processed\1965_west


## Summary

Print a summary of the processed data and next steps.

In [20]:
# Print summary
table_creation.print_summary(YEAR, SIDE, final_line_df, final_stops_df, final_line_stops_df, paths)


ENRICHMENT SUMMARY: 1965 WEST

Processed data summary:
  - Lines: 102
  - Stops: 1066
  - Line-stop connections: 1396

Transport type distribution:
  - autobus: 73 lines
  - s-bahn: 10 lines
  - tram: 9 lines
  - u-bahn: 9 lines
  - ferry: 1 lines

Geographic distribution:
  - West: 1066 stops

Data completeness:
  - Stops with location: 1066 (100.0%)

Data saved to: ..\data\processed\1965_west

Next steps:
  1. Analyze the processed data to understand network structure
  2. Run network metrics to compare East and West Berlin
  3. Create visualizations of the transport network

