# Berlin Transport Network - Data Enrichment

This notebook processes and enriches Berlin public transportation data for a specific year and side (East/West). It performs the following steps:

1. **Configuration**: Set up year and side to process
2. **Data Loading**: Load base data and intermediate files
3. **Line Enrichment**: Add profile and capacity information to lines
4. **Administrative Data**: Add district/neighborhood information to stops
5. **Postal Code Data**: Add postal code information to stops
6. **Line-Stop Relationships**: Process relationships between lines and stops
7. **Data Finalization**: Finalize and save processed data
8. **Reference Data**: Update the reference stations dataset

Most of the implementation logic is in the `src.enrichment` module.

In [11]:
import sys
import pandas as pd
import logging
from pathlib import Path

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Import modules
sys.path.append('..')
from src import enricher

## 1. Configuration

Set up the year and side (east/west) to process, and define paths to data files.

In [12]:
# Configuration
YEAR = 1965
SIDE = "east"

# Set up paths
BASE_DIR = Path('../data')
paths = {
    'base_dir': BASE_DIR,
    'raw_dir': BASE_DIR / 'raw',
    'interim_dir': BASE_DIR / 'interim',
    'processed_dir': BASE_DIR / 'processed',
    'geo_data_dir': BASE_DIR / 'data-external',
    'existing_stations': BASE_DIR / 'processed' / 'existing_stations.csv'
}

## 2. Data Loading

Load the raw and intermediate data files required for processing.

In [13]:
# Load data
try:
    line_df_initial, final_stops, existing_stations_df = enricher.load_data(paths, YEAR, SIDE)
except Exception as e:
    logger.error(f"Error loading data: {e}")
    raise

2025-04-06 14:55:23,074 - INFO - Loaded base data: 74 lines
2025-04-06 14:55:23,100 - INFO - Loaded verified stops: 780 stops
2025-04-06 14:55:23,114 - INFO - Loaded existing stations: 2138 stations


## 3. Line Enrichment

Enrich line data with profile and capacity information based on transport type.

In [14]:
# Enrich lines with profile and capacity information
line_df = enricher.enrich_lines(line_df_initial, SIDE)

# Display a sample of the enriched lines
line_df.head()

2025-04-06 14:55:23,142 - INFO - Enriched lines with profile and capacity information


Unnamed: 0,line_id,year,line_name,type,start_stop,length (time),length (km),east_west,frequency (7:30),profile,capacity
0,19651_east,1965,A,u-bahn,Pankow (Vinetastr.)<> Thälmannplatz,19.0,7.5,east,5.0,Kleinprofil,750
1,19652_east,1965,E,u-bahn,Alexanderplatz<> Friedrichsfelde (Tierpark),16.0,7.1,east,5.0,Großprofil,1000
2,19653_east,1965,1,tram,Adalbertstrasse Ecke Köpenickerstrasse<> Walte...,27.0,7.3,east,20.0,,195
3,19654_east,1965,A 1 P,autobus,Köpenickerstrasse Ecke Adalbertstrasse<> Alexa...,6.0,2.0,east,10.0,,0
4,19655_east,1965,3,tram,Revalerstrasse (S-Bhf. Warschauer Strasse)<> B...,43.0,11.4,east,12.0,,195


## 4. Administrative Data

Add district and neighborhood information to stops based on their geographic location.

In [15]:
# Load district data
districts_gdf, west_berlin_districts = enricher.load_district_data(paths['geo_data_dir'])

# Add administrative data
if districts_gdf is not None and west_berlin_districts is not None:
    enriched_stops_df = enricher.add_administrative_data(SIDE, final_stops, districts_gdf, west_berlin_districts)

    logger.info(f"Enriched stops created, not saved")
else:
    logger.warning("Could not load district data, skipping administrative enrichment")
    enriched_stops_df = final_stops

# Display a sample of the enriched stops
enriched_stops_df.head()

2025-04-06 14:55:23,841 - INFO - Loaded district data: 96 districts
2025-04-06 14:55:23,844 - INFO - Loaded 53 West Berlin districts
2025-04-06 14:55:23,857 - INFO - Created GeoDataFrame with 745 valid geometries from 780 total stops
2025-04-06 14:55:23,874 - INFO - Added administrative data to 745 stops
2025-04-06 14:55:23,875 - INFO - Enriched stops created, not saved


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,neighbourhood,district,east_west
0,Pankow (Vinetastr.),u-bahn,A,19650_east,"52.55916667,13.41333333",Q570906,Pankow,Pankow,east
1,Schönhauser Allee,u-bahn,A,19651_east,"52.54932889,13.41370611",Q47014936,Prenzlauer Berg,Pankow,east
2,Dimitroffstr.,u-bahn,A,19652_east,"52.54166667,13.41222222",Q571382,Prenzlauer Berg,Pankow,east
3,Senefelderplatz,u-bahn,A,19653_east,"52.53242000,13.41274300",Q32661454,Prenzlauer Berg,Pankow,east
4,Luxemburgplatz,u-bahn,A,19654_east,"52.52833333,13.41027778",Q658790,Mitte,Mitte,east


## 5. Postal Code Data

Add postal code information to stops based on their geographic location.

In [16]:
# Add postal code data
enriched_stops_df = enricher.add_postal_code_data(
    enriched_stops_df, 
    geo_data_dir=paths['geo_data_dir']
)
# Display a sample of the enriched stops
enriched_stops_df.head()

2025-04-06 14:55:23,906 - INFO - Loading postal code data from local file ..\data\data-external\berlin_postal_codes.geojson
2025-04-06 14:55:24,334 - INFO - Loaded postal code data: 193 areas
2025-04-06 14:55:24,343 - INFO - Created GeoDataFrame with 745 valid geometries from 745 total stops
2025-04-06 14:55:24,354 - INFO - Added postal codes to 659 out of 745 stops


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,neighbourhood,district,east_west,postal_code
0,Pankow (Vinetastr.),u-bahn,A,19650_east,"52.55916667,13.41333333",Q570906,Pankow,Pankow,east,13189
1,Schönhauser Allee,u-bahn,A,19651_east,"52.54932889,13.41370611",Q47014936,Prenzlauer Berg,Pankow,east,10439
2,Dimitroffstr.,u-bahn,A,19652_east,"52.54166667,13.41222222",Q571382,Prenzlauer Berg,Pankow,east,10437
3,Senefelderplatz,u-bahn,A,19653_east,"52.53242000,13.41274300",Q32661454,Prenzlauer Berg,Pankow,east,10405
4,Luxemburgplatz,u-bahn,A,19654_east,"52.52833333,13.41027778",Q658790,Mitte,Mitte,east,10119


## 6. Line-Stop Relationships

Process relationships between lines and stops, including creating a line-stops DataFrame, adding line type information, and adding stop foreign keys.

In [17]:
raw_df = pd.read_csv(f"../data/raw/{YEAR}_{SIDE}.csv")

In [18]:
from src import table_creation

# Create line-stops DataFrame
line_stops = table_creation.create_line_stops_df(raw_df)

# # Add stop foreign keys
line_stops = table_creation.add_stop_foreign_keys(line_stops, enriched_stops_df, YEAR, SIDE)

# Display a sample of the line-stops relationships
line_stops.head()

2025-04-06 14:55:24,510 - INFO - Added stop foreign keys to 747 line-stop relationships


Unnamed: 0,stop_order,stop_id,line_id
0,0,19650_east,19651_east
1,1,19651_east,19651_east
2,2,19652_east,19651_east
3,3,19653_east,19651_east
4,4,19654_east,19651_east


## 7. Data Finalization

Finalize and save the processed data to the output directory.

In [19]:
# Finalize data
final_line_df, final_stops_df, final_line_stops_df = table_creation.finalize_data(
    line_df, enriched_stops_df, line_stops
)

# Save final data
table_creation.save_data(paths, final_line_df, final_stops_df, final_line_stops_df, YEAR, SIDE)

2025-04-06 14:55:24,540 - INFO - Finalized data: 74 lines, 745 stops, 747 line-stop relationships
2025-04-06 14:55:24,557 - INFO - Saved processed data to ..\data\processed\1965_east


## Summary

Print a summary of the processed data and next steps.

In [20]:
# Print summary
table_creation.print_summary(YEAR, SIDE, final_line_df, final_stops_df, final_line_stops_df, paths)


ENRICHMENT SUMMARY: 1965 EAST

Processed data summary:
  - Lines: 74
  - Stops: 745
  - Line-stop connections: 747

Transport type distribution:
  - autobus: 31 lines
  - tram: 28 lines
  - s-bahn: 10 lines
  - omnibus: 3 lines
  - u-bahn: 2 lines

Geographic distribution:
  - East: 745 stops

Data completeness:
  - Stops with location: 745 (100.0%)

Data saved to: ..\data\processed\1965_east

Next steps:
  1. Analyze the processed data to understand network structure
  2. Run network metrics to compare East and West Berlin
  3. Create visualizations of the transport network

