# Berlin Transport Network - Data Enrichment

This notebook processes and enriches Berlin public transportation data for a specific year and side (East/West). It performs the following steps:

1. **Configuration**: Set up year and side to process
2. **Data Loading**: Load base data and intermediate files
3. **Line Enrichment**: Add profile and capacity information to lines
4. **Administrative Data**: Add district/neighborhood information to stops
5. **Postal Code Data**: Add postal code information to stops
6. **Line-Stop Relationships**: Process relationships between lines and stops
7. **Data Finalization**: Finalize and save processed data
8. **Reference Data**: Update the reference stations dataset

Most of the implementation logic is in the `src.enrichment` module.

In [13]:
import sys
import pandas as pd
import logging
from pathlib import Path

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Import modules
sys.path.append('..')
from src import enricher

## 1. Configuration

Set up the year and side (east/west) to process, and define paths to data files.

In [14]:
# Configuration
YEAR = 1964
SIDE = "east"

# Set up paths
BASE_DIR = Path('../data')
paths = {
    'base_dir': BASE_DIR,
    'raw_dir': BASE_DIR / 'raw',
    'interim_dir': BASE_DIR / 'interim',
    'processed_dir': BASE_DIR / 'processed',
    'geo_data_dir': BASE_DIR / 'data-external',
}

## 2. Data Loading

Load the raw and intermediate data files required for processing.

In [15]:
# Load data
try:
    line_df_initial, final_stops = enricher.load_data(paths, YEAR, SIDE)
except Exception as e:
    logger.error(f"Error loading data: {e}")
    raise

2025-04-28 17:47:06,572 - INFO - Loaded base data: 77 lines
2025-04-28 17:47:06,585 - INFO - Loaded verified stops: 700 stops


## 3. Line Enrichment

Enrich line data with profile and capacity information based on transport type.

In [16]:
# Enrich lines with profile and capacity information
line_df = enricher.enrich_lines(line_df_initial, SIDE)

# Display a sample of the enriched lines
line_df.head()

2025-04-28 17:47:06,652 - INFO - Enriched lines with profile and capacity information


Unnamed: 0,line_id,year,line_name,type,start_stop,length (time),length (km),east_west,frequency (7:30),profile,capacity
0,19641_east,1964,1,tram,Ostbahnhof<> Hannoverschestr.,28.0,,east,20.0,,195
1,19642_east,1964,A1P,autobus,Adalbertstr. Ecke Köpenickerstr.<> Alexanderpl...,6.0,,east,10.0,,0
2,19643_east,1964,3,tram,S-Bhf. Warschauerstr.<> Björnsonstr.,45.0,,east,12.0,,195
3,19644_east,1964,4,tram,S-Bhf. Warschauerstr.<> Eberswalderstr. Ecke O...,26.0,,east,12.0,,195
4,19645_east,1964,11,tram,Heinrich-Heine-Str.<> Am Kupfergraben,27.0,,east,20.0,,195


## 4. Administrative Data

Add district and neighborhood information to stops based on their geographic location.

In [17]:
# Load district data
districts_gdf, west_berlin_districts = enricher.load_district_data(paths['geo_data_dir'])

# Add administrative data
if districts_gdf is not None and west_berlin_districts is not None:
    enriched_stops_df = enricher.add_administrative_data(SIDE, final_stops, districts_gdf, west_berlin_districts)

    logger.info(f"Enriched stops created, not saved")
else:
    logger.warning("Could not load district data, skipping administrative enrichment")
    enriched_stops_df = final_stops

# Display a sample of the enriched stops
enriched_stops_df.head()

2025-04-28 17:47:07,023 - INFO - Loaded district data: 96 districts
2025-04-28 17:47:07,026 - INFO - Loaded 53 West Berlin districts
2025-04-28 17:47:07,068 - INFO - Created GeoDataFrame with 700 valid geometries from 700 total stops
2025-04-28 17:47:07,089 - INFO - Added administrative data to 700 stops
2025-04-28 17:47:07,091 - INFO - Enriched stops created, not saved


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,latitude,longitude,match_score,matched_name,matched_stop_id,matched_historical_lines,location_from,neighbourhood,district,east_west
0,Ostbahnhof,tram,1,19640_east,"52.50910764,13.43550619",,,,,,,,1964878.0,Friedrichshain,Friedrichshain-Kreuzberg,east
1,U-Bhf. Strausbergerplatz,tram,1,19641_east,"52.51808307,13.43293834",,52.518083,13.432938,100.0,U-Bhf. Strausbergerplatz,196524_east,['1'],,Friedrichshain,Friedrichshain-Kreuzberg,east
2,Leninplatz,tram,1,19642_east,"52.52326580,13.43264570",,52.523266,13.432646,100.0,Leninplatz,196525_east,['1'],,Friedrichshain,Friedrichshain-Kreuzberg,east
3,"Alexanderplatz, Memhardstr.",tram,1,19643_east,"52.52334390,13.41107730",Q65227406,,,,,,,196414.0,Mitte,Mitte,east
4,Rosenthalerplatz,tram,1,19644_east,"52.53017690,13.40161210",Q65093945,,,,,,,19641024.0,Mitte,Mitte,east


## 5. Postal Code Data

Add postal code information to stops based on their geographic location.

In [18]:
# Add postal code data
enriched_stops_df = enricher.add_postal_code_data(
    enriched_stops_df, 
    geo_data_dir=paths['geo_data_dir']
)
# Display a sample of the enriched stops
enriched_stops_df.head()

2025-04-28 17:47:07,122 - INFO - Loading postal code data from local file ..\data\data-external\berlin_postal_codes.geojson
2025-04-28 17:47:07,326 - INFO - Loaded postal code data: 193 areas
2025-04-28 17:47:07,358 - INFO - Created GeoDataFrame with 700 valid geometries from 700 total stops
2025-04-28 17:47:07,371 - INFO - Added postal codes to 622 out of 700 stops


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,latitude,longitude,match_score,matched_name,matched_stop_id,matched_historical_lines,location_from,neighbourhood,district,east_west,postal_code
0,Ostbahnhof,tram,1,19640_east,"52.50910764,13.43550619",,,,,,,,1964878.0,Friedrichshain,Friedrichshain-Kreuzberg,east,10243
1,U-Bhf. Strausbergerplatz,tram,1,19641_east,"52.51808307,13.43293834",,52.518083,13.432938,100.0,U-Bhf. Strausbergerplatz,196524_east,['1'],,Friedrichshain,Friedrichshain-Kreuzberg,east,10243
2,Leninplatz,tram,1,19642_east,"52.52326580,13.43264570",,52.523266,13.432646,100.0,Leninplatz,196525_east,['1'],,Friedrichshain,Friedrichshain-Kreuzberg,east,10249
3,"Alexanderplatz, Memhardstr.",tram,1,19643_east,"52.52334390,13.41107730",Q65227406,,,,,,,196414.0,Mitte,Mitte,east,10178
4,Rosenthalerplatz,tram,1,19644_east,"52.53017690,13.40161210",Q65093945,,,,,,,19641024.0,Mitte,Mitte,east,10119


## 6. Line-Stop Relationships

Process relationships between lines and stops, including creating a line-stops DataFrame, adding line type information, and adding stop foreign keys.

In [19]:
raw_df = pd.read_csv(f"../data/raw/{YEAR}_{SIDE}.csv")

In [20]:
from src import table_creation

# Create line-stops DataFrame
line_stops = table_creation.create_line_stops_df(raw_df)

# # Add stop foreign keys
line_stops = table_creation.add_stop_foreign_keys(line_stops, enriched_stops_df, YEAR, SIDE)

# Display a sample of the line-stops relationships
line_stops.head()

2025-04-28 17:47:07,525 - INFO - Added stop foreign keys to 708 line-stop relationships


Unnamed: 0,stop_order,stop_id,line_id
0,0,19640_east,19641_east
1,1,19641_east,19641_east
2,2,19642_east,19641_east
3,3,19643_east,19641_east
4,4,19644_east,19641_east


## 7. Data Finalization

Finalize and save the processed data to the output directory.

In [21]:
# Finalize data
final_line_df, final_stops_df, final_line_stops_df = table_creation.finalize_data(
    line_df, enriched_stops_df, line_stops
)

# Save final data
table_creation.save_data(paths, final_line_df, final_stops_df, final_line_stops_df, YEAR, SIDE)

2025-04-28 17:47:07,549 - INFO - Finalized data: 77 lines, 700 stops, 708 line-stop relationships
2025-04-28 17:47:07,573 - INFO - Saved processed data to ..\data\processed\1964_east


## Summary

Print a summary of the processed data and next steps.

In [22]:
# Print summary
table_creation.print_summary(YEAR, SIDE, final_line_df, final_stops_df, final_line_stops_df, paths)


ENRICHMENT SUMMARY: 1964 EAST

Processed data summary:
  - Lines: 77
  - Stops: 700
  - Line-stop connections: 708

Transport type distribution:
  - autobus: 34 lines
  - tram: 28 lines
  - s-bahn: 10 lines
  - omnibus: 3 lines
  - u-bahn: 2 lines

Geographic distribution:
  - East: 700 stops

Data completeness:
  - Stops with location: 700 (100.0%)

Data saved to: ..\data\processed\1964_east

Next steps:
  1. Analyze the processed data to understand network structure
  2. Run network metrics to compare East and West Berlin
  3. Create visualizations of the transport network

