In [1]:
# 00_process_data.ipynb

"""
This notebook processes raw Berlin transport data from Fahrplanbücher into structured formats.
"""

import sys
from pathlib import Path
import pandas as pd
import logging

# Add the src directory to the Python path
sys.path.append(str(Path('../src').resolve()))

# Import processing modules
from utils.data_loader import DataLoader, format_line_list
from processor import TransportDataProcessor
from utils.geolocation import StationMatcher

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Berlin Transport Data Processing

This notebook performs the initial extraction and transformation of Berlin's historical public transportation data from raw sources. It represents the first step in our processing pipeline.

## Purpose

1. **Data Extraction**: Load and parse raw data from digitized Fahrplanbücher (timetables)
2. **Initial Structuring**: Convert raw data into structured tables with consistent formats
3. **Station Identification**: Establish unique identifiers for transportation stops
4. **Preliminary Geolocation**: Match stations to known geographic coordinates where possible

## Process Overview

The process follows these key steps:
1. Load raw data from CSV files containing transcribed Fahrplanbuch information
2. Process this data into standardized tables (lines and stops)
3. Match stations with existing station records to obtain geographic coordinates
4. Generate interim data files for subsequent processing stages

## Historical Context

The data represents Berlin's public transportation system during the Cold War era (1945-1989). During this period, Berlin was divided, with separate transportation authorities operating in East and West Berlin. This division is reflected in our data processing approach, where we handle each side separately for each year.

In [2]:
# Configuration
YEAR = 1965
SIDE = "east"  # or "east"
DATA_DIR = Path('../data')

# Initialize loader
loader = DataLoader()

In [3]:
# Load raw transcribed data
raw_data_path = DATA_DIR / 'raw' / f'{YEAR}_{SIDE}.csv'
raw_df = loader.load_raw_data(str(raw_data_path))
logger.info(f"Loaded raw data: {len(raw_df)} lines")

# Display sample of loaded data to verify
print("\nSample of loaded data:")
print(raw_df[['line_name', 'type', 'stops']].head())

2025-04-06 14:54:20,691 - INFO - Loaded raw data: 74 lines



Sample of loaded data:
  line_name     type                                              stops
0         A   u-bahn  Pankow (Vinetastr.) - Schönhauser Allee - Dimi...
1         E   u-bahn  Alexanderplatz - Schillingstr. - Strausberger ...
2         1     tram  Adalbertstrasse Ecke Köpenickerstrasse - Ostba...
3     A 1 P  autobus  Köpenickerstrasse Ecke Adalbertstrasse - S-Bhf...
4         3     tram  Revalerstrasse (S-Bhf. Warschauer Strasse) - K...


## Existing Station Reference Data

To ensure consistency across years and facilitate geolocation, we maintain a reference dataset of known stations. This dataset:

1. Serves as a lookup table for station coordinates
2. Helps standardize station names across different time periods
3. Provides unique identifiers for stations that persist across snapshots
4. Records the lines that serve each station through time

As we process new data, this reference dataset will be expanded with newly identified stations.

In [4]:
# Load existing stations data
existing_stations_path = DATA_DIR / 'processed' / 'existing_stations1.csv'
existing_stations_df = pd.read_csv(existing_stations_path)

# Format line lists in existing stations
existing_stations_df['in_lines'] = existing_stations_df['in_lines'].apply(format_line_list)

logger.info(f"Loaded existing stations: {len(existing_stations_df)} stations")

2025-04-06 14:54:20,750 - INFO - Loaded existing stations: 3224 stations


## Initial Data Processing

The TransportDataProcessor class transforms our raw data into structured tables:

1. **Lines Table**: Contains information about each transportation line
   - Unique identifiers
   - Type (U-Bahn, S-Bahn, tram, bus)
   - Terminal stations
   - Service frequency
   - Journey time and distance

2. **Stops Table**: Contains information about each station
   - Unique identifiers
   - Station names
   - Transportation type
   - Placeholder for geographic coordinates

This structured format facilitates network analysis and visualization in later stages.

In [5]:
# Process cleaned raw data
processor = TransportDataProcessor(YEAR, SIDE)

try:
    # Pass the DataFrame directly
    results = processor.process_raw_data(raw_df, existing_stations_df)
    logger.info("Initial processing complete")
    
    # Display processing results
    for name, df in results.items():
        print(f"\n{name} table shape: {df.shape}")
        print(f"Sample of {name}:")
        display(df.head(2))  # Using display for better notebook output
        
except Exception as e:
    logger.error(f"Error in initial processing: {e}")
    raise

2025-04-06 14:54:20,760 - INFO - Using provided DataFrame
2025-04-06 14:54:20,780 - INFO - Created tables: lines (74 rows), stops (743 rows), 
2025-04-06 14:54:20,781 - INFO - Initial processing complete



lines table shape: (74, 9)
Sample of lines:


Unnamed: 0,line_id,year,line_name,type,start_stop,length (time),length (km),east_west,frequency (7:30)
0,19651,1965,A,u-bahn,Pankow (Vinetastr.)<> Thälmannplatz,19.0,7.5,east,5.0
1,19652,1965,E,u-bahn,Alexanderplatz<> Friedrichsfelde (Tierpark),16.0,7.1,east,5.0



stops table shape: (743, 6)
Sample of stops:


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier
0,Pankow (Vinetastr.),u-bahn,A,19650_east,,
1,Schönhauser Allee,u-bahn,A,19651_east,,


## Saving Interim Results

The processed tables are saved as interim files. These will be used in subsequent notebooks for:
1. Geolocation verification and enhancement
2. Data enrichment with administrative and temporal information
3. Network construction and analysis

By breaking the process into discrete steps, we can better manage the complexity and ensure quality at each stage.

In [6]:
# Save results
for name, df in results.items():
    output_path = DATA_DIR / 'interim' / 'stops_base' / f'{name}_{YEAR}_{SIDE}.csv'
    output_path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(output_path, index=False)
    logger.info(f"Saved {name} table to {output_path}")

2025-04-06 14:54:20,836 - INFO - Saved lines table to ..\data\interim\stops_base\lines_1965_east.csv
2025-04-06 14:54:20,841 - INFO - Saved stops table to ..\data\interim\stops_base\stops_1965_east.csv


## Station Matching

This step attempts to match stations in our current dataset with those in our reference database. This process:

1. Compares station names and types to find potential matches
2. Assigns geographic coordinates from matched stations
3. Identifies stations that require manual geolocation
4. Logs matching statistics for quality control

Stations that cannot be automatically matched will be processed manually using OpenRefine in a subsequent step.

In [7]:
# Station Matching Process
matcher = StationMatcher(existing_stations_df)

# Process stops table with location matching
matched_stops = matcher.add_location_data(results['stops'])

2025-04-06 14:54:20,975 - INFO - No match found for station: Stadtmitte (Mohrenstr.)
2025-04-06 14:54:21,010 - INFO - No match found for station: Frankfurter Allee (Ringbahn)


2025-04-06 14:54:21,032 - INFO - No match found for station: Adalbertstrasse Ecke Köpenickerstrasse
2025-04-06 14:54:21,056 - INFO - No match found for station: Invalidenstrasse Ecke Brunnenstrasse
2025-04-06 14:54:21,060 - INFO - No match found for station: Chausseestrasse Ecke Invalidenstrasse
2025-04-06 14:54:21,078 - INFO - No match found for station: Revalerstrasse (S-Bhf. Warschauer Strasse)
2025-04-06 14:54:21,084 - INFO - No match found for station: Boxhagenerstrasse
2025-04-06 14:54:21,091 - INFO - No match found for station: HohenSchönhauser Strasse
2025-04-06 14:54:21,094 - INFO - No match found for station: Omnibus Bhf. Lichtenberger Strasse
2025-04-06 14:54:21,109 - INFO - No match found for station: Revalerstrasse (S-Bhf. Warschauer Strasse)
2025-04-06 14:54:21,112 - INFO - No match found for station: Grünberger Strasse Ecke Warschauerstrasse
2025-04-06 14:54:21,134 - INFO - No match found for station: Eberswalder Strasse
2025-04-06 14:54:21,136 - INFO - No match found fo

In [8]:
# Analysis of matching results
total_stops = len(matched_stops)
matched = matched_stops['location'].notna().sum()
unmatched = total_stops - matched

print("\nMatching Statistics:")
print(f"Total stations: {total_stops}")
print(f"Matched: {matched} ({matched/total_stops*100:.1f}%)")
print(f"Unmatched: {unmatched} ({unmatched/total_stops*100:.1f}%)")

# Display sample of matched stations
print("\nSample of matched stations:")
display(matched_stops[matched_stops['location'].notna()].head(3))

print("\nSample of unmatched stations:")
display(matched_stops[matched_stops['location'].isna()].head(3))


Matching Statistics:
Total stations: 743
Matched: 706 (95.0%)
Unmatched: 37 (5.0%)

Sample of matched stations:


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier
0,Pankow (Vinetastr.),u-bahn,A,19650_east,"52.559166666667,13.413333333333",Q570906
1,Schönhauser Allee,u-bahn,A,19651_east,"52.549328888889,13.413706111111",Q47014936
2,Dimitroffstr.,u-bahn,A,19652_east,"52.541666666667,13.412222222222",Q571382



Sample of unmatched stations:


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier
10,Stadtmitte (Mohrenstr.),u-bahn,A,196510_east,,
18,Frankfurter Allee (Ringbahn),u-bahn,E,196518_east,,
22,Adalbertstrasse Ecke Köpenickerstrasse,tram,1,196522_east,,


## Validation and Export

As a final step, we validate the matched stations and export the results:

1. The complete dataset is saved for the next processing stage
2. Unmatched stations are exported separately for manual geolocation
3. Matching statistics are logged for quality assurance

The manual geolocation process will be performed using OpenRefine, which provides tools for interactive data cleaning and enrichment.

In [9]:
# Validate matches
from utils.geolocation import validate_matches
validate_matches(matched_stops)

# Save results
matched_dir = Path('../data/interim/stops_matched_initial')
matched_dir.mkdir(parents=True, exist_ok=True)

# Save all stops (both matched and unmatched)
matched_path = matched_dir / f'stops_{YEAR}_{SIDE}.csv'
matched_stops.to_csv(matched_path, index=False)

# Save unmatched stops separately for OpenRefine
unmatched_stops = matched_stops[matched_stops['location'].isna()]
openrefine_dir = Path('../data/interim/stops_for_openrefine')
openrefine_dir.mkdir(parents=True, exist_ok=True)
openrefine_path = openrefine_dir / f'unmatched_stops_{YEAR}_{SIDE}.csv'
unmatched_stops.to_csv(openrefine_path, index=False)

print(f"\nSaved {len(matched_stops)} total stops")
print(f"Exported {len(unmatched_stops)} unmatched stops for manual processing")


Matching Statistics:
Total stations: 743
Matched: 706 (95.0%)
Unmatched: 37 (5.0%)

Sample of unmatched stations:
10                   Stadtmitte (Mohrenstr.)
18              Frankfurter Allee (Ringbahn)
22    Adalbertstrasse Ecke Köpenickerstrasse
28      Invalidenstrasse Ecke Brunnenstrasse
29     Chausseestrasse Ecke Invalidenstrasse
Name: stop_name, dtype: object

Saved 743 total stops
Exported 37 unmatched stops for manual processing


## Next Steps

After this initial processing, the workflow continues with:

1. **Manual Geolocation**: Using OpenRefine to add coordinates to unmatched stations
2. **Geolocation Verification**: Validating coordinates and splitting composite stations
3. **Data Enrichment**: Adding administrative and contextual information
4. **Network Construction**: Building a graph representation of the transportation system
5. **Analysis**: Investigating network properties and evolution over time

The next notebook in the sequence is `01_geolocation_verification_splitting.ipynb`.