In [1]:
# 00_process_data.ipynb

"""
This notebook processes raw Berlin transport data from Fahrplanbücher into structured formats.
"""

import sys
from pathlib import Path
import pandas as pd
import logging
import os

# Add the src directory to the Python path
sys.path.append(str(Path('../src').resolve()))

# Import processing modules
from utils.data_loader import DataLoader, format_line_list
from processor import TransportDataProcessor
# --- Import the correct matcher ---
from df_station_matcher import DataFrameStationMatcher
# from db_station_matcher import Neo4jStationMatcher # <-- Can likely remove this import

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Berlin Transport Data Processing

This notebook performs the initial extraction and transformation of Berlin's historical public transportation data from raw sources. It represents the first step in our processing pipeline.

## Purpose

1. **Data Extraction**: Load and parse raw data from digitized Fahrplanbücher (timetables)
2. **Initial Structuring**: Convert raw data into structured tables with consistent formats
3. **Station Identification**: Establish unique identifiers for transportation stops
4. **Preliminary Geolocation**: Match stations to known geographic coordinates where possible

## Process Overview

The process follows these key steps:
1. Load raw data from CSV files containing transcribed Fahrplanbuch information
2. Process this data into standardized tables (lines and stops)
3. Match stations with existing station records to obtain geographic coordinates
4. Generate interim data files for subsequent processing stages

## Historical Context

The data represents Berlin's public transportation system during the Cold War era (1945-1989). During this period, Berlin was divided, with separate transportation authorities operating in East and West Berlin. This division is reflected in our data processing approach, where we handle each side separately for each year.

In [2]:
# Configuration
YEAR = 1964
SIDE = "east"  # or "east"
DATA_DIR = Path('../data')

# Initialize loader
loader = DataLoader()

In [3]:
# Load raw transcribed data
raw_data_path = DATA_DIR / 'raw' / f'{YEAR}_{SIDE}.csv'
raw_df = loader.load_raw_data(str(raw_data_path))
logger.info(f"Loaded raw data: {len(raw_df)} lines")

# Display sample of loaded data to verify
print("\nSample of loaded data:")
print(raw_df[['line_name', 'type', 'stops']].head())

2025-04-28 00:02:00,758 - INFO - Loaded raw data: 77 lines



Sample of loaded data:
  line_name     type                                              stops
0         1     tram  Ostbahnhof - U-Bhf. Strausbergerplatz - Leninp...
1       A1P  autobus  Adalbertstr. Ecke Köpenickerstr. - S-Bhf. Jann...
2         3     tram  S-Bhf. Warschauerstr. - Kopernikusstr. - Holte...
3         4     tram  S-Bhf. Warschauerstr. - Grünbergerstr. - Frank...
4        11     tram  Heinrich-Heine-Str. - S-Bhf. Jannowitzbrücke -...


## Existing Station Reference Data

To ensure consistency across years and facilitate geolocation, we maintain a reference dataset of known stations. This dataset:

1. Serves as a lookup table for station coordinates
2. Helps standardize station names across different time periods
3. Provides unique identifiers for stations that persist across snapshots
4. Records the lines that serve each station through time

As we process new data, this reference dataset will be expanded with newly identified stations.

## Initial Data Processing

The TransportDataProcessor class transforms our raw data into structured tables:

1. **Lines Table**: Contains information about each transportation line
   - Unique identifiers
   - Type (U-Bahn, S-Bahn, tram, bus)
   - Terminal stations
   - Service frequency
   - Journey time and distance

2. **Stops Table**: Contains information about each station
   - Unique identifiers
   - Station names
   - Transportation type
   - Placeholder for geographic coordinates

This structured format facilitates network analysis and visualization in later stages.

In [4]:
# Process cleaned raw data
processor = TransportDataProcessor(YEAR, SIDE)

try:
    # Pass only the raw DataFrame - existing_stations_df is not used by the processor
    results = processor.process_raw_data(raw_df) # <-- MODIFIED HERE
    logger.info("Initial processing complete")

    # Display processing results
    for name, df in results.items():
        print(f"\n{name} table shape: {df.shape}")
        print(f"Sample of {name}:")
        print(df.head(2))

except Exception as e:
    logger.error(f"Error in initial processing: {e}")
    raise

2025-04-28 00:02:00,778 - INFO - Using provided DataFrame
2025-04-28 00:02:00,795 - INFO - Created tables: lines (77 rows), stops (701 rows), 
2025-04-28 00:02:00,801 - INFO - Initial processing complete



lines table shape: (77, 9)
Sample of lines:
  line_id  year line_name     type  \
0   19641  1964         1     tram   
1   19642  1964       A1P  autobus   

                                          start_stop  length (time)  \
0                      Ostbahnhof<> Hannoverschestr.           28.0   
1  Adalbertstr. Ecke Köpenickerstr.<> Alexanderpl...            6.0   

   length (km) east_west  frequency (7:30)  
0          NaN      east              20.0  
1          NaN      east              10.0  

stops table shape: (701, 6)
Sample of stops:
                  stop_name  type line_name     stop_id location identifier
0                Ostbahnhof  tram         1  19640_east                    
1  U-Bhf. Strausbergerplatz  tram         1  19641_east                    


## Station Matching

This step attempts to match stations in our current dataset with those in our reference database. This process:

1. Compares station names and types to find potential matches
2. Assigns geographic coordinates from matched stations
3. Identifies stations that require manual geolocation
4. Logs matching statistics for quality control

Stations that cannot be automatically matched will be processed manually using OpenRefine in a subsequent step.

In [5]:
# Initialize the new DataFrame-based station matcher
df_matcher = DataFrameStationMatcher(
    uri="bolt://100.82.176.18:7687", # Or your Neo4j URI
    username="neo4j",
    password="BerlinTransport2024"
)

# Process stops table with location matching using the new matcher
# Pass the current_year and side to fetch the correct historical data
matched_stops = df_matcher.add_location_data(results['stops'], YEAR, SIDE, score_cutoff=85) 

# Close the Neo4j connection when done with the matcher instance
df_matcher.close()

2025-04-28 00:02:01,147 - INFO - Found closest previous year with data for side 'east': 1965
2025-04-28 00:02:01,312 - INFO - Fetched 743 historical stations (with lines) from year 1965 for side 'east'.
2025-04-28 00:02:01,349 - INFO - Partial match (name/type ok, score=100) but line mismatch for: 'Ostbahnhof' (tram, Line: 1). Historical lines: ['82']
2025-04-28 00:02:01,429 - INFO - Partial match (name/type ok, score=100) but line mismatch for: 'Alexanderplatz, Memhardstr.' (tram, Line: 1). Historical lines: ['74']
2025-04-28 00:02:01,450 - INFO - Partial match (name/type ok, score=100) but line mismatch for: 'Rosenthalerplatz' (tram, Line: 1). Historical lines: ['11']
2025-04-28 00:02:01,476 - INFO - Partial match (name/type ok, score=100) but line mismatch for: 'Invalidenstr. Ecke Chausseestr.' (tram, Line: 1). Historical lines: ['11']
2025-04-28 00:02:01,510 - INFO - Partial match (name/type ok, score=100) but line mismatch for: 'Oranienburger Tor' (tram, Line: 1). Historical lines

In [6]:
# Analysis of matching results
total_stops = len(matched_stops)

# Check for matches based on the presence of latitude data,
# which is only added by the matcher on success.
matched = matched_stops['latitude'].notna().sum()
unmatched = total_stops - matched

print("\nMatching Statistics:")
print(f"Total stations: {total_stops}")
print(f"Matched: {matched} ({matched/total_stops*100:.1f}%)")
print(f"Unmatched: {unmatched} ({unmatched/total_stops*100:.1f}%)")

# Display sample of matched stations (using latitude check)
print("\nSample of matched stations:")
# Display only if there are matched stations to avoid errors/empty output
if matched > 0:
    display(matched_stops[matched_stops['latitude'].notna()].head(min(3, matched)))
else:
    print("No stations were matched.")

# Display sample of unmatched stations (using latitude check)
print("\nSample of unmatched stations:")
# Display only if there are unmatched stations
if unmatched > 0:
    display(matched_stops[matched_stops['latitude'].isna()].head(min(3, unmatched)))
else:
    print("All stations were matched.")



Matching Statistics:
Total stations: 701
Matched: 393 (56.1%)
Unmatched: 308 (43.9%)

Sample of matched stations:


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,latitude,longitude,match_score,matched_name,matched_stop_id,matched_historical_lines
1,U-Bhf. Strausbergerplatz,tram,1,19641_east,"52.51808307,13.43293834",,52.518083,13.432938,100,U-Bhf. Strausbergerplatz,196524_east,[1]
2,Leninplatz,tram,1,19642_east,"52.5232658,13.4326457",,52.523266,13.432646,100,Leninplatz,196525_east,[1]
8,Adalbertstr. Ecke Köpenickerstr.,autobus,A1P,19648_east,"52.50880201,13.42459206",,52.508802,13.424592,95,Köpenickerstr. Ecke Adalbertstr.,196531_east,[A1P]



Sample of unmatched stations:


Unnamed: 0,stop_name,type,line_name,stop_id,location,identifier,latitude,longitude,match_score,matched_name,matched_stop_id,matched_historical_lines
0,Ostbahnhof,tram,1,19640_east,,,,,,,,
3,"Alexanderplatz, Memhardstr.",tram,1,19643_east,,,,,,,,
4,Rosenthalerplatz,tram,1,19644_east,,,,,,,,


In [7]:
legacy_stops = pd.read_csv('../legacy_data/stations.csv')

In [8]:
unmatched_stops = matched_stops[matched_stops['latitude'].isna()]

# Perform matching based on stop_name, type, and line_name in in_lines
for index, row in unmatched_stops.iterrows():
    # Filter legacy_stops for matching stop_name and type
    potential_matches = legacy_stops[
        (legacy_stops['stop_name'] == row['stop_name']) &
        (legacy_stops['type'] == row['type'])
    ]
    
    # Check if line_name from unmatched_stops is in in_lines of legacy_stops
    closest_match = None
    closest_year_diff = float('inf')
    
    for _, match in potential_matches.iterrows():
        # Extract the year from the stop_id of the match
        match_year = int(str(match['stop_id'])[:4])  # Assuming the year is the first 4 digits of stop_id
        year_diff = abs(YEAR - match_year)
        
        if year_diff < closest_year_diff:
            closest_year_diff = year_diff
            closest_match = match
    
    if closest_match is not None:
        # Update location and identifier in unmatched_stops
        unmatched_stops.at[index, 'location'] = closest_match['location']
        unmatched_stops.at[index, 'location_from'] = closest_match['stop_id']
        unmatched_stops.at[index, 'identifier'] = closest_match['identifier']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unmatched_stops.at[index, 'location_from'] = closest_match['stop_id']


In [9]:
# Filter rows with non-null location from unmatched_stops
newly_matched = unmatched_stops[unmatched_stops['location'].notna() & (unmatched_stops['location'] != "")].copy()

# Append newly matched rows to matched_stops
matched_stops = pd.concat([matched_stops, newly_matched], ignore_index=True)

# Update unmatched_stops to only include rows with null location
unmatched_stops = unmatched_stops[unmatched_stops['location'].isna() | (unmatched_stops['location'] == "")].copy()

In [10]:
matched = len(matched_stops)
unmatched = len(unmatched_stops)

print(f"Matched: {matched}")
print(f"Unmatched: {unmatched}")

Matched: 970
Unmatched: 39


In [11]:
# --- Save results ---
matched_dir = Path('../data/interim/stops_matched_initial')
matched_dir.mkdir(parents=True, exist_ok=True)

# Save all stops (both matched and unmatched)
matched_path = matched_dir / f'stops_{YEAR}_{SIDE}.csv'
matched_stops.to_csv(matched_path, index=False)
print(f"\nSaved {len(matched_stops)} total stops to {matched_path}")

# Save unmatched stops separately for OpenRefine based on missing latitude
openrefine_dir = Path('../data/interim/stops_for_openrefine')
openrefine_dir.mkdir(parents=True, exist_ok=True)
openrefine_path = openrefine_dir / f'unmatched_stops_{YEAR}_{SIDE}.csv'
unmatched_stops.to_csv(openrefine_path, index=False)
print(f"Exported {len(unmatched_stops)} unmatched stops for manual processing to {openrefine_path}")


Saved 970 total stops to ..\data\interim\stops_matched_initial\stops_1964_east.csv
Exported 39 unmatched stops for manual processing to ..\data\interim\stops_for_openrefine\unmatched_stops_1964_east.csv


## Validation and Export

As a final step, we validate the matched stations and export the results:

1. The complete dataset is saved for the next processing stage
2. Unmatched stations are exported separately for manual geolocation
3. Matching statistics are logged for quality assurance

The manual geolocation process will be performed using OpenRefine, which provides tools for interactive data cleaning and enrichment.

## Next Steps

After this initial processing, the workflow continues with:

1. **Manual Geolocation**: Using OpenRefine to add coordinates to unmatched stations
2. **Geolocation Verification**: Validating coordinates and splitting composite stations
3. **Data Enrichment**: Adding administrative and contextual information
4. **Network Construction**: Building a graph representation of the transportation system
5. **Analysis**: Investigating network properties and evolution over time

The next notebook in the sequence is `01_geolocation_verification_splitting.ipynb`.