# Berlin Transport Network - Data Enrichment

This notebook processes and enriches Berlin public transportation data for a specific year and side (East/West). It performs the following steps:

1. **Configuration**: Set up year and side to process
2. **Data Loading**: Load base data and intermediate files
3. **Line Enrichment**: Add profile and capacity information to lines
4. **Administrative Data**: Add district/neighborhood information to stops
5. **Postal Code Data**: Add postal code information to stops
6. **Line-Stop Relationships**: Process relationships between lines and stops
7. **Data Finalization**: Finalize and save processed data
8. **Reference Data**: Update the reference stations dataset

Most of the implementation logic is in the `src.enrichment` module.

In [None]:
import sys
import pandas as pd
import logging
from pathlib import Path

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Import modules
sys.path.append('..')
from src import enricher

## 1. Configuration

Set up the year and side (east/west) to process, and define paths to data files.

In [None]:
# Configuration
YEAR = 1965
SIDE = "west"

# Set up paths
BASE_DIR = Path('../data')
paths = {
    'base_dir': BASE_DIR,
    'raw_dir': BASE_DIR / 'raw',
    'interim_dir': BASE_DIR / 'interim',
    'processed_dir': BASE_DIR / 'processed',
    'geo_data_dir': BASE_DIR / 'data-external',
}

## 2. Data Loading

Load the raw and intermediate data files required for processing.

In [None]:
# Load data
try:
    line_df_initial, final_stops = enricher.load_data(paths, YEAR, SIDE)
except Exception as e:
    logger.error(f"Error loading data: {e}")
    raise

## 3. Line Enrichment

Enrich line data with profile and capacity information based on transport type.

In [None]:
# Enrich lines with profile and capacity information
line_df = enricher.enrich_lines(line_df_initial, SIDE)

# Display a sample of the enriched lines
line_df.head()

## 4. Administrative Data

Add district and neighborhood information to stops based on their geographic location.

In [None]:
# Load district data
districts_gdf, west_berlin_districts = enricher.load_district_data(paths['geo_data_dir'])

# Add administrative data
if districts_gdf is not None and west_berlin_districts is not None:
    enriched_stops_df = enricher.add_administrative_data(SIDE, final_stops, districts_gdf, west_berlin_districts)

    logger.info(f"Enriched stops created, not saved")
else:
    logger.warning("Could not load district data, skipping administrative enrichment")
    enriched_stops_df = final_stops

# Display a sample of the enriched stops
enriched_stops_df.head()

## 5. Postal Code Data

Add postal code information to stops based on their geographic location.

In [None]:
# Add postal code data
enriched_stops_df = enricher.add_postal_code_data(
    enriched_stops_df, 
    geo_data_dir=paths['geo_data_dir']
)
# Display a sample of the enriched stops
enriched_stops_df.head()

## 6. Line-Stop Relationships

Process relationships between lines and stops, including creating a line-stops DataFrame, adding line type information, and adding stop foreign keys.

In [None]:
raw_df = pd.read_csv(f"../data/raw/{YEAR}_{SIDE}.csv")

In [None]:
from src import table_creation

# Create line-stops DataFrame
line_stops = table_creation.create_line_stops_df(raw_df)

# # Add stop foreign keys
line_stops = table_creation.add_stop_foreign_keys(line_stops, enriched_stops_df, YEAR, SIDE)

# Display a sample of the line-stops relationships
line_stops.head()

## 7. Data Finalization

Finalize and save the processed data to the output directory.

In [None]:
# Finalize data
final_line_df, final_stops_df, final_line_stops_df = table_creation.finalize_data(
    line_df, enriched_stops_df, line_stops
)

# Save final data
table_creation.save_data(paths, final_line_df, final_stops_df, final_line_stops_df, YEAR, SIDE)

## Summary

Print a summary of the processed data and next steps.

In [None]:
# Print summary
table_creation.print_summary(YEAR, SIDE, final_line_df, final_stops_df, final_line_stops_df, paths)