This notebook is meant to analyze the reference stations file to clean it and provide corrected reference files.
AIM:
- keep all entries, even signals etc. as they could be included as incident locations.
- remove duplicate entries of DfT categories, and therefore of the data that will be analyzed and saved into 'processed data'

-> As of 31 March 2024, there were 2,585 open mainline stations in Great Britain

In [22]:
import pandas as pd

# Load the two reference files
stations_coords = pd.read_json('../data/reference provided/stations_ref_coordinates.json')
stations_dft = pd.read_json('../data/reference provided/stations_ref_with_dft.json')

print("Columns in stations_ref_coordinates.json:")
print(stations_coords.columns.tolist())
print(f"\nShape: {stations_coords.shape}")
print(f"\nTotal stanox entries: {len(stations_coords)}")

print("\n" + "="*50)
print("\nColumns in stations_ref_with_dft.json:")
print(stations_dft.columns.tolist())
print(f"\nShape: {stations_dft.shape}")
print(f"\nTotal stanox entries: {len(stations_dft)}")

Columns in stations_ref_coordinates.json:
['location_id', 'name', 'description', 'tiploc', 'crs', 'nlc', 'stanox', 'notes', 'longitude', 'latitude', 'isOffNetwork', 'timingPointType']

Shape: (54386, 12)

Total stanox entries: 54386


Columns in stations_ref_with_dft.json:
['location_id', 'name', 'description', 'tiploc', 'crs', 'nlc', 'stanox', 'notes', 'longitude', 'latitude', 'isOffNetwork', 'timingPointType', 'dft_category']

Shape: (4317, 13)

Total stanox entries: 4317


## Extract TIPLOCs from Schedule Data

Now we'll load the schedule data and extract all unique TIPLOCs that are actually used in the schedules. This will help us filter the reference file to keep only relevant entries.

In [5]:
# Load the schedule data
import pickle

schedule_file = '../data/CIF_ALL_FULL_DAILY_toc-full_p4.pkl'
print(f"Loading schedule data from: {schedule_file}")
schedule_data = pd.read_pickle(schedule_file)
print(f"Loaded {len(schedule_data)} schedule entries")
print(f"Type: {type(schedule_data)}")

Loading schedule data from: ../data/CIF_ALL_FULL_DAILY_toc-full_p4.pkl
Loaded 680185 schedule entries
Type: <class 'pandas.core.frame.DataFrame'>
Loaded 680185 schedule entries
Type: <class 'pandas.core.frame.DataFrame'>


In [6]:
# Extract all unique TIPLOCs from schedule data
all_tiplocs = set()

# If it's a DataFrame, convert to list of dicts
if isinstance(schedule_data, pd.DataFrame):
    schedule_list = schedule_data.to_dict('records')
else:
    schedule_list = schedule_data

print("Extracting TIPLOCs from schedule entries...")
for idx, entry in enumerate(schedule_list):
    if idx % 10000 == 0:
        print(f"  Processed {idx}/{len(schedule_list)} entries, found {len(all_tiplocs)} unique TIPLOCs")
    
    try:
        # Navigate to schedule_segment -> schedule_location
        schedule_locations = entry['schedule_segment']['schedule_location']
        
        # Extract all tiploc_code values from this schedule
        for loc in schedule_locations:
            tiploc = loc.get('tiploc_code')
            if tiploc:
                all_tiplocs.add(tiploc)
    except (KeyError, TypeError):
        continue

print(f"\nTotal unique TIPLOCs found in schedule data: {len(all_tiplocs)}")
print(f"Sample TIPLOCs: {list(all_tiplocs)[:10]}")

Extracting TIPLOCs from schedule entries...
  Processed 0/680185 entries, found 0 unique TIPLOCs
  Processed 10000/680185 entries, found 3712 unique TIPLOCs
  Processed 20000/680185 entries, found 4683 unique TIPLOCs
  Processed 30000/680185 entries, found 5438 unique TIPLOCs
  Processed 20000/680185 entries, found 4683 unique TIPLOCs
  Processed 30000/680185 entries, found 5438 unique TIPLOCs
  Processed 40000/680185 entries, found 5662 unique TIPLOCs
  Processed 40000/680185 entries, found 5662 unique TIPLOCs
  Processed 50000/680185 entries, found 5891 unique TIPLOCs
  Processed 50000/680185 entries, found 5891 unique TIPLOCs
  Processed 60000/680185 entries, found 5924 unique TIPLOCs
  Processed 60000/680185 entries, found 5924 unique TIPLOCs
  Processed 70000/680185 entries, found 5940 unique TIPLOCs
  Processed 80000/680185 entries, found 6200 unique TIPLOCs
  Processed 70000/680185 entries, found 5940 unique TIPLOCs
  Processed 80000/680185 entries, found 6200 unique TIPLOCs
  P

## Filter Reference File by Schedule TIPLOCs

Now we'll filter the reference file to keep only entries whose TIPLOCs are actually present in the schedule data.

In [7]:
# Filter the reference file to keep only TIPLOCs present in schedule data
print(f"Original reference file entries: {len(stations_dft)}")

# Filter to keep only entries where TIPLOC is in the schedule data
stations_dft_filtered = stations_dft[stations_dft['tiploc'].isin(all_tiplocs)].copy()

print(f"Filtered reference file entries: {len(stations_dft_filtered)}")
print(f"Removed {len(stations_dft) - len(stations_dft_filtered)} entries not found in schedule data")

Original reference file entries: 6104
Filtered reference file entries: 4456
Removed 1648 entries not found in schedule data


In [12]:
# Count how many times each TIPLOC appears in the schedule data
tiploc_counts = {}

print("Counting TIPLOC usage in schedule data...")
for idx, entry in enumerate(schedule_list):
    if idx % 10000 == 0:
        print(f"  Processed {idx}/{len(schedule_list)} entries")
    
    try:
        schedule_locations = entry['schedule_segment']['schedule_location']
        for loc in schedule_locations:
            tiploc = loc.get('tiploc_code')
            if tiploc:
                tiploc_counts[tiploc] = tiploc_counts.get(tiploc, 0) + 1
    except (KeyError, TypeError):
        continue

print(f"\nCalculated usage counts for {len(tiploc_counts)} TIPLOCs")

Counting TIPLOC usage in schedule data...
  Processed 0/680185 entries
  Processed 10000/680185 entries
  Processed 20000/680185 entries
  Processed 30000/680185 entries
  Processed 40000/680185 entries
  Processed 50000/680185 entries
  Processed 60000/680185 entries
  Processed 70000/680185 entries
  Processed 80000/680185 entries
  Processed 90000/680185 entries
  Processed 100000/680185 entries
  Processed 110000/680185 entries
  Processed 120000/680185 entries
  Processed 130000/680185 entries
  Processed 140000/680185 entries
  Processed 150000/680185 entries
  Processed 160000/680185 entries
  Processed 170000/680185 entries
  Processed 180000/680185 entries
  Processed 190000/680185 entries
  Processed 200000/680185 entries
  Processed 210000/680185 entries
  Processed 220000/680185 entries
  Processed 230000/680185 entries
  Processed 240000/680185 entries
  Processed 250000/680185 entries
  Processed 260000/680185 entries
  Processed 270000/680185 entries
  Processed 280000/6

## Add Missing Stations with DfT Categories

Add stations that are missing from the original file but should be included with their proper DfT categories.

In [33]:
# Add missing stations with their DfT categories
print("Adding missing stations to the dataset:")
print("=" * 80)

# Define the missing stations with their DfT categories
missing_stations = [
    {
        "location_id": 34639,
        "name": None,
        "description": "London Bridge",
        "tiploc": "LNDNBDG",
        "crs": "LBG",
        "nlc": "514800",
        "stanox": "87601",
        "notes": None,
        "longitude": None,
        "latitude": None,
        "isOffNetwork": "TRUE",
        "timingPointType": "O",
        "dft_category": "A"  # Category A
    },
    {
        "location_id": 36219,
        "name": None,
        "description": "Victoria London",
        "tiploc": "VICTRIA",
        "crs": "VIC",
        "nlc": "542600",
        "stanox": "87201",
        "notes": None,
        "longitude": -0.146727421,
        "latitude": 51.47730525,
        "isOffNetwork": "FALSE",
        "timingPointType": "O",
        "dft_category": "A"  # Category A
    },
    {
        "location_id": 37481,
        "name": None,
        "description": "Waterloo London",
        "tiploc": "WATRLOO",
        "crs": "WAT",
        "nlc": "559800",
        "stanox": "87212",
        "notes": None,
        "longitude": -0.111577769,
        "latitude": 51.5019955,
        "isOffNetwork": "TRUE",
        "timingPointType": "O",
        "dft_category": "A"  # Category A
    },
    {
        "location_id": 36591,
        "name": None,
        "description": "Ashford International",
        "tiploc": "ASHFKI",
        "crs": "ASI",
        "nlc": "546600",
        "stanox": "89428",
        "notes": None,
        "longitude": -0.8750,
        "latitude": 51.1435,
        "isOffNetwork": "FALSE",
        "timingPointType": "O",
        "dft_category": "B"  # Category B
    },
    {
        "location_id": 37218,
        "name": "Ebbsfleet Int Se",
        "description": "Ebbsfleet International",
        "tiploc": "EBSFDOM",
        "crs": "EBD",
        "nlc": "556600",
        "stanox": "89530",
        "notes": None,
        "longitude": 0.319566,
        "latitude": 51.442903,
        "isOffNetwork": "FALSE",
        "timingPointType": "M",
        "dft_category": "B"  # Category B
    },
    {
        "location_id": 17058,
        "name": "St Albans",
        "description": "St Albans",
        "tiploc": "STALBCY",
        "crs": "SAC",
        "nlc": "154800",
        "stanox": "63201",
        "notes": None,
        "longitude": -0.328306197,
        "latitude": 51.74890034,
        "isOffNetwork": "FALSE",
        "timingPointType": "T",
        "dft_category": "B"  # Category B
    },
    {
        "location_id": 34752,
        "name": None,
        "description": "Waterloo (East) London",
        "tiploc": "WLOE",
        "crs": "WAE",
        "nlc": "515800",
        "stanox": "88402",
        "notes": None,
        "longitude": -0.109634505,
        "latitude": 51.50369076,
        "isOffNetwork": "FALSE",
        "timingPointType": "O",
        "dft_category": "B"  # Category B
    }
]

# First, check if Hull with CRS HUL exists and update it
hull_mask = stations_dft_corrected['crs'] == 'HUL'
if hull_mask.any():
    stations_dft_corrected.loc[hull_mask, 'dft_category'] = 'B'
    hull_desc = stations_dft_corrected.loc[hull_mask, 'description'].values[0]
    print(f"✓ Updated Hull (CRS: HUL, Description: '{hull_desc}') to category B")
else:
    print("✗ WARNING: Hull with CRS 'HUL' not found")

print()

# Add the missing stations
added_count = 0
for station in missing_stations:
    # Check if station already exists by TIPLOC or STANOX
    exists_tiploc = (stations_dft_corrected['tiploc'] == station['tiploc']).any()
    exists_stanox = (stations_dft_corrected['stanox'] == str(station['stanox'])).any()
    
    if exists_tiploc or exists_stanox:
        print(f"  Station '{station['description']}' (TIPLOC: {station['tiploc']}) already exists - updating DfT category to {station['dft_category']}")
        # Update existing entry
        mask = (stations_dft_corrected['tiploc'] == station['tiploc']) | (stations_dft_corrected['stanox'] == str(station['stanox']))
        stations_dft_corrected.loc[mask, 'dft_category'] = station['dft_category']
    else:
        # Add new entry
        new_row = pd.DataFrame([station])
        stations_dft_corrected = pd.concat([stations_dft_corrected, new_row], ignore_index=True)
        print(f"✓ Added '{station['description']}' with DfT category {station['dft_category']}")
        added_count += 1

print("\n" + "=" * 80)
print(f"Added {added_count} new stations")
print(f"Total stations in corrected dataset: {len(stations_dft_corrected)}")

# Show updated category counts
print("\nUpdated DfT Category Distribution:")
print(stations_dft_corrected['dft_category'].value_counts().sort_index())
print(f"Total with DfT category: {stations_dft_corrected['dft_category'].notna().sum()}")
print(f"Total without DfT category: {stations_dft_corrected['dft_category'].isna().sum()}")

Adding missing stations to the dataset:
✓ Updated Hull (CRS: HUL, Description: 'Hull') to category B

  Station 'London Bridge' (TIPLOC: LNDNBDG) already exists - updating DfT category to A
✓ Added 'Victoria London' with DfT category A
  Station 'Waterloo London' (TIPLOC: WATRLOO) already exists - updating DfT category to A
✓ Added 'Ashford International' with DfT category B
✓ Added 'Ebbsfleet International' with DfT category B
  Station 'St Albans' (TIPLOC: STALBCY) already exists - updating DfT category to B
✓ Added 'Waterloo (East) London' with DfT category B

Added 4 new stations
Total stations in corrected dataset: 4321

Updated DfT Category Distribution:
dft_category
A      21
B      64
C1     95
C2    195
Name: count, dtype: int64
Total with DfT category: 375
Total without DfT category: 3946


## Regenerate Cleaned File with Corrections

Now we'll reprocess the corrected data through the filtering and deduplication pipeline.

In [34]:
# Apply the complete processing pipeline to the corrected data
print("Reprocessing corrected data through filtering and deduplication pipeline...")
print("=" * 80)

# Step 1: Filter by schedule TIPLOCs
stations_dft_corrected_filtered = stations_dft_corrected[stations_dft_corrected['tiploc'].isin(all_tiplocs)].copy()
print(f"After filtering by schedule TIPLOCs: {len(stations_dft_corrected_filtered)} entries")

# Step 2: Add usage counts
stations_dft_corrected_filtered['tiploc_usage_count'] = stations_dft_corrected_filtered['tiploc'].map(tiploc_counts).fillna(0).astype(int)

# Step 3: Smart deduplication
stations_dft_corrected_filtered['has_dft_category'] = stations_dft_corrected_filtered['dft_category'].notna().astype(int)
stations_dft_corrected_filtered['priority_score'] = (
    stations_dft_corrected_filtered['has_dft_category'] * 1000000 + 
    stations_dft_corrected_filtered['tiploc_usage_count']
)

stations_dft_corrected_sorted = stations_dft_corrected_filtered.sort_values(
    by=['stanox', 'priority_score', 'tiploc'], 
    ascending=[True, False, True]
)

stations_dft_corrected_deduplicated = stations_dft_corrected_sorted.drop_duplicates(subset='stanox', keep='first')

print(f"After smart deduplication: {len(stations_dft_corrected_deduplicated)} entries")
print(f"Removed {len(stations_dft_corrected_filtered) - len(stations_dft_corrected_deduplicated)} duplicates")

# Step 4: Drop helper columns and prepare final data
columns_to_drop = ['tiploc_usage_count', 'has_dft_category', 'priority_score']
stations_final_corrected = stations_dft_corrected_deduplicated.drop(
    columns=[col for col in columns_to_drop if col in stations_dft_corrected_deduplicated.columns]
).copy()

# Step 5: Fix numeric column formatting
numeric_to_string_cols = ['nlc', 'stanox']
for col in numeric_to_string_cols:
    if col in stations_final_corrected.columns:
        stations_final_corrected[col] = stations_final_corrected[col].fillna('').apply(
            lambda x: str(int(float(x))) if x != '' and pd.notna(x) else ''
        )

print("\nFinal DfT Category Distribution (Corrected):")
print(stations_final_corrected['dft_category'].value_counts().sort_index())
print(f"Total with DfT category: {stations_final_corrected['dft_category'].notna().sum()}")
print(f"Total without DfT category: {stations_final_corrected['dft_category'].isna().sum()}")

Reprocessing corrected data through filtering and deduplication pipeline...
After filtering by schedule TIPLOCs: 4319 entries
After smart deduplication: 4319 entries
Removed 0 duplicates

Final DfT Category Distribution (Corrected):
dft_category
A      20
B      63
C1     95
C2    195
Name: count, dtype: int64
Total with DfT category: 373
Total without DfT category: 3946


In [35]:
# Save the final corrected and cleaned reference file
output_file_corrected = '../data/reference provided/stations_ref_with_dft_cleaned.json'

# Convert to list of dictionaries for JSON format
stations_cleaned_corrected_records = stations_final_corrected.to_dict('records')

import json
with open(output_file_corrected, 'w') as f:
    json.dump(stations_cleaned_corrected_records, f, indent=2)

print(f"Saved corrected cleaned reference file to: {output_file_corrected}")
print(f"Total entries: {len(stations_cleaned_corrected_records)}")
print(f"\nSummary:")
print(f"  - Original entries: {len(stations_dft)}")
print(f"  - After manual corrections: {len(stations_dft_corrected)}")
print(f"  - After filtering by schedule TIPLOCs: {len(stations_dft_corrected_filtered)}")
print(f"  - After smart deduplication: {len(stations_final_corrected)}")
print(f"  - Total removed: {len(stations_dft_corrected) - len(stations_final_corrected)}")
print(f"\nProcessing:")
print(f"  - Applied manual DfT category corrections")
print(f"  - Added missing stations with DfT categories")
print(f"  - Filtered by schedule TIPLOCs")
print(f"  - Smart deduplication (DfT category + usage frequency)")
print(f"  - Proper JSON formatting (nlc/stanox as strings, lat/lon as numbers)")

Saved corrected cleaned reference file to: ../data/reference provided/stations_ref_with_dft_cleaned.json
Total entries: 4319

Summary:
  - Original entries: 4317
  - After manual corrections: 4321
  - After filtering by schedule TIPLOCs: 4319
  - After smart deduplication: 4319
  - Total removed: 2

Processing:
  - Applied manual DfT category corrections
  - Added missing stations with DfT categories
  - Filtered by schedule TIPLOCs
  - Smart deduplication (DfT category + usage frequency)
  - Proper JSON formatting (nlc/stanox as strings, lat/lon as numbers)


## Check for TIPLOC Duplicates in Final Cleaned File

Verify if there are any duplicate TIPLOCs in the final corrected cleaned file.

In [36]:
# Check for TIPLOC duplicates in the final cleaned file
print("Checking for TIPLOC duplicates in final cleaned file:")
print("=" * 80)

# Check for duplicate TIPLOCs
tiploc_duplicates = stations_final_corrected['tiploc'].duplicated().sum()
print(f"Number of duplicate TIPLOC entries: {tiploc_duplicates}")

if tiploc_duplicates > 0:
    # Show the duplicate TIPLOCs
    duplicate_tiplocs_df = stations_final_corrected[stations_final_corrected['tiploc'].duplicated(keep=False)].sort_values('tiploc')
    print(f"\nTotal entries with duplicate TIPLOCs: {len(duplicate_tiplocs_df)}")
    print(f"Unique TIPLOCs that are duplicated: {duplicate_tiplocs_df['tiploc'].nunique()}")
    
    print("\nDuplicate TIPLOC entries:")
    print(duplicate_tiplocs_df[['tiploc', 'stanox', 'description', 'dft_category', 'crs']].to_string())
else:
    print("\n✓ No duplicate TIPLOCs found!")
    print("Each TIPLOC appears only once in the final cleaned file.")

print("\n" + "=" * 80)
print("Summary:")
print(f"  - Total entries: {len(stations_final_corrected)}")
print(f"  - Unique TIPLOCs: {stations_final_corrected['tiploc'].nunique()}")
print(f"  - Unique STANOX: {stations_final_corrected['stanox'].nunique()}")
print(f"  - Duplicate TIPLOCs: {tiploc_duplicates}")

Checking for TIPLOC duplicates in final cleaned file:
Number of duplicate TIPLOC entries: 0

✓ No duplicate TIPLOCs found!
Each TIPLOC appears only once in the final cleaned file.

Summary:
  - Total entries: 4319
  - Unique TIPLOCs: 4319
  - Unique STANOX: 4319
  - Duplicate TIPLOCs: 0


## Final DfT Categories with Station Descriptions

Display all stations by their DfT category in the final cleaned file.

In [37]:
# Display all stations by DfT category in the final cleaned file
print("FINAL DfT CATEGORY DISTRIBUTION IN CLEANED FILE")
print("=" * 100)

categories = ['A', 'B', 'C1', 'C2', 'D', 'E', 'F']

for category in categories:
    stations_in_category = stations_final_corrected[stations_final_corrected['dft_category'] == category]
    print(f"\n{'='*100}")
    print(f"DfT Category {category}: {len(stations_in_category)} stations")
    print(f"{'='*100}")
    
    if len(stations_in_category) > 0:
        # Sort by description for easier reading
        stations_sorted = stations_in_category.sort_values('description')
        
        # Display in a compact format
        for idx, row in stations_sorted.iterrows():
            name = str(row.get('description', 'N/A'))[:50]
            crs = str(row.get('crs', 'N/A'))[:4] if pd.notna(row.get('crs')) else 'N/A'
            stanox = str(row['stanox']) if pd.notna(row['stanox']) else 'N/A'
            tiploc = str(row['tiploc'])[:12] if pd.notna(row['tiploc']) else 'N/A'
            print(f"  {name:50} | CRS: {crs:4} | STANOX: {stanox:6} | TIPLOC: {tiploc:12}")
    else:
        print(f"  No stations found in category {category}")

print(f"\n{'='*100}")
print(f"\nSUMMARY:")
print(f"  Total entries in cleaned file: {len(stations_final_corrected)}")
print(f"  Stations with DfT category: {stations_final_corrected['dft_category'].notna().sum()}")
print(f"  Stations without DfT category: {stations_final_corrected['dft_category'].isna().sum()}")
print(f"\nCategory breakdown:")
for category in categories:
    count = len(stations_final_corrected[stations_final_corrected['dft_category'] == category])
    if count > 0:
        print(f"    {category}: {count}")
print(f"{'='*100}")

FINAL DfT CATEGORY DISTRIBUTION IN CLEANED FILE

DfT Category A: 20 stations
  Birmingham New Street                              | CRS: BHM  | STANOX: 65630  | TIPLOC: BHAMNWS     
  Blackfriars London                                 | CRS: BFR  | STANOX: 87245  | TIPLOC: BLFR        
  Bristol Temple Meads                               | CRS: BRI  | STANOX: 81700  | TIPLOC: BRSTLTM     
  Cannon Street London                               | CRS: CST  | STANOX: 88403  | TIPLOC: CANONST     
  Cardiff Central                                    | CRS: CDF  | STANOX: 77301  | TIPLOC: CRDFCEN     
  Charing Cross London                               | CRS: CHX  | STANOX: 88401  | TIPLOC: CHRX        
  Eastern                                            | CRS: N/A  | STANOX: 87601  | TIPLOC: LNDNBDE     
  Euston London                                      | CRS: EUS  | STANOX: 72410  | TIPLOC: EUSTON      
  Fenchurch Street London                            | CRS: FST  | STANOX: 52711  |

## Investigate Missing Category A Stations

Check why Victoria London, London Bridge, and Waterloo London are not in the final file.

In [38]:
# Check if the manually added stations are present and why they might be missing
print("Investigating manually added Category A stations:")
print("=" * 100)

# Check the TIPLOCs we added
check_tiplocs = ['LNDNBDG', 'VICTRIA', 'WATRLOO']
check_names = ['London Bridge', 'Victoria London', 'Waterloo London']

for tiploc, name in zip(check_tiplocs, check_names):
    print(f"\n{name} (TIPLOC: {tiploc}):")
    
    # Check if in corrected dataset (before filtering)
    in_corrected = (stations_dft_corrected['tiploc'] == tiploc).any()
    print(f"  - In stations_dft_corrected: {in_corrected}")
    
    if in_corrected:
        station_info = stations_dft_corrected[stations_dft_corrected['tiploc'] == tiploc].iloc[0]
        print(f"    Category: {station_info['dft_category']}")
        print(f"    Description: {station_info['description']}")
        print(f"    CRS: {station_info.get('crs', 'N/A')}")
    
    # Check if TIPLOC is in schedule data
    in_schedule = tiploc in all_tiplocs
    print(f"  - TIPLOC in schedule data: {in_schedule}")
    
    # Check if in filtered dataset (after schedule TIPLOC filtering)
    in_filtered = (stations_dft_corrected_filtered['tiploc'] == tiploc).any()
    print(f"  - In stations_dft_corrected_filtered: {in_filtered}")
    
    # Check if in final dataset
    in_final = (stations_final_corrected['tiploc'] == tiploc).any()
    print(f"  - In stations_final_corrected: {in_final}")

print("\n" + "=" * 100)
print("\nDiagnosis:")
print("If TIPLOC is not in schedule data, it will be filtered out during the")
print("'Filter by schedule TIPLOCs' step, even if manually added.")

Investigating manually added Category A stations:

London Bridge (TIPLOC: LNDNBDG):
  - In stations_dft_corrected: False
  - TIPLOC in schedule data: False
  - In stations_dft_corrected_filtered: False
  - In stations_final_corrected: False

Victoria London (TIPLOC: VICTRIA):
  - In stations_dft_corrected: True
    Category: A
    Description: Victoria London
    CRS: VIC
  - TIPLOC in schedule data: False
  - In stations_dft_corrected_filtered: False
  - In stations_final_corrected: False

Waterloo London (TIPLOC: WATRLOO):
  - In stations_dft_corrected: False
  - TIPLOC in schedule data: False
  - In stations_dft_corrected_filtered: False
  - In stations_final_corrected: False


Diagnosis:
If TIPLOC is not in schedule data, it will be filtered out during the
'Filter by schedule TIPLOCs' step, even if manually added.
