This visualisation scrapes from the British Transport Police website with UK station crime data. The website can be found here: https://crimemaps.btp.police.uk/station_list. 

To match with the locations of TfL stations specifically, I use a csv provided by TfL of the exact location of their stations. The csv can be found in this freedom of information requestion response: https://tfl.gov.uk/corporate/transparency/freedom-of-information/foi-request-detail?referenceId=FOI-1451-1819. Note that this csv list stations for the London Underground, DLR, and London Overground networks. Because it was published in 2018, it does not include certain stations that opened specifically for the Elizabeth Line or other new stations that have opened since then. The csv also originally included TfL Rail, but I deleted these rows, as TfL Rail was discontinued in 2022.

In [427]:
import requests     #for making web / API calls
from bs4 import BeautifulSoup   #For parsing the HTML into something we can search
import pandas as pd


In [428]:
stations_list="https://crimemaps.btp.police.uk/station_list"
response = requests.get(stations_list) 

In [429]:
response.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<!--[if lt IE 7 ]><html lang="en" class="no-js ie6" xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml"><![endif]-->\n<!--[if IE 7 ]><html lang="en" class="no-js ie7"><![endif]-->\n<!--[if IE 8 ]><html lang="en" class="no-js ie8"><![endif]-->\n<!--[if IE 9 ]><html lang="en" class="no-js ie9"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" class="no-js"><!--<![endif]-->\n<head>\n\n    <meta name="language" content="en-gb"/>\n    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>\n\n    <meta name="author" content="Rock Kitchen Harris, Leicester - http://www.rkh.co.uk/"/>\n\n    <title>A-Z list of stations</title>\n\n    <meta name="keywords" content="british transport police crime maps local rail network"/>\n    <m

In [456]:
soup = BeautifulSoup(response.text)

In [457]:
len(soup.body.find_all('li')) #Stations are listed as <li> elements, and there are 3003 of them on this website

3003

In [458]:
stations = soup.find_all('li')  

Step 2: Matching the Stations

BTP stations are labeled as either (LU Station) - referring to the London Underground, (DLR) - the Docklands Light Railway, (Tram Stops) - referring to all tram stops, including TfL Tramlink, or (Station) - all else, including London Overground stations. 

Every station name has its own <a href> element which will be important for scraping the data later on. 

In [None]:
# Step 1: Cleaning all station names and track hrefs
station_data = []  # List to store dictionaries with original names, cleaned names, and hrefs

for station in stations:
    original_name = station.get_text(strip=True)
    link_tag = station.find('a')  # Find the <a> tag
    href = link_tag['href'] if link_tag else None  


# I noticed some inconsistencies within the BTP lists which I knew would cause issues in matching with TfL's station csv. 
    cleaned_name = ( 
        original_name.replace(" (Gt London)", "")  # Remove "(Gt London)"
                     .replace("Shadwell (LU Station)", "Shadwell (London Overground)")  # Correct Shadwell misclassification
                     .replace("Custom House (Station)", "Custom House (DLR)")  # Correct Custom House classification
                     .replace("Liverpool Lime Street (LU Station)", "Liverpool Lime Street (Station)")  # Correct misclassification. Liverpool Lime Street is in the city of Liverpool and does not belong to the LU network
                     .replace("Staion", "Station")  # Fix typo
                     .replace("  ", " ")  # Fix double spaces
                     .strip()  # Remove leading/trailing spaces
    )

    # I only want the TfL-relevant stations, not all the stations listed, so I condition for the suffixes that BTP uses in its list:
    if '(DLR)' in cleaned_name or '(London Overground)' in cleaned_name or '(LU Station)' in cleaned_name:
        # Appending the relevant station data to the list
        station_data.append({
            "Original Name": original_name,
            "Cleaned Name": cleaned_name,
            "Href": href
        })

# Printing the number of relevant stations. 311 London Underground and DLR stations scraped here.
print(f"Number of tfl stations scraped: {len(station_data)}")


Since the BTP station list classifies London Overground stations and Tramlink stations as "(Station)" and "(Tram Stop)", respectively, which it also does for non-London stations, I use the TfL-sourced station csv to determine the names of the London Overground and Tramlink stations that I need to scrape from the BTP list. 

In [460]:
csv_file = 'Stations_20180921.csv'
tfl_stations_df = pd.read_csv(csv_file)

In [461]:
#Creating a list of the stations marked in the csv as servicing the London Overground and Tramlink
overground_stations_csv = tfl_stations_df[tfl_stations_df['NETWORK'] == 'London Overground']['NAME'].tolist()
tramlink_stations_csv = tfl_stations_df[tfl_stations_df['NETWORK'] == 'Tramlink']['NAME'].tolist()


In [None]:
len(overground_stations_csv) #there are 110 stations in the TfL csv marked as servicing London Overground

In [None]:
len(tramlink_stations_csv) #there are 39 stations in the TfL csv marked as servicing the Tramlink

In [None]:
overground_count = 0
tram_count = 0
processed_stations = set()

# # Convert Overground stations from CSV to a lowercase set for comparison
overground_stations_csv_set = set(NAME.lower().strip() for NAME in overground_stations_csv)
# tramlink_stations_csv_set = set(NAME.lower().strip() for NAME in tramlink_stations_csv)

tramlink_stations_csv_set = set(
    NAME.replace(" (Tramlink)", "").replace(" (Station)", "").replace(" (Tram Stop)", "").strip().lower()
    for NAME in tramlink_stations_csv)
print(f"Tramlink CSV Set: {tramlink_stations_csv_set}")



for station in stations:
    # Extract the original name
    station_name = station.get_text(strip=True)
    link_tag = station.find('a')  # Find the <a> tag
    href = link_tag['href'] if link_tag else None  # Extract href if it exists

    # Process Overground stations (stations with " (Station)")
    if "(Station)" in station_name:
        stripped_overground_name = station_name.replace(" (Station)", "").strip().lower()
        if stripped_overground_name in overground_stations_csv_set:
            cleaned_name = f"{station_name.replace(' (Station)', '')} (London Overground)"
            station_data.append({
                "Original Name": station_name,
                "Cleaned Name": cleaned_name,
                "Href": href
            })
            overground_count += 1
            processed_stations.add(station_name)
            print(f"Added Overground station: {cleaned_name}")


    # Process Tramlink stations (stations with " (Tram Stop)")
    if "(Tram Stop)" in station_name:
        stripped_tramlink_name = station_name.replace(" (Tram Stop)", "").replace(" (Tramlink)", "").strip().lower()
        if stripped_tramlink_name in tramlink_stations_csv_set:
            cleaned_name = f"{station_name.replace(' (Tram Stop)', '')} (Tramlink)"
            station_data.append({
                "Original Name": station_name,
                "Cleaned Name": cleaned_name,
                "Href": href
            })
            tram_count += 1
            print(f"Added Tramlink station: {cleaned_name}. Original name: {station_name} -> Stripped: {stripped_tramlink_name}")
        if stripped_tramlink_name not in tramlink_stations_csv_set:
          print(f"Mismatch: {station_name}. Original name: {station_name} -> Stripped: {stripped_tramlink_name}")




# Summary
print(f"Number of London Overground stations added: {overground_count}")
print(f"Number of Tramlink stations added: {tram_count}")
print(f"Total stations after adding Overground and Tramlink: {len(station_data)}")

# Convert to DataFrame for easier handling
station_df = pd.DataFrame(station_data)

Step 3: Checking for consistency and matches between the two lists (csv and scraped lists)

In [464]:
#To make the naming consistent between the csv and scraped lists...
network_suffix_mapping = {
    "London Underground": "(LU Station)",
    "DLR": "(DLR)",
    "London Overground": "(London Overground)",
    "Tramlink": "(Tramlink)"
}

In [465]:
#...The above mapping applied to the names of stations from the csv file
tfl_stations_df['Updated Name'] = tfl_stations_df.apply(
    lambda row: f"{row['NAME']} {network_suffix_mapping[row['NETWORK']]}"
                if row['NETWORK'] in network_suffix_mapping else row['NAME'],
    axis=1
)

In [None]:
print(tfl_stations_df[['NAME', 'NETWORK', 'Updated Name']])


In [1]:
#To check for matches, I convert both lists to a set and consider them as lowercase (in case there are discrepancies between the two)

# Convert scraped station names to lowercase set
scraped_stations_set = set(station['Cleaned Name'].lower() for station in station_data)

# Convert updated CSV station names to lowercase set
csv_stations_set = set(name.lower() for name in tfl_stations_df['Updated Name'])


NameError: name 'station_data' is not defined

In [468]:
#To find matches, what's missing in CSV, and what's missing in scraped:
matches = scraped_stations_set & csv_stations_set
missing_in_csv = scraped_stations_set - csv_stations_set
missing_in_scraped = csv_stations_set - scraped_stations_set

In [2]:
#Matching and missing checks
print(f"Exact matches ({len(matches)}):")
print(matches)
#There are 442 stations that are exact matches between the lists

print(f"\nStations in scraped list but missing in CSV ({len(missing_in_csv)}):")
print(missing_in_csv)
#There were initially 4 stations in the scraped list missing in the csv. One of these was Liverpool Lime Street Station, which 
# was marked erroneously as LU Underground. This was corrected above.

print(f"\nStations in CSV but missing in scraped list ({len(missing_in_scraped)}):")
print(missing_in_scraped)
#There are 20 stations that are listed in the csv but missing on the BTP website

NameError: name 'matches' is not defined

Step 4: Addressing inconsistencies

The inconsistencies that matter to me here are the ones that are in the scraped list but not the csv, as this would mean we have some crime data (harder to find) which is missing some coordinates (quite easy to find). 

The missing ones:
'hammersmith (district and piccadilly lines) (lu station)
hammersmith (hammersmith and city line) (lu station)'
    These two don't matter, because the scraped list already includes a Hammersmith (LU station) item, so I can ignore.
 'nine elms (lu station)', 
    This makes sense that it's missing, as this station was only opened in 2021, and the csv file was provided in 2018. I can fix this with a Google search of coordinates and then add this station to the list.

Nine Elms:
51.4799° N, 0.1285° W

The 20 stations that are in the csv but not the scraped list are not concerning. Upon checking these, I noticed that most were stations for multiple networks but only listed as one network on the BTP website, which is no problem as the data will be captured anyway. Some stations are simply not listed on the BTP website, which is a problem I cannot address, as I visualise anything for those stations without crime data.



In [470]:
tfl_stations_df['Zone'] = tfl_stations_df['Zone'].astype(float)


In [471]:
new_station = { #filling out the relevant columns in the dataframe
    "NAME": "Nine Elms", 
    "x": -0.1285,  # Longitude
    "y": 51.4799,  # Latitude
    "Zone": 1.0, #TfL Zone
    "NETWORK": "London Underground" 
}
tfl_stations_df = pd.concat([tfl_stations_df, pd.DataFrame([new_station])], ignore_index=True)
print(tfl_stations_df.tail())  # Print the last few rows to confirm addition


       FID  OBJECTID                NAME   EASTING  NORTHING LINES  \
419  473.0     137.0             Clapton  534775.0  186528.0   NaN   
420  474.0     381.0      Crystal Palace  534111.0  170555.0   NaN   
421  477.0     363.0     Woodgrange Park  541821.0  185350.0   NaN   
422  478.0     364.0  Willesden Junction  521879.0  182944.0   NaN   
423    NaN       NaN           Nine Elms       NaN       NaN   NaN   

                NETWORK  Zone         x          y  \
419   London Overground   0.0 -0.055485  51.561030   
420   London Overground   0.0 -0.071128  51.417633   
421   London Overground   0.0  0.045631  51.548716   
422   London Overground   0.0 -0.242689  51.531751   
423  London Underground   1.0 -0.128500  51.479900   

                               Updated Name  
419             Clapton (London Overground)  
420      Crystal Palace (London Overground)  
421     Woodgrange Park (London Overground)  
422  Willesden Junction (London Overground)  
423                     

In [473]:
network_suffix_mapping = { #redoing the updated name changes so it can also apply to Nine Elms
    "London Underground": "(LU Station)",
    "DLR": "(DLR)",
    "London Overground": "(London Overground)"
}

In [474]:
tfl_stations_df['Updated Name'] = tfl_stations_df.apply(
    lambda row: f"{row['NAME']} {network_suffix_mapping[row['NETWORK']]}"
                if row['NETWORK'] in network_suffix_mapping else row['NAME'],
    axis=1
)

In [475]:
print(tfl_stations_df.tail())  # Printing the last few rows to confirm addition


       FID  OBJECTID                NAME   EASTING  NORTHING LINES  \
419  473.0     137.0             Clapton  534775.0  186528.0   NaN   
420  474.0     381.0      Crystal Palace  534111.0  170555.0   NaN   
421  477.0     363.0     Woodgrange Park  541821.0  185350.0   NaN   
422  478.0     364.0  Willesden Junction  521879.0  182944.0   NaN   
423    NaN       NaN           Nine Elms       NaN       NaN   NaN   

                NETWORK  Zone         x          y  \
419   London Overground   0.0 -0.055485  51.561030   
420   London Overground   0.0 -0.071128  51.417633   
421   London Overground   0.0  0.045631  51.548716   
422   London Overground   0.0 -0.242689  51.531751   
423  London Underground   1.0 -0.128500  51.479900   

                               Updated Name  
419             Clapton (London Overground)  
420      Crystal Palace (London Overground)  
421     Woodgrange Park (London Overground)  
422  Willesden Junction (London Overground)  
423                  Nin

In [479]:
#checking that the addition of nine elms produces no further match errors

csv_stations_set = set(name.lower() for name in tfl_stations_df['Updated Name'])


# Find matches, missing in CSV, and missing in scraped
matches = scraped_stations_set & csv_stations_set
missing_in_csv = scraped_stations_set - csv_stations_set
missing_in_scraped = csv_stations_set - scraped_stations_set

# Print results
print(f"Exact matches ({len(matches)}):")
print(matches)

print(f"\nStations in scraped list but missing in CSV ({len(missing_in_csv)}):")
print(missing_in_csv)

print(f"\nStations in CSV but missing in scraped list ({len(missing_in_scraped)}):")
print(missing_in_scraped)

Exact matches (411):
{'hendon central (lu station)', 'barkingside (lu station)', 'beckton (dlr)', 'dalston junction (london overground)', 'headstone lane (london overground)', 'theydon bois (lu station)', 'west croydon (london overground)', 'wood street (london overground)', 'ickenham (lu station)', 'northwood hills (lu station)', 'west silvertown (dlr)', 'leyton (lu station)', 'finchley central (lu station)', 'ruislip manor (lu station)', 'upney (lu station)', 'kensington (olympia) (lu station)', 'canons park (lu station)', 'temple (lu station)', 'london bridge (lu station)', 'all saints (dlr)', 'hornchurch (lu station)', 'cockfosters (lu station)', 'embankment (lu station)', 'kenton (lu station)', 'harlesden (london overground)', 'chalfont & latimer (lu station)', 'barking (london overground)', 'hyde park corner (lu station)', 'bruce grove (london overground)', 'upton park (lu station)', 'denmark hill (london overground)', 'seven sisters (lu station)', 'edgware (lu station)', 'russel

In [480]:
print(station_df)

                                         Original Name  \
0                                     Abbey Road (DLR)   
1                              Acton Town (LU Station)   
2                                 Aldgate (LU Station)   
3                            Aldgate East (LU Station)   
4                                     All Saints (DLR)   
5                                Alperton (LU Station)   
6                                Amersham (LU Station)   
7                                   Angel (LU Station)   
8                                 Archway (LU Station)   
9                             Arnos Grove (LU Station)   
10                                Arsenal (LU Station)   
11                           Baker Street (LU Station)   
12                                 Balham (LU Station)   
13                                          Bank (DLR)   
14                                   Bank (LU Station)   
15                               Barbican (LU Station)   
16            

Step 5: Fetching the crime data

In [338]:
base_url = "https://crimemaps.btp.police.uk"

In [481]:
# Iterating through station_data to fetch crime data
for station in station_data:
    # Constructing the full URL
    station_url = f"{base_url}{station['Href']}"

    try:
        # Requesting the station page
        station_response = requests.get(station_url)
        station_response.raise_for_status()  # Raise an exception for HTTP errors

        # Parsing the station page
        station_soup = BeautifulSoup(station_response.text, "html.parser")

        # Extracting table data from <tbody> under <tr class="thisyear">
        tbody = station_soup.find("tbody")
        if tbody:
            row = tbody.find("tr", class_="thisyear")  # Locate the row for the current year
            if row:
                total_crime = row.find("td", class_="total fit")  # Adjust class as needed
                station["Crime Data"] = total_crime.get_text(strip=True) if total_crime else "N/A"
            else:
                station["Crime Data"] = "N/A"  # If no matching row is found
        else:
            station["Crime Data"] = "N/A"  # If <tbody> is missing

    except requests.RequestException as e:
        print(f"Failed to fetch data for {station['Cleaned Name']}: {e}")
        station["Crime Data"] = "Error"  # Mark as an error for failed requests

# Converting to DataFrame
station_df = pd.DataFrame(station_data)

# Displaying the updated DataFrame with crime data
print(station_df.head())


               Original Name               Cleaned Name           Href  \
0           Abbey Road (DLR)           Abbey Road (DLR)  /data/9267080   
1    Acton Town (LU Station)    Acton Town (LU Station)  /data/6033747   
2       Aldgate (LU Station)       Aldgate (LU Station)  /data/9283850   
3  Aldgate East (LU Station)  Aldgate East (LU Station)  /data/9249834   
4           All Saints (DLR)           All Saints (DLR)  /data/9249838   

  Crime Data  
0          5  
1         27  
2         14  
3         42  
4          9  


In [485]:
#Adding name and network columns to the dataframe so they can easily match up with the csv data
station_df["NAME"] = station_df["Cleaned Name"].str.replace(r"\s*\(.*?\)", "", regex=True).str.strip()
station_df["NETWORK"] = station_df["Cleaned Name"].str.extract(r"\((.*?)\)$", expand=False).str.strip()

station_df["NETWORK"] = station_df["NETWORK"].replace("LU Station", "London Underground")



In [486]:
print(station_df.head())


               Original Name               Cleaned Name           Href  \
0           Abbey Road (DLR)           Abbey Road (DLR)  /data/9267080   
1    Acton Town (LU Station)    Acton Town (LU Station)  /data/6033747   
2       Aldgate (LU Station)       Aldgate (LU Station)  /data/9283850   
3  Aldgate East (LU Station)  Aldgate East (LU Station)  /data/9249834   
4           All Saints (DLR)           All Saints (DLR)  /data/9249838   

  Crime Data          NAME             NETWORK  
0          5    Abbey Road                 DLR  
1         27    Acton Town  London Underground  
2         14       Aldgate  London Underground  
3         42  Aldgate East  London Underground  
4          9    All Saints                 DLR  


In [487]:
# Saving the DataFrame to a CSV file
output_file = "station_data_with_crime.csv"  # Specify the output file name
station_df.to_csv(output_file, index=False)

print(f"DataFrame successfully saved to {output_file}")

DataFrame successfully saved to station_data_with_crime.csv
