# Generating accesibility indicators using Google Colab
This notebook will allow you to run the accessibility code within the cloud.

# GPU check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_. You can check the output of `!nvidia-smi` to check which GPU you have. Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
! nvidia-smi

Thu Jul 11 10:42:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Setup:
First, we need to install RAPIDS within the Colab environment. The code below will do the following:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

This will complete in ~5 minutes. It needs to be done every time you load up this notebook within Colab.


In [1]:
! git clone https://github.com/rapidsai/rapidsai-csp-utils.git # Get the latest RAPIDS version
! python rapidsai-csp-utils/colab/pip-install.py # Install RAPIDS


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 490, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 490 (delta 149), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (490/490), 136.70 KiB | 15.19 MiB/s, done.
Resolving deltas: 100% (251/251), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.0 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
Installing the rest of the RAPIDS 24.4.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1200.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 GB 1.1 MB/s eta 0:00:00
Collecting cugraph-cu12==24.4.*
  Downloading 

Let's check that this has worked correctly by calling the version numbers of libraries. If RAPIDS was installed correctly, then will print their version numbers in each cell.

In [2]:
import cudf
cudf.__version__

'24.04.01'

In [3]:
import cuml
cuml.__version__

'24.04.00'

In [4]:
import cugraph
cugraph.__version__

'24.04.00'

In [5]:
import cuspatial
cuspatial.__version__

'24.04.00'

In [6]:
import cuxfilter
cuxfilter.__version__

'24.04.01'

Just one last thing we need to install for our code to work (this will need to be reinstalled every time you run this notebook in Colab).

In [7]:
! pip install pyogrio

Collecting pyogrio
  Downloading pyogrio-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyogrio
Successfully installed pyogrio-0.9.0


# Clone the GitHub repo #

First, lets clone the repo from the GitHub page and get everything set up here.

One can either run everything from within the Colab environment locally (requires to upload data each time of asking) or connect to your Google Drive account and run via there. I will provide code for each option, although I have personally found that the first option runs slightly faster (although it can be a faff to manually upload data to folders). First, lets set up the local drive here.

In [8]:
## Step 1: Clone the GitHub repo locally
! git clone https://github.com/markagreen/groundswell_indicators.git # Clone the UK Routes GitHub page (do the first time)
# ! git pull # Make sure you have the latest version
! ls # Check has worked


Cloning into 'groundswell_indicators'...
remote: Enumerating objects: 450, done.[K
remote: Counting objects: 100% (259/259), done.[K
remote: Compressing objects: 100% (156/156), done.[K
remote: Total 450 (delta 125), reused 209 (delta 87), pack-reused 191[K
Receiving objects: 100% (450/450), 211.70 MiB | 13.53 MiB/s, done.
Resolving deltas: 100% (188/188), done.
Updating files: 100% (74/74), done.
groundswell_indicators	rapidsai-csp-utils  sample_data


In [9]:
## Step 2: Navigate to the folder to run the code
# Define the path
path = '/content/groundswell_indicators/accessibility'

# Change director to path
%cd {path}
! ls

/content/groundswell_indicators/accessibility
access_indicators_colab.ipynb  debug.log  pyproject.toml  requirements-dev.lock  scripts  SPEC.pdf
data			       LICENSE	  README.md	  requirements.lock	 SPEC.md  ukroutes


The second approach is to run everything via your Google Drive. Follow the steps below to do this. If you are unable to mount the Google Drive following runnin the second bit of code, then run the following code below. Else you can skip for now (left here in case of issues).

In [None]:
#from google.colab import drive
#drive.flush_and_unmount()
## Check and clean the mountpoint directory
#import shutil
#import os
#
#mountpoint = '/content/gdrive'
#if os.path.isdir(mountpoint):
#    shutil.rmtree(mountpoint)  # Remove the directory if it exists and contains files

To mount Google Drive, run the following.

In [None]:
#from google.colab import drive
#drive.mount('/content/gdrive', force_remount=True) # Link notebook to your Google Drive (opens a link the first time to set up)
#! ls /content/gdrive # Check has mounted corrected

Then let's navigate to folder where store all Colab files on (this is a folder that already exists on my Google Drive - so will need to be created if you don't have one).

In [None]:
#import os
#
## Define the path
#path = '/content/gdrive/MyDrive/Colab'
#
## Check if the path exists
#if os.path.exists(path):
#    print(f"Path exists: {path}")
#else:
#    print(f"Path does not exist, creating: {path}")
#    os.makedirs(path)
#
## Change director to path
#%cd {path}

Next we clone the GitHub page (do the first time) or update it to the latest version of the code.

In [None]:
# ! git clone https://github.com/markagreen/groundswell_indicators.git # Clone the UK Routes GitHub page (do the first time)
# ! git pull # Make sure you have the latest version

Let's move to the UK routes directory now.

In [None]:
## Define the path
#path = '/content/gdrive/MyDrive/Colab/groundswell_indicators/accessibility'#
#
## Change director to path
#%cd {path}

To check the list of files within to double check that we are in the correct drive.

In [None]:
#! ls # List all files in directory

# Preprocessing data
The first step is to process all of the road network information into the format that we need. We only need to run this once - so once the road network has been calculated and saved we don't need to run this every time. One can skip to the next step for subsequent indicators. Before we begin, make sure to run the R file "process_input_files.R".

We need to first run the preprocessing script. I have found that it only works if I paste it in here entirely, rather than running from source. Prior to running you will need to manually upload the raw road network files to `data/raw/oproad` (do this by selecting the folder icon on the left hand size and uploading the files manually).

In [None]:
import cudf
from scipy.spatial import distance_matrix
import pandas as pd
import cugraph
import geopandas as gpd
import polars as pl
from scipy.spatial import KDTree

from ukroutes.common.logger import logger
from ukroutes.common.utils import Paths# , filter_deadends
from ukroutes.process_routing import add_to_graph

def filter_deadends(nodes, edges):
    G = cugraph.Graph()
    G.from_cudf_edgelist(
        edges, source="start_node", destination="end_node", edge_attr="time_weighted"
    )
    components = cugraph.connected_components(G)
    component_counts = components["labels"].value_counts().reset_index()
    component_counts.columns = ["labels", "count"]

    largest_component_label = component_counts[
        component_counts["count"] == component_counts["count"].max()
    ]["labels"][0]

    largest_component_nodes = components[
        components["labels"] == largest_component_label
    ]["vertex"]
    filtered_edges = edges[
        edges["start_node"].isin(largest_component_nodes)
        & edges["end_node"].isin(largest_component_nodes)
    ]
    filtered_nodes = nodes[nodes["node_id"].isin(largest_component_nodes)]
    return filtered_nodes, filtered_edges

def process_road_edges() -> pl.DataFrame:
    """
    Create time estimates for road edges based on OS documentation

    Time estimates based on speed estimates and edge length. Speed estimates
    taken from OS documentation. This also filters to remove extra cols.

    Parameters
    ----------
    edges : pd.DataFrame
        OS highways df containing edges, and other metadata

    Returns
    -------
    pd.DataFrame:
        OS highways df with time weighted estimates
    """

    a_roads = ["A Road", "A Road Primary"]
    b_roads = ["B Road", "B Road Primary"]

    road_edges: pl.DataFrame = pl.from_pandas(
        gpd.read_file(
            Paths.OPROAD,
            layer="road_link",
            ignore_geometry=True,
            engine="pyogrio",  # much faster
        )
    )

    road_edges = (
        road_edges.with_columns(
            pl.when(pl.col("road_classification") == "Motorway")
            .then(67)
            .when(
                (
                    pl.col("form_of_way").is_in(
                        ["Dual Carriageway", "Collapsed Dual Carriageway"]
                    )
                )
                & (pl.col("road_classification").is_in(a_roads))
            )
            .then(57)
            .when(
                (
                    pl.col("form_of_way").is_in(
                        ["Dual Carriageway", "Collapsed Dual Carriageway"]
                    )
                )
                & (pl.col("road_classification").is_in(b_roads))
            )
            .then(45)
            .when(
                (pl.col("form_of_way") == "Single Carriageway")
                & (pl.col("road_classification").is_in(a_roads + b_roads))
            )
            .then(25)
            .when(pl.col("road_classification").is_in(["Unclassified"]))
            .then(24)
            .when(pl.col("form_of_way").is_in(["Roundabout"]))
            .then(10)
            .when(pl.col("form_of_way").is_in(["Track", "Layby"]))
            .then(5)
            .otherwise(10)
            .alias("speed_estimate")
        )
        .with_columns(pl.col("speed_estimate") * 1.609344)
        .with_columns(
            (((pl.col("length") / 1000) / pl.col("speed_estimate")) * 60).alias(
                "time_weighted"
            ),
        )
    )
    return road_edges.select(["start_node", "end_node", "time_weighted", "length"])


def process_road_nodes() -> pl.DataFrame:
    road_nodes = gpd.read_file(Paths.OPROAD, layer="road_node", engine="pyogrio")
    road_nodes["easting"], road_nodes["northing"] = (
        road_nodes.geometry.x,
        road_nodes.geometry.y,
    )
    return pl.from_pandas(road_nodes[["id", "easting", "northing"]]).rename(
        {"id": "node_id"}
    )


def ferry_routes(road_nodes: pl.DataFrame) -> tuple[pl.DataFrame, pl.DataFrame]:
    # http://overpass-turbo.eu/?q=LyoKVGhpcyBoYcSGYmVlbiBnxI1lcmF0ZWQgYnkgdGhlIG92xJJwxIlzLXR1cmJvIHdpemFyZC7EgsSdxJ9yaWdpbmFsIHNlxLBjaMSsxIk6CsOiwoDCnHJvdcSVPWbEknJ5xYjCnQoqLwpbxYx0Ompzb25dW3RpbWXFmzoyNV07Ci8vxI_ElMSdciByZXN1bHRzCigKICDFryBxdcSSxJrEo3J0IGZvcjogxYjFisWbZcWPxZHFk8KAxZXGgG5vZGVbIsWLxY1lIj0ixZByxZIiXSh7e2LEqnh9fSnFrcaAd2F5xp_GocSVxqTGpsaWxqrGrMauxrDGssa0xb_FtWVsxJRpxaDGusaTxr3Gp8apxqvGrcavb8axxrPFrceFxoJwxLduxorFtsW4xbrFvMWbxJjGnHnFrT7Frcejc2vHiMaDdDs&c=BH1aTWQmgG

    ferries = gpd.read_file(Paths.RAW / "oproad" / "ferries.geojson")[
        ["id", "geometry"]
    ].to_crs("EPSG:27700")
    ferry_nodes = (
        ferries[ferries["id"].str.startswith("node")].copy().reset_index(drop=True)
    )
    ferry_nodes["easting"], ferry_nodes["northing"] = (
        ferry_nodes.geometry.x,
        ferry_nodes.geometry.y,
    )
    ferry_edges = (
        ferries[ferries["id"].str.startswith("relation")]
        .explode(index_parts=False)
        .copy()
        .reset_index(drop=True)
    )
    road_nodes = road_nodes.to_pandas().copy()

    nodes_tree = KDTree(road_nodes[["easting", "northing"]].values)
    distances, indices = nodes_tree.query(ferry_nodes[["easting", "northing"]].values)
    ferry_nodes["node_id"] = road_nodes.iloc[indices]["node_id"].reset_index(drop=True)

    ferry_edges["length"] = ferry_edges["geometry"].apply(lambda x: x.length)
    ferry_edges = ferry_edges.assign(
        time_weighted=(ferry_edges["length"].astype(float) / 1000) / 25 * 1.609344 * 60
    )

    ferry_edges["start_node"] = ferry_edges["geometry"].apply(lambda x: x.coords[0])
    ferry_edges["easting"], ferry_edges["northing"] = (
        ferry_edges["start_node"].apply(lambda x: x[0]),
        ferry_edges["start_node"].apply(lambda x: x[1]),
    )
    distances, indices = nodes_tree.query(ferry_edges[["easting", "northing"]])
    ferry_edges["start_node"] = road_nodes.iloc[indices]["node_id"].reset_index(
        drop=True
    )

    ferry_edges["end_node"] = ferry_edges["geometry"].apply(lambda x: x.coords[-1])
    ferry_edges["easting"], ferry_edges["northing"] = (
        ferry_edges["end_node"].apply(lambda x: x[0]),
        ferry_edges["end_node"].apply(lambda x: x[1]),
    )
    distances, indices = nodes_tree.query(ferry_edges[["easting", "northing"]])
    ferry_edges["end_node"] = road_nodes.iloc[indices]["node_id"].reset_index(drop=True)
    return (
        pl.from_pandas(ferry_nodes[["node_id", "easting", "northing"]]),
        pl.from_pandas(
            ferry_edges[["start_node", "end_node", "time_weighted", "length"]]
        ),
    )


def combine_subgraphs(nodes, edges):
    graph = cugraph.Graph()
    graph.from_cudf_edgelist(
        cudf.from_pandas(edges), source="start_node", destination="end_node"
    )
    components = cugraph.connected_components(graph)
    component_counts = components["labels"].value_counts().reset_index()

    largest_component_label = component_counts[
        component_counts["count"] == component_counts["count"].max()
    ]["labels"][0]
    largest_component = components[components["labels"] == largest_component_label]
    largest_cn = nodes[nodes["node_id"].isin(largest_component["vertex"].to_pandas())]
    largest_ce = edges[
        edges["start_node"].isin(largest_component["vertex"].to_pandas())
        | edges["end_node"].isin(largest_component["vertex"].to_pandas())
    ]

    subgraph_component_labels = component_counts[
        component_counts["labels"] != largest_component_label
    ]["labels"]
    subgraph_component = components[
        components["labels"].isin(subgraph_component_labels)
    ]
    sub_cn = nodes[nodes["node_id"].isin(subgraph_component["vertex"].to_pandas())]

    _, nodes, edges = add_to_graph(
        sub_cn,
        cudf.from_pandas(largest_cn),
        cudf.from_pandas(largest_ce),
    )
    return nodes, edges


def process_os():
    logger.info("Starting OS highways processing...")
    edges = process_road_edges()
    nodes = process_road_nodes()

    ferry_nodes, ferry_edges = ferry_routes(nodes)
    nodes = pl.concat([nodes, ferry_nodes]).to_pandas()
    edges = pl.concat([edges, ferry_edges]).to_pandas()

    unique_node_ids = nodes["node_id"].unique()
    node_id_mapping = {
        node_id: new_id for new_id, node_id in enumerate(unique_node_ids)
    }
    nodes["node_id"] = nodes["node_id"].map(node_id_mapping)
    edges["start_node"] = edges["start_node"].map(node_id_mapping)
    edges["end_node"] = edges["end_node"].map(node_id_mapping)

    # nodes, edges = filter_deadends(cudf.from_pandas(nodes), cudf.from_pandas(edges))
    nodes, edges = combine_subgraphs(nodes, edges)

    nodes.to_pandas().to_parquet(Paths.OS_GRAPH / "nodes.parquet", index=False)
    logger.debug(f"Nodes saved to {Paths.OS_GRAPH / 'nodes.parquet'}")
    edges.to_pandas().to_parquet(Paths.OS_GRAPH / "edges.parquet", index=False)
    logger.debug(f"Edges saved to {Paths.OS_GRAPH / 'edges.parquet'}")


if __name__ == "__main__":
    process_os()


INFO:ukroutes.common.logger:Starting OS highways processing...


DEBUG:ukroutes.common.logger:Nodes saved to data/processed/osm/nodes.parquet


DEBUG:ukroutes.common.logger:Edges saved to data/processed/osm/edges.parquet


# Process routing

Here we call a script file (once again I print this entirely in here for ease of running. We run a seperate script for each indicator that we want to create. In this instance, we will just calculate the distance to the nearest green space (considering all green spaces). This is a starting point for us. We can revise the specific indicator later.

Things to check: (1) the original code was developed for postcodes rather than UPRNs. We might need to change some of the K means settings in the process_routing script to accomodate that they will be closely clustered vs postcodes (as more and tightly packed). Solution would be to increase the value above 10. (2) Time (minutes) vs distances (meters). Update the Routing() weights setting to either "time_weighted" or "distance" depending on preference.

In [17]:
import cudf
import geopandas as gpd
import numpy as np
import pandas as pd
import cugraph
import warnings

from ukroutes import Routing
from ukroutes.common.utils import Paths #, filter_deadends
from ukroutes.preprocessing import process_os
from ukroutes.process_routing import add_to_graph, add_topk

# To stop warnings being printed, which will lead to a million of them on Colab
warnings.filterwarnings("ignore", category=FutureWarning, module="cugraph")

# Load destinations
greenspace = pd.read_parquet(Paths.PROCESSED / "osgsl" / "osgsl_all.parquet") # All green spaces of any size / type

# Load road network
nodes: cudf.DataFrame = cudf.from_pandas(
    pd.read_parquet(Paths.OS_GRAPH / "nodes.parquet")
)
edges: cudf.DataFrame = cudf.from_pandas(
    pd.read_parquet(Paths.OS_GRAPH / "edges.parquet")
)

#def filter_deadends(nodes, edges):
#    G = cugraph.Graph()
#    G.from_cudf_edgelist(
#        edges, source="start_node", destination="end_node", edge_attr="time_weighted"
#    )
#    components = cugraph.connected_components(G)
#    component_counts = components["labels"].value_counts().reset_index()
#    component_counts.columns = ["labels", "count"]
#
#    largest_component_label = component_counts[
#        component_counts["count"] == component_counts["count"].max()
#    ]["labels"][0]
#
#    largest_component_nodes = components[
#        components["labels"] == largest_component_label
#    ]["vertex"]
#    filtered_edges = edges[
#        edges["start_node"].isin(largest_component_nodes)
#        & edges["end_node"].isin(largest_component_nodes)
#    ]
#    filtered_nodes = nodes[nodes["node_id"].isin(largest_component_nodes)]
#    return filtered_nodes, filtered_edges

#nodes, edges = filter_deadends(nodes, edges)

greenspace, nodes, edges = add_to_graph(greenspace, nodes, edges, 1)

toids = pd.read_parquet(Paths.PROCESSED / "toids_cm_osgb.parquet")
# toids = toids.sample(100, random_state = 1234) # For testing purposes, subset a smaller dataset
toids, nodes, edges = add_to_graph(toids, nodes, edges, 2)
greenspace = add_topk(greenspace, toids, 3)


routing = Routing(
    name="greenspace",
    edges=edges,
    nodes=nodes,
    outputs=toids,
    inputs=greenspace,
    #weights="time_weighted", # Use to get time (mins)
    weights="length", # Use to get the distance (meters)
    min_buffer=5000,
    max_buffer=500_000,
    #cutoff=60,
)
routing.fit()
#distances = routing.fetch_distances()

#distances = (
#    distances.set_index("vertex")
#    .join(toids.set_index("node_id"), how="right")
#    .reset_index()
#)
routing.distances

distances = (
    routing.distances.set_index("vertex")
    .join(cudf.from_pandas(toids).set_index("node_id"), how="right")
    .reset_index()
)

OUT_FILE = Paths.OUT_DATA / "distances_greenspace_topk3.csv"
distances[["TOID", "distance"]].to_csv(OUT_FILE, index=False)


Output()

DEBUG:ukroutes.common.logger:Routing complete for greenspace in 1.28 minutes.
