# INTRODUCTION


**Research Question**
Can the Estimated Time of Arrival on a Dublin Road Network be accurately predicted using Graph Neural Networks ?

**Aims**

* Investigate the potential of GNNs to model and predict ETAs in transportation networks.
* Analyze the strengths and limitations of each approach in capturing the temporal and spatial dynamics of transportation data.


**Objectives**

* Review and understand the foundational concepts and applications of GNNs from existing literature, especially in the context of ETA prediction.
* Acquire and pre-process the dataset to make it suitable for graph-based neural network modelling.
* Develop a GNN-based models which incorporate spatial and  temporal dynamics to predict ETAs
* Identify appropriate performance metrics to assess the accuracy and efficiency of the model.
* Compare the performance of the GNN models, highlighting the advantages and drawbacks of each approach.
* Visualize and interpret the results, drawing insights about the factors impacting ETA predictions.


**DATA**

Explanation of definitions.

**Route** A group of two or more Control Sites usually along a common section of roadway. The Route is made up of one or more links.

**Link** Two adjacent junctions with Control Sites and the corresponding section of intermediate route.

**STT** Smoothed Travel Time.

**AccSTT** Accumulated Smoothed Travel Time.

**TCS** Traffic Control Site. This is the SCATS Site ID for the junction and it is unique city-wide.


# Data Import

Loading "Dublin Trips" file as data.

In [None]:
# Importing necessary libraries and reading the CSV file
import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

!wget -O "/content/drive/My Drive/TRIPS/trips-1-day.csv" "https://data.smartdublin.ie/dataset/d083b9a8-bed7-444c-a387-d58318f31c5d/resource/3bf193dc-6029-42e7-987f-31ea5ae3c32f/download/trips-1-day.csv"

data = pd.read_csv("/content/drive/My Drive/TRIPS/trips-1-day.csv", error_bad_lines=False)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# Drop rows with any null values
data = data.dropna()

# Convert 'Timestamp' to datetime format
data['timestamp'] = pd.to_datetime(data['Timestamp'], format='%Y%m%d-%H%M')

# Reformat 'Timestamp' to 'DD-MM-YYYY HH:MM' format
data['timestamp'] = data['timestamp'].dt.strftime('%d-%m-%Y %H:%M')

data = data.loc[data['timestamp']<'19-11-2015 10:00']

# Data Exploration and Analysis

In [None]:
# Checking for missing values in the dataset
missing_values = data.isnull().sum()

missing_values

The visualization tool, matplotlib, is incorporated to display data graphically.The first histogram showcases the distribution of "STT" values from the dataThe second histogram displays the distribution of "AccSTT" values from the data

In [None]:
import matplotlib.pyplot as plt

# Plotting the distribution of STT and AccSTT values
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# STT distribution
axes[0].hist(data["STT"], bins=50, color='blue', edgecolor='black')
axes[0].set_title('Distribution of STT Values')
axes[0].set_xlabel('STT')
axes[0].set_ylabel('Frequency')

# AccSTT distribution
axes[1].hist(data["AccSTT"], bins=50, color='green', edgecolor='black')
axes[1].set_title('Distribution of AccSTT Values')
axes[1].set_xlabel('AccSTT')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

The number of distinct values for the columns "Route", "Link", "TCS1", and "TCS2" in the data are computed.unique_counts provides a summary of how many unique values are present for each of these columns.

In [None]:
# Counting unique values for routes, links, and TCS values
unique_routes = data["Route"].nunique()
unique_links = data["Link"].nunique()
unique_direction = data["Direction"].nunique()
unique_tcs1 = data["TCS1"].nunique()
unique_tcs2 = data["TCS2"].nunique()

unique_counts = {
    "Unique Routes": unique_routes,
    "Unique Links": unique_links,
    "unique direction": unique_direction,
    "Unique TCS1": unique_tcs1,
    "Unique TCS2": unique_tcs2
}

unique_counts

ther are totaly 50 and in each route there are upto 30 links and each links have 2 directions , these links are the connection between TCS1 and TCS2

In [None]:
df = data

Data Visualisation (Nodes and Edges)

In [None]:
import pandas as pd
import folium
import xml.etree.ElementTree as ET

# Parse the KML file
!wget -O "/content/drive/My Drive/TRIPS/routes.kml" "https://data.smartdublin.ie/dataset/d083b9a8-bed7-444c-a387-d58318f31c5d/resource/cd3f099c-e00a-49cc-8d02-ed2e10b8fc3e/download/routes.kml"
tree = ET.parse('/content/drive/My Drive/TRIPS/routes.kml')
root = tree.getroot()

# Extracting namespace for KML
namespace = {"kml": root.tag.split('}')[0].strip('{')}

# Extracting route details from placemarks
routes_data = []
for placemark in root.findall(".//kml:Placemark", namespace):
    route_id = None
    extended_data_element = placemark.find("kml:ExtendedData", namespace)
    if extended_data_element is not None:
        for data in extended_data_element:
            for value in data:
                route_id = value.text

    line_string_element = placemark.find("kml:LineString", namespace)
    if line_string_element is not None:
        coordinates = line_string_element.find("kml:coordinates", namespace).text.strip().split(" ")
        for coord_pair in coordinates:
            longitude, latitude = coord_pair.split(",")[:2]
            routes_data.append({"RouteID": route_id, "Longitude": float(longitude), "Latitude": float(latitude)})

# Convert to DataFrame
routes_df = pd.DataFrame(routes_data)

# Load the sample dataset
!wget -O "/content/drive/My Drive/TRIPS/trips.csv" "https://opendata.dublincity.ie/TrafficOpenData/CP_TR/trips.csv"
data = pd.read_csv('/content/drive/My Drive/TRIPS/trips.csv')

# Extract unique combinations of TCS1, TCS2, # Route, Link, and Direction
unique_combinations = data[["TCS1", "TCS2", "# Route", "Link", "Direction"]].drop_duplicates()
unique_tcs = set(unique_combinations["TCS1"]).union(set(unique_combinations["TCS2"]))

# Filter coordinates dataframe for unique TCS values
filtered_routes_df = routes_df[routes_df["RouteID"].isin(map(str, unique_tcs))]

# Create a base map centered around Dublin City
m = folium.Map(location=[53.349805, -6.26031], zoom_start=13)

# Add nodes to the map
for _, row in filtered_routes_df.iterrows():
    folium.CircleMarker(
        location=[row["Latitude"], row["Longitude"]],
        radius=2,
        color="blue",
        fill=True,
        fill_color="blue"
    ).add_to(m)

# Add edges to the map based on unique combinations
for _, row in unique_combinations.iterrows():
    start_coords = filtered_routes_df[filtered_routes_df["RouteID"] == str(row["TCS1"])][["Latitude", "Longitude"]].values
    end_coords = filtered_routes_df[filtered_routes_df["RouteID"] == str(row["TCS2"])][["Latitude", "Longitude"]].values
    if len(start_coords) > 0 and len(end_coords) > 0:
        folium.PolyLine([start_coords[0], end_coords[0]], color="black", weight=0.5).add_to(m)

# Display the map
m


The Blue dots shows the TCS and the line represent the links between the TCS

# Graph Construction

here we are taking a different approach on preprocessing, as a graph usually the TCS - Traffic Control Site will selected as Nodes and the link between TCS will the edges and the Node feature will be Predicted For *T+1* time using the *T* Node feature.

In our dataset the STT- Short Travel Time is the Y value that needs to be predicted and since the STT is the travel time between two TCS it will be an edge feature, so in our approach, we have used a method called **line graph transformation**. we have used the combination of **TCS1 and TCS2** as **NODE** , the **EDGES** between these Nodes are the common TCS the Nodes Shares

line graph transformation, nodes represent edges of the original graph, and two nodes in the line graph are connected if their corresponding edges in the original graph share a node.The concept of a line graph is rooted in graph theory and has been used in various applications to study properties or relationships that are more naturally expressed between edges rather than nodes of a graph.

creating a dataframe which holds unique Timestamp and node with the STT for each direction

In [None]:
# Converting TCS1 and TCS2 to string and then concatenating

df['node'] = np.where(df['Direction'] == 1,
                      df['TCS1'].astype(str) + '-' + df['TCS2'].astype(str),
                      df['TCS2'].astype(str) + '-' + df['TCS1'].astype(str))


# Pivoting table to get STT values for both directions as separate columns
pivot_df = df.pivot_table(index=['Timestamp', 'node'], columns='Direction', values='STT', aggfunc='first').reset_index()

# Filling missing values with 0 since there might be few links which will have only one direction
pivot_df.fillna(0, inplace=True)

# Displaying the transformed dataframe
pivot_df.head()


In [None]:
# Installing torch-geomentric for graph construction

!pip install torch-geometric

Function to create graph each timestep with Y-value(STT of T+1)

In [None]:
import torch
import itertools
from torch_geometric.data import Data

def create_graph_for_timestamp_with_target(df, ts):
    timestamp_df = df[df['Timestamp'] == ts]

    nodes = timestamp_df['node'].unique().tolist()
    node_index = {node: i for i, node in enumerate(nodes)}

    edge_index = torch.tensor([[node_index[src], node_index[dst]] for src, dst in itertools.product(nodes, nodes)], dtype=torch.long).t().contiguous()

    x = torch.tensor(timestamp_df.sort_values(by='node')[[1, 2]].values, dtype=torch.float)

    next_ts_idx = df['Timestamp'].searchsorted(ts, side='right')
    next_ts = df['Timestamp'].iloc[next_ts_idx] if next_ts_idx < len(df['Timestamp']) else None

    if next_ts is None:
        y = torch.zeros_like(x)
    else:
        next_ts_df = df[df['Timestamp'] == next_ts].sort_values(by='node')
        y = torch.tensor(next_ts_df[[1, 2]].values, dtype=torch.float)

    data = Data(x=x, edge_index=edge_index, y=y)
    data.timestamp = ts
    return data

In [None]:
#Creating a graph for each timestamp using the function

timestamps = pivot_df['Timestamp'].unique()
graphs_with_target = [create_graph_for_timestamp_with_target(pivot_df, ts) for ts in timestamps]

In [None]:
next(iter(graphs_with_target))

In [None]:
len(graphs_with_target)

# Data Splitting and Normalization

Splitting data

In [None]:
# Determing the split point
split_idx = int(0.8 * len(timestamps))

# Spliting the data based on the determined index
train_timestamps = timestamps[:split_idx]
test_timestamps = timestamps[split_idx:]

train_dataset = [graph for graph in graphs_with_target if graph.timestamp in train_timestamps]
test_dataset = [graph for graph in graphs_with_target if graph.timestamp in test_timestamps]

Normalization

In [None]:
# Computing mean and standard deviation of target values in the training set
train_targets = [data.y for data in train_dataset]
all_train_targets = torch.cat(train_targets, dim=0)
mean_target = all_train_targets.mean(dim=0)
std_target = all_train_targets.std(dim=0)

# Normalize target values in training and testing datasets
for data in train_dataset:
    data.y = (data.y - mean_target) / std_target

for data in test_dataset:
    data.y = (data.y - mean_target) / std_target

mean_target, std_target

**Data Loader**

In [None]:
# Creating data loaders
from torch_geometric.data import DataLoader
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)


In [None]:
# Get one batch of data from the train_loader
sample_batch = next(iter(train_loader))

# Print the shape of node features and edge indices
print("Node features shape:", sample_batch.x.shape)
print("Edge index shape:", sample_batch.edge_index.shape)

# If dataset contains target values
if hasattr(sample_batch, 'y'):
    print("Target/Label shape:", sample_batch.y.shape)

# Temporal Graph Convolution Recurrent Network

TGC-RN Model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class TGCRN_STTPredictor(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super(TGCRN_STTPredictor, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim1)
        self.recurrent_layer = nn.GRU(hidden_dim1, hidden_dim1, batch_first=True)
        self.fc = nn.Linear(hidden_dim1, output_dim)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # First Graph Convolution Layer
        x = self.conv1(x, edge_index)
        x = F.relu(x)

        # Using the batch vector to determine the start and end of each graph's nodes in the batch
        x_list = [x[batch == i] for i in range(batch.max() + 1)]
        x_packed = torch.nn.utils.rnn.pack_sequence(x_list, enforce_sorted=False)

        # Recurrent Layer
        x, _ = self.recurrent_layer(x_packed)

        # Unpack the output from the GRU to restore the original sequence lengths
        x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True)

        # Concatenate the outputs for all graphs in the batch to match original number of nodes
        x = torch.cat([x[i][:len(x_list[i])] for i in range(x.size(0))], dim=0)

        # Apply ReLU activation to the recurrent output
        x = F.relu(x)

        # Fully Connected Layer to produce the output
        x = self.fc(x)

        return x

# Instantiate the model
input_dim = 2  # Because we have STT values for two directions as input
hidden_dim1 = 64
hidden_dim2 = 32
output_dim = 2  # Predict STT for both directions

TGCRN_model = TGCRN_STTPredictor(input_dim, hidden_dim1, hidden_dim2, output_dim)


TGC-RN Training

In [None]:
learning_rate = 0.001
num_epochs = 100

# Define the optimizer and loss function
optimizer = torch.optim.Adam(TGCRN_model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
TGCRN_model = TGCRN_model.to(device)

TGCRN_loss_values = []

# Training Loop
for epoch in range(num_epochs):
    TGCRN_model.train()  # Set the model to training mode
    total_loss = 0

    for batch in train_loader:
        batch = batch.to(device)  # Move the batch data to the device

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = TGCRN_model(batch)

        # Compute the loss
        loss = criterion(outputs, batch.y)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Print the average loss for this epoch
    average_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Loss: {average_loss:.4f}")
    TGCRN_loss_values.append(average_loss)

In [None]:
import matplotlib.pyplot as plt

# Exclude the first loss value
epochs = range(2, len(TGCRN_loss_values) + 1)
losses = TGCRN_loss_values[1:]

plt.figure(figsize=(12, 6))
plt.plot(epochs, losses, label="Training Loss", marker='o')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training Loss Over Epochs for TGC-RN Model")
plt.legend()
plt.grid(True)
plt.show()

TGC-RN Evaluation

In [None]:
TGCRN_model.eval()
test_loss_TGCRN = 0
all_predictions_TGCRN = []
all_true_values_TGCRN = []

for data in test_loader:
    data = data.to(device)
    with torch.no_grad():
        predictions = TGCRN_model(data)


    loss = criterion(predictions, data.y)
    test_loss_TGCRN += loss.item()

    all_predictions_TGCRN.append(predictions.cpu().numpy())
    all_true_values_TGCRN.append(data.y.cpu().numpy())

print(f"Test Loss: {test_loss_TGCRN:.4f}")


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Flatten the lists
flattened_predictions_TGCRN = np.concatenate(all_predictions_TGCRN, axis=0)
flattened_true_values_TGCRN = np.concatenate(all_true_values_TGCRN, axis=0)

# Calculate metrics
mae_TGCRN = mean_absolute_error(flattened_true_values_TGCRN, flattened_predictions_TGCRN)
mse_TGCRN = mean_squared_error(flattened_true_values_TGCRN, flattened_predictions_TGCRN)
rmse_TGCRN = np.sqrt(mse_TGCRN)
r2_TGCRN = r2_score(flattened_true_values_TGCRN, flattened_predictions_TGCRN)

# Create a DataFrame
metrics_df = pd.DataFrame({
    'Metrics': ['Test Loss', 'MAE', 'MSE', 'RMSE', 'R2'],
    'TGC-RN': [test_loss_TGCRN, mae_TGCRN, mse_TGCRN, rmse_TGCRN, r2_TGCRN] \
})

metrics_df

In [None]:
import matplotlib.pyplot as plt

# Specify the indices of the timestamps you want to plot
timestamps_to_plot = [0]  # Adjust this list as per needs

figures = []

# Plotting
for timestamp_index in timestamps_to_plot:
    if timestamp_index < len(all_true_values_TGCRN):
        fig1 = plt.figure(figsize=(15, 5))

        # Direction 1
        plt.subplot(1, 2, 1)
        plt.plot(all_true_values_TGCRN[timestamp_index][:50, 0], label="True Values", marker='o')
        plt.plot(all_predictions_TGCRN[timestamp_index][:50, 0], label="Predictions", marker='x')
        plt.legend()
        plt.title(f"TGC-RN Model Prediction for Direction 1")
        plt.xlabel("Time-step")  # x-axis label
        plt.ylabel("STT")  # y-axis label

        # Direction 2
        plt.subplot(1, 2, 2)
        plt.plot(all_true_values_TGCRN[timestamp_index][:50, 1], label="True Values", marker='o')
        plt.plot(all_predictions_TGCRN[timestamp_index][:50, 1], label="Predictions", marker='x')
        plt.legend()
        plt.title(f"TGC-RN Model Prediction for Direction 2")
        plt.xlabel("Time-step")  # x-axis label
        plt.ylabel("STT")  # y-axis label

        plt.tight_layout()
        plt.show()
        figures.append(fig1)


# Temporal Graph Chebyshev Convolution Network

TGCCN Model

In [None]:
import torch.nn.functional as F
from torch_geometric.nn import ChebConv

class TGCCN_STTPredictor(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super(TGCCN_STTPredictor, self).__init__()
        self.tgcn1 = ChebConv(input_dim, hidden_dim1, K=2)
        self.recurrent_layer = torch.nn.LSTM(hidden_dim1, hidden_dim2, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim2, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        # First Graph Convolution Layer
        x = self.tgcn1(x, edge_index)
        x = F.relu(x)

        # Recurrent Layer to capture temporal patterns
        x, _ = self.recurrent_layer(x.unsqueeze(0))
        x = x.squeeze(0)

        # Apply ReLU activation to the LSTM output
        x = F.relu(x)

        # Fully Connected Layer to produce the output
        x = self.fc(x)

        return x

# Instantiate the model
input_dim = 2  # Because we have STT values for two directions as input
hidden_dim1 = 64
hidden_dim2 = 32
output_dim = 2  # Predict STT for both directions

TGCCN_model = TGCCN_STTPredictor(input_dim, hidden_dim1, hidden_dim2, output_dim)


TGCCN Training

In [None]:
# Hyperparameters
learning_rate = 0.01
epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Use the Mean Squared Error Loss
criterion = torch.nn.MSELoss()

# Use the Adam optimizer
optimizer = torch.optim.Adam(TGCCN_model.parameters(), lr=learning_rate)

# Move model to the appropriate device
TGCCN_model = TGCCN_model.to(device)

TGCCN_loss_values = []

# Training loop
for epoch in range(epochs):
    TGCCN_model.train()
    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        out = TGCCN_model(data)
        loss = criterion(out, data.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")
    TGCCN_loss_values.append(total_loss)


In [None]:
import matplotlib.pyplot as plt

# Exclude the first loss value
epochs = range(1, len(TGCCN_loss_values) + 1)
losses = TGCCN_loss_values[0:]

plt.figure(figsize=(12, 6))
plt.plot(epochs, losses, label="Training Loss", marker='o')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("TGCCN Model Training Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
import matplotlib.pyplot as plt

# loss from 5th epoch
epochs = range(85, len(TGCCN_loss_values) + 1)
losses = TGCCN_loss_values[84:]

plt.figure(figsize=(12, 6))
plt.plot(epochs, losses, label="Training Loss", marker='o')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("TGCCN Model Training Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()

TGCCN Evaluation

In [None]:
TGCCN_model.eval()
test_loss_TGCCN = 0
all_predictions_TGCCN = []
all_true_values_TGCCN = []

for data in test_loader:
    data = data.to(device)
    with torch.no_grad():
        predictions = TGCCN_model(data)

    # Check for shape mismatch and skip if they don't match
    if predictions.shape != data.y.shape:
        continue

    loss = criterion(predictions, data.y)
    test_loss_TGCCN += loss.item()

    all_predictions_TGCCN.append(predictions.cpu().numpy())
    all_true_values_TGCCN.append(data.y.cpu().numpy())

print(f"Test Loss: {test_loss_TGCCN:.4f}")


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Flatten the lists
flattened_predictions_TGCCN = np.concatenate(all_predictions_TGCCN, axis=0)
flattened_true_values_TGCCN = np.concatenate(all_true_values_TGCCN, axis=0)

# Calculate metrics
mae_TGCCN = mean_absolute_error(flattened_true_values_TGCCN, flattened_predictions_TGCCN)
mse_TGCCN = mean_squared_error(flattened_true_values_TGCCN, flattened_predictions_TGCCN)
rmse_TGCCN = np.sqrt(mse_TGCCN)
r2_TGCCN = r2_score(flattened_true_values_TGCCN, flattened_predictions_TGCCN)

metrics_df['TGCCN'] = [test_loss_TGCCN, mae_TGCCN, mse_TGCCN, rmse_TGCCN, r2_TGCCN]

metrics_df[['Metrics','TGCCN']]

In [None]:
import matplotlib.pyplot as plt

# Specify the indices of the timestamps you want to plot
timestamps_to_plot = [0]



# Plotting
for timestamp_index in timestamps_to_plot:
    if timestamp_index < len(all_true_values_TGCCN):
        fig2 = plt.figure(figsize=(15, 5))

        # Direction 1
        plt.subplot(1, 2, 1)
        plt.plot(all_true_values_TGCCN[timestamp_index][:50, 0], label="True Values", marker='o')
        plt.plot(all_predictions_TGCCN[timestamp_index][:50, 0], label="Predictions", marker='x')
        plt.legend()
        plt.title(f"TGCCN Model Prediction for Direction 1")
        plt.xlabel("Time-step")  # x-axis label
        plt.ylabel("STT")  # y-axis label


        # Direction 2
        plt.subplot(1, 2, 2)
        plt.plot(all_true_values_TGCCN[timestamp_index][:50, 1], label="True Values", marker='o')
        plt.plot(all_predictions_TGCCN[timestamp_index][:50, 1], label="Predictions", marker='x')
        plt.legend()
        plt.title(f"TGCCN Model Prediction for Direction 2")
        plt.xlabel("Time-step")  # x-axis label
        plt.ylabel("STT")  # y-axis label


        plt.tight_layout()
        plt.show()
        figures.append(fig2)

# Result

In [None]:
metrics_df

In [None]:
from IPython.display import display

for fig in [figures[0], figures[1]]:
    display(fig)