

This notebook runs faster on a GPU runtime. To enable it, go to Edit > Notebook Settings > Hardware Accelerator > GPU.


## Setup

In [1]:
import torch

print(torch.__version__)


2.5.0+cu121


In [2]:
# Install required packages
import os

os.environ['TORCH'] = torch.__version__
!pip install pyg-lib -f https://data.pyg.org/whl/nightly/torch-${TORCH}.html
!pip install git+https://github.com/pyg-team/pytorch_geometric.git

!pip install sentence_transformers
!pip3 install fuzzywuzzy[speedup]
!pip install captum

Looking in links: https://data.pyg.org/whl/nightly/torch-2.5.0+cu121.html
[31mERROR: Could not find a version that satisfies the requirement pyg-lib (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pyg-lib[0m[31m
[0mCollecting git+https://github.com/pyg-team/pytorch_geometric.git
  Cloning https://github.com/pyg-team/pytorch_geometric.git to /tmp/pip-req-build-ic2g38dg
  Running command git clone --filter=blob:none --quiet https://github.com/pyg-team/pytorch_geometric.git /tmp/pip-req-build-ic2g38dg
  Resolved https://github.com/pyg-team/pytorch_geometric.git to commit facf0c404182b3b08eba0f8954906f7f01ef9eb5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: torch-geometric
  Building wheel for torch-geometric (pyproject.toml) ... [?25l[?25hdone
  Created wheel for torch-geometric: filenam

## Link Regression on the MovieLens Dataset

This notebook shows how to load a set of `*.csv` files into a `torch_geometric.data.HeteroData` object and how to train a [heterogeneous graph model](https://pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html#hgtutorial).

We are going to use the [Movielens dataset](https://grouplens.org/datasets/movielens/), which is collected by the GroupLens Research group. The toy dataset describes movies, users, and their ratings. We are going to predict the rating of a user for a movie.

## Data Ingestion

In [3]:

import pandas as pd


In [28]:
# Load the entire ratings dataframe into memory:
data = pd.read_csv("/content/ChargingRecords.csv")
data

Unnamed: 0,UserID,ChargerID,ChargerCompany,Location,ChargerType,StartDay,StartTime,EndDay,EndTime,StartDatetime,EndDatetime,Duration,Demand
0,0,1,1,hotel,0,2022-09-15,20:54:02,2022-09-15,23:59:13,2022-09-15 20:54,2022-09-15 23:59,185,20.36
1,0,1,1,hotel,0,2022-09-14,20:01:05,2022-09-14,21:31:04,2022-09-14 20:01,2022-09-14 21:31,90,10.19
2,0,1,1,hotel,0,2022-09-14,18:54:30,2022-09-14,19:54:29,2022-09-14 18:54,2022-09-14 19:54,60,6.78
3,0,1,1,hotel,0,2022-09-29,18:32:51,2022-09-30,0:16:42,2022-09-29 18:32,2022-09-30 0:16,344,37.65
4,0,1,1,hotel,0,2022-09-25,19:30:15,2022-09-26,0:30:14,2022-09-25 19:30,2022-09-26 0:30,300,33.81
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72851,2155,2649,0,public institution,0,2022-01-27,9:54:44,2022-01-27,9:58:57,2022-01-27 9:54,2022-01-27 9:58,4,0.50
72852,2379,2670,0,sightseeing,1,2021-10-31,14:52:11,2021-10-31,15:20:28,2021-10-31 14:52,2021-10-31 15:20,28,16.53
72853,2388,2670,0,sightseeing,1,2021-10-03,13:52:14,2021-10-03,14:32:13,2021-10-03 13:52,2021-10-03 14:32,40,12.20
72854,2388,2671,0,company,1,2021-11-18,11:37:44,2021-11-18,11:45:37,2021-11-18 11:37,2021-11-18 11:45,8,3.80


## Data Preprocessing

We are going to use the genre as well as the title of the movie as node features. For the `title` features, we are going to use a pre-trained [sentence transformer](https://www.sbert.net/) model to encode the title into a vector.
For the `genre` features, we are going to use a one-hot encoding.

In [29]:
import numpy as np
import torch

# One-hot encode the genres:
Locations = data['Location'].str.get_dummies().values
Locations = torch.from_numpy(Locations).to(torch.float)

In [30]:
len(data['Location'].unique())

14

In [31]:
Locations.shape

torch.Size([72856, 14])

In [32]:
ChargerCompany = torch.from_numpy(data['ChargerCompany'].values).to(torch.float)

In [33]:
ChargerType = torch.from_numpy(data['ChargerType'].values).to(torch.float)

In [34]:
ChargerCompany.unsqueeze(-1)

tensor([[1.],
        [1.],
        [1.],
        ...,
        [0.],
        [0.],
        [0.]])

In [35]:
# Concatenate the genres and title features:
station_features = torch.cat([Locations, ChargerCompany.unsqueeze(-1),ChargerType.unsqueeze(-1)], dim=-1)

# We don't have user features, which is why we use an identity matrix
ev_features = torch.eye(len(data['UserID'].unique()))


In [36]:
ev_features.shape

torch.Size([2337, 2337])

The `ratings.csv` file contains the ratings of users for movies. From this
file we are extracting the `userId`. We create a mapping from the `userId`
to a unique consecutive value in the range `[0, num_users]`. This is needed as we want our final data representation to be as compact as possible, *e.g.*, the representation of a user in the first row should be accessible via `x[0]`.
The same we do for the `movieId`.
Afterwards, we obtain the final `edge_index` representation of shape `[2, num_ratings]` from `ratings.csv` by merging mapped user and movie indices with the raw indices given by the original data frame.


In [37]:
# Create a mapping from the userId to a unique consecutive value in the range [0, num_users]:
unique_user_id = data['UserID'].unique()
unique_user_id = pd.DataFrame(data={
    'UserID': unique_user_id,
    'mappedUserId': pd.RangeIndex(len(unique_user_id))
    })
print("Mapping of user IDs to consecutive values:")
print("==========================================")
print(unique_user_id.head())
print()

# Create a mapping from the movieId to a unique consecutive value in the range [0, num_movies]:
unique_movie_id = data['ChargerID'].unique()
unique_movie_id = pd.DataFrame(data={
    'ChargerID': unique_movie_id,
    'mappedMovieId': pd.RangeIndex(len(unique_movie_id))
    })
print("Mapping of movie IDs to consecutive values:")
print("===========================================")
print(unique_movie_id.head())
print()

# Merge the mappings with the original data frame:
data = data.merge(unique_user_id, on='UserID')
data = data.merge(unique_movie_id, on='ChargerID')

# With this, we are ready to create the edge_index representation in COO format
# following the PyTorch Geometric semantics:
edge_index = torch.stack([
    torch.tensor(data['mappedUserId'].values),
    torch.tensor(data['mappedMovieId'].values)]
    , dim=0)

assert edge_index.shape == (2, len(data))

print("Final edge indices pointing from users to movies:")
print("================================================")
print(edge_index[:, :10])

Mapping of user IDs to consecutive values:
   UserID  mappedUserId
0       0             0
1      14             1
2     156             2
3     246             3
4     330             4

Mapping of movie IDs to consecutive values:
   ChargerID  mappedMovieId
0          1              0
1          2              1
2          3              2
3          5              3
4          7              4

Final edge indices pointing from users to movies:
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [38]:
print(edge_index[:, -10:])

tensor([[2334, 1703, 1304, 2335, 2335, 2336, 1976, 1458, 1458, 1458],
        [2112, 2113, 2114, 2115, 2115, 2116, 2117, 2117, 2118, 2118]])


## Heterogeneous Graph Construction

With this we are ready to initialize our heterogeneous graph data object and pass the
necessary information to it.

We also take care of adding reverse edges to the `HeteroData` object. This allows our GNN
model to use both directions of the edges for the message passing.

In [39]:
data.columns

Index(['UserID', 'ChargerID', 'ChargerCompany', 'Location', 'ChargerType',
       'StartDay', 'StartTime', 'EndDay', 'EndTime', 'StartDatetime',
       'EndDatetime', 'Duration', 'Demand', 'mappedUserId', 'mappedMovieId'],
      dtype='object')

In [43]:
import torch_geometric.transforms as T
from torch_geometric.data import HeteroData

# Create the heterogeneous graph data object:
Hdata = HeteroData()

# Add the user nodes:
Hdata['user'].x = ev_features  # [num_users, num_features_users]

# Add the movie nodes:
Hdata['station'].x = station_features  # [num_movies, num_features_movies]

# Add the rating edges:
Hdata['user', 'demand', 'station'].edge_index = edge_index  # [2, num_ratings]

# Add the rating labels:
Demand = torch.from_numpy(data['Demand'].values).to(torch.float)
Hdata['user', 'demand', 'station'].edge_label = Demand  # [num_ratings]

# We also need to make sure to add the reverse edges from movies to users
# in order to let a GNN be able to pass messages in both directions.
# We can leverage the `T.ToUndirected()` transform for this from PyG:
Hdata = T.ToUndirected()(Hdata)

# With the above transformation we also got reversed labels for the edges.
# We are going to remove them:
del Hdata['station', 'rev_demand', 'user'].edge_label

assert Hdata['user'].num_nodes == len(unique_user_id)
assert Hdata['user', 'demand', 'station'].num_edges == len(data)
assert Hdata['station'].num_features == 16

Hdata

HeteroData(
  user={ x=[2337, 2337] },
  station={ x=[72856, 16] },
  (user, demand, station)={
    edge_index=[2, 72856],
    edge_label=[72856],
  },
  (station, rev_demand, user)={ edge_index=[2, 72856] }
)

## Dataset Splitting

We can now split our data into a training, validation and test set. We are going to use
the `T.RandomLinkSplit` transform from PyG to do this. This transform will randomly
split the links with their label/demand into training, validation and test set.
We are going to use 80% of the edges for training, 10% for validation and 10% for testing.

In [44]:
train_data, val_data, test_data = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    neg_sampling_ratio=0.0,
    edge_types=[('user', 'demand', 'station')],
    rev_edge_types=[('station', 'rev_demand', 'user')],
)(Hdata)
train_data, val_data

(HeteroData(
   user={ x=[2337, 2337] },
   station={ x=[72856, 16] },
   (user, demand, station)={
     edge_index=[2, 58286],
     edge_label=[58286],
     edge_label_index=[2, 58286],
   },
   (station, rev_demand, user)={ edge_index=[2, 58286] }
 ),
 HeteroData(
   user={ x=[2337, 2337] },
   station={ x=[72856, 16] },
   (user, demand, station)={
     edge_index=[2, 58286],
     edge_label=[7285],
     edge_label_index=[2, 7285],
   },
   (station, rev_demand, user)={ edge_index=[2, 58286] }
 ))

## Graph Neural Network

We are now ready to define our GNN model. We are going to use a simple GNN model with
two message passing layers for the encoding of the user and movie nodes.
Additionally, we are going to use a decoder to predict the rating for the encoded
user-station combination.

In [47]:
from torch_geometric.nn import SAGEConv, to_hetero

class GNNEncoder(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv((-1, -1), out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x


class EdgeDecoder(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.lin1 = torch.nn.Linear(2 * hidden_channels, hidden_channels)
        self.lin2 = torch.nn.Linear(hidden_channels, 1)

    def forward(self, z_dict, edge_label_index):
        row, col = edge_label_index
        z = torch.cat([z_dict['user'][row], z_dict['station'][col]], dim=-1)

        z = self.lin1(z).relu()
        z = self.lin2(z)
        return z.view(-1)


class Model(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.encoder = GNNEncoder(hidden_channels, hidden_channels)
        self.encoder = to_hetero(self.encoder, Hdata.metadata(), aggr='sum')
        self.decoder = EdgeDecoder(hidden_channels)

    def forward(self, x_dict, edge_index_dict, edge_label_index):
        z_dict = self.encoder(x_dict, edge_index_dict)
        return self.decoder(z_dict, edge_label_index)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Model(hidden_channels=32).to(device)

print(model)

Model(
  (encoder): GraphModule(
    (conv1): ModuleDict(
      (user__demand__station): SAGEConv((-1, -1), 32, aggr=mean)
      (station__rev_demand__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__demand__station): SAGEConv((-1, -1), 32, aggr=mean)
      (station__rev_demand__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
  )
  (decoder): EdgeDecoder(
    (lin1): Linear(in_features=64, out_features=32, bias=True)
    (lin2): Linear(in_features=32, out_features=1, bias=True)
  )
)


## Training a Heterogeneous GNN

Training our GNN is then similar to training any PyTorch model.
We move the model to the desired device, and initialize an optimizer that takes care of adjusting model parameters via stochastic gradient descent.

The training loop applies the forward computation of the model, computes the loss from ground-truth labels and obtained predictions, and adjusts model parameters via back-propagation and stochastic gradient descent.


In [48]:
import torch.nn.functional as F

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    optimizer.zero_grad()
    pred = model(train_data.x_dict, train_data.edge_index_dict,
                 train_data['user', 'station'].edge_label_index)
    target = train_data['user', 'station'].edge_label
    loss = F.mse_loss(pred, target)
    loss.backward()
    optimizer.step()
    return float(loss)

@torch.no_grad()
def test(data):
    data = data.to(device)
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict,
                 data['user', 'station'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = data['user', 'station'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    return float(rmse)


for epoch in range(1, 301):
    train_data = train_data.to(device)
    loss = train()
    train_rmse = test(train_data)
    val_rmse = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_rmse:.4f}, '
          f'Val: {val_rmse:.4f}')

Epoch: 001, Loss: 489.5347, Train: 22.0008, Val: 22.2866
Epoch: 002, Loss: 484.0378, Train: 21.8183, Val: 22.1042
Epoch: 003, Loss: 476.0367, Train: 21.5400, Val: 21.8258
Epoch: 004, Loss: 463.9698, Train: 21.1076, Val: 21.3935
Epoch: 005, Loss: 445.5321, Train: 20.4557, Val: 20.7415
Epoch: 006, Loss: 418.4375, Train: 19.5260, Val: 19.8106
Epoch: 007, Loss: 381.2643, Train: 18.3539, Val: 18.6341
Epoch: 008, Loss: 333.4708, Train: 18.3303, Val: 18.5934
Epoch: 009, Loss: 277.3129, Train: 18.3303, Val: 18.5890
Epoch: 010, Loss: 220.9105, Train: 18.3303, Val: 18.5885
Epoch: 011, Loss: 183.6581, Train: 18.3303, Val: 18.5884
Epoch: 012, Loss: 198.4483, Train: 18.3303, Val: 18.5884
Epoch: 013, Loss: 244.4870, Train: 18.3303, Val: 18.5884
Epoch: 014, Loss: 244.7488, Train: 18.3303, Val: 18.5884
Epoch: 015, Loss: 214.6815, Train: 18.3303, Val: 18.5884
Epoch: 016, Loss: 186.3483, Train: 18.3303, Val: 18.5884
Epoch: 017, Loss: 173.6774, Train: 18.3303, Val: 18.5885
Epoch: 018, Loss: 174.8454, Tra

In [49]:
with torch.no_grad():
    test_data = test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict,
                 test_data['user', 'station'].edge_label_index)
    # pred = pred.clamp(min=0, max=5)
    target = test_data['user', 'station'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    print(f'Test RMSE: {rmse:.4f}')

userId = test_data['user', 'station'].edge_label_index[0].cpu().numpy()
stationId = test_data['user', 'station'].edge_label_index[1].cpu().numpy()
pred = pred.cpu().numpy()
target = target.cpu().numpy()

print(pd.DataFrame({'userId': userId, 'stationId': stationId, 'demand': pred, 'target': target}))

Test RMSE: 11.1323
      userId  stationId     demand     target
0       1124        331   5.598994   3.890000
1          0        181  14.600853  10.190000
2       1035       1581   8.787170   1.800000
3          0         15  16.132374  30.400000
4       1544       1463   9.652048   3.820000
...      ...        ...        ...        ...
7280       0         41  19.253338  31.940001
7281    1138       1216  41.602848  54.189999
7282     862        141  27.064394  33.240002
7283       0         81  15.051769  12.130000
7284       0         90  19.834482   9.300000

[7285 rows x 4 columns]


In [53]:
userId=1
chargerId=5

In [54]:
new_user_index = unique_user_id[unique_user_id['UserID'] == userId]['mappedUserId'].values[0]
new_station_index = unique_movie_id[unique_movie_id['ChargerID'] == chargerId]['mappedMovieId'].values[0]
new_edge_index = torch.tensor([[new_user_index], [new_station_index]])

In [55]:
with torch.no_grad():  # No need to calculate gradients for prediction
    model.eval()  # Set the model to evaluation mode
    pred = model(
        train_data.x_dict,  # Assuming you want to use the training data's features
        train_data.edge_index_dict,  # and connectivity for context
        new_edge_index.to(device)  # The new edge to predict
    )
    print(f"Predicted demand for User {userId} and Charger {chargerId}: {pred.item()}")

Predicted demand for User 1 and Charger 5: 34.144615173339844
