Source: https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

In [1]:
import torch
from torch_geometric.data import Data

import torch
from torch.nn import Sequential as Seq, Linear, ReLU
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import remove_self_loops, add_self_loops

## Step 1: define the graph (we can also use NetworkX)

In [2]:
# Define the graph using torch_geometric. 4 vertices

# Graph-level or node-level ground-truth labels with arbitrary shape = labels that shall be predicted
nodes_labes = torch.tensor([0, 1, 0, 1], dtype=torch.float)

# Node feature matrix with shape [num_nodes, num_node_features]
nodes_features = torch.tensor([[2,1], [5,6], [3,7], [12,0]], dtype=torch.float)

# Graph connectivity in COO format with shape [2, num_edges]
# COO format: source nodes in the first list, destination nodes in the second list
edge_index = torch.tensor([[0, 2, 1, 0, 3],
                           [3, 1, 0, 1, 2]], dtype=torch.long)

# create the graph stucture (we can also add vertices positio and edge features)
# Documentation: https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Data.html#torch_geometric.data.Data 
data = Data(x=nodes_attributes, y=nodes_labes, edge_index=edge_index)

NameError: name 'nodes_attributes' is not defined

## Step 2: put the graph in a Pytorch Dataset to be used by GNN

2 types of dataset **classes**: InMemoryDataset (for data that fit my RAM) and Dataset (larger data not fitting the RAM)

Note that these are classes, therefore we must implement them as classes. The implementation is similar fot the 2 types

The dataset is a class inheriting from InMemoryDataset (or Dataset) which has the following methods:
1. raw_file_names(): pass
2. processed_file_names(): pass
3. download(): pass
4. process(): most important method of Dataset. You need to gather your data into a list of Data objects. Then, call self.collate() to compute the slices that will be used by the DataLoader object. The following shows an example of the custom dataset from PyG official website.

In [3]:
import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.
        pass

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        if self.pre_filter is not None:
            data_list [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

SyntaxError: invalid syntax (2133045341.py, line 27)

## Basics of GNN

#### Message passing (or neighborhood aggregation)
For each node *v* of the graph, we can define a *d*-dimensional vector **h**_v that contains the info of the node and its neighbors. The transformation from node info to an *d*-dimensional vector is called **node embedding**.

**h**_v = *f*(**x**_v, **x**_ed[v], **h**_ne[v], **x**_ne[v])

where:

**x**_v: feature vector of the node *v*

**x**_ed[v]: feature vectors of the edges connected with the node *v*

**h**_ne[v]: embedding of the neighboring nodes of *v* (thus the embedding of each node connected to *v*)

**x**_ne[v]: feature vectors of the neighboring nodes of *v* (thus the embedding of each node connected to *v*)

*f*: transition function


The output of the GNN is computed as:

**O**_v = *g*(**h**_v, **x**_v)

Both *f* and *g* can be interpreted as feed-forward fully-connected Neural Networks. The loss is:

*loss* = sum(**ground_truth**_i - **o**_i) 

which can be optimized via gradient descent (optimization of *f* and *g*)

Source: https://medium.com/towards-data-science/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3


###### In general:

![image](message_passing.png)

**x**_i^k = UPDATE^k(**x**_i^(k-1), AGGREGATE((MESSAGE^k(**x**_i^(k-1), **x**_j^(k-1), **e**_ij))

where:

**x** is the node embedding (*d*-dimensional vector)

**e** is the edge features

It works as follows:
1. Calculation of the message passing functions for the i-th node for every neighbor (using connected neighbor features and edge fetures). We have 1 message passing for every neighbor nodes and edges (j). 
2. Aggregation of all the messages to the selected i-th node. Usually, the aggregation function is a weighted sum or weighted product, whose weights must be learned. (Neuron: aggregation of info from connected neurons to generate an input impulse). 
3. Update the new feature vector of the node *i*.


Example with:
![image](SageConv.png)



**In order to create a Message passing layer with PyTorch, we must create a class which inherits from pytorch.MessagePassing and implements the following methods:**
1. message():
2. update():
3. propagate()

In [4]:
class SAGEConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(SAGEConv, self).__init__(aggr='max') #  "Max" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)
        self.act = torch.nn.ReLU()
        self.update_lin = torch.nn.Linear(in_channels + out_channels, in_channels, bias=False)
        self.update_act = torch.nn.ReLU()
        
    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        
        
        edge_index, _ = remove_self_loops(edge_index)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        
        
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x)

    def message(self, x_j):
        # x_j has shape [E, in_channels]

        x_j = self.lin(x_j)
        x_j = self.act(x_j)
        
        return x_j

    def update(self, aggr_out, x):
        # aggr_out has shape [N, out_channels]


        new_embedding = torch.cat([aggr_out, x], dim=1)
        
        new_embedding = self.update_lin(new_embedding)
        new_embedding = self.update_act(new_embedding)
        
        return new_embedding

In [5]:
from torch.nn import Sequential as Seq, Linear, ReLU
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import remove_self_loops, add_self_loops


class SAGEConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(SAGEConv, self).__init__(aggr='max') #  "Max" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)
        self.act = torch.nn.ReLU()
        self.update_lin = torch.nn.Linear(in_channels + out_channels, in_channels, bias=False)
        self.update_act = torch.nn.ReLU()
        
    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        
        
        edge_index, _ = remove_self_loops(edge_index)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        
        
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x)

    def message(self, x_j):
        # x_j has shape [E, in_channels]

        x_j = self.lin(x_j)
        x_j = self.act(x_j)
        
        return x_j

    def update(self, aggr_out, x):
        # aggr_out has shape [N, out_channels]


        new_embedding = torch.cat([aggr_out, x], dim=1)
        
        new_embedding = self.update_lin(new_embedding)
        new_embedding = self.update_act(new_embedding)
        
        return new_embedding

### EXAMPLE: RecSys Challenge 2015

The RecSys Challenge 2015 is challenging data scientists to build a session-based recommender system. Participants in this challenge are asked to solve two tasks:
1. Predict whether there will be a buy event followed by a sequence of clicks
2. Predict which item will be bought

The click dataset contains all the clicks (visits) to the online object. The buy dataset contains all the item purchases. We will merge the two datasets into one 

### Step 1: load the data

In [9]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

# load the click dataset contains all the clicks (visits) to the online object into a pandas df
df = pd.read_csv('yoochoose-clicks.dat', header=None)
df.columns=['session_id','timestamp','item_id','category']

# load the buy dataset contains all the item purchases  into a pandas df
buy_df = pd.read_csv('yoochoose-buys.dat', header=None)
buy_df.columns=['session_id','timestamp','item_id','price','quantity']

# Encode target labels with value between 0 and n_classes-1.
item_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
# df.head()

  df = pd.read_csv('yoochoose-clicks.dat', header=None)


In [7]:
#randomly sample click sessions (the original dataset is too large)
num_sessions = 1000000
sampled_session_id = np.random.choice(df.session_id.unique(), num_sessions, replace=False)
df = df.loc[df.session_id.isin(sampled_session_id)]
df.nunique()

session_id    1000000
timestamp     3570402
item_id         35720
category          233
dtype: int64

In [8]:
# To determine the ground truth, i.e. whether there is any buy event for a given session, 
# we simply check if a session_id in yoochoose-clicks.dat presents in yoochoose-buys.dat as well.
df['label'] = df.session_id.isin(buy_df.session_id)
df.head()

Unnamed: 0,session_id,timestamp,item_id,category,label
19,8,2014-04-06T08:49:58.728Z,45037,0,False
20,8,2014-04-06T08:52:12.647Z,45037,0,False
24,11,2014-04-03T10:44:35.672Z,39927,0,True
25,11,2014-04-03T10:45:01.674Z,39927,0,True
26,11,2014-04-03T10:45:29.873Z,39966,0,True


So far, we have loaded the data into pandas dataframes, encoded the object ids into general IDs, randomly sampled the click sessions and added the ground truth label (buy / not buy)

### Step 2: store the data in a Pytorch dataframe

In [10]:
import torch
from torch_geometric.data import InMemoryDataset
from tqdm import tqdm

In [13]:
# create dataset class inheriting from InMemoryDataset (or Dataset)
# To do OOP with class inheritance in Python, we need to use the method super()
# super() gives you access to methods in a superclass from the subclass that inherits from it.
# super must be used inside the init function

class YooChooseBinaryDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseBinaryDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0]) # this is a standard line within PyTorch

    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['yoochoose_click_binary_1M_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        """
        The most important method of Dataset: here you transform the loaded data into a Pytorch graph.
        You need to gather your data into a list of Data objects. 
        Then, call self.collate() to compute the slices that will be used by the DataLoader object. 
        """
        data_list = []

        # process by session_id
        grouped = df.groupby('session_id')
        for session_id, group in tqdm(grouped):
            sess_item_id = LabelEncoder().fit_transform(group.item_id)
            group = group.reset_index(drop=True)
            group['sess_item_id'] = sess_item_id
            node_features = group.loc[group.session_id==session_id,['sess_item_id','item_id']].sort_values('sess_item_id').item_id.drop_duplicates().values

            node_features = torch.LongTensor(node_features).unsqueeze(1)
            target_nodes = group.sess_item_id.values[1:]
            source_nodes = group.sess_item_id.values[:-1]

            edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.long)
            x = node_features

            y = torch.FloatTensor([group.label.values[0]])

            data = Data(x=x, edge_index=edge_index, y=y)
            data_list.append(data)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

In [17]:
dataset = YooChooseBinaryDataset('ogbn-products')
dataset = dataset.shuffle()
train_dataset = dataset[:800000]
val_dataset = dataset[800000:900000]
test_dataset = dataset[900000:]
len(train_dataset), len(val_dataset), len(test_dataset)

Processing...
  edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.long)
  0%|                                               | 0/9249729 [00:09<?, ?it/s]


AttributeError: 'DataFrame' object has no attribute 'label'

In [18]:
embed_dim = 128
from torch_geometric.nn import TopKPooling
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp
import torch.nn.functional as F
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.conv1 = SAGEConv(embed_dim, 128)
        self.pool1 = TopKPooling(128, ratio=0.8)
        self.conv2 = SAGEConv(128, 128)
        self.pool2 = TopKPooling(128, ratio=0.8)
        self.conv3 = SAGEConv(128, 128)
        self.pool3 = TopKPooling(128, ratio=0.8)
        self.item_embedding = torch.nn.Embedding(num_embeddings=df.item_id.max() +1, embedding_dim=embed_dim)
        self.lin1 = torch.nn.Linear(256, 128)
        self.lin2 = torch.nn.Linear(128, 64)
        self.lin3 = torch.nn.Linear(64, 1)
        self.bn1 = torch.nn.BatchNorm1d(128)
        self.bn2 = torch.nn.BatchNorm1d(64)
        self.act1 = torch.nn.ReLU()
        self.act2 = torch.nn.ReLU()        
        
    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.item_embedding(x)
        x = x.squeeze(1)        

        x = F.relu(self.conv1(x, edge_index))

        x, edge_index, _, batch, _ = self.pool1(x, edge_index, None, batch)
        x1 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv2(x, edge_index))
     
        x, edge_index, _, batch, _ = self.pool2(x, edge_index, None, batch)
        x2 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv3(x, edge_index))

        x, edge_index, _, batch, _ = self.pool3(x, edge_index, None, batch)
        x3 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = x1 + x2 + x3

        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = self.act2(x)      
        x = F.dropout(x, p=0.5, training=self.training)

        x = torch.sigmoid(self.lin3(x)).squeeze(1)

        return x

Source: https://medium.com/analytics-vidhya/ohmygraphs-graphsage-in-pyg-598b5ec77e7b