# Graph Neural Network with Pytorch Geometric

In the last decade, Deep Learning approaches (e.g. Convolutional Neural Networks and Recurrent Neural Networks) allowed to achieve unprecedented performance on a broad range of problems coming from a variety of different fields (e.g. Computer Vision and Speech Recognition). Despite the results obtained, research on DL techniques has mainly focused so far on data defined on Euclidean domains (i.e. grids). Nonetheless, in a multitude of different fields, such as: Biology, Physics, Network Science, Recommender Systems and Computer Graphics; one may have to deal with data defined on non-Euclidean domains (i.e. graphs and manifolds). The adoption of Deep Learning in these particular fields has been lagging behind until very recently, primarily since the non-Euclidean nature of data makes the definition of basic operations (such as convolution) rather elusive. Geometric Deep Learning deals in this sense with the extension of Deep Learning techniques to graph/manifold structured data.
This website represents a collection of materials in the field of Geometric Deep Learning. We collect workshops, tutorials, publications and code, that several differet researchers has produced in the last years. Our goal is to provide a general picture of this new and emerging field, which is rapidly developing in the scientific community, thanks to the broad applicability it presents.
## Resources
- https://arxiv.org/pdf/1706.02216.pdf
- https://github.com/rusty1s/pytorch_geometric
- https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
- https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3
- http://geometricdeeplearning.com/


## Graph
In Computer Science, a graph is a data structure consisting of two components, verticies and edges. A graph $G$ can be well described by the set of vertices $V$ and edges $E$ it contains.
$$ G = (V,E) $$

Edges can be directed or undirected. Vertices are usually called Nodes.
![directed graph](http://think-like-a-git.net/assets/images2/directed-graph.png) ![undirected](http://think-like-a-git.net/assets/images2/undirected-graph.png)

## Graph Neural Network
Works on Graph Structure, each node in the graph is associated with a label and we would like to predict these labels  of the nodes. [First Graph Paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1015.7227&rep=rep1&type=pdf) where it was introduced.
In the node classification problem setup, each node $\nu$ is characterised by its feature $x_\nu$ and associated ground truth label $t_\nu$. If we have a prtially labeled graph, we would like to use the labeled nodes to predicte the labels for the unlabeled nodes. 

For partially labeled graph $G$, the goal is to use the labeled nodes to predict the unlabeled ones. It learns to represent each node with a $d$ dimensional vector (state) $h_\nu$ with contains the information of its neighborhood. 
$$\mathbf{h}_\nu = f(\mathbf{X}_\nu, \mathbf{X}_{co[\nu]}, \mathbf{h}_{ne[\nu]}, \mathbf{X}_{ne[\nu]})$$
$$\mathbf{O}_\nu = \mathcal{g}(\mathbf{h}_\nu, \mathbf{x}_\nu)$$

where $x_\nu$, $x_{co[\nu]}$,  $h_{ne[\nu]}$, $x_{ne[\nu]}$ are the features of $\nu$, the features of its edges, the states, and the features of the nodes in the neighbourhood of $\nu$, respectively. Since we are seeking a solution for $h_\nu$, using [Banach fixed point theorm](https://en.wikipedia.org/wiki/Fixed-point_theorem), namely $\exists$ solution which $x=f(x)$.

Rewriting the above  equation as an iterative update process, which is called __message passing__ or __neighbourhood aggregation__:
$$ \mathbf{H}^{t+1} = F(\mathbf{H}^t, \mathbf{X})$$

Where $\mathbf{H}$ and $\mathbf{X}$ denote the concatenation of all $h$ and $x$. $\mathbf{H}^t$ denotes the $t$-th iteration of $\mathbf{H}$. 

$\mathcal{f}$ and $\mathcal{g}$ can be interpreted as the feedfoward neural networks. Assuming for each node $\nu$ there is target $t_\nu$ the loss can be written as follow:
$$\text{loss} = \sum_{i=1}^p(\mathbf{t}_i-\mathbf{o}_i)$$
where $p$ is the number of supervised nodes (see [Graph Neural Networks](https://arxiv.org/pdf/1812.08434.pdf) )

## GraphSAGE (Graph SAmple and aggreGatE)
![graph](images/graphsage-d.png)
Unlike embedding approaches that are based on matrix factorization,we leverage node features (e.g., text attributes, node profile information, node degrees) in order tolearn an embedding function that generalizes to unseen nodes. By incorporating node features in the learning algorithm, we simultaneously learn the topological structure of each node’s neighborhood as well as the distribution of node features in the neighborhood.  While we focus on feature-rich graphs (e.g., citation data with text attributes, biological data with functional/molecular markers), our approach can also make use of structural features that are present in all graphs (e.g., node degrees).

To learn embedding for each node in an inductive way: [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216.pdf)
![graphsage](images/graphsage.png) we represent each node as an aggregation of its neighbourhood. 


$$ \mathbf{h}^k_{\mathcal{N}(\nu)} \leftarrow \text{AGGREGATE}_k ({\mathbf{h}^{k-1}_u, \forall u \in \mathcal{N}(\nu)})$$

$$ \mathbf{h}^k_\nu \leftarrow \sigma(\mathbf{W}^k . \text{CONCAT}(\mathbf{h}^{k-1}_\nu, \mathbf{h}^k_{\mathcal{N}(\nu)}))  $$

In [1]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)

In [2]:
data

Data(edge_index=[2, 4], x=[3, 1])

In [3]:
x = torch.tensor([[2,1], [5,6], [3,7], [12,0]], dtype=torch.float)
y = torch.tensor([0, 1, 0, 1], dtype=torch.float)

![Example Graph](images/example_graph.png)

In [4]:
import torch
from torch_geometric.data import Data


x = torch.tensor([[2,1], [5,6], [3,7], [12,0]], dtype=torch.float)
y = torch.tensor([0, 1, 0, 1], dtype=torch.float)

edge_index = torch.tensor([[0, 2, 1, 0, 3],
                           [3, 1, 0, 1, 2]], dtype=torch.long)
#connect node 0 => 3, 2 => 1, 1 => 0, 0 => 1, 3 => 2


data = Data(x=x, y=y, edge_index=edge_index)

In [5]:
data

Data(edge_index=[2, 5], x=[4, 2], y=[4])

In [10]:
import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data_1.pt', 'data_2.pt', ...]

    def __len__(self):
        return len(self.processed_file_names)

    def download(self):
        # Download to `self.raw_dir`.
        pass 
    
    def process(self):
        i = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path`.
            data = Data(...)

            if self.pre_filter is not None and not self.pre_filter(data):
                 continue

            if self.pre_transform is not None:
                 data = self.pre_transform(data)

            torch.save(data, ops.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1

    def get(self, idx):
        data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
        return data
        pass

In [11]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.read_csv('data/yoochoose-clicks.dat', header=None)
df.columns=['session_id','timestamp','item_id','category']

buy_df = pd.read_csv('data/yoochoose-buys.dat', header=None)
buy_df.columns=['session_id','timestamp','item_id','price','quantity']

item_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
print(df.head().T)

  interactivity=interactivity, compiler=compiler, result=result)


                                   0                         1  \
session_id                         1                         1   
timestamp   2014-04-07T10:51:09.277Z  2014-04-07T10:54:09.868Z   
item_id                         2053                      2052   
category                           0                         0   

                                   2                         3  \
session_id                         1                         1   
timestamp   2014-04-07T10:54:46.998Z  2014-04-07T10:57:00.306Z   
item_id                         2054                      9876   
category                           0                         0   

                                   4  
session_id                         2  
timestamp   2014-04-07T13:56:37.614Z  
item_id                        19448  
category                           0  


In [12]:
import numpy as np
#randomly sample a couple of them
sampled_session_id = np.random.choice(df.session_id.unique(), 1000000, replace=False)
df = df.loc[df.session_id.isin(sampled_session_id)]
df.nunique()

session_id    1000000
timestamp     3570054
item_id         35508
category          238
dtype: int64

In [13]:
df['label'] = df.session_id.isin(buy_df.session_id)
df.head()

Unnamed: 0,session_id,timestamp,item_id,category,label
10,3,2014-04-02T13:17:46.940Z,28989,0,False
11,3,2014-04-02T13:26:02.515Z,35310,0,False
12,3,2014-04-02T13:30:12.318Z,43178,0,False
21,9,2014-04-06T11:26:24.127Z,9613,0,False
22,9,2014-04-06T11:28:54.654Z,9613,0,False


In [14]:
import torch
from torch_geometric.data import InMemoryDataset
from tqdm import tqdm

class YooChooseBinaryDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseBinaryDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])
        print(self.processed_dir)

    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['data/yoochoose_click_binary_1M_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        
        data_list = []

        # process by session_id
        grouped = df.groupby('session_id')
        for session_id, group in tqdm(grouped):
            sess_item_id = LabelEncoder().fit_transform(group.item_id)
            group = group.reset_index(drop=True)
            group['sess_item_id'] = sess_item_id
            node_features = group.loc[group.session_id==session_id,['sess_item_id','item_id']].sort_values('sess_item_id').item_id.drop_duplicates().values

            node_features = torch.LongTensor(node_features).unsqueeze(1)
            target_nodes = group.sess_item_id.values[1:]
            source_nodes = group.sess_item_id.values[:-1]

            edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.long)
            x = node_features

            y = torch.FloatTensor([group.label.values[0]])

            data = Data(x=x, edge_index=edge_index, y=y)
            data_list.append(data)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])
        

In [18]:
dataset = YooChooseBinaryDataset('')
dataset = dataset.shuffle()
train_dataset = dataset[:800000]
val_dataset = dataset[800000:900000]
test_dataset = dataset[900000:]
len(train_dataset), len(val_dataset), len(test_dataset)

Processing...


100%|██████████| 1000000/1000000 [40:29<00:00, 411.65it/s] 


Done!
./processed


(800000, 100000, 100000)

In [None]:
$$
\begin{algorithm}
\caption{My algorithm}\label{euclid}
\begin{algorithmic}[1]
\Procedure{MyProcedure}{}
\State $\textit{stringlen} \gets \text{length of }\textit{string}$
\State $i \gets \textit{patlen}$
\BState \emph{top}:
\If {$i > \textit{stringlen}$} \Return false
\EndIf
\State $j \gets \textit{patlen}$
\BState \emph{loop}:
\If {$\textit{string}(i) = \textit{path}(j)$}
\State $j \gets j-1$.
\State $i \gets i-1$.
\State \textbf{goto} \emph{loop}.
\State \textbf{close};
\EndIf
\State $i \gets i+\max(\textit{delta}_1(\textit{string}(i)),\textit{delta}_2(j))$.
\State \textbf{goto} \emph{top}.
\EndProcedure
\end{algorithmic}
\end{algorithm}