# Notebook 1: Introduction to Graphs
This notebook is mainly about understanding graph minign and graph neural networks. It'll introduce a lot of the main topics and introduce some of the packages you'll need to work with graphs.

## NetworkX
NetworkX is useful Python package for creating, manipulating, and mining graph.

In [None]:
import networkx as nx

In [None]:
#Create a graph. This graph will be undirected
G = nx.Graph()
print("Is G directed?: " + str(G.is_directed()))

#Create a directed graph
H = nx.DiGraph()
print("Is H directed?: "+ str(H.is_directed()))

# From there, we can add graph level attributes to it using:
G.graph['Name'] = 'Bar'
print(G.graph)

### Nodes

We can easily add nodes and attributes to NetworkX graphs using the add_node() method.

In [None]:
# Add one node with node level attributes:
G.add_node(0, feature=5, label=0)

# Get attributes of the node 0
node_0_attr = G.nodes[0]
print("Node 0 has the attributes {}".format(node_0_attr))

In [None]:
# Add multiple nodes with attributes
G.add_nodes_from([
  (1, {"feature": 1, "label": 1}),
  (2, {"feature": 2, "label": 2})
]) #(node, attrdict)

# Loop through all the nodes
# Set data=True will return node attributes
for node in G.nodes(data=True):
  print(node)

# Get number of nodes
num_nodes = G.number_of_nodes()
print("G has {} nodes".format(num_nodes))

### Edges
Edges as well as their attributes can be easily added to NetworkX graphs. 

In [None]:
# Add one edge with edge weight 0.5
G.add_edge(0, 1, weight=0.5)

# Get attributes of the edge (0, 1)
edge_0_1_attr = G.edges[(0, 1)]
print("Edge (0, 1) has the attributes {}".format(edge_0_1_attr))

In [None]:
# Add multiple edges with edge weights
G.add_edges_from([
  (1, 2, {"weight": 0.3}),
  (2, 0, {"weight": 0.1})
])

# Loop through all the edges
# Here there is no data=True, so only the edge will be returned
for edge in G.edges():
  print(edge)

# Get number of edges
num_edges = G.number_of_edges()
print("G has {} edges".format(num_edges))

### Visualization
We can also visualize all of the graphs that we make.

In [None]:
# Draw graph
nx.draw(G, with_labels = True)

### Node Degree and Neighbor

We can also obtain the degrees of the node as well as recieve information about how many neighbors a node has.

In [None]:
node_id = 1

# Degree of node 1
print("Node {} has degree {}".format(node_id, G.degree[node_id]))

# Get neighbor of node 1
for neighbor in G.neighbors(node_id):
  print("Node {} has neighbor {}".format(node_id, neighbor))

## Intro to PyTorch Geometric

We first need to ensure that pytorch has been downloaded and then afterwards, we'll need to install the pytorch geometric libraries.

### Set-Up

In [None]:
import torch
print("PyTorch has version {}".format(torch.__version__))

In [None]:
# Install torch geometric
!pip3.9 install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip3.9 install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip3.9 install torch-geometric

In [None]:
%matplotlib inline
import torch
import networkx as nx
import matplotlib.pyplot as plt

# Visualization function for NX graph or PyTorch tensor
def visualize(h, color, epoch=None, loss=None, accuracy=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])

    if torch.is_tensor(h):
        h = h.detach().cpu().numpy()
        plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
        if epoch is not None and loss is not None and accuracy['train'] is not None and accuracy['val'] is not None:
            plt.xlabel((f'Epoch: {epoch}, Loss: {loss.item():.4f} \n'
                       f'Training Accuracy: {accuracy["train"]*100:.2f}% \n'
                       f' Validation Accuracy: {accuracy["val"]*100:.2f}%'),
                       fontsize=16)
    else:
        nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                         node_color=color, cmap="Set2")
    plt.show()

In [None]:
def visualize_graph(G, color):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                     node_color=color, cmap="Set2")
    plt.show()


def visualize_embedding(h, color, epoch=None, loss=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    h = h.detach().cpu().numpy()
    plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
    if epoch is not None and loss is not None:
        plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16)
    plt.show()

### Introduction

PyTorch Geometric is an extension of the PyTorch library specifically designed specifically around implementing PyTorch for graph data. 

Note: faced issue with current method. May have to take a look into https://lightrun.com/answers/pyg-team-pytorch_geometric-oserror-winerror-127-the-specified-procedure-could-not-be-found-when-importing-torch_geometric to solve related issues.

In [None]:
from torch_geometric.datasets import KarateClub

dataset = KarateClub()
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

Now that we've gotten a bit of an overview of what exactly is in the dataset, we can look a bit further into the dataset.

In [None]:
data = dataset[0]  # Get the first graph object.

print(data)
print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')

In [None]:
data.edge_index.T

### Data
Graphs in PyTG are stored as a Data object. We can print the data object anytime by doing print(data) to recieve some information about its attributes and shapes.

In [None]:
print(data)

We can see that data holds 4 main attributes to it. `edge_index` property holds information about the graph connectivity. `node features` as `x` and `node labels` as `y`, There is also `train_mask` which describes the nodes of which we already know community assignments of. 

The `data` object also provides utility functions for use of inferring basic properties. For instance, we can infer whether there exist isolated nodes on the graph (i.e. no edges on any nodes), whether or not the graph contains self-lops or whether the graph is undirected.

### Edge Index
We can now print the edge_index of our graph using:

In [None]:
from IPython.display import Javascript  # Restrict height of output cell.
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

edge_index = data.edge_index
print(edge_index.t())

The edge_index helps us to better understand how PyG represents graphs internally. Here, edge_index holds tuples of two node indices where the first value describes node index of the soruce code and teh second value describes the node index of the destination node of an edge. We call this format coo format (COO format) which commonly represents sparse matrices. Instead of holding the adjacency information in a dense representation $\mathbf{A} \in \{ 0, 1 \}^{|\mathcal{V}| \times |\mathcal{V}|}$, PyG represents graphs sparsely, which refers to only holding the coordinates/values for which entries in $\mathbf{A}$ are non-zero.

In [None]:
from torch_geometric.utils import to_networkx

G = to_networkx(data, to_undirected=True)
visualize_graph(G, color=data.y)

### Implementing Graph Neural Networks

Now that we've covered a bit about PyG, we can go into implementing a very simple Graph Neural Network

For this, we'll be using the most simple GNN operators, **GCN layers** which is defined as

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v) \, \cup \, \{ v \}} \frac{1}{c_{w,v}} \cdot \mathbf{x}_w^{(\ell)}
$$

where $\mathbf{W}^{(\ell + 1)}$ denotes a trainable weight matrix of shape `[num_output_features, num_input_features]` and $c_{w,v}$ refers to a fixed normalization coefficient for each edge.

In PyG, we can implement this using GCNConv which we can use and execute by passing down in the node feature representation `x` and the COO graph connectivity representation `edge_index`.

In [None]:
import torch
from torch.nn import Linear
from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        torch.manual_seed(1234)
        self.conv1 = GCNConv(dataset.num_features, 4)
        self.conv2 = GCNConv(4, 4)
        self.conv3 = GCNConv(4, 2)
        self.classifier = Linear(2, dataset.num_classes)

    def forward(self, x, edge_index):
        h = self.conv1(x, edge_index)
        h = h.tanh()
        h = self.conv2(h, edge_index)
        h = h.tanh()
        h = self.conv3(h, edge_index)
        h = h.tanh()  # Final GNN embedding space.
        
        # Apply a final (linear) classifier.
        out = self.classifier(h)

        return out, h

model = GCN()
print(model)

In the code above, we define initialize all of the building blocks in the `__init__` section of the code and then define the computational flow in the `forward` function. We help the GCNConv layers reduce node feature dimensionality using tanh non-linearities with a final application of torch.nn.Linear that maps the nodes to 1 out of the 4 classes/communities. We return both the output of the final classifier as well as the final node embeddings produced by our GNN. We proceed to initialize our final model via `GCN()`, and printing our model produces a summary of all its used sub-modules.

### Embedding the Karate Club Network

Let's take a look at node embeddings produced by our GNN. We start off by passing features `x` and the graph connectivity information `edge_index` to the model, and visualize its 2-dimensional embedding.

In [None]:
model = GCN()

_, h = model(data.x, data.edge_index)
print(f'Embedding shape: {list(h.shape)}')

visualize_embedding(h, color=data.y)

### Training on the Karate Club Network
Since everything in the network is differentiable and parameterized, we can add some labels, train the model and observe how the embeddings react. We simply train against one node per class, but are allowed to make use of the complete input graph data.

We use CrossEntropyLoss here and we initialize Adam as the stochastic gradient optimizer. 

While we compute node embeddings for all nodes, we only use the training nodes for computing loss. This is implemented by filtering the output of the classifier `out` and ground-truth labels `data.y` to only contain the nodes in `train_mask`.

In [None]:
import time
from IPython.display import Javascript  # Restrict height of output cell.
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 430})'''))

model = GCN()
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out, h = model(data.x, data.edge_index)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss, h

for epoch in range(401):
    loss, h = train(data)
    if epoch % 10 == 0:
        visualize_embedding(h, color=data.y, epoch=epoch, loss=loss)
        time.sleep(0.3)

We can see that our 3-layer GCn model manages to linearly separate the communities and classify most of the nodes correctly. 