# SNA - Project in Google Colab

## Colab Preparation

**Keep Alive**

When training google colab tends to kick you out, This might help: https://medium.com/@shivamrawat_756/how-to-prevent-google-colab-from-disconnecting-717b88a128c0

**Get Started**

Run the following script to mount google drive and install needed python packages. Pytorch comes pre-installed.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/sna/

/content/drive/MyDrive/sna


In [3]:
!ls

archive		     graph.pkl				     GraphSage-LinkPrediction.ipynb
data		     GraphSage_LinkPrediction21.ipynb	     node2vec_colab.ipynb
exploratory	     GraphSage_LinkPrediction22-Copy1.ipynb  node2vec.ipynb
graph_directed2.pkl  GraphSage_LinkPrediction22.ipynb	     README.md
graph_directed.pkl   GraphSage-LinkPrediction2.ipynb	     similarity_based.ipynb


#### Install libraries

In [4]:
!pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html

Looking in links: https://data.dgl.ai/wheels/torch-2.3/repo.html
Collecting dgl
  Downloading https://data.dgl.ai/wheels/torch-2.3/dgl-2.4.0-cp310-cp310-manylinux1_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
Collecting torch<=2.4.0 (from dgl)
  Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch<=2.4.0->dgl)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch<=2.4.0->dgl)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch<=2.4.0->dgl)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<=2.4.0->dgl)
  Downloading nvidia_cudn

In [5]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
import numpy as np
import scipy.sparse as sp
import pickle

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


### 1. Link prediction with GNN
Assume we are given a graph g with incomplete data, for example, only 50% of the edges are present.

The goal is to predict **whether there is an edge** between any 2 nodes in g.

In [24]:
import pickle

# Load the graph
with open('graph_directed.pkl', 'rb') as f:
    g = pickle.load(f)

# # Convert to DGL graph if necessary
# if not isinstance(g, dgl.DGLGraph):
#     g = dgl.from_networkx(g)
g = dgl.from_networkx(g, edge_attrs=['count'])

# Verify the loaded graph
print(f"Number of nodes: {g.number_of_nodes()}")
print(f"Number of edges: {g.number_of_edges()}")
print("Edge data keys:", g.edata.keys())


Number of nodes: 10964
Number of edges: 1058868
Edge data keys: dict_keys(['count'])


Transforming Edge features into Node Features

In [25]:
import torch

# Initialize node features as zeros
num_nodes = g.num_nodes()
node_features = torch.zeros((num_nodes, 1))  # One-dimensional node features

# Aggregate edge counts into node features
for src, dst, edge_data in zip(*g.edges(), g.edata['count']):
    node_features[src] += edge_data
    node_features[dst] += edge_data  # Assuming undirected graph

# Assign aggregated features to nodes
g.ndata['feat'] = node_features
print("Node features (aggregated edge counts):\n", g.ndata['feat'])

Node features (aggregated edge counts):
 tensor([[ 6.],
        [36.],
        [26.],
        ...,
        [ 2.],
        [ 4.],
        [16.]])


Train/test split and obtaining positive edges

In [26]:
np.random.seed(42)  # For reproducibility
edge_ids = np.arange(g.num_edges())
edge_ids = np.random.permutation(edge_ids)

train_size = int(0.8 * len(edge_ids))
train_mask = edge_ids[:train_size]
test_mask = edge_ids[train_size:]

# Remove test edges to create g_main
g_main = dgl.remove_edges(g, test_mask)

# Extract edges
u, v = g.edges()

# Positive edges for training and testing
train_pos_u, train_pos_v = u[train_mask], v[train_mask]
test_pos_u, test_pos_v = u[test_mask], v[test_mask]

Generation of negative edges

In [27]:
# Create adjacency matrix
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj = adj.todense() + np.eye(g.num_nodes())  # Add self-loops to avoid sampling them as negatives

# Get non-edges
u_neg, v_neg = np.where(adj == 0)

# Sample negative edges
neg_ids = np.random.choice(len(u_neg), len(u))
train_neg_u, train_neg_v = u_neg[neg_ids[:train_size]], v_neg[neg_ids[:train_size]]
test_neg_u, test_neg_v = u_neg[neg_ids[train_size:]], v_neg[neg_ids[train_size:]]

Train/test graph creation

In [28]:
# # Training graphs
# g_train_pos = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.num_nodes())
# g_train_neg = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.num_nodes())

# # Testing graphs
# g_test_pos = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.num_nodes())
# g_test_neg = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.num_nodes())

# Training graphs
g_train_pos = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.num_nodes())
g_train_pos.edata['count'] = g.edata['count'][train_mask]  # Copy 'count' attribute

g_train_neg = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.num_nodes())
g_train_neg.edata['count'] = torch.zeros(train_neg_u.shape[0])  # No count for negative edges

# Testing graphs
g_test_pos = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.num_nodes())
g_test_pos.edata['count'] = g.edata['count'][test_mask]  # Copy 'count' attribute

g_test_neg = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.num_nodes())
g_test_neg.edata['count'] = torch.zeros(test_neg_u.shape[0])  # No count for negative edges


### 2. GNN with SageConv
dgl.nn.SAGEConv(in_dim, out_dim) updates in the following way

\begin{align*}
h_i^{(l+1)}&= W.\text{concat}(h_i^{(l)},h_{N(i)}^{(l+1)})+b \ \text{with} \\ h_{N(i)}^{(l+1)}&=\text{Mean}\{h_j^{(l)}, j\in N(i)\}
\end{align*}

Here is our **model structure**
<center>
input -> SAGEConv1 -> relu -> SAGEConv2 -> predictor
<end><center>


In [39]:
# slightly changed architecture
import dgl.function as fn
from dgl.nn import SAGEConv

from sklearn.metrics import roc_auc_score # for computing auc metric

class GraphSage(nn.Module):
    def __init__(self, in_dim, hidden_dim):
        super().__init__()
        self.conv1 = SAGEConv(in_dim, hidden_dim, aggregator_type="mean")
        self.conv2 = SAGEConv(hidden_dim, hidden_dim, aggregator_type="mean")

    def forward(self, g, features):
        h = self.conv1(g, features)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

    def predict(self, g, h):
        with g.local_scope():
            g.ndata['h'] = h
            g.apply_edges(fn.u_dot_v('h', 'h', 'score'))  # Compute edge scores
            return g.edata['score'].squeeze()  # Ensure the result is 1D


    def loss(self, pos_scores, neg_scores):
        # Combine positive and negative scores
        scores = torch.cat([pos_scores, neg_scores])
        # Create labels: 1 for positive edges, 0 for negative edges
        labels = torch.cat([torch.ones(pos_scores.shape[0]), torch.zeros(neg_scores.shape[0])])
        # Compute binary cross-entropy loss
        return F.binary_cross_entropy_with_logits(scores, labels)

    def auc_score(self, pos_scores, neg_scores):
        scores = torch.cat([pos_scores, neg_scores]).detach().numpy()
        labels = torch.cat([torch.ones(pos_scores.shape[0]), torch.zeros(neg_scores.shape[0])]).detach().numpy()
        return roc_auc_score(labels, scores)

### 3. Train and Test

In [47]:
from sklearn.metrics import accuracy_score

def train(model, g_main, g_train_pos, g_train_neg, optimizer):
    model.train()
    optimizer.zero_grad()

    # Forward pass
    h = model(g_main, g_main.ndata['feat'])

    # Predict scores for positive and negative training edges
    pos_scores = model.predict(g_train_pos, h)
    neg_scores = model.predict(g_train_neg, h)

    # Calculate loss
    loss = model.loss(pos_scores, neg_scores)

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

    # Combine scores and labels for AUC and accuracy
    scores = torch.cat([pos_scores, neg_scores]).detach().numpy()
    labels = torch.cat([torch.ones(pos_scores.shape[0]), torch.zeros(neg_scores.shape[0])]).detach().numpy()

    # Compute ROC AUC
    auc = roc_auc_score(labels, scores)

    # Compute Accuracy (using threshold = 0.5)
    predicted_labels = (scores >= 0.5).astype(int)
    accuracy = accuracy_score(labels, predicted_labels)

    return loss.item(), auc, accuracy


In [50]:
from sklearn.metrics import accuracy_score

@torch.no_grad()
def evaluate(model, g_main, g_test_pos, g_test_neg):
    model.eval()
    h = model(g_main, g_main.ndata['feat'])

    # Predict scores for positive and negative test edges
    pos_scores = model.predict(g_test_pos, h)
    neg_scores = model.predict(g_test_neg, h)

    # Calculate loss
    loss = model.loss(pos_scores, neg_scores)

    # Combine scores and labels for AUC and accuracy
    scores = torch.cat([pos_scores, neg_scores]).detach().numpy()
    labels = torch.cat([torch.ones(pos_scores.shape[0]), torch.zeros(neg_scores.shape[0])]).detach().numpy()

    # Compute ROC AUC
    auc = roc_auc_score(labels, scores)

    # Compute Accuracy (using threshold = 0.5)
    predicted_labels = (scores >= 0.5).astype(int)
    accuracy = accuracy_score(labels, predicted_labels)

    return loss.item(), auc, accuracy

In [51]:
# new hyperparams and initialization of the model
from torch.optim import Adam

# initialize model and optimizer
in_dim = g_main.ndata['feat'].shape[1]
hidden_dim = 16
model = GraphSage(in_dim, hidden_dim)
optimizer = Adam(model.parameters(), lr=0.01)

In [52]:
print("Edge data in g_main:", g_main.edata.keys())
print("Edge data in g_train_pos:", g_train_pos.edata.keys())
print("Edge data in g_train_neg:", g_train_neg.edata.keys())


Edge data in g_main: dict_keys(['count'])
Edge data in g_train_pos: dict_keys(['count'])
Edge data in g_train_neg: dict_keys(['count'])


In [53]:
# new training loop
num_epochs = 50
for epoch in range(num_epochs):
    train_loss, train_auc, train_accuracy = train(model, g_main, g_train_pos, g_train_neg, optimizer)
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {train_loss:.4f}, AUC: {train_auc:.4f}, Accuracy: {train_accuracy:.4f}")

Epoch 1/50, Loss: 3974002.7500, AUC: 0.6718, Accuracy: 0.5006
Epoch 2/50, Loss: 3190918.7500, AUC: 0.6815, Accuracy: 0.5013
Epoch 3/50, Loss: 2540643.5000, AUC: 0.6943, Accuracy: 0.5033
Epoch 4/50, Loss: 2007348.5000, AUC: 0.7097, Accuracy: 0.5068
Epoch 5/50, Loss: 1574263.1250, AUC: 0.7261, Accuracy: 0.5087
Epoch 6/50, Loss: 1226983.1250, AUC: 0.7413, Accuracy: 0.5094
Epoch 7/50, Loss: 951711.5625, AUC: 0.7541, Accuracy: 0.5097
Epoch 8/50, Loss: 734233.4375, AUC: 0.7643, Accuracy: 0.5099
Epoch 9/50, Loss: 563588.8750, AUC: 0.7730, Accuracy: 0.5101
Epoch 10/50, Loss: 430882.0938, AUC: 0.7807, Accuracy: 0.5096
Epoch 11/50, Loss: 328680.3438, AUC: 0.7867, Accuracy: 0.5091
Epoch 12/50, Loss: 251015.7188, AUC: 0.7907, Accuracy: 0.5092
Epoch 13/50, Loss: 193044.0781, AUC: 0.7926, Accuracy: 0.5114
Epoch 14/50, Loss: 150632.8438, AUC: 0.7927, Accuracy: 0.5168
Epoch 15/50, Loss: 120352.7266, AUC: 0.7914, Accuracy: 0.5259
Epoch 16/50, Loss: 99347.0469, AUC: 0.7893, Accuracy: 0.5379
Epoch 17/50,

In [54]:
# Evaluate on the test graph
with torch.no_grad():
    test_loss, test_auc, test_accuracy = evaluate(model, g_main, g_test_pos, g_test_neg)
    print(f"Test Loss: {test_loss:.4f} | Test AUC: {test_auc:.4f} | Test Accuracy: {test_accuracy:.4f}")

Test Loss: 34780.7109 | Test AUC: 0.7969 | Test Accuracy: 0.6038


#### Save the model

In [55]:
# Save the model state
torch.save(model.state_dict(), 'graphsage_model.pth')
print("Model saved as 'graphsage_model.pth'")

Model saved as 'graphsage_model.pth'
