<a href="https://colab.research.google.com/github/przemekkubiak/graph-fake-news-detection/blob/main/graph_networks_upfd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [6]:
pip install torch



In [None]:
import torch

!pip uninstall torch-geometric  --y
!pip install git+https://github.com/pyg-team/pytorch_geometric.git

[0mCollecting git+https://github.com/pyg-team/pytorch_geometric.git
  Cloning https://github.com/pyg-team/pytorch_geometric.git to /tmp/pip-req-build-ncp8p6e_
  Running command git clone --filter=blob:none --quiet https://github.com/pyg-team/pytorch_geometric.git /tmp/pip-req-build-ncp8p6e_
  Resolved https://github.com/pyg-team/pytorch_geometric.git to commit 13d819819aa49e5661bee54258b233676bf25634
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: torch_geometric
  Building wheel for torch_geometric (pyproject.toml) ... [?25l[?25hdone
  Created wheel for torch_geometric: filename=torch_geometric-2.6.0-py3-none-any.whl size=1100960 sha256=143dc32144ae8e0fb0929a9d2cf608b727653514e1a80ba8386f4f8c3db446cf
  Stored in directory: /tmp/pip-ephem-wheel-cache-g35arxfz/wheels/d3/78/eb/9e26525b948d19533f1688fb6c209cec8a0ba793d39b49ae

In [None]:
!nvcc --version

In [14]:
import numpy as np
import torch.nn as nn
from torch.nn import Linear
from torch_geometric.datasets import UPFD
from torch_geometric.loader import DataLoader
import torch.nn.functional as F
import argparse
import torch_geometric
from torch_geometric.transforms import ToUndirected, NormalizeFeatures
from torch_geometric.nn import GATConv, GCNConv, SAGEConv, global_max_pool

The following code is used to connect to Google Drive and mount at a specific directory within yout Google Drive. This can be used to load datasets not included in Python libraries and export data or metadata.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
! ls

gdrive	sample_data


In [None]:
%cd gdrive/MyDrive/project_folder

/content/gdrive/MyDrive/project_folder


# Load Data

In this example, I am using the dataset proposed by [Dou et al. (2021)](https://arxiv.org/abs/2104.12259). This is a benchmark dataset made in line with the authors' proposed framework accounting for social information like user's preference. See the paper for details.

In [9]:
dataset = UPFD(root='data/', name = 'politifact', feature= 'spacy')

# Get the first graph objects from the dataset.
data = dataset[0]

# Check the basic statistics of the graph.

print()
print(data)
print('===========================================================================================================')

print()
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')


Data(x=[72, 300], edge_index=[2, 71], y=[1])

Dataset: UPFD(62, name=politifact, feature=spacy):
Number of graphs: 62
Number of features: 300
Number of classes: 2
Number of nodes: 72
Number of edges: 71
Average node degree: 0.99


Load the training, validation, and test datasets. I use transform = ToUndirected() to enable the GNN to pass messages in both directions.

In [10]:
train_dataset = UPFD(root='data/', name = 'politifact', feature= 'spacy', split = 'train',transform = ToUndirected())
val_dataset = UPFD(root='data/', name = 'politifact', feature= 'spacy', split = 'val', transform = ToUndirected())
test_dataset = UPFD(root='data/', name = 'politifact', feature= 'spacy', split = 'test', transform = ToUndirected())

Utilise the DataLoader class for mini-batching of the graph data. Generally, this helps fully utilise the GNU: with the memory overhead decreased, large graphs can fit into the GNU memory thanks to the procedure of saving adjacency matrices holding only non-zero entities. It also avoids the need of operators reliant on message passing to be modified as messages are no longer exchanged between nodes belonding to different graphs.

In [12]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

for step, data in enumerate(train_loader):
    print(f'Step {step + 1}:')
    print('=======')
    print(f'Number of graphs in the current batch: {data.num_graphs}')
    print(data)
    print()

Step 1:
Number of graphs in the current batch: 32
DataBatch(x=[3013, 300], edge_index=[2, 5962], y=[32], batch=[3013], ptr=[33])

Step 2:
Number of graphs in the current batch: 30
DataBatch(x=[3059, 300], edge_index=[2, 6058], y=[30], batch=[3059], ptr=[31])



# Training

First, define a network class for computation. Then, define training and testing functions. Both are slightly modified code from PyG's source code (see [Fey and Lenssen 2019](https://arxiv.org/abs/1903.02428) and the [GitHub](https://github.com/pyg-team/pytorch_geometric?tab=readme-ov-file)).

In [15]:
class Net(torch.nn.Module):
    def __init__(self, model, in_channels, hidden_channels, out_channels,
                 concat=False):
        super().__init__()
        self.concat = concat

        if model == 'GCN':
            self.conv1 = GCNConv(in_channels, hidden_channels)
        elif model == 'SAGE':
            self.conv1 = SAGEConv(in_channels, hidden_channels)
        elif model == 'GAT':
            self.conv1 = GATConv(in_channels, hidden_channels)

        if self.concat:
            self.lin0 = Linear(in_channels, hidden_channels)
            self.lin1 = Linear(2 * hidden_channels, hidden_channels)

        self.lin2 = Linear(hidden_channels, out_channels)

    def forward(self, x, edge_index, batch):
        h = self.conv1(x, edge_index).relu()
        h = global_max_pool(h, batch)

        if self.concat:
            # Get the root node (tweet) features of each graph:
            root = (batch[1:] - batch[:-1]).nonzero(as_tuple=False).view(-1)
            root = torch.cat([root.new_zeros(1), root + 1], dim=0)
            news = x[root]

            news = self.lin0(news).relu()
            h = self.lin1(torch.cat([news, h], dim=-1)).relu()

        h = self.lin2(h)
        return h.log_softmax(dim=-1)


chosen_model = 'GCN'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net(chosen_model, train_dataset.num_features, 128,
            train_dataset.num_classes, concat=True).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)


def train():
    model.train()

    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        out = model(data.x, data.edge_index, data.batch)
        loss = F.nll_loss(out, data.y)
        loss.backward()
        optimizer.step()
        total_loss += float(loss) * data.num_graphs

    return total_loss / len(train_loader.dataset)


@torch.no_grad()
def test(loader):
    model.eval()

    total_correct = total_examples = 0
    for data in loader:
        data = data.to(device)
        pred = model(data.x, data.edge_index, data.batch).argmax(dim=-1)
        total_correct += int((pred == data.y).sum())
        total_examples += data.num_graphs

    return total_correct / total_examples


for epoch in range(1, 101):
    loss = train()
    train_acc = test(train_loader)
    val_acc = test(val_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch:02d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, '
          f'Val: {val_acc:.4f}, Test: {test_acc:.4f}')

Epoch: 01, Loss: 0.6850, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 02, Loss: 0.6827, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 03, Loss: 0.6741, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 04, Loss: 0.6713, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 05, Loss: 0.6678, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 06, Loss: 0.6646, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 07, Loss: 0.6631, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 08, Loss: 0.6581, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 09, Loss: 0.6555, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 10, Loss: 0.6482, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 11, Loss: 0.6416, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 12, Loss: 0.6397, Train: 0.5806, Val: 0.4194, Test: 0.4887
Epoch: 13, Loss: 0.6293, Train: 0.5968, Val: 0.4194, Test: 0.4887
Epoch: 14, Loss: 0.6202, Train: 0.6290, Val: 0.4516, Test: 0.5249
Epoch: 15, Loss: 0.6060, Train: 0.7581, Val: 0.5806, Test: 0.6063
Epoch: 16,

In [16]:
test(test_loader)

0.8371040723981901

In [17]:
print(model.eval())

Net(
  (conv1): GCNConv(300, 128)
  (lin0): Linear(in_features=300, out_features=128, bias=True)
  (lin1): Linear(in_features=256, out_features=128, bias=True)
  (lin2): Linear(in_features=128, out_features=2, bias=True)
)
