# Graph Neutral Network
- **2024/25/1**: Önálló laboratórium 2
- Sági Benedek
- **Konzulens**: Unyi Dániel
- **Projekt és a célja**: A gráf neurális hálózatban a döntésekhez készítünk egy/több magyarázhatósági algoritmust, mely elmondja, hogy egy adott modell miért hozott ilyen döntést.




In [None]:
!pip uninstall torch
!pip install torch==2.4.0

In [1]:
!pip install torch_geometric
!pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cpu.html

Looking in links: https://data.pyg.org/whl/torch-2.4.0+cpu.html


In [2]:
import torch

import torch_geometric

from torch_geometric import datasets

from torch_geometric.nn import GCNConv, SAGEConv, global_mean_pool
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss
from torch_geometric.transforms import RandomLinkSplit
from torch_geometric.loader import DataLoader

from torch_geometric.loader import NeighborLoader
from torch_geometric.loader import LinkNeighborLoader
from torch_geometric.loader import LinkLoader

import torch_sparse
import torch_geometric.transforms as T

from torch_geometric.utils import negative_sampling
from torch_geometric.utils import train_test_split_edges

import numpy as np

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Lépések:

##1.fázis
---
1. Adatok letöltése, előkészítése
Mivel a pytorch geometrich könyvtárnak van az az API-ja, amivel le lehet tölteni a kiválasztott adathalmazokat.
2. Adatok áttekintése
3. Modellek készítése különböző célokra
  * Csúcs szintű osztályozás
  * Gráf szintű osztályozás
  * Élprédikció

4. Eredmények elemzése
---






### Adatok letöltése

In [4]:
#Reddit
#dataset_Reddit = datasets.Reddit(root='data/Reddit')
#UPFD
#dataset_EllipticBitcoinDataset = datasets.EllipticBitcoinDataset(root='data/EllipticBitcoinDataset')
#FacebookPagePage
dataset_FacebookPagePage = datasets.FacebookPagePage(root='data/FacebookPagePage')

Downloading https://graphmining.ai/datasets/ptg/facebook.npz
Processing...
Done!


### Adatok áttekintése

In [None]:
def dataset_info(dataset):
  print(f'Dataset: {dataset}:')
  print('======================')
  print(f'Number of graphs: {len(dataset)}')
  print(f'Number of features: {dataset.num_features}')
  print(f'Number of classes: {dataset.num_classes}')
  data = dataset[0]  # Get the first graph object.
  print(data)
  print(f'Number of nodes: {data.num_nodes}')
  print(f'Number of edges: {data.num_edges}')
  print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
  print(f'Has isolated nodes: {data.has_isolated_nodes()}')
  print(f'Has self-loops: {data.has_self_loops()}')
  print(f'Is undirected: {data.is_undirected()}')

In [None]:
dataset_info(dataset_Reddit)
dataset_info(dataset_UPFD)
dataset_info(dataset_Airports)

Dataset: Reddit():
Number of graphs: 1
Number of features: 602
Number of classes: 41
Data(x=[232965, 602], edge_index=[2, 114615892], y=[232965], train_mask=[232965], val_mask=[232965], test_mask=[232965])
Number of nodes: 232965
Number of edges: 114615892
Average node degree: 491.99
Has isolated nodes: False
Has self-loops: False
Is undirected: True
Dataset: UPFD(62, name=politifact, feature=profile):
Number of graphs: 62
Number of features: 10
Number of classes: 2
Data(x=[72, 10], edge_index=[2, 71], y=[1])
Number of nodes: 72
Number of edges: 71
Average node degree: 0.99
Has isolated nodes: False
Has self-loops: False
Is undirected: False
Dataset: EuropeAirports():
Number of graphs: 1
Number of features: 399
Number of classes: 4
Data(x=[399, 399], edge_index=[2, 5995], y=[399])
Number of nodes: 399
Number of edges: 5995
Average node degree: 15.03
Has isolated nodes: False
Has self-loops: True
Is undirected: False


#### Következetés
*   Reddit:
Ebben az adathalmazban nincs egyéb érdekesség, azaz indirekt, nincs DAG, és nincsenek izolált pontok. Csak egy gráf.
*   UPFD:
Ám ebben vannak izolált pontok, sőt több gráf is van.
*   Airports:
Itt viszont van DAG.



In [None]:
for graph_n in range(62):
  data = dataset_UPFD[graph_n]  # Get the first graph object.
  print(graph_n)
  print(data)
  print(f'Number of nodes: {data.num_nodes}')
  print(f'Number of edges: {data.num_edges}')
  print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
  print(f'Has isolated nodes: {data.has_isolated_nodes()}')
  print(f'Has self-loops: {data.has_self_loops()}')
  print(f'Is undirected: {data.is_undirected()}')

0
Data(x=[72, 10], edge_index=[2, 71], y=[1])
Number of nodes: 72
Number of edges: 71
Average node degree: 0.99
Has isolated nodes: False
Has self-loops: False
Is undirected: False
1
Data(x=[32, 10], edge_index=[2, 31], y=[1])
Number of nodes: 32
Number of edges: 31
Average node degree: 0.97
Has isolated nodes: False
Has self-loops: False
Is undirected: False
2
Data(x=[312, 10], edge_index=[2, 311], y=[1])
Number of nodes: 312
Number of edges: 311
Average node degree: 1.00
Has isolated nodes: False
Has self-loops: False
Is undirected: False
3
Data(x=[271, 10], edge_index=[2, 270], y=[1])
Number of nodes: 271
Number of edges: 270
Average node degree: 1.00
Has isolated nodes: False
Has self-loops: False
Is undirected: False
4
Data(x=[34, 10], edge_index=[2, 33], y=[1])
Number of nodes: 34
Number of edges: 33
Average node degree: 0.97
Has isolated nodes: False
Has self-loops: False
Is undirected: False
5
Data(x=[60, 10], edge_index=[2, 59], y=[1])
Number of nodes: 60
Number of edges: 59
A

### Modellek készítése

#### Csúcs szintű osztályozás
- Gráf konvolucios hálózat
Innen vettem ezt az alapvető kódot: https://colab.research.google.com/drive/14OvFnAXggxB8vM4e8vSURUp1TaKnovzX?usp=sharing#scrollTo=fmXWs1dKIzD8
- Ez alapján bovitettem: https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#convolutional-layers
- A működését ennek segítségével értelmeztem: https://tkipf.github.io/graph-convolutional-networks/

In [5]:
class GCN1(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN1, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def  forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

In [4]:
class GCN2(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN2, self).__init__()
        self.conv1 = SAGEConv(input_dim, hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, hidden_dim)
        self.conv4 = GCNConv(hidden_dim, output_dim)

    def  forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = self.conv3(x, edge_index)
        x = F.relu(x)
        x = self.conv4(x, edge_index)
        return F.log_softmax(x, dim=1)

#### Élprédikáció

In [49]:
class LinkPredictor(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(LinkPredictor, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.fc = torch.nn.Linear(hidden_dim * 2, 1)

    def encode(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        return self.conv2(x, edge_index)

    def decode(self, z, edge_index):
        src, dst = edge_index
        z_src, z_dst = z[src], z[dst]
        return torch.sigmoid(self.fc(torch.cat([z_src, z_dst], dim=1)))

    def forward(self, data):
        z = self.encode(data.x, data.edge_index)
        return self.decode(z, data.edge_index)

#### Gráfszintu osztalyozas

In [None]:
class GCNGraphClassifier(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCNGraphClassifier, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.fc = torch.nn.Linear(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        x = global_mean_pool(x, batch)  # Gráf szintű pooling
        return F.log_softmax(self.fc(x), dim=1)


### Eredmények és elemzése

In [8]:
batch_size = 128
num_neighbors = [10,10]
lr = 1e-3
hidden_dim = 64

#### Csúcs szintű osztályozás

In [50]:
def node_train_model(train_loader, test_loader, optimizer, model, device):
  def train():
      model.train()
      total_loss = 0
      for batch in train_loader:
          batch = batch.to(device)
          optimizer.zero_grad()
          out = model(batch.x, batch.edge_index)
          print(out)
          criterion = CrossEntropyLoss()
          loss = criterion(out[:batch.batch_size], batch.y[:batch.batch_size])
          loss.backward()
          optimizer.step()
          total_loss += loss.item()
      return total_loss / len(train_loader)
  def test():
      model.eval()
      correct = 0
      total = 0  # Összes tesztadat száma
      for batch in test_loader:
          batch = batch.to(device)
          out = model(batch.x, batch.edge_index)
          pred = out.max(dim=1)[1]
          correct += pred.eq(batch.y).sum().item()
          total += batch.y.size(0)  # Összes minta száma
      return correct / total
  for epoch in range(10):
    loss = train()
    acc = test()
    #Elemzés
    print(f'Epoch {epoch}, Loss: {loss:.4f}, Test Accuracy: {acc:.4f}')

With mask:

In [6]:

for dataset in [dataset_Reddit]:
  data = dataset[0]
  train_loader = NeighborLoader(data, num_neighbors=num_neighbors , batch_size=batch_size, input_nodes=data.train_mask)
  test_loader = NeighborLoader(data, num_neighbors=num_neighbors , batch_size=batch_size, input_nodes=data.test_mask)
  model = GCN2(input_dim=dataset.num_features, hidden_dim=hidden_dim, output_dim=dataset.num_classes).to(device)
  optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=5e-4)
  print(f"{dataset}:")
  node_train_model(train_loader, test_loader, optimizer, model, device)

Without mask, creating these:

In [None]:
#Creating masks
for dataset in [dataset_Reddit]:
  data = dataset[0]
  num_nodes = data.num_nodes
  train_ratio = 0.8
  test_ratio = 0.2

  perm = torch.randperm(data.num_nodes)
  train_mask = perm[:int(data.num_nodes * train_ratio)]
  test_mask = perm[int(data.num_nodes * train_ratio):]

  data.train_mask = torch.zeros()
  data.train_mask = torch.zeros()

  train_loader = NeighborLoader(data, num_neighbors=num_neighbors , batch_size=batch_size, input_nodes=data.train_mask)
  test_loader = NeighborLoader(data, num_neighbors=num_neighbors , batch_size=batch_size, input_nodes=data.test_mask)
  model = GCN2(input_dim=dataset.num_features, hidden_dim=hidden_dim, output_dim=dataset.num_classes).to(device)
  optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=5e-4)
  print(f"{dataset}:")
  node_train_model(train_loader, test_loader, optimizer, model, device)

#### Élprédikáció

In [25]:
data = dataset_FacebookPagePage[0]
transform = RandomLinkSplit(is_undirected=True)
train_data, val_data, test_data = transform(data)

In [29]:
train_loader = LinkNeighborLoader(
    train_data,
    num_neighbors=num_neighbors,
    batch_size=batch_size,
    edge_label_index=train_data.edge_index,
    shuffle=True
)

test_loader = LinkNeighborLoader(
    test_data,
    num_neighbors=num_neighbors,
    batch_size=batch_size,
    edge_label_index=test_data.edge_index,
    shuffle=False
)

In [83]:
def edge_train_model(train_loader, test_loader, optimizer, model, device):
  def train():
      total_loss = 0
      model.train()
      for batch in train_loader:
          batch = batch.to(device)
          optimizer.zero_grad()
          z = model.encode(batch.x, batch.edge_index)
          loss = F.binary_cross_entropy(model.decode(z, batch.edge_index).squeeze(), torch.ones(batch.edge_index.size(1)).to(device))
          loss.backward()
          optimizer.step()
      return total_loss / len(train_loader)
  def test():
    model.eval()
    correct = 0
    total = 0  # Összes tesztadat száma
    for batch in test_loader:
        batch = batch.to(device)
        out = model.encode(batch.x, batch.edge_index)
        pred = out.max(dim=1)[1]
        correct += pred.eq(batch.y).sum().item()
        total += batch.y.size(0)  # Összes minta száma
    return correct / total
  for epoch in range(10):
    loss = train()
    acc = test()
    #Elemzés
    print(f'Epoch {epoch}, Loss: {loss:.4f}, Test Accuracy: {acc:.4f}')

In [89]:
def edge_train_model(train_loader, test_loader, optimizer, model, device):
    def train():
        total_loss = 0
        model.train()
        for batch in train_loader:
            batch = batch.to(device)
            optimizer.zero_grad()

            # Kódolás
            z = model.encode(batch.x, batch.edge_index)

            # Pozitív élek dekódolása
            pos_pred = model.decode(z, batch.edge_index)

            # Negatív minták generálása
            neg_edge_index = negative_sampling(
                edge_index=batch.edge_index, num_nodes=batch.x.size(0), num_neg_samples=batch.edge_index.size(1)
            )

            # Negatív élek dekódolása
            neg_pred = model.decode(z, neg_edge_index)

            # Loss számítása: Pozitív és negatív minták összehasonlítása
            pos_loss = F.binary_cross_entropy(pos_pred.squeeze(), torch.ones(pos_pred.size(0)).to(device))
            neg_loss = F.binary_cross_entropy(neg_pred.squeeze(), torch.zeros(neg_pred.size(0)).to(device))

            # Teljes veszteség
            loss = pos_loss + neg_loss
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        return total_loss / len(train_loader)

    def test():
        model.eval()
        correct = 0
        total = 0  # Összes tesztadat száma

        with torch.no_grad():
            for batch in test_loader:
                batch = batch.to(device)

                # Kódolás és predikció
                z = model.encode(batch.x, batch.edge_index)

                # Pozitív élek predikciója
                pos_pred = model.decode(z, batch.edge_index)

                # Negatív élek mintavételezése
                neg_edge_index = negative_sampling(
                    edge_index=batch.edge_index, num_nodes=batch.x.size(0), num_neg_samples=batch.edge_index.size(1)
                )
                neg_pred = model.decode(z, neg_edge_index)

                # Predikciók összehasonlítása
                pos_correct = (pos_pred > 0.5).sum().item()
                neg_correct = (neg_pred < 0.5).sum().item()

                correct += pos_correct + neg_correct
                total += pos_pred.size(0) + neg_pred.size(0)

        return correct / total

    for epoch in range(10):
        loss = train()
        acc = test()
        # Elemzés
        print(f'Epoch {epoch}, Loss: {loss:.4f}, Test Accuracy: {acc:.4f}')


In [90]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LinkPredictor(dataset_FacebookPagePage.num_node_features, hidden_dim=16).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
edge_train_model(train_loader, test_loader, optimizer, model, device)

Epoch 0, Loss: 0.7680, Test Accuracy: 0.8516
Epoch 1, Loss: 0.7247, Test Accuracy: 0.8499
Epoch 2, Loss: 0.7121, Test Accuracy: 0.8553
Epoch 3, Loss: 0.7067, Test Accuracy: 0.8526
Epoch 4, Loss: 0.7033, Test Accuracy: 0.8542
Epoch 5, Loss: 0.7013, Test Accuracy: 0.8559
Epoch 6, Loss: 0.6982, Test Accuracy: 0.8565
Epoch 7, Loss: 0.6951, Test Accuracy: 0.8555
Epoch 8, Loss: 0.6932, Test Accuracy: 0.8555
Epoch 9, Loss: 0.6915, Test Accuracy: 0.8559
