In [2]:
!pip install torch_geometric

Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: torch_geometric
Successfully installed torch_geometric-2.6.1


In [29]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# Load the Cora dataset
dataset = Planetoid(root='data/Cora', name='Cora')

# Prepare data
data = dataset[0]

# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)
%time
# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")


CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 6.2 µs
Epoch 0, Loss: 1.9531267881393433
Epoch 10, Loss: 0.6011121273040771
Epoch 20, Loss: 0.09413811564445496
Epoch 30, Loss: 0.01961567811667919
Epoch 40, Loss: 0.007385045289993286
Epoch 50, Loss: 0.004220667760819197
Epoch 60, Loss: 0.0030528465285897255
Epoch 70, Loss: 0.0024841390550136566
Epoch 80, Loss: 0.002137193689122796
Epoch 90, Loss: 0.0018891923828050494
Training complete!


In [24]:
data

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

## Explanation:
GCN aggregates features from a node’s neighbors using graph convolutions. This allows the network to learn representations based on both node features and graph structure.
The Cora dataset is used to classify nodes into one of 7 research topics.

## Questions (1 point each):

1. What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?
**Ans:** As we can see below, adding more GCN layers resulted in less training loss. However, if we continue to add more GCN layers in an effort to improve the model's awareness of the graph structure through message passing, we might end up creating a model that treats all nodes the same. In other words, the node feature representations would converge to indistinguishable vectors (known as over-smoothing). This is because as the model passes through more layers, the size of the computational graphs increases which increases the probability of accessing the same nodes. *(See code below)*


2. What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?
**Ans:** Increasing the size of the hidden dimension led to a training loss of 0. This could mean that the model overfit to the training data and won't generalize as well to test data. It is better if we keep a smaller hidden dimension in order to prevent the GCN from overfitting. *(See code below)*


3. What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?
**Ans:** Replacing the ReLU with sigmoid led to a training loss of nearly 0 for every epoch. This is likely a cause of the vanishing gradients problem that arises when using the sigmoid activation function. We should stick with ReLU for this use case because the gradients are either 1 (for positive inputs) or 0 (for negative inputs) so gradients don't shrink as drastically. *(See code below)*


4. What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?
**Ans:** If we trained only on 10% of nodes, then the model's performance would worsen because it hasn't seen the rest of the graph data. In other words, performance relies heavily on the model's ability to learn graph structure and node feature representation during training. By removing most of the training data, the model's capability to learn the structure of the data would erode thus leading to a poor test performance.


5. What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?
**Ans:** Using a different optimizer would affect convergence speed since each optimizer has its own way of adjusting step size and finding local minimas. In the case below where we replaced Adam with RMSprop, it converged slightly faster (1µs faster) compared to the base model above.

Extra credit: 
1. What would happen if we used edge weights (non-binary) in the adjacency matrix? How would it affect message passing?
**Ans:** Using edge weights in the adjacency matrix should give the model more information about the structure of the graph and the relative positions/semantic relationships between nodes. Theoretically, this would benefit message passing by allowing the model to recognize the strength of relationships between nodes.


2. What would happen if we removed the log-softmax function in the output layer? Would the loss function still work correctly?
**Ans:** My initial hypothesis is that removing the log-softmax function wouldn't allow the loss function to work correctly since the output of the network would be incorrectly formatted. However, after trying this below (see code), it seems to work based off the training loss (which consistently reduces over each epoch and is similar loss to the above model).

## No points, just for you to think about:
1. What would happen if we applied dropout to the node features during training? How would it affect the model’s generalization?
2. What would happen if we used mean-pooling instead of summing the messages in the GCN layers?
3. What would happen if we pre-trained the node features using a different algorithm, like Node2Vec, before feeding them into the GCN?


1433

#### Question 1 - Adding 3 layers

In [6]:
# Define a 3-layer GCN
class GCN3(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN3, self).__init__()
        self.conv1 = GCNConv(input_dim, 250)
        self.conv2 = GCNConv(250, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        x = torch.relu(x)
        x = self.conv3(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model_3 = GCN3(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model_3.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 0.0015451112994924188
Epoch 10, Loss: 4.580696622724645e-05
Epoch 20, Loss: 8.037974112085067e-06
Epoch 30, Loss: 3.624783630584716e-06
Epoch 40, Loss: 2.468470484018326e-06
Epoch 50, Loss: 2.0367670003906824e-06
Epoch 60, Loss: 1.8315585066375206e-06
Epoch 70, Loss: 1.6910628346522572e-06
Epoch 80, Loss: 1.5829236872377805e-06
Epoch 90, Loss: 1.4867054005662794e-06
Training complete!


#### Question 2 - Increasing size of hidden layer

In [7]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model_hidden = GCN(input_dim=dataset.num_node_features, hidden_dim=64, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model_hidden.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 1.4109230050962651e-06
Epoch 10, Loss: 8.685246655204537e-08
Epoch 20, Loss: 1.362391799375473e-08
Epoch 30, Loss: 5.108969247658024e-09
Epoch 40, Loss: 2.554484623829012e-09
Epoch 50, Loss: 1.7029897492193413e-09
Epoch 60, Loss: 8.514948746096707e-10
Epoch 70, Loss: 8.514948746096707e-10
Epoch 80, Loss: 0.0
Epoch 90, Loss: 0.0
Training complete!


#### Question 3 - Replace ReLU with Sigmoid

In [13]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.sigmoid(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model_hidden = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model_hidden.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 2.3126474388845963e-06
Epoch 10, Loss: 1.4730856889855204e-07
Epoch 20, Loss: 2.7247834211152622e-08
Epoch 30, Loss: 1.1069433369925719e-08
Epoch 40, Loss: 6.811958996877365e-09
Epoch 50, Loss: 5.960464122267695e-09
Epoch 60, Loss: 3.4059794984386826e-09
Epoch 70, Loss: 3.4059794984386826e-09
Epoch 80, Loss: 1.7029897492193413e-09
Epoch 90, Loss: 8.514948746096707e-10
Training complete!


#### Question 5 - Replace Adam with RMSProp

In [23]:
%time

# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model_hidden = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.RMSprop(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model_hidden.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.2 µs
Epoch 0, Loss: 0.001809474197216332
Epoch 10, Loss: 0.0005646863719448447
Epoch 20, Loss: 0.00028500787448138
Epoch 30, Loss: 0.00019343489839229733
Epoch 40, Loss: 0.00014670485688839108
Epoch 50, Loss: 0.00011805366375483572
Epoch 60, Loss: 9.856133692665026e-05
Epoch 70, Loss: 8.438501390628517e-05
Epoch 80, Loss: 7.36061847419478e-05
Epoch 90, Loss: 6.512028630822897e-05
Training complete!


In [30]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return x

# Initialize model, optimizer, and loss function
model2 = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model2.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 0.0016936450265347958
Epoch 10, Loss: 5.1432183681754395e-05
Epoch 20, Loss: 9.090367711905856e-06
Epoch 30, Loss: 4.063294454681454e-06
Epoch 40, Loss: 2.7733008209906984e-06
Epoch 50, Loss: 2.2777373942517443e-06
Epoch 60, Loss: 2.0265488274162635e-06
Epoch 70, Loss: 1.8690238903218415e-06
Epoch 80, Loss: 1.7413008208677638e-06
Epoch 90, Loss: 1.6382705325668212e-06
Training complete!
