<a href="https://colab.research.google.com/github/natnj/NJ-projects-pub/blob/main/NJ_Homework_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLCB Homework 2

**Deadline: Mon Oct 16th 11:59pm**

In this homework, you will get some practical experience with modeling small molecules. The goal is to expose you to useful libraries (torch, rdkit, dgl) that are heavily utilized in practice. The questions are designed in a way that you should not need to know these libraries to answer them, but if anything is confusing or if you get stuck please ask for help on Piazza!

The homework has 3 parts:

1. Introduction to neural networks
    1. 1-layer neural network
    1. Multi-layer MLP & Activation functions

1. Modeling small molecules:
    1. Morgan fingerprints & MLP  
    1. Smiles strings & RNN  
    1. Molecular graphs & GNN  

1. Generalization (Required for grad students, optional for undergraduates)

1. Competition (Optional)
    
Answer all questions directly in this notebook and complete the missing code where marked with **COMPLETE HERE**.

## 0. Setup

Run the cell below to import all the relevant libraries.

In [None]:
!pip install matplotlib
!pip install scikit-learn
!pip install torch
!pip install rdkit
!pip install dgllife
!pip install tqdm
!pip install  dgl -f https://data.dgl.ai/wheels/repo.html


import torch
import numpy as np
from sklearn.datasets import make_classification
from sklearn.datasets import make_gaussian_quantiles
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from dgllife.data import Tox21
from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer, CanonicalBondFeaturizer
from rdkit import Chem
from rdkit.Chem import AllChem
from tqdm import tqdm


# Part 1 datasets
X1, Y1 = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_repeated=0, class_sep=5.0, n_classes=2, random_state=1)
Y1 = Y1[..., None]
X2, Y2 = make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=2, random_state=1)
Y2 = Y2[..., None]

# Part 2 datasets
smiles_to_g = SMILESToBigraph(
    node_featurizer=CanonicalAtomFeaturizer(),
    edge_featurizer=CanonicalBondFeaturizer()
)
Tox21 = Tox21(smiles_to_g)

## 1. Introduction to Neural Networks

In this section, we will briefly review neural networks via a simple Multi-Layer Preceptron (MLP).

### 1.A Single Layer Network (Logistic Regression)

First let's consider the simplest model possible, with a single linear layer:

$$y = xW + b$$

Here, *x* is a feature vector size *n*, which represents the input to the model:

$$ x = [x_0, x_1, ... x_n] $$

Our goal is to learn the weight matrix *W* and the bias *b* that transform *x* into *y*.

#### 1.A.1 Question (1pt)

Assume that the input data *x* is 10-dimensional (i.e there are 10 features per sample), and the output *y* is 3-dimensional, what is the total number of parameters learned in the formulation presented above?

**ANSWER**:

#### 1.A.2 Question (2pt)

We are now going to train our model from scratch! First we need to define a loss function. We will use the cross entropy loss:

$$Loss(y_{pred}, y_{true}) = - (y_{true} \cdot log(y_{pred}) + (1 - y_{true}) \cdot log(1 - y_{pred})) $$

Here *y_true* is 1 for a positive example and 0 for negatives, and *y_pred* is a probability for the positive class. For simpliciy, we will assume that our input *x* is 2-dimensional. In order to obtain a probability (between 0 and 1) we must normalize the output *y_pred*. We do so using the sigmoid function:

$$ y^{pred} = sigmoid(xW + b) = \frac{e^{xW + b}}{1 + e^{xW + b}}$$

In order to learn *W* and *b*, we can use gradient descent: we comptue the derivative of the loss with respect to the parameters.

$$
\begin{align}
    \frac{dLoss}{dW} &= (y_{pred} - y_{true})x \\
    \frac{dLoss}{db} &= (y_{pred} - y_{true})
\end{align}
$$

Implement the functions below! It may be useful to print the shape of the input to make sure you are handling dimensions correctly.

In [None]:
def sigmoid(x):
    """Sigmoid activation of the input x."""
    # COMPLETE HERE (hint: use np.exp)
    pass

def predict(x, W, b):
    """Returns y_pred given the input and learned parameters."""
    # COMPLETE HERE (hint: use the sigmoid function above, and np.dot)
    pass

def loss(y_pred, y_true):
    """Returns the cross-entropy loss given the prediction and target."""
    # COMPLETE HERE (hint: use np.log)
    # Consider adding a small epsilon to the log if you run into errors.
    # Return per sample (no averaging in the first dim)
    pass

def dLossdW(y_pred, y_true, x):
    """Comptues the derivative of the loss with respect to W."""
    # COMPLETE HERE
    # Return per sample (no averaging in the first dim)
    pass

def dLossdb(y_pred, y_true):
    """Comptues the derivative of the loss with respect to b."""
    # COMPLETE HERE
    # Return per sample (no averaging in the first dim)
    pass


### DON'T MODIFY BELOW

def gradient_descent_solver(x, y_true):
    # Initialize weights
    W = np.array([0.0, 0.0])[:, None]
    b = np.array([0])
    alpha = 1.0
    num_steps = 1000

    # Perform steps of gradient descent
    y_pred = predict(x, W, b)
    L_start = loss(y_pred, y_true).mean()
    accuracy_start = ((y_pred > 0.5) == y_true).mean()

    for _ in range(num_steps):
        y_pred = predict(x, W, b)
        L = loss(y_pred, y_true).mean()
        accuracy = ((y_pred > 0.5) == y_true).mean()

        dW = dLossdW(y_pred, y_true, x)
        db = dLossdb(y_pred, y_true)
        W = W - alpha * dW.mean(axis=0)[:, None]
        b = b - alpha * db.mean(axis=0)

    print("Start loss: ", L_start)
    print("Final loss: ", L)

    print("Start accuracy: ", accuracy_start)
    print("Final accuracy: ", accuracy)
    return W, b


def plot_results(x, y_true, W, b):
    plt.figure()
    plt.scatter(x[:, 0], x[:, 1], c=y_true)

    x1 = np.linspace(-10, 10)
    x2 = 0 * x1 - 0
    plt.plot(x1, x2, c="b", label="Starting boundary")

    x1 = np.linspace(-10, 10)
    x2 = -W[0] / W[1] * x1 - b / W[1]
    plt.plot(x1, x2, c="r", label="Final boundary")

    plt.xlim(-10, 10)
    plt.ylim(-10, 10)
    plt.legend()
    plt.show()

# Run training
W1, b1 = gradient_descent_solver(X1, Y1)
plot_results(X1, Y1, W1, b1)

W2, b2 = gradient_descent_solver(X2, Y2)
plot_results(X2, Y2, W2, b2)

**1.A.3 Question (2pt)**

Comment on the plots above. How did the model perform on the first dataset? How about the second? Why is the model not appropriate for the second dataset?

**ANSWER**:

### 1.B Multi-layer Perceptron (MLP)

As you found in the example above, some datasets require more complex models to classify correctly. Let's see if we can improve the performance on the second dataset. This time we will implement a 2-layer neural network:

$$y = (xW_1 + b_1)W_2 + b_2$$

**1.B.1 Question (1pt)**

Prove that the 2-layer model defined above isn't more powerful than a single layer model. (Hint: can you show the model is still just linear?)

**ANSWER**:

**1.B.2 Question (3pt)**

In order to increase the modeling power of the model, we introduce a "non-linearity" between the two layers. Here we choose the simple ReLU function:

$$ ReLU(x) = max(0, x) $$

We now have the following model:

$$y = ReLU(xW_1 + b_1)W_2 + b_2$$

The derivative is now more complex so we are going to use the PyTorch library which automates differentiation for us. All we need to do now is define our model and loss function. Fill out the code below.



In [None]:
def relu(x):
    return torch.nn.functional.relu(x)

def sigmoid(x):
    """Sigmoid activation of the input x."""
    # COMPLETE HERE (hint: use torch.exp)
    pass

def predict(x, W1, b1, W2, b2):
    """Returns y_pred given the input and learned parameters."""
    # COMPLETE HERE (hint: use the sigmoid & relu functions above + torch.mm)
    pass

def loss(y_pred, y_true):
    """Returns the cross-entropy loss given the prediction and target."""
    # COMPLETE HERE (hint: use torch.log)
    # Consider adding a small epsilon to the log if you run into errors.
    # Return per sample (no averaging in the first dim)
    pass


### DON'T MODIFY BELOW

def gradient_descent_solver(x, y_true):
    # Initialize weights
    random = np.random.RandomState(1)
    W1 = random.randn(2, 100) * 0.01
    W2 = random.randn(100, 1) * 0.01
    W1 = torch.nn.Parameter(torch.tensor(W1).float())
    b1 = torch.nn.Parameter(torch.zeros((100,)))
    W2 = torch.nn.Parameter(torch.tensor(W2).float())
    b2 = torch.nn.Parameter(torch.zeros((1,)))
    alpha = 0.1
    num_steps = 1000

    # Perform steps of gradient descent
    x = torch.tensor(x).float()
    y_true = torch.tensor(y_true).float()
    optimizer = torch.optim.SGD([W1, b1, W2, b2], alpha)

    y_pred = predict(x, W1, b1, W2, b2)
    L_start = loss(y_pred, y_true).mean()
    accuracy_start = ((y_pred > 0.5) == y_true).float().mean()

    for _ in range(num_steps):
        optimizer.zero_grad()
        y_pred = predict(x, W1, b1, W2, b2)
        L = loss(y_pred, y_true).mean()
        L.backward()
        optimizer.step()
        accuracy = ((y_pred > 0.5) == y_true).float().mean()

    print("Start loss: ", L_start.item())
    print("Final loss: ", L.item())

    print("Start accuracy: ", accuracy_start.item())
    print("Final accuracy: ", accuracy.item())

# Run training
print("Dataset 1")
gradient_descent_solver(X1, Y1)

print("\nDataset 2")
gradient_descent_solver(X2, Y2)

**1.B.3 Question (1pt)**

Comment on the plot above, how does the model perform now compared to the 1-layer model in the previous problem?

**ANSWER**:

## 2. Modeling small molecules

In this problem, you will experiment with various approaches to molecular modeling from the simplest approach to increasingly more complex.

For this problem, we consider the Tox21 dataset. Let's visualize some of the molecules in the dataset using RDKit!

In [None]:
print(Tox21[0][0])
Chem.MolFromSmiles(Tox21[0][0])

In [None]:
print(Tox21[1][0])
Chem.MolFromSmiles(Tox21[1][0])

### 2.A MLP over fingerprints

Now we are going to train our first model using a simple MLP over Morgan Fingerprints. Let's look at an example of Morgan fingerprints. You can find more information about it here: https://www.rdkit.org/docs/GettingStartedInPython.html

In [None]:
fpgen = AllChem.GetMorganGenerator(radius=3)
mol = Chem.MolFromSmiles(Tox21[0][0])
ao = AllChem.AdditionalOutput()
ao.CollectBitInfoMap()
fp = fpgen.GetCountFingerprint(mol, additionalOutput=ao)
arr = np.zeros((0,), dtype=np.int8)
Chem.DataStructs.ConvertToNumpyArray(fp, arr)
print("Number of bits: ", len(arr))
print("Total non zero entries:", (arr > 0).sum())
print("Maximum count:", arr.max())

The fingerprint is a count vector of length 2048, where each entry corresponds to a substrucuture and the number of times it appears in the molecule.

**2.A.1 Question (1pt)**

Find the most represented bit (maximum count in the array) and visualize it.  Where in the molecule does this pattern appear?

In [None]:
def most_represented_bit(arr):
    # COMPLETE HERE
    pass

# Change num_item to any value < count for that bit
num_item = 0
bi = ao.GetBitInfoMap()
idx = most_represented_bit(arr)
Chem.Draw.DrawMorganBit(mol, idx, bi, whichExample=num_item)

**ANSWER**:

**2.A.2 Question (2pt)**

We process the dataset so that we have the fingerprints for all molecules, and split the data randomly into training and testing.

Note that here we predict 12 labels for each molecule. Because the positive to negative ratio is imbalanced, we measure performance using the ROC-AUC (instead of accuracy) for each label individually and report the average performance across the 12 labels.

Implement a small MLP and train it. Try different values for the hidden dimension, fingerprint radius and dropout. Compare results and comment. Try at least 2 other combinations of hyperparameters.

In [None]:
# COMPLETE HERE (i.e. play with different values)
RADIUS = 3
HIDDEN_DIM = 128
DROPOUT = 0
NUM_EPOCHS = 10
FP_SIZE = 2048


class MLP(torch.nn.Module):

    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.w1 = torch.nn.Linear(in_dim, hidden_dim)
        self.relu = torch.nn.ReLU()
        self.dropout = torch.nn.Dropout(DROPOUT)
        self.w2 = torch.nn.Linear(hidden_dim, out_dim)

    def forward(self, batch):
        data = batch["data"]
        # COMPLETE HERE (hint: each pytorch layer can be called as a function)
        pass


# DON'T MODIFY BELOW
def process_sample(sample):
    smiles = sample[0]
    labels = sample[2]
    mask = sample[3]
    fpgen = AllChem.GetMorganGenerator(radius=RADIUS, fpSize=FP_SIZE)
    mol = Chem.MolFromSmiles(smiles)
    ao = AllChem.AdditionalOutput()
    ao.CollectBitInfoMap()
    fp = fpgen.GetCountFingerprint(mol, additionalOutput=ao)
    arr = np.zeros((0,), dtype=np.int8)
    Chem.DataStructs.ConvertToNumpyArray(fp, arr)
    arr = torch.tensor(arr).float()
    return {"data": arr, "labels": labels, "mask": mask}

def create_dataset():
    dataset = list(map(process_sample, Tox21))
    train, test = train_test_split(dataset, test_size=0.2, random_state=1)
    return train, test

def create_model():
    model = MLP(FP_SIZE, HIDDEN_DIM, 12)
    return model

def evaluate(model, dataloader):
    out_pred, out_labels, out_mask = [], [], []
    for batch in dataloader:
        mask = batch["mask"]
        labels = batch["labels"]
        y_pred = model(batch).sigmoid()
        out_pred.append(y_pred)
        out_labels.append(labels)
        out_mask.append(mask)

    out_pred = torch.cat(out_pred).detach().numpy()
    out_labels = torch.cat(out_labels).detach().numpy()
    out_mask = torch.cat(out_mask).bool().detach().numpy()

    aucs = []
    for i in range(12):
        preds = out_pred[:, i]
        labels = out_labels[:, i]
        mask = out_mask[:, i]
        preds = preds[mask]
        labels = labels[mask]
        aucs.append(roc_auc_score(labels, preds))

    return np.mean(aucs)

def train(model, train_dataloader, test_dataloader, num_epochs):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train_loss = []
    train_aucs = []
    test_aucs = []
    for _ in tqdm(range(num_epochs), total=num_epochs):
        avg_loss = 0
        model.train()
        for batch in train_dataloader:
            optimizer.zero_grad()
            mask = batch["mask"]
            labels = batch["labels"]
            y_pred = model(batch)
            loss = torch.nn.functional.binary_cross_entropy_with_logits(y_pred, labels, reduction="none")
            loss = (loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
            loss = loss.mean()
            loss.backward()
            optimizer.step()
            avg_loss += loss.item()

        model.eval()
        with torch.no_grad():
            train_auc = evaluate(model, train_dataloader)
            test_auc = evaluate(model, test_dataloader)

        avg_loss /= len(train_dataloader)
        train_loss.append(avg_loss)
        train_aucs.append(train_auc)
        test_aucs.append(test_auc)

    return train_loss, train_aucs, test_aucs

train_set, test_set = create_dataset()
train_dl = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)
test_dl = torch.utils.data.DataLoader(test_set, batch_size=32, shuffle=False)
model = create_model()
train_loss, train_aucs, test_aucs = train(model, train_dl, test_dl, num_epochs=NUM_EPOCHS)

fig, (ax1, ax2, ax3) = plt.subplots(3)
ax1.plot(train_loss)
ax1.set_title("Training Loss")
ax2.plot(train_aucs)
ax2.set_title("Training ROC-AUC")
ax3.plot(test_aucs)
ax3.set_title("Test ROC-AUC")

print("Final Training ROC-AUC: ", train_aucs[-1])
print("Best Training ROC-AUC: ", max(train_aucs))
print("\nFinal Testing ROC-AUC: ", test_aucs[-1])
print("Best Testing ROC-AUC: ", max(test_aucs))

**2.A.3 Question (1pt)**

How does the radius affect performance, how about the hidden dimension?

**ANSWER**

**2.A.4 Question (1pt)**

Why does the test AUC initially goes up but then starts to fall down? Try increasing the dropout, does it still happen?

**ANSWER**

### 2.B RNN over Smiles

We now shift our attention to using reccurrent neural network (RNN) over smile strings. First, let's briefly review what an RNN is. RNN's are used to encode sequences. At each step an RNN takes as input it's current state *h* and the input at that step *x*:

$$ h_i = f(x_i, h_{i-1}) $$

Consider the simplest example possible, where we apply a simple linear layer on the concatenated input and state vectors.

$$ h_i = W [x_i, h_{i-1}] + b $$

**2.B.1 Question (1pt)**

This formulation leads to exploding / vanishing gradients. Provide some intuition as to why that might be the case. (Hint: consider what happens to W as you start to unroll the computation across multiple steps, what happens to the eigenvalues of W? What does it imply when they are greater than 1, or lesser than 1?)

**ANSWER**:

In order to address this challenge, and improve RNN's capacity to model long sequences, LSTM's were introduced. Without going in too much detail, LSTMs control the flow of information using learned input and output gates. This has been shown to be an effective way to model longer sequences. You can read more about them here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

**2.B.2 Question (3pt)**

You will now implement an RNN / LSTM. Fill out the missing code below, here again play with some of the values for the hidden dimension and dropout. Comment on your observations. This make take a bit of time to train, so no need to try too many things.

In [None]:
# COMPLETE HERE (i.e. modify to a different value)
HIDDEN_DIM = 100
DROPOUT = 0
NUM_EPOCHS = 10

class RNN(torch.nn.Module):

    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.embedding = torch.nn.Embedding(in_dim, hidden_dim)
        self.lstm = torch.nn.LSTM(
            hidden_dim,
            hidden_dim,
            dropout=DROPOUT,
            batch_first=True,
            bidirectional=True
        )
        self.fc = torch.nn.Linear(2 * hidden_dim, out_dim)

    def forward(self, batch):
        data = batch["data"]
        pad_mask = batch["pad_mask"]
        max_len = data.shape[1]

        # COMPLETE HERE

        # First we embed each input token into a vector
        # Use the self.embedding layer
        emb = ...

        # Compute lengths from padding mask & use the above
        lengths = ...

        # In order to ignore padding we use
        out = torch.nn.utils.rnn.pack_padded_sequence(out, lengths=lengths, batch_first=True, enforce_sorted=False)

        # Now we pass it to the LSTM (which outputs out, state). Ignore the state.
        out = ...

        # Now we unpack, we use
        out = torch.nn.utils.rnn.pad_packed_sequence(out, batch_first=True, total_length=max_len)[0]

        # Compute the average vector for the sequence
        # Beware of padding! Use the mask! Note that you will need to
        # expand the mask to a 3D Tensor. You can do so with mask.unsqueeze(-1)
        # you will also need to unsqueeze the lengths when dividing by them to take the average
        out = ...

        # Finally apply the fc layer
        out = ...
        return out


# DON'T MODIFY THIS
vocab = {"~": 0}
def process_sample(sample, max_length):
    smiles = sample[0]
    labels = sample[2]
    mask = sample[3]

    tok_ids = []
    for token in smiles:
        if token not in vocab:
            vocab[token] = len(vocab)
            tok_id = len(vocab)
        else:
            tok_id = vocab[token]
        tok_ids.append(tok_id)

    arr = torch.tensor(tok_ids).long()
    return {"data": arr, "labels": labels, "mask": mask}

def create_dataset():
    max_length = max(len(x[0]) for x in Tox21)
    dataset = list(map(lambda x: process_sample(x, max_length), Tox21))
    train, test = train_test_split(dataset, test_size=0.2, random_state=1)
    return train, test

def create_model():
    return RNN(len(vocab) + 1, HIDDEN_DIM, 12)

def collate_fn(data):

    tok_ids = [d["data"] for d in data]
    pad_mask = [torch.ones_like(d["data"]) for d in data]
    labels = [d["labels"] for d in data]
    mask = [d["mask"] for d in data]

    tok_ids = torch.nn.utils.rnn.pad_sequence(tok_ids, batch_first=True)
    pad_mask = torch.nn.utils.rnn.pad_sequence(pad_mask, batch_first=True)
    labels = torch.stack(labels)
    mask = torch.stack(mask)

    return {"data": tok_ids, "labels": labels, "mask": mask, "pad_mask": pad_mask}


train_set, test_set = create_dataset()
model = create_model()
train_dl = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_dl = torch.utils.data.DataLoader(test_set, batch_size=32, shuffle=False, collate_fn=collate_fn)
train_loss, train_aucs, test_aucs = train(model, train_dl, test_dl, num_epochs=NUM_EPOCHS)

fig, (ax1, ax2, ax3) = plt.subplots(3)
ax1.plot(train_loss)
ax1.set_title("Training Loss")
ax2.plot(train_aucs)
ax2.set_title("Training ROC-AUC")
ax3.plot(test_aucs)
ax3.set_title("Test ROC-AUC")

print("Final Training ROC-AUC: ", train_aucs[-1])
print("Best Training ROC-AUC: ", max(train_aucs))
print("\nFinal Testing ROC-AUC: ", test_aucs[-1])
print("Best Testing ROC-AUC: ", max(test_aucs))

**2.B.3 Question (1pt)**

How does the performance compare to the MLP baseline? Can you further improve it by modifying the hyperparameters? Try at least 2 other combinations of hyperparameters.

**ANSWER**:

### 2.C GNN over graph

Moving on, we will now use the graph representation and a simple message passing neural network (MPNN):

$$
\begin{align*}
m_{ij} &= MLP([h_i, h_j, e_{ij}]) \\
h_i^{neigh} &= \sum_j m_{ij} \\
h_i^{new} &= MLP(h_i, h_i^{neigh})
\end{align*}
$$

**2.C.2 Question (1pt)**

For a given molecule, how many message passing steps would you need for every atom to "see" every other atom?

**ANSWER**:

**2.C.2 Question (3pt)**

Again complete the missing code and try a couple of different hyperparameters.

In [None]:
import dgl

# COMPLETE HERE (i.e. modify to a different value)
HIDDEN_DIM = 64
NUM_STEPS = 4
DROPOUT = 0
EPOCHS = 20


class GNNLayer(torch.nn.Module):

    def __init__(self, dim, dropout):
        super().__init__()
        self.message_mlp = torch.nn.Sequential(
            torch.nn.Linear(dim * 3, dim),
            torch.nn.SiLU(),
            torch.nn.Dropout(dropout),
            torch.nn.Linear(dim, dim)
        )
        self.node_mlp = torch.nn.Sequential(
            torch.nn.Linear(dim * 2, dim),
            torch.nn.SiLU(),
            torch.nn.Dropout(dropout),
            torch.nn.Linear(dim, dim)
        )
    def message(self, edges):
        node_src = edges.src['h']
        node_dst = edges.dst['h']
        edge = edges.data['e']

        # COMPLETE HERE (hint: use torch.cat in dim=-1 to concatenate and apply the message_mlp)
        msg = ...
        return {'msg_h': msg}

    def forward(self, graph, nodes, edges):
        with graph.local_scope():
            # node feature
            graph.ndata['h'] = nodes

            # edge feature
            graph.edata['e'] = edges

            # Compute messages
            graph.apply_edges(self.message)
            graph.update_all(dgl.function.copy_e('msg_h', 'm'), dgl.function.sum('m', 'h_neigh'))
            h_neigh = graph.ndata['h_neigh']

            # Compute node updates
            # COMPLETE HERE (hint: use torch.cat in dim=-1 to concatenate and apply the node mlp)
            h_new = ...
            return h_new


# DON'T MODIFY THIS
class GNN(torch.nn.Module):
    def __init__(self, dim, dropout):
        super().__init__()
        self.node_fc = torch.nn.Linear(74, dim)
        self.edge_fc = torch.nn.Linear(12, dim)
        self.gnn = GNNLayer(dim, dropout)
        self.fc = torch.nn.Linear(dim, 12)

    def forward(self, batch):
        g = batch["graph"]
        nodes = self.node_fc(g.ndata["h"])
        edges = self.edge_fc(g.edata["e"])

        for i in range(NUM_STEPS):
            # We add a residual connection with helps with stability
            nodes_new = self.gnn(g, nodes, edges)
            nodes = nodes + torch.nn.functional.relu(nodes_new)

        g.ndata["h_out"] = nodes
        out = dgl.mean_nodes(g, "h_out")
        out = self.fc(out)
        return out

def process_sample(sample):
    return {"graph": sample[1], "labels": sample[2], "mask": sample[3]}

def create_dataset():
    dataset = list(map(process_sample, Tox21))
    train, test = train_test_split(dataset, test_size=0.2, random_state=1)
    return train, test

def create_model():
    return GNN(HIDDEN_DIM, DROPOUT)

def collate_fn(data):
    graph = dgl.batch([d["graph"] for d in data])
    labels = torch.stack([d["labels"] for d in data])
    mask = torch.stack([d["mask"] for d in data])
    return {"graph": graph, "labels": labels, "mask": mask}


train_set, test_set = create_dataset()
model = create_model()
train_dl = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_dl = torch.utils.data.DataLoader(test_set, batch_size=32, shuffle=False, collate_fn=collate_fn)
train_loss, train_aucs, test_aucs = train(model, train_dl, test_dl, num_epochs=EPOCHS)

fig, (ax1, ax2, ax3) = plt.subplots(3)
ax1.plot(train_loss)
ax1.set_title("Training Loss")
ax2.plot(train_aucs)
ax2.set_title("Training ROC-AUC")
ax3.plot(test_aucs)
ax3.set_title("Test ROC-AUC")

print("Final Training ROC-AUC: ", train_aucs[-1])
print("Best Training ROC-AUC: ", max(train_aucs))
print("\nFinal Testing ROC-AUC: ", test_aucs[-1])
print("Best Testing ROC-AUC: ", max(test_aucs))

**2.C.3 Question (1pt)**

How does the performance compare to the MLP and RNN baselines? Can you improve it by playing with hyperparameters (you still get full credit regardless)? Try at least 2 other combinations of hyperparameters.

**ANSWER**:

### 3. Generalization (Required for grad students, optional for undergaduates)

So far, we have assumed random splits of the data. As discussed during lecture, that can be misleading, as performance can vary as we look at molecules that are distant from the training set.

Copy paste the code from above and modify it to use a different data split. Do this for all 3 models. You need only modify the *create_dataset* function.

Experiment with splitting based on the ScaffoldSplitter.train_val_test_split in dglife:
https://lifesci.dgl.ai/api/utils.splitters.html#dgllife.utils.ScaffoldSplitter.train_val_test_split

Set frac_val=0 and frac_test=0.2

In [None]:
# Hint: ScaffoldSplitter.train_val_test_split(Tox21, frac_val=0 and frac_test=0.2)
from dgllife.utils.splitters import ScaffoldSplitter

**3.1 Question (1pt)**

How does the performance of the MLP change?

**ANSWER**:

**3.2 Question (1pt)**

How does the performance of the LSTM change?

**ANSWER**:

**3.3 Question (1pt)**

How does the performance of the GNN change?

**ANSWER**:

## 4. Competition! (Optional to all)

Please report the best results you are able to get on the test set on the random split and the scaffold split (if you did part 3). And report the hyperprameters you used!

No extra credit, just glory!

**ANSWER**: