# MoleculeNet - HIV

HIV dataset: "Experimentally measured abilities to inhibit HIV replication".
The dataset is part of the MoleculeNet benchmark suite.

From TUDataset benchmark:  https://moleculenet.org/datasets-1

You can find similar projects on github, youtube or scholar searching for "molecular property prediction"

## IMPORTANT NOTES
- It's a classification problem, the evaluation metric is ROC-AUC.

- The dataset is very unbalanced, so try to explore possible strategies to handle it.

- The prescribed dataset train/val/test split is Scaffold. This is a specific strategy to divide molecules in clusters that present similar molecular structure or function. We train only on a few clusters and we evaluate on different ones. It is basically forcing out of distribution inference. If you want an additional challenge, you can try to implement it, it will force you to use the popular chemical library 'rdkit'.

- You shuold be able to use colab/kaggle to train a model on gpu, you can also create a smaller version of the dataset if you have constraints on the resources.

In [1]:
import torch
from torch_geometric.datasets import MoleculeNet
from torch_geometric.loader import DataLoader
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool, GATConv, GINConv, GINEConv
from torch.nn import Sequential, Linear, ReLU
from torch.optim.lr_scheduler import ReduceLROnPlateau

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load HIV datasets from MoleculeNet benchmark suite
dataset = MoleculeNet(root='./data', name='HIV')

Downloading https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/HIV.csv
Processing...
  self.process()
  self.process()
  self.process()
  self.process()
  self.process()
  self.process()
  self.process()
Done!


In [3]:
print(f"Example of a graph:\n {dataset[0]}\n")
print(f"Node features:\n {dataset.x}\n")
print(f"Edge features:\n {dataset.edge_attr}\n")
print(f"Graph labels:\n {dataset.y}")

Example of a graph:
 Data(x=[19, 9], edge_index=[2, 40], edge_attr=[40, 3], smiles='CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)=[O+]2', y=[1, 1])

Node features:
 tensor([[6, 0, 4,  ..., 4, 0, 0],
        [6, 0, 4,  ..., 4, 0, 0],
        [6, 0, 3,  ..., 3, 0, 1],
        ...,
        [7, 0, 2,  ..., 3, 1, 1],
        [8, 0, 2,  ..., 3, 1, 1],
        [6, 0, 3,  ..., 3, 1, 1]])

Edge features:
 tensor([[ 1,  0,  0],
        [ 1,  0,  0],
        [ 1,  0,  0],
        ...,
        [12,  0,  1],
        [12,  0,  1],
        [12,  0,  1]])

Graph labels:
 tensor([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])


# Hints
- you may need a dataloader since the dataset is quite large
- you are free to proceed with any architecture, supervised / unsupervised / self-supervised
- you can find multiple examples of similar tasks or model on github