<div>
    <h1 align="center"><font color="blue"> DELIVERABLE 2 </font></h1>
</div>

<div>
    <h4 align="left"><font color="green"> Downloading Libraries </font></h4>
</div>

In [1]:
pip install rdkit-pypi torch_geometric faiss-cpu sacremoses bitsandbytes

Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Co

In [2]:
# rdkit-pypi: Helps me work with chemical structures and SMILES strings for molecules.
# torch_geometric: Allows me to build graph neural networks (GNNs) for processing molecular data.
# faiss-cpu: Used for fast similarity searches with embeddings, like finding similar compounds.
# sacremoses: Likely needed for text processing, possibly for the language model part.
# bitsandbytes: Helps with memory-efficient model training, especially for large language models.
                                             
print("---------- ALL LIBRARIES HAVE BEEN DOWNLOADED ----------")

---------- ALL LIBRARIES HAVE BEEN DOWNLOADED ----------


<div>
    <h4 align="left"><font color="green"> Importing Libraries </font></h4>
</div>

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GINConv, global_add_pool
from torch_geometric.data import Data, Batch
from torch_geometric.loader import DataLoader
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs

import faiss

print("---------- ALL LIBRARIES HAVE BEEN IMPORTED ----------")

# torch, torch.nn, and torch.nn.functional: For building and training neural networks, like my GNN model.
# torch_geometric modules (GINConv, global_add_pool, Data, Batch, DataLoader): Help me create and process graph-based data for molecules.
# rdkit modules (Chem, AllChem, DataStructs, Descriptors): lets me work with chemical structures, generate fingerprints, and calculate properties like logP.
# faiss: For efficient similarity searches using embeddings.

---------- ALL LIBRARIES HAVE BEEN IMPORTED ----------


In [4]:
# Set device for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


<div>
    <h3 align="left"><font color="red"> STEP 01: Data Loading and Preprocessing </font></h3>
</div>

In [5]:
df = pd.read_csv('/kaggle/input/smiles/SMILES_Big_Data_Set.csv')
print("Dataset columns:", df.columns.tolist())

# Standardizing SMILES strings to ensure consistency and track invalid ones.
invalid_smiles_count = 0
def standardize_smiles(smiles):
    global invalid_smiles_count
    try:
        mol = Chem.MolFromSmiles(smiles)  # Convert SMILES to RDKit molecule object.
        if mol is None:
            invalid_smiles_count += 1 
            return None
        return Chem.MolToSmiles(mol, isomericSmiles=True)  # Convert back to standardized SMILES.
    except:
        invalid_smiles_count += 1  # Increment counter if conversion fails.
        return None

df['standard_smiles'] = df['SMILES'].apply(standardize_smiles) 
df = df.dropna(subset=['standard_smiles']).drop_duplicates(subset=['standard_smiles'])
print(f"Removed {invalid_smiles_count} invalid SMILES strings.")


df['pIC50'] = pd.to_numeric(df['pIC50'], errors='coerce') 
df['num_atoms'] = pd.to_numeric(df['num_atoms'], errors='coerce')  
df['logP'] = pd.to_numeric(df['logP'], errors='coerce') 
df = df.dropna() 

# Creating a column of RDKit molecule objects for later use, like generating fingerprints.
df['mol'] = df['standard_smiles'].apply(Chem.MolFromSmiles)

Dataset columns: ['SMILES', 'pIC50', 'mol', 'num_atoms', 'logP']
Removed 0 invalid SMILES strings.


<div>
    <h3 align="left"><font color="red"> STEP 02: Generating Fingerprints (Morgan Fingerprints) </font></h3>
</div>

In [6]:
# Creating Morgan fingerprints to represent molecular structures numerically for GNN input.
def generate_morgan_fingerprint(mol, radius=2, n_bits=2048):
    if mol is None:
        return None
    try:
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=n_bits)  # Generate 2048-bit Morgan fingerprint with radius 2.
        arr = np.zeros((n_bits,), dtype=np.float32)
        DataStructs.ConvertToNumpyArray(fp, arr)  # Convert fingerprint to NumPy array of 0s and 1s.
        return arr
    except:
        return None

df['morgan_fp'] = df['mol'].apply(generate_morgan_fingerprint)  
df = df[df['morgan_fp'].notnull()]  # Remove rows where fingerprint generation failed.
fp_matrix = np.stack(df['morgan_fp'].values)  # Stack all fingerprints into a single NumPy array for GNN training.
print(f"Fingerprint matrix shape: {fp_matrix.shape}")

Fingerprint matrix shape: (14823, 2048)


<div>
    <h3 align="left"><font color="red"> STEP 03: GNN for Fingerprint Embedding (GIN) </font></h3>
</div>

In [7]:
# Defining a Graph Neural Network (GNN) to create compact embeddings from Morgan fingerprints.
class FingerprintGNN(nn.Module):
    def __init__(self, input_dim=2048, hidden_dim=512, output_dim=256):
        super().__init__()
        self.fp_to_node = nn.Linear(input_dim, hidden_dim)  # Reduce 2048-bit fingerprint to 512 dimensions.
        self.conv1 = GINConv(nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),  # First linear layer for graph convolution.
            nn.ReLU(),  # Activation
            nn.Linear(hidden_dim, hidden_dim)  # Second linear layer for feature transformation.
        ))
        self.conv2 = GINConv(nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),  # Second graph convolution layer.
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        ))
        self.lin = nn.Linear(hidden_dim, output_dim)  # Final layer to output 256-dimensional embedding.

    def forward(self, x, edge_index, batch):
        x = self.fp_to_node(x)  # Transform input fingerprint to hidden dimension.
        x = self.conv1(x, edge_index).relu() 
        x = self.conv2(x, edge_index) 
        pooled = global_add_pool(x, batch)  # Aggregate node features into a single embedding per graph.
        return self.lin(pooled)  

data_list = []
for fp in df['morgan_fp']:
    node_feat = torch.FloatTensor(fp).unsqueeze(0)  # Convert fingerprint to tensor and add batch dimension.
    edge_index = torch.tensor([[0], [0]], dtype=torch.long) 
    data = Data(x=node_feat, edge_index=edge_index) 
    data_list.append(data)

batch_size = 128  # Set batch size for efficient training.
loader = DataLoader(data_list, batch_size=batch_size, shuffle=False)  # Create DataLoader for batching graphs.

# Training the GNN model using an autoencoder-like loss.
gin_model = FingerprintGNN().to(device)  
optimizer = torch.optim.Adam(gin_model.parameters(), lr=0.001)  # Set up Adam optimizer.
target_projection = nn.Linear(2048, 256).to(device)  # Linear layer to project fingerprints to 256 dimensions for loss calculation.

# Ensure=ing target_projection parameters are optimized along with GNN.
combined_params = list(gin_model.parameters()) + list(target_projection.parameters())
optimizer = torch.optim.Adam(combined_params, lr=0.001) 

epochs = 10

print("\nTraining GIN model...")
for epoch in range(epochs):
    gin_model.train()  
    target_projection.train() 
    total_loss = 0
    for batch in loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        out = gin_model(batch.x, batch.edge_index, batch.batch)  # Get GNN embeddings.
        target = target_projection(batch.x) 
        loss = F.mse_loss(out, target)  # Calculate MSE loss between GNN and projected embeddings.
        loss.backward()
        optimizer.step()  # Update model weights.
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(loader)}")

# Generating embeddings for all fingerprints using the trained GNN.
print("\nGenerating GNN embeddings...")
gin_model.eval()  
target_projection.eval() 
embeddings = []
with torch.no_grad():  # Disable gradient tracking to save memory.
    for batch in loader:
        batch = batch.to(device) 
        emb = gin_model(batch.x, batch.edge_index, batch.batch)  # Generate embeddings.
        embeddings.append(emb.cpu().numpy()) 
embedding_matrix = np.vstack(embeddings) 
print(f"Embedding matrix shape: {embedding_matrix.shape}")


Training GIN model...
Epoch 1, Loss: 0.0014479051144384168
Epoch 2, Loss: 0.0004800298243183
Epoch 3, Loss: 0.0003346442300143876
Epoch 4, Loss: 0.0003052238546309848
Epoch 5, Loss: 0.0003707754834161686
Epoch 6, Loss: 0.0004837360282745694
Epoch 7, Loss: 0.0008593910772168752
Epoch 8, Loss: 0.002141221021984479
Epoch 9, Loss: 0.001700232632895771
Epoch 10, Loss: 0.0017588131043433759

Generating GNN embeddings...
Embedding matrix shape: (14823, 256)


<div>
    <h4 align="left"><font color="green"> Saving preprocessed data, embeddings, trained model </font></h4>
</div>

In [8]:
# Saving my processed data and trained GNN model for later use.
df['gnn_embedding'] = embedding_matrix.tolist() 
df.to_csv('preprocessed_data_with_embeddings.csv', index=False) 

# Saving the GNN model's weights to a file.
torch.save(gin_model.state_dict(), "gin_model.pth") 

print("Data Saved!")

Data Saved!


<div>
    <h4 align="left"><font color="green"> Checking if required columns exist in df </font></h4>
</div>

In [9]:
# Checking if my DataFrame has the necessary columns for later steps.
if 'gnn_embedding' not in df.columns or 'standard_smiles' not in df.columns:
    raise ValueError("Required columns 'gnn_embedding' or 'standard_smiles' not found in DataFrame.")
else:
    print("Required Columns Exist!")

# Resetting the DataFrame index to align with the embedding matrix.
df = df.reset_index(drop=True)  # Ensure row indices match embedding matrix to avoid mismatches.

Required Columns Exist!


<div>
    <h3 align="left"><font color="red"> STEP 04: HNSW Index for GNN Embeddings </font></h3>
</div>

In [10]:
# Converting GNN embeddings to a NumPy array for Faiss.
embedding_matrix = np.stack(df['gnn_embedding'].values).astype(np.float32)  
embedding_dim = embedding_matrix.shape[1] 

index = faiss.IndexHNSWFlat(embedding_dim, 32)  # Create HNSW index with M=32 (graph degree).
index.hnsw.efConstruction = 200  # Set construction parameter for better index quality.
index.hnsw.efSearch = 100  # Set search parameter for better accuracy.
faiss.normalize_L2(embedding_matrix)  # Normalize embeddings for cosine similarity.

index.add(embedding_matrix)  # Index all embeddings for similarity searches.
print(f"Indexed {embedding_matrix.shape[0]} compounds.")

# Saving the index to a file for later use.
faiss.write_index(index, "gnn_hnsw_index.faiss")

Indexed 14823 compounds.


<div>
    <h3 align="left"><font color="red"> STEP 05: HNSW Search Function </font></h3>
</div>

In [11]:
# Defining a function to find compounds similar to a query fingerprint using the HNSW index.
def search_similar_compounds(query_fp, gin_model, index, top_k=5, device='cpu'):
    """
    Search for compounds similar to the query fingerprint using HNSW index.
    """
    try:
        # Setting up the GNN model to generate embeddings for the query.
        gin_model.eval() 
        gin_model.to(device) 

        query_fp = np.array(query_fp, dtype=np.float32)  
        node_feat = torch.FloatTensor(query_fp).unsqueeze(0).to(device) 
        edge_index = torch.tensor([[0], [0]], dtype=torch.long).to(device)  # Create self-loop for single-node graph.
        data = Data(x=node_feat, edge_index=edge_index)  # Wrap in Data object.
        batch = torch.zeros(1, dtype=torch.long).to(device)  # Batch tensor for single graph.

        with torch.no_grad(): 
            query_embedding = gin_model(data.x, data.edge_index, batch).cpu().numpy()  # Get 256-dimensional embedding.
        
        query_embedding = query_embedding.astype(np.float32) 
        faiss.normalize_L2(query_embedding)

        # Searching for the top_k most similar compounds.
        _, indices = index.search(query_embedding, top_k)  

        # Retrieving the SMILES strings of similar compounds.
        similar_smiles = df.iloc[indices[0]]['standard_smiles'].values.tolist() 
        return similar_smiles
    
    except Exception as e:
        print(f"Error during similarity search: {e}")
        return []  

print("Similar Compound Search Function made!")

Similar Compound Search Function made!


<div>
    <h4 align="left"><font color="green"> Example Search Using HNSW </font></h4>
</div>

In [12]:
print("\nSearching for similar compounds...")

# Testing the similarity search with a sample SMILES string.
query_smiles = "NS(=O)(=O)N1CCC(NC(=O)c2cnn3ccc(N4CCCC4c4cc(F)ccc4F)nc23)CC1"
query_mol = Chem.MolFromSmiles(query_smiles)  # Convert SMILES to RDKit molecule.
if query_mol is None:
    print("Error: Invalid query SMILES string.")
else:
    query_fp = generate_morgan_fingerprint(query_mol)  # Generate Morgan fingerprint for query.
    if query_fp is None:
        print("Error: Failed to generate fingerprint for query molecule.")
    else:
        # Using the search function to find similar compounds.
        similar_compounds = search_similar_compounds(query_fp, gin_model, index, top_k=5, device=device)  # Find top 5 similar compounds.
        print("\nTop 5 Similar Compounds:")
        for i, smiles in enumerate(similar_compounds, 1):
            print(f"{i}. {smiles}")


Searching for similar compounds...

Top 5 Similar Compounds:
1. Cc1ccc2cccnc2c1
2. C=CC12COC(=O)C(=C)C1C1OC(=O)C(=C)C1C(O)C2
3. Nc1ncnc2ncn(C(c3ccccc3)c3ccccc3)c12
4. CC(C)CC(N)C(=O)NC(CC(C)C)C(=O)NC(C)C(=O)NC(Cc1ccccc1)C(=O)O
5. COC(=O)C(C)NP(=O)(OCC1C=CC(n2cc(C)c(=O)[nH]c2=O)O1)Oc1cccc(I)c1
