#### SESSION 2

1. oncopredict
2. sciduc

drug synergy
1. cancerGPT

#### DRUG RESPONSE PRED
1. Individual genetic profiles: expression levels (actively transcribing ones)
2. gene expression data - how patient response to the drug (mircoarrays, NGS tech)
   * Other data too (genomic mutation, CNmutations)
   * select most effective medication with least side effects
   * Drug response prediciton with gene expression data - improve clinical outcomes

     Challenges:
     1. High dimensionality (overfitting - too many genes or features)
     2. feature selection done hence (random forest)
     3. interpretable decisions (deep neural nets - tough)
     4. data heterogeneity and quality (depends on how data was generated etc)
        

#### R package oncopredict
https://cran.r-project.org/web/packages/oncoPredict/index.html

#### how drug response is imputed
ridge regression (expression and response)
ctrp (pan cancer drug atlas cell lines) - https://www.cancer.gov/about-nci/organization/ccct/ctrp

#### sciDUC- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634928/
* transfer based approach integrates bulk with single cell dataset
* integration of bulk and single cell data to capture common gene expression patterns
* train on integrated data then

##### Integration step
1. CCA using SVD (apply SVD on cross covariance matrix) - metagenes is output
2. get low dimensional embeddings of the single cell and bulk data matrices
3. PRISM - drug screen dataset - downlaod on DepMap - https://depmap.org/portal/download/custom/
4. DRG score (drug response score)
5. Rho stats and cohensD to measure effect size. (Higher value is better)

##### How to run sciDUC
1. Genes to model
2. preprocess single cell data - log and CPM
3. bootstrap sampling approach

#### DRUG SYNERGY
CancerGPT 

1. accelerated discovery

In [2]:
# Part 1: Obtain Data
#______________________________________________________________________________
'''
This setup will allow you to effectively access chemical data from PubChem.
'''

### Step 1: Install `pubchempy`

#pip install pubchempy

### Step 2: Using `pubchempy` to Retrieve SMILES
'''
Here's a basic example of how you can retrieve the SMILES 
string for a specific compound by its name, 
CAS number, or CID (Compound ID).
'''

"\nHere's a basic example of how you can retrieve the SMILES \nstring for a specific compound by its name, \nCAS number, or CID (Compound ID).\n"

In [3]:
#### Example 1: Retrieve SMILES by Compound Name

import pubchempy as pcp

def get_smiles_by_name(compound_name):
    try:
        compound = pcp.get_compounds(compound_name, 'name')[0]  # Get the first matching compound
        return compound.isomeric_smiles  # Return the isomeric SMILES string
    except IndexError:
        return "Compound not found"

# Example usage
compound_name = 'Aspirin'
smiles_string = get_smiles_by_name(compound_name)
print(f'SMILES for {compound_name}: {smiles_string}')
# SMILES for Aspirin: CC(=O)OC1=CC=CC=C1C(=O)O

#### Example 2: Retrieve SMILES by CID

def get_smiles_by_cid(cid):
    try:
        compound = pcp.Compound.from_cid(cid)
        return compound.isomeric_smiles
    except Exception as e:
        return str(e)

# Example usage
cid = 2244  # CID for Aspirin
smiles_string = get_smiles_by_cid(cid)
print(f'SMILES for CID {cid}: {smiles_string}')
# SMILES for CID 2244: CC(=O)OC1=CC=CC=C1C(=O)O

### Step 3: Handling Multiple Compounds and Advanced Queries
'''
Note: PubChemPy allows for more complex queries and handling multiple results.
'''
def search_smiles(query):
    compounds = pcp.get_compounds(query, 'name')
    smiles_list = [comp.isomeric_smiles for comp in compounds if comp.isomeric_smiles is not None]
    return smiles_list

# Example usage
query = 'benzene'
smiles_results = search_smiles(query)
print(f'SMILES for {query}: {smiles_results}')
# SMILES for benzene: ['C1=CC=CC=C1']



SMILES for Aspirin: CC(=O)OC1=CC=CC=C1C(=O)O
SMILES for CID 2244: CC(=O)OC1=CC=CC=C1C(=O)O
SMILES for benzene: ['C1=CC=CC=C1']


In [6]:
# Part 2: Train LLM
#______________________________________________________________________________

import torch
from transformers import BertTokenizer, BertForMaskedLM
from torch.utils.data import DataLoader, Dataset
from rdkit import Chem

# Create a custom dataset for SMILES strings
class SmilesDataset(Dataset):
    def __init__(self, smiles, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.data = [tokenizer.encode(smile, max_length=max_length, truncation=True) for smile in smiles]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return torch.tensor(self.data[idx], dtype=torch.long)

# Sample data (usually you'll have much more data)
smiles_data = ["CCO", "O=C(O)c1ccccc1C(=O)O", "CCCC"]

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
# Prepare dataset and dataloader
dataset = SmilesDataset(smiles_data, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Example training loop (simplified)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)


In [None]:
for epoch in range(2):  # This would be much higher in a real scenario
    for batch in dataloader:
        outputs = model(batch, labels=batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"Training loss: {loss.item()}")

In [10]:
# Generating a new molecule (basic example) aka a SMILES string
model.eval()
sampled_smiles = "CC(C)"
input_ids = tokenizer.encode(sampled_smiles, return_tensors="pt")
with torch.no_grad():
    predictions = model(input_ids)[0]
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    predicted_token = tokenizer.decode([predicted_index])
    new_smiles = sampled_smiles + predicted_token

# Validate new SMILES
new_smiles = new_smiles.strip(",.!? ")  # Remove potentially problematic characters
new_mol = Chem.MolFromSmiles(new_smiles)
if new_mol:
    print(f"Generated valid SMILES: {new_smiles}")
else:
    print("Generated invalid SMILES")
# Generated valid SMILES: CC(C)
"""
When you receive an output like "Generated valid SMILES: CC(C)", 
it indicates that the SMILES string "CC(C)" has been successfully 
recognized and validated as a correct representation of a chemical 
structure using the RDKit library. 
"""


2024-05-13 15:47:11.621035: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Generated invalid SMILES


[15:47:13] SMILES Parse Error: syntax error while parsing: CC(C)topical
[15:47:13] SMILES Parse Error: Failed parsing SMILES 'CC(C)topical' for input: 'CC(C)topical'


'\nWhen you receive an output like "Generated valid SMILES: CC(C)", \nit indicates that the SMILES string "CC(C)" has been successfully \nrecognized and validated as a correct representation of a chemical \nstructure using the RDKit library. \n'

In [12]:
new_smiles

'CC(C)topical'