# GDSC Data Preprocessing for PaccMann

## Overview

This notebook preprocesses the GDSC (Genomics of Drug Sensitivity in Cancer) dataset for use with the PaccMann framework. The preprocessing steps include splitting drug response data, saving it for the 10 cv. Shuffling the SMILES strings, and saving the results in a text file.

## Input Data

- **GDSC Data**: This should include drug response data with drug and gene features.In mask cell, mask drug and mask combination setting splitted for 10 times cvs.

## Output

- **Train, Test, Talidation Data for Model Training and Predicting**: For each cv in each spliting, data are divided in to 80% training, 10% testing and 10 validation.
- **Gene Expression data**: Same as other models, minor changes for Paccmann.
- **SMILES**: Scrapped from pubchem, randomly shuffled to create the effect of no drug information input.


## How to Use

1. **Load the Data**: Ensure you have the GDSC dataset in the correct format.
2. **Run the Notebook**: Execute each cell in the notebook to preprocess the data.
3. **Output File**: The preprocessed data will be saved for training and testing Paccmann model.








In [3]:
import pandas as pd
import os
import csv
import numpy as np
import random
from sklearn.model_selection import train_test_split

os.chdir('.../paccmann_predictor/gdsc_old')

## Response data; train and test

### Mask combination

In [5]:
path = r'.../paccmann_predictor/gdsc_old/mask_comb'

In [None]:
for i in range (10):
    maskcomb_train = pd.read_csv(f'{"mask_comb/train_cv"}_{i+1}.csv')
    maskcomb_train = maskcomb_train.rename(columns={"Drug name": "drug", "Sanger ID": "cell_line"}).drop(['Origin_idx'], axis=1)
    maskcomb_val = pd.read_csv(f'{"mask_comb/valid_cv"}_{i+1}.csv')
    maskcomb_val = maskcomb_val.rename(columns={"Drug name": "drug", "Sanger ID": "cell_line"}).drop(['Origin_idx'], axis=1)


    train_set, test_set = train_test_split(maskcomb_train, test_size=0.1, random_state=42)

    #print(f"Fold {i + 1}: Train: {len(train_set)},Test: {len(test_set)}")
           
    train_set.to_csv(os.path.join(path, f'{"maskcomb_train"}_{i+1}.csv'))
    test_set.to_csv(os.path.join(path, f'{"maskcomb_test"}_{i+1}.csv'))
    maskcomb_val.to_csv(os.path.join(path, f'{"maskcomb_valid"}_{i+1}.csv'))

### Mask cell

In [None]:
path=r'/nas/longleaf/home/qhz/paccmann_predictor/gdsc_old/mask_cell'

In [None]:
for i in range (10):
    maskcell_train = pd.read_csv(f'{"mask_cell/train_cv"}_{i+1}.csv')
    maskcell_train = maskcell_train.rename(columns={"Drug name": "drug", "Sanger ID": "cell_line"}).drop(['Origin_idx'], axis=1)
    maskcell_val = pd.read_csv(f'{"mask_cell/valid_cv"}_{i+1}.csv')
    maskcell_val = maskcell_val.rename(columns={"Drug name": "drug", "Sanger ID": "cell_line"}).drop(['Origin_idx'], axis=1)

    cell_list = pd.unique(maskcell_train['cell_line'])

    train_cell, test_cell = train_test_split(cell_list, test_size=0.1, random_state=42)
    
    test_set = maskcell_train[maskcell_train['cell_line'].isin(test_cell)]
    train_set = maskcell_train[maskcell_train['cell_line'].isin(train_cell)]
       
    print(f"Fold {i + 1}: Train: {len(pd.unique(train_set['cell_line']))},Test: {len(pd.unique(test_set['cell_line']))}")
    
           
    train_set.to_csv(os.path.join(path, f'{"maskcell_train"}_{i+1}.csv'))
    test_set.to_csv(os.path.join(path, f'{"maskcell_test"}_{i+1}.csv'))
    maskcell_val.to_csv(os.path.join(path, f'{"maskcell_valid"}_{i+1}.csv'))

### Mask drug

In [2]:
path=r'/nas/longleaf/home/qhz/paccmann_predictor/gdsc_old/mask_drug'

In [None]:
for i in range (10):
    maskdrug_train = pd.read_csv(f'{"mask_drug/train_cv"}_{i+1}.csv').drop(['Origin_idx'], axis=1)
    maskdrug_val = pd.read_csv(f'{"mask_drug/valid_cv"}_{i+1}.csv').drop(['Origin_idx'], axis=1)

    maskdrug_train = maskdrug_train.rename(columns={"Drug name": "drug", "Sanger ID": "cell_line"})
    maskdrug_val = maskdrug_val.rename(columns={ "Drug name": "drug", "Sanger ID": "cell_line"})

    drug_list = pd.unique(maskdrug_train['drug'])

    train_drug, test_drug = train_test_split(drug_list, test_size=0.1, random_state=42)
    
    train_set = maskdrug_train[maskdrug_train['drug'].isin(train_drug)]
    test_set = maskdrug_train[maskdrug_train['drug'].isin(test_drug)]
       
    print(f"Fold {i + 1}: Train: {len(pd.unique(train_set['drug']))},Test: {len(pd.unique(test_set['drug']))}")
    
           
    train_set.to_csv(os.path.join(path, f'{"maskdrug_train"}_{i+1}.csv'))
    test_set.to_csv(os.path.join(path, f'{"maskdrug_test"}_{i+1}.csv'))
    maskdrug_val.to_csv(os.path.join(path, f'{"maskdrug_valid"}_{i+1}.csv'))

## Gene expression data

In [None]:
gene_exp = pd.read_csv('gdsc_exp_common_forqh.csv')
df1 = gene_exp.drop(['Unnamed: 0', 'COSMIC_ID'], axis=1)
df = gene_exp.set_index(['SANGER_MODEL_ID'], inplace = True)
df.to_csv('gdsc_gene_exp.csv',index=False)

### New GDSC smiles

In [1]:
import pandas as pd
import numpy as np
import requests
from itertools import product
from rdkit import Chem
from rdkit.Chem import AllChem

In [2]:
def smiles_from_pubchem_cids(cids):

    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{','.join(map(str, cids))}/property/CanonicalSMILES/JSON"
    r = requests.get(url)
    r.raise_for_status()
    return [item["CanonicalSMILES"] for item in r.json()["PropertyTable"]["Properties"]]


In [3]:
druginfos = pd.read_csv('oldGDSC_druginfo.csv')

In [4]:
druginfos['smiles'] = smiles_from_pubchem_cids(druginfos['cid'])
druginfos

Unnamed: 0.1,Unnamed: 0,cid,drug_name,smiles
0,0,24360,Camptothecin,CCC1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=C4C3...
1,1,13342,Vinblastine,CCC1(CC2CC(C3=C(CCN(C2)C1)C4=CC=CC=C4N3)(C5=C(...
2,2,5702198,Cisplatin,N.N.Cl[Pt]Cl
3,3,6253,Cytarabine,C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)O)O
4,4,148124,Docetaxel,CC1=C2C(C(=O)C3(C(CC4C(C3C(C(C2(C)C)(CC1OC(=O)...
...,...,...,...,...
139,139,51000408,VE821,CS(=O)(=O)C1=CC=C(C=C1)C2=CN=C(C(=N2)C(=O)NC3=...
140,140,44137675,AZD6482,CC1=CN2C(=O)C=C(N=C2C(=C1)C(C)NC3=CC=CC=C3C(=O...
141,141,24905401,AT13148,C1=CC(=CC=C1C2=CNN=C2)C(CN)(C3=CC=C(C=C3)Cl)O
142,142,24785538,BMS-754807,CC1(CCCN1C2=NN3C=CC=C3C(=N2)NC4=NNC(=C4)C5CC5)...


In [5]:
data = {"canonical_smiles" : druginfos['smiles'], "drug": druginfos['drug_name']}
df = pd.DataFrame(data)
df

Unnamed: 0,canonical_smiles,drug
0,CCC1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=C4C3...,Camptothecin
1,CCC1(CC2CC(C3=C(CCN(C2)C1)C4=CC=CC=C4N3)(C5=C(...,Vinblastine
2,N.N.Cl[Pt]Cl,Cisplatin
3,C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)O)O,Cytarabine
4,CC1=C2C(C(=O)C3(C(CC4C(C3C(C(C2(C)C)(CC1OC(=O)...,Docetaxel
...,...,...
139,CS(=O)(=O)C1=CC=C(C=C1)C2=CN=C(C(=N2)C(=O)NC3=...,VE821
140,CC1=CN2C(=O)C=C(N=C2C(=C1)C(C)NC3=CC=CC=C3C(=O...,AZD6482
141,C1=CC(=CC=C1C2=CNN=C2)C(CN)(C3=CC=C(C=C3)Cl)O,AT13148
142,CC1(CCCN1C2=NN3C=CC=C3C(=N2)NC4=NNC(=C4)C5CC5)...,BMS-754807


In [9]:
np.savetxt(r'gdsc_smile.txt', df, fmt='%s', delimiter='\t')

In [7]:
drug_list = df['drug']
random.seed(316)
shuff_list = random.Random(316).shuffle(drug_list)
shuff_list = drug_list
shuff_list

0      Temozolomide
1             VE821
2       Talazoparib
3            VX-11e
4          PD173074
           ...     
139           AZ960
140       Cisplatin
141          LGK974
142      Cytarabine
143      Navitoclax
Name: drug, Length: 144, dtype: object

In [8]:
shuf_smiles =  pd.DataFrame(data = { "canonical_smiles": df['canonical_smiles'],"drug": shuff_list})

In [None]:
np.savetxt(r'gdsc_smile_random.txt', shuf_smiles, fmt='%s', delimiter='\t') 