# Lab 3: Drug Dose Response

### Steps 

* Select CCLE as the study I will test and validate on. 
* Randomly ten drugs from the CCLE study to run models through.
* Join drug response data, which contains the tumor growth (y-variable), with RNA-Seq and SNP data to generate a robust feature set. 
* Given the continous nature of the y-variable, run a set of regression models: Logistic Regression, Decision Trees, and Naive Bayes. 
* Find the best model and apply it to ten drugs pulled from the other studies. 

In [59]:
import pandas as pd
import random

### Read in dose response, RNA-Seq, and SNP data for 10 random drugs

I'll do the cut for 10 drugs in the dose response and then stitch with snp and rna-seq oncogenes data.

In [2]:
CCLE_dose_response = pd.read_csv('Data/dose_response/CCLE_dose_response1', sep='\t')
drugs = list(CCLE_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 11)
CCLE_10 = CCLE_dose_response[CCLE_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
CCLE_10['DRUG_ID'].value_counts()

CCLE.8     4027
CCLE.7     4024
CCLE.1     4022
CCLE.3     4021
CCLE.24    4020
CCLE.15    4016
CCLE.11    4000
CCLE.20    3959
CCLE.21    3679
CCLE.17    3469
Name: DRUG_ID, dtype: int64

In [3]:
CTRP_dose_response = pd.read_csv('Data/dose_response/CTRP_dose_response1', sep='\t')
drugs = list(CTRP_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
CTRP_10 = CTRP_dose_response[CTRP_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
CTRP_10['DRUG_ID'].value_counts()

CTRP.525    13724
CTRP.366    13480
CTRP.256    13434
CTRP.185    13337
CTRP.323    13293
CTRP.169    12961
CTRP.343    12845
CTRP.534    12352
CTRP.4      12163
CTRP.91      3281
Name: DRUG_ID, dtype: int64

In [4]:
gCSI_dose_response = pd.read_csv('Data/dose_response/gCSI_dose_response1', sep='\t')
drugs = list(gCSI_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 12)
gCSI_10 = gCSI_dose_response[gCSI_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
gCSI_10['DRUG_ID'].value_counts()

gCSI.12    3681
gCSI.13    3681
gCSI.6     3645
gCSI.10    3645
gCSI.16    3627
gCSI.2     3627
gCSI.5     3618
gCSI.8     3600
gCSI.7     3599
gCSI.1     3582
Name: DRUG_ID, dtype: int64

In [5]:
GDSC_dose_response = pd.read_csv('Data/dose_response/GDSC_dose_response1', sep='\t')
drugs = list(GDSC_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
GDSC_10 = GDSC_dose_response[GDSC_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
GDSC_10['DRUG_ID'].value_counts()

GDSC.1072    17532
GDSC.1032    12850
GDSC.346      8802
GDSC.294      8802
GDSC.258      8775
GDSC.135      8388
GDSC.1052     8010
GDSC.1038     8001
GDSC.9        3852
GDSC.87       3843
Name: DRUG_ID, dtype: int64

In [6]:
NCI_dose_response = pd.read_csv('Data/dose_response/NCI60_dose_response1', sep='\t')
drugs = list(NCI_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
NCI_10 = NCI_dose_response[NCI_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
NCI_10['DRUG_ID'].value_counts()

NSC.1507      295
NSC.7782      295
NSC.650772    295
NSC.13480     295
NSC.757794    295
NSC.648585    295
NSC.95503     295
NSC.758617    295
NSC.717458    295
NSC.676449    295
Name: DRUG_ID, dtype: int64

### Stitch it together 

In [12]:
snp = pd.read_csv('Data/snps/combo_snp_linc1000', sep = '\t')
snp.shape

(4085, 2365)

In [13]:
rna_seq = pd.read_csv('Data/rna-seq/combined_rnaseq_data_lincs1000', sep = '\t')
rna_seq.shape

(15196, 943)

CCLE

In [15]:
CCLE_rnaseq = pd.merge(rna_seq, CCLE_10, left_on='Sample', right_on='CELLNAME')
CCLE_rnaseq.shape

(36872, 949)

In [17]:
CCLE_final = pd.merge(CCLE_rnaseq, snp, left_on='Sample', right_on='Sample')
CCLE_final.shape

(30597, 3313)

In [49]:
CCLE_use = CCLE_final.drop(columns=['CELLNAME', 'Sample', 'CONCUNIT', 'EXPID'])
CCLE_use.shape

(30597, 3309)

CTRP

In [33]:
CTRP_rnaseq = pd.merge(rna_seq, CTRP_10, left_on='Sample', right_on='CELLNAME')
CTRP_rnaseq.shape

(111143, 949)

In [50]:
CTRP_use = CTRP_rnaseq.drop(columns=['CELLNAME', 'Sample', 'CONCUNIT', 'EXPID'])
CTRP_use.shape

(111143, 945)

gCSI

In [35]:
gCSI_rnaseq = pd.merge(rna_seq, gCSI_10, left_on='Sample', right_on='CELLNAME')
gCSI_rnaseq.shape

(31742, 949)

In [51]:
gCSI_use = gCSI_rnaseq.drop(columns=['CELLNAME', 'Sample', 'CONCUNIT', 'EXPID'])
gCSI_use.shape

(31742, 945)

GDSC

In [37]:
GDSC_rnaseq = pd.merge(rna_seq, GDSC_10, left_on='Sample', right_on='CELLNAME')
GDSC_rnaseq.shape

(55921, 949)

In [38]:
GDSC_final = pd.merge(GDSC_rnaseq, snp, left_on='Sample', right_on='Sample')
GDSC_final.shape

(37061, 3313)

In [52]:
GDSC_use = GDSC_final.drop(columns=['CELLNAME', 'Sample', 'CONCUNIT', 'EXPID'])
GDSC_use.shape

(37061, 3309)

NCI60

In [40]:
NCI_rnaseq = pd.merge(rna_seq, NCI_10, left_on='Sample', right_on='CELLNAME')
NCI_rnaseq.shape

(2950, 949)

In [41]:
NCI_final = pd.merge(NCI_rnaseq, snp, left_on='Sample', right_on='Sample')
NCI_final.shape

(2950, 3313)

In [53]:
NCI_use = NCI_final.drop(columns=['CELLNAME', 'Sample', 'CONCUNIT', 'EXPID'])
NCI_use.shape

(2950, 3309)

### Output to CSV

In [54]:
CCLE_use.to_csv('CCLE_stitched.csv', index='DRUG_ID')

In [55]:
NCI_use.to_csv('NCI_stitched.csv', index='DRUG_ID')

In [56]:
GDSC_use.to_csv('GDSC_stitched.csv', index='DRUG_ID')

In [57]:
gCSI_use.to_csv('gCSI_stitched.csv', index='DRUG_ID')

In [58]:
CTRP_use.to_csv('CTRP_stitched.csv', index='DRUG_ID')