# Lab 3: Drug Dose Response

### Steps 

* Select CCLE as the study I will test and validate on. 
* Randomly ten drugs from the CCLE study to run models through.
* Join drug response data, which contains the tumor growth (y-variable), with RNA-Seq and SNP data to generate a robust feature set. 
* Given the continous nature of the y-variable, run a set of regression models: Logistic Regression, Decision Trees, and Naive Bayes. 
* Find the best model and apply it to ten drugs pulled from the other studies. 

In [22]:
import pandas as pd
import random

### Read in dose response, RNA-Seq, and SNP data for 10 random drugs

I'll do the cut for 10 drugs in the dose response and then stitch with snp and rna-seq oncogenes data.

In [61]:
CCLE_dose_response = pd.read_csv('Data/dose_response/CCLE_dose_response1', sep='\t')
drugs = list(CCLE_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 11)
CCLE_10 = CCLE_dose_response[CCLE_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
CCLE_10['DRUG_ID'].value_counts()

CCLE.23    4031
CCLE.16    4027
CCLE.4     4022
CCLE.1     4022
CCLE.24    4020
CCLE.18    4019
CCLE.19    4016
CCLE.10    3963
CCLE.9     3924
CCLE.13    2536
Name: DRUG_ID, dtype: int64

In [63]:
CTRP_dose_response = pd.read_csv('Data/dose_response/CTRP_dose_response1', sep='\t')
drugs = list(CTRP_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
CTRP_10 = CTRP_dose_response[CTRP_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
CTRP_10['DRUG_ID'].value_counts()

CTRP.438    13643
CTRP.162    13638
CTRP.27     13632
CTRP.23     13594
CTRP.400    13459
CTRP.490    13447
CTRP.256    13434
CTRP.182    13421
CTRP.57     13315
CTRP.225    12789
Name: DRUG_ID, dtype: int64

In [73]:
gCSI_dose_response = pd.read_csv('Data/dose_response/gCSI_dose_response1', sep='\t')
drugs = list(gCSI_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 12)
gCSI_10 = gCSI_dose_response[gCSI_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
gCSI_10['DRUG_ID'].value_counts()

gCSI.12    3681
gCSI.3     3681
gCSI.10    3645
gCSI.15    3627
gCSI.16    3627
gCSI.11    3618
gCSI.14    3618
gCSI.9     3618
gCSI.5     3618
gCSI.8     3600
Name: DRUG_ID, dtype: int64

In [75]:
GDSC_dose_response = pd.read_csv('Data/dose_response/GDSC_dose_response1', sep='\t')
drugs = list(GDSC_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
GDSC_10 = GDSC_dose_response[GDSC_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
GDSC_10['DRUG_ID'].value_counts()

GDSC.150     12948
GDSC.1014    12538
GDSC.1058    12529
GDSC.249      8802
GDSC.269      8775
GDSC.182      8388
GDSC.1020     8055
GDSC.1052     8010
GDSC.1039     7893
Name: DRUG_ID, dtype: int64

In [77]:
NCI_dose_response = pd.read_csv('Data/dose_response/NCI60_dose_response1', sep='\t')
drugs = list(NCI_dose_response['DRUG_ID'].values)
ten_rando_drugs = random.sample(drugs, 10)
NCI_10 = NCI_dose_response[NCI_dose_response['DRUG_ID'].isin(ten_rando_drugs)]
NCI_10['DRUG_ID'].value_counts()

NSC.740475    590
NSC.2805      295
NSC.690605    295
NSC.676927    295
NSC.743059    295
NSC.4432      295
NSC.37202     295
NSC.647591    295
NSC.61815     295
NSC.716684    295
Name: DRUG_ID, dtype: int64

### Stitch it together 

In [101]:
snp = pd.read_csv('Data/snps/combo_snp_oncogenes', sep = '\t')
snp.shape

(6039, 6173)

In [80]:
rna_seq = pd.read_csv('Data/rna-seq/combined_rnaseq_data_oncogenes', sep = '\t')
rna_seq.shape

(15196, 1942)

CCLE

In [87]:
CCLE_rnaseq = pd.merge(rna_seq, CCLE_10, left_on='Sample', right_on='CELLNAME')
CCLE_rnaseq.shape

(36289, 1948)

In [88]:
CCLE_final = pd.merge(drug_rnaseq, snp, left_on='Sample', right_on='Sample')
CCLE_final.shape

(35841, 8120)

CTRP

In [91]:
CTRP_rnaseq = pd.merge(rna_seq, CTRP_10, left_on='Sample', right_on='CELLNAME')
CTRP_rnaseq.shape

(123056, 1948)

In [102]:
CTRP_final = pd.merge(CTRP_rnaseq, snp, left_on='Sample', right_on='Sample')
CTRP_final.shape

(0, 8120)

gCSI

In [97]:
gCSI_rnaseq = pd.merge(rna_seq, gCSI_10, left_on='Sample', right_on='CELLNAME')
gCSI_rnaseq.shape

(31788, 1948)

In [98]:
gCSI_final = pd.merge(gCSI_rnaseq, snp, left_on='Sample', right_on='Sample')
gCSI_final.shape

(0, 8120)

GDSC

In [103]:
GDSC_rnaseq = pd.merge(rna_seq, GDSC_10, left_on='Sample', right_on='CELLNAME')
GDSC_rnaseq.shape

(55821, 1948)

In [104]:
GDSC_final = pd.merge(GDSC_rnaseq, snp, left_on='Sample', right_on='Sample')
GDSC_final.shape

(47030, 8120)

NCI60

In [106]:
NCI_rnaseq = pd.merge(rna_seq, NCI_10, left_on='Sample', right_on='CELLNAME')
NCI_rnaseq.shape

(3245, 1948)

In [107]:
NCI_final = pd.merge(NCI_rnaseq, snp, left_on='Sample', right_on='Sample')
NCI_final.shape

(3245, 8120)

### Output to CSV

In [89]:
CCLE_final.to_csv('CCLE_stitched.csv')

In [108]:
NCI_final.to_csv('NCI_stitched.csv')

In [109]:
GDSC_final.to_csv('GDSC_stitched.csv')

In [110]:
gCSI_rnaseq.to_csv('gCSI_stitched.csv')

In [111]:
CTRP_rnaseq.to_csv('CTRP_stitched.csv')