GDSC2

Dataset Description: Genomics in Drug Sensitivity in Cancer (GDSC) is a resource for therapeutic biomarker discovery in cancer cells. It contains wet lab IC50 for 100s of drugs in 1000 cancer cell lines. In this dataset, we use RMD normalized gene expression for cancer lines and SMILES for drugs. Y is the log normalized IC50. This is the version 2 of GDSC, which uses improved experimental procedures.

Task Description: Regression. Given the gene expression of cell lines and the SMILES of drug, predict the drug sensitivity level.

Dataset Statistics: 92,703 pairs, 805 cancer cells and 137 drugs

In [1]:
import pandas as pd

### Import dataset and gene names

In [2]:
from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC2')
df = data.get_data()

Found local copy...
Loading...
Done!


In [3]:
genes = data.get_gene_symbols()

Found local copy...
Loading...


### Orginal dataset 

In [4]:
df

Unnamed: 0,Drug_ID,Drug,Cell Line_ID,Cell Line,Y
0,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,HCC1954,"[8.54820830373167, 2.5996072676336297, 10.3759...",-0.251083
1,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,HCC1143,"[7.58193774904993, 2.81430257671695, 10.363326...",1.343315
2,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,HCC1187,"[9.013252540641961, 2.9520929896608, 9.3474286...",1.736985
3,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,HCC1395,"[7.4351511634642105, 2.8325700611437004, 10.34...",-2.309078
4,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,HCC1599,"[8.334239608034789, 2.7477031637484997, 10.314...",-3.106684
...,...,...,...,...,...
92698,JQ1,CC1=C(SC2=C1C(=N[C@H](C3=NN=C(N32)C)CC(=O)OC(C...,EFM-192A,"[7.90969861787306, 3.0665091537456, 11.3513791...",3.576583
92699,JQ1,CC1=C(SC2=C1C(=N[C@H](C3=NN=C(N32)C)CC(=O)OC(C...,HCC1428,"[7.241512102682691, 2.7729214122229098, 9.7214...",1.402466
92700,JQ1,CC1=C(SC2=C1C(=N[C@H](C3=NN=C(N32)C)CC(=O)OC(C...,HDQ-P1,"[8.59362481381391, 2.7654211455101003, 9.91057...",2.762460
92701,JQ1,CC1=C(SC2=C1C(=N[C@H](C3=NN=C(N32)C)CC(=O)OC(C...,JIMT-1,"[8.44162845293353, 2.6392762542455497, 11.4637...",3.442930


check if there is any missinig name from the gene list

In [5]:
print(len(genes))
print(len(df['Cell Line'][0]))

17737
17737


count the nan values in the original gene list

In [6]:
count = 0
for x, i in enumerate(genes):
    if type(i) == float:
        genes[x] = str(i)



for i in genes:
    if i == 'nan':
        count += 1

print(count)

318


build a dataset with the 805 cell lines and the respective gene expression list

this also removed the nan values in the columns

In [7]:
gene_expression_dict = {} 

for index, row in df.iterrows():
    gene_expression_dict[row['Cell Line_ID']] = dict(zip(genes, row['Cell Line']))

In [8]:
gene_expression_by_drug_dict = {}

for index, row in df.iterrows():
    gene_expression_by_drug_dict[row['Drug_ID']] = dict(zip(genes, row['Cell Line']))

In [9]:
gene_expression = pd.DataFrame.from_dict(gene_expression_dict, orient='index')

In [10]:
gene_expression

Unnamed: 0,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,GCLC,NFYA,...,LINC00514,OR1D5,ZNF234,MYH4,LINC00526,PPY2,KRT18P55,POLRMTP1,UBL5P2,TBC1D3P5
HCC1954,8.548208,2.599607,10.375991,5.178378,4.267357,3.092322,6.170279,7.553067,9.280913,5.474400,...,3.413112,3.228033,4.666941,2.632448,3.511284,3.013987,3.333420,2.867266,8.781375,3.232597
HCC1143,7.581938,2.814303,10.363326,3.770037,3.394502,3.111186,6.228677,8.440833,8.005206,5.669074,...,3.591381,3.238828,4.059109,2.564325,4.324982,3.209396,3.539049,2.984753,9.361984,3.225168
HCC1187,9.013253,2.952093,9.347429,4.982836,4.122282,3.290773,3.014210,5.551352,5.032812,7.126702,...,7.497050,3.350337,4.604941,2.914548,4.450711,3.301961,3.088815,2.985197,9.628198,3.425641
HCC1395,7.435151,2.832570,10.344827,3.877500,3.555658,3.511154,7.652886,8.245466,5.650228,6.008727,...,3.420246,3.334150,3.933930,2.641512,4.761300,3.326525,2.891492,2.984267,8.766315,3.487581
HCC1599,8.334240,2.747703,10.314551,4.691847,3.906000,3.199376,3.489741,7.464137,6.321866,5.216688,...,3.732887,3.201282,4.057951,2.749716,4.713647,3.118046,3.105409,3.261109,9.617997,3.346011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NCC010,5.987799,2.784995,10.052390,3.668912,3.767270,3.085118,5.332890,8.031837,6.249088,4.950251,...,3.369585,3.532607,4.051723,2.621257,4.634649,3.498556,3.384732,2.977642,9.363887,3.320187
RCC-JW,7.529016,3.195540,10.600703,3.997457,4.544972,3.450394,5.279941,7.988237,4.217507,5.345541,...,3.075105,3.789976,4.708105,2.609469,3.816648,3.152940,3.364600,3.126751,10.156729,3.376311
MM1S,3.210050,2.874365,10.700068,5.069513,4.781263,3.170943,3.145185,6.453126,5.564713,5.589164,...,3.256844,3.427507,3.786833,2.669626,4.439463,3.055257,3.285702,3.419171,10.088247,3.507856
SNU-61,8.077116,2.781325,10.038055,5.205411,3.758069,3.111444,3.714148,8.947314,6.013510,4.406719,...,4.991264,3.312884,4.008429,2.772584,5.787669,3.283547,3.740781,3.471996,8.837456,3.143362


there is one nan value left

In [11]:
for i, x in enumerate(gene_expression.columns):
    if x == 'nan':
        print(i, x)

82 nan


In [12]:
# drop gene_expression.columns[82]

gene_expression = gene_expression.drop(gene_expression.columns[82], axis=1)

In [13]:
for i, x in enumerate(gene_expression.columns):
    if x == 'nan':
        print(i, x)

In [14]:
gene_expression.to_csv('gene_expression.csv')

create dataset only with the drugs

In [15]:
drugs = df.drop_duplicates(subset=['Drug_ID', 'Drug']).drop(['Cell Line', 'Cell Line_ID', 'Y'], axis=1)

In [16]:
drugs.drop(drugs.columns[0], axis=1, inplace=True)

Unnamed: 0,Drug_ID,Drug
0,Camptothecin,CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...
804,Vinblastine,CC[C@@]1(CC2C[C@@](C3=C(CCN(C2)C1)C4=CC=CC=C4N...
1551,Cisplatin,N.N.[Cl-].[Cl-].[Pt+2]
2316,Cytarabine,C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)C...
3065,Docetaxel,CC1=C2[C@H](C(=O)[C@@]3([C@H](C[C@@H]4[C@]([C@...
...,...,...
90383,LJI308,C1COCCN1C2=CC=C(C=C2)C3=C(C=NC=C3)C4=CC(=C(C(=...
91126,GSK591,C1CC(C1)NC2=NC=CC(=C2)C(=O)NC[C@@H](CN3CCC4=CC...
91868,VE821,CS(=O)(=O)C1=CC=C(C=C1)C2=CN=C(C(=N2)C(=O)NC3=...
92609,AZD6482,CC1=CN2C(=O)C=C(N=C2C(=C1)[C@@H](C)NC3=CC=CC=C...


In [17]:
drugs.to_csv('drugs.csv')

create dataset only with the combinations

In [18]:
combinations = df.drop(['Drug', 'Cell Line'], axis=1)

In [19]:
combinations

Unnamed: 0,Drug_ID,Cell Line_ID,Y
0,Camptothecin,HCC1954,-0.251083
1,Camptothecin,HCC1143,1.343315
2,Camptothecin,HCC1187,1.736985
3,Camptothecin,HCC1395,-2.309078
4,Camptothecin,HCC1599,-3.106684
...,...,...,...
92698,JQ1,EFM-192A,3.576583
92699,JQ1,HCC1428,1.402466
92700,JQ1,HDQ-P1,2.762460
92701,JQ1,JIMT-1,3.442930


In [20]:
combinations.to_csv('combinations.csv')

In [1]:
import pandas as pd

# Sample DataFrame
data = {'column1': [1, 2, 3],
        'column2': ['a', 'b', 'c'],
        'list_column': [[10, 20], [30, 40, 50], [60, 70]]}

df = pd.DataFrame(data)

# Transforming the list_column into separate columns
df = pd.concat([df.drop(['list_column'], axis=1), df['list_column'].apply(pd.Series)], axis=1)

# Display the modified DataFrame
print(df)

   column1 column2     0     1     2
0        1       a  10.0  20.0   NaN
1        2       b  30.0  40.0  50.0
2        3       c  60.0  70.0   NaN


  df = pd.concat([df.drop(['list_column'], axis=1), df['list_column'].apply(pd.Series)], axis=1)
