# The NIH LINCS data is very large
You can download all relevant files on the [GEO Website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138).

The dataset contains 12328 rows (genes) by 118050 columns (samples) for a total of 1,455,320,400 entries

In [2]:
# Assuming an 8 byte float
base_rows = 12328
base_columns = 118050
f"Thats {base_rows * base_columns * 8 / 1e9} gigabytes"

'Thats 11.6425632 gigabytes'

**This is too large for practical use**, and might be too big to fit in working memory (RAM) for many computers.

But we aren't interested in all the samples. Many samples here measure data for experiments we are not interested in. Luckily, the NIH provides a number of metadata files we can use to help use decide which experiments we are interested in.

### Determining data of interest

The reagent (chemical or genetic) being studied in a given experiment is called the perturbagen, and there a variety of types of perturbagens. 

Let's load in the metadata file that contains info on the perturbagens.

In [3]:
import pandas as pd

pert_info = pd.read_csv("data/GSE70138_Broad_LINCS_pert_info_2017-03-06.txt", sep="\t")

Let's see what kinds of perturbagens there are in the dataset:

In [4]:
pert_info["pert_type"].drop_duplicates()

0            trt_cp
513     ctl_vehicle
1797        trt_xpr
2150      ctl_untrt
2151     ctl_vector
Name: pert_type, dtype: object

Looking at the Connectopedia [entry on perturbagens](https://clue.io/connectopedia/perturbagen_types_and_controls), we can see that the `pert_type` for drugs (compounds) is `trt_cp`. 

Let's see some examples:

In [5]:
# Also want to include controls
compound_perturbagens = pert_info[pert_info["pert_type"].isin(["trt_cp","ctl_vehicle"])]
print(f"Found {compound_perturbagens.shape[0]} different compounds.")
compound_perturbagens[:5]

Found 1797 different compounds.


Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
0,BRD-K70792160,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,GYBXAGDWMCJZJK-UHFFFAOYSA-N,10-DEBC,trt_cp
1,BRD-K68552125,CCCCCCCCCCCCCC(=O)O[C@@H]1[C@@H](C)[C@]2(O)[C@...,PHEDXBVPIONUQT-RGYGYFBISA-N,phorbol-myristate-acetate,trt_cp
2,BRD-K92301463,CCCCC(C)(C)[C@H](O)\C=C\[C@H]1[C@H](O)CC(=O)[C...,QAOBBBBDJSWHMU-WMBBNPMCSA-N,"16,16-dimethylprostaglandin-e2",trt_cp
3,BRD-A29731977,CCCCCC(=O)O[C@@]1(CCC2C3CCC4=CC(=O)CC[C@]4(C)C...,DOMWKUIIPQCAJU-JKPPDDDBSA-N,17-hydroxyprogesterone-caproate,trt_cp
4,BRD-K07954936,OC(=O)CCCC[C@@H]1SC[C@@H]2NC(=N)N[C@H]12,WWVANQJRLPIHNS-ZKWXMUAHSA-N,2-iminobiotin,trt_cp


Note that `pert_iname` in this dataset corresponds with `sm_name` in the Kaggle dataset (`de_train.parquet`). The same holds true `canonical_smiles` and `SMILES`, respectively.

By the way, the negative control (DMSO) exists in this dataset too, with a special `pert_type` called `ctl_vehicle`.

In [6]:
control_perturbagen = pert_info[pert_info["pert_type"]=="ctl_vehicle"]
control_perturbagen

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
513,DMSO,CS(=O)C,IAZDPXIOMUYVGZ-UHFFFAOYSA-N,DMSO,ctl_vehicle


And so are the positive controls: dabrafenib and belinostat.

In [7]:
positive_perturbagens = pert_info[pert_info["pert_iname"].isin(["dabrafenib","belinostat"])]
positive_perturbagens

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
216,BRD-K17743125,ONC(=O)\C=C\c1cccc(c1)S(=O)(=O)Nc1ccccc1,NCNRHFGMJRPRSK-MDZDMXLPSA-N,belinostat,trt_cp
441,BRD-K09951645,CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(...,BFSMGDJOXZAERB-UHFFFAOYSA-N,dabrafenib,trt_cp


### Building an index
The work we just did tells us which `pert_id`'s we are interested in, but we don't quite have an index into the dataset yet. 

Let's load in the sample metadata. Note that there is 1 entry for every sample in the dataset.

In [8]:
sig_info = pd.read_csv("data/GSE70138_Broad_LINCS_sig_info_2017-03-06.txt", sep="\t")
sig_info.shape

(118050, 8)

This sample metadata contains the same `pert_type` column as above perturbagen metadata, but it doesn't have any info on what the perturbagen is:

In [9]:
sig_info.columns

Index(['sig_id', 'pert_id', 'pert_iname', 'pert_type', 'cell_id', 'pert_idose',
       'pert_itime', 'distil_id'],
      dtype='object')

Luckily we can combine all the work we've done so far. 
We want to annotate every sample which is uses a compound perturbagen with it's SMILES and International Chemical Identifier (INCHI).

*Hold on tight!*

In [10]:
compound_perturbagens = compound_perturbagens[['pert_id', 'canonical_smiles']]
key = "pert_id"
comp_sig_info = sig_info.join(compound_perturbagens.set_index(key),on=key,how="right")
comp_sig_info

Unnamed: 0,sig_id,pert_id,pert_iname,pert_type,cell_id,pert_idose,pert_itime,distil_id,canonical_smiles
55789,REP.A007_A375_24H:N13,BRD-K70792160,10-DEBC,trt_cp,A375,10.0 um,24 h,REP.A007_A375_24H_X1_B22:N13|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55790,REP.A007_A375_24H:N14,BRD-K70792160,10-DEBC,trt_cp,A375,3.33 um,24 h,REP.A007_A375_24H_X1_B22:N14|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55791,REP.A007_A375_24H:N15,BRD-K70792160,10-DEBC,trt_cp,A375,1.11 um,24 h,REP.A007_A375_24H_X1_B22:N15|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55792,REP.A007_A375_24H:N16,BRD-K70792160,10-DEBC,trt_cp,A375,0.37 um,24 h,REP.A007_A375_24H_X1_B22:N16|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55793,REP.A007_A375_24H:N17,BRD-K70792160,10-DEBC,trt_cp,A375,0.12 um,24 h,REP.A007_A375_24H_X1_B22:N17|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
...,...,...,...,...,...,...,...,...,...
39334,LPROT003_A549_6H:G04,BRD-K92960067,smer-3,trt_cp,A549,10.0 um,6 h,LPROT003_A549_6H_X1.A2_B22:G04,O=C1c2ccccc2-c2nc3nonc3nc12
39335,LPROT003_A549_6H:G06,BRD-K92960067,smer-3,trt_cp,A549,10.0 um,6 h,LPROT003_A549_6H_X1.A2_B22:G06,O=C1c2ccccc2-c2nc3nonc3nc12
39523,LPROT003_PC3_6H:G01,BRD-K92960067,smer-3,trt_cp,PC3,10.0 um,6 h,LPROT003_PC3_6H_X1.A2_B22:G01,O=C1c2ccccc2-c2nc3nonc3nc12
39524,LPROT003_PC3_6H:G03,BRD-K92960067,smer-3,trt_cp,PC3,10.0 um,6 h,LPROT003_PC3_6H_X1.A2_B22:G03,O=C1c2ccccc2-c2nc3nonc3nc12


Lookss great, but I want info about the cells. Let's load in the metadata.

In [11]:
gene_info = pd.read_csv("data/GSE70138_Broad_LINCS_cell_info_2017-04-28.txt", sep="\t")
gene_info

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
0,A375,cell line,A375,-666,-666,tumor,skin,malignant melanoma,adherent,CRL-1619,ATCC,54,F,-666
1,A375.311,cell line,A375,A375,genetically modified to stably express Cas9 pr...,tumor,skin,malignant melanoma,adherent,CRL-1619,ATCC,54,F,-666
2,A549,cell line,A549,-666,-666,tumor,lung,non small cell lung cancer| carcinoma,adherent,CCL-185,ATCC,58,M,Caucasian
3,A549.311,cell line,A549,A549,genetically modified to stably express Cas9 p...,tumor,lung,non small cell lung cancer| carcinoma,adherent,CCL-185,ATCC,58,M,Caucasian
4,A673,cell line,A673,-666,-666,tumor,bone,ewing's sarcoma,adherent,CRL-1598,ATCC,-666,F,-666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,CD34,primary,CD34,-666,-666,normal,bone,bone marrow,suspension,-666,-666,-666,-666,-666
94,PHH,primary,PHH,-666,-666,primary,liver,normal primary liver,-666,-666,CellzDirect,-666,-666,-666
95,SKB,primary,SKB,-666,-666,normal,muscle,myoblast,-666,CC-2580,Lonza,-666,-666,-666
96,SKL,primary,SKL,-666,-666,primary,muscle,normal primary skeletal muscle cells,adherent,CC-2561,LONZA,-666,-666,-666


In [12]:
Out of curiosity, do we have any Henrietta Lacks cells?

Object `cells` not found.


In [13]:
gene_info[gene_info["base_cell_id"]=="HELA"]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
22,HELA,cell line,HELA,-666,-666,tumor,large intestine,adenocarcinoma,adherent,CCL-2,ATCC,31,F,Black
23,HELA.311,cell line,HELA,HELA,genetically modified to stably express Cas9 pr...,tumor,large intestine,adenocarcinoma,adherent,CCL-2,ATCC,31,F,Black


Fascinating. Regardless, lets kinds of cells are available:

In [14]:
gene_info["cell_type"].drop_duplicates()

0          cell line
83    differentiated
89               ESC
90              iPSC
91           primary
Name: cell_type, dtype: object

Let's dig into the primary and differentiated cells:

In [15]:
gene_info[gene_info["cell_type"].isin(["differentiated","primary"])]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
83,MNEU.E,differentiated,MNEU,-666,differentiated from ESC to be motor neurons,normal,-666,-666,adherent,-666,Harvard University,-666,-666,-666
84,NEU,differentiated,NEU,-666,terminally differentiated to be neurons,normal,-666,-666,adherent,-666,-666,-666,-666,-666
85,NEU.KCL,differentiated,NEU,NEU,NEU exposed to KCl (potassium chloride) soluti...,normal,-666,-666,adherent,-666,-666,-666,-666,-666
86,NPC,differentiated,NPC,-666,"differentiated from iPSC, but not terminally d...",primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
87,NPC.CAS9,differentiated,NPC,NPC,NPC that were genetically modified to stably e...,primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
88,NPC.TAK,differentiated,NPC,NPC,"differentiated from iPSC, but not terminally d...",primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
91,ASC,primary,ASC,-666,-666,primary,adipose,normal primary adipocyte stem cells,adherent,-666,-666,-666,-666,-666
92,ASC.C,primary,ASC,-666,-666,primary,adipose,normal primary adipocyte stem cells,adherent,HPA-v,Sciencell,-666,-666,-666
93,CD34,primary,CD34,-666,-666,normal,bone,bone marrow,suspension,-666,-666,-666,-666,-666
94,PHH,primary,PHH,-666,-666,primary,liver,normal primary liver,-666,-666,CellzDirect,-666,-666,-666


No blood cells. Let's look a little further:

In [16]:
gene_info[gene_info["primary_site"]=="blood"]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
76,U266,cell line,U266,-666,-666,tumor,blood,"myeloman, haematopoietic,lymphoid",mix,ACC9,DSMZ,-666,-666,-666


Not helpful. Let's just keep the info we gathered about the compounds and move on.
Did we save any space?

In [17]:
f"We still have {12328 * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We still have 11.230413504 gigabytes'

But we can include only the genes we care about.
See this [notebook for more info](https://www.kaggle.com/code/laurasisson/exploring-the-lincs-gene-metadata).

In [18]:
gene_info = pd.read_csv("data/GSE70138_Broad_LINCS_gene_info_2017-03-06.txt", sep="\t")
gene_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
2,2978,GUCA1A,guanylate cyclase activator 1A,0,0
3,2049,EPHB3,EPH receptor B3,0,1
4,2101,ESRRA,estrogen related receptor alpha,0,1
...,...,...,...,...,...
12323,4034,LRCH4,leucine-rich repeats and calponin homology (CH...,0,1
12324,399664,MEX3D,mex-3 RNA binding family member D,0,1
12325,54869,EPS8L1,EPS8 like 1,0,1
12326,90379,DCAF15,DDB1 and CUL4 associated factor 15,0,1


In [19]:
train_df = pd.read_parquet("data/de_train.parquet")
train_df

Unnamed: 0,cell_type,sm_name,sm_lincs_id,SMILES,control,A1BG,A1BG-AS1,A2M,A2M-AS1,A2MP1,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1
0,NK cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.104720,-0.077524,-1.625596,-0.144545,0.143555,...,-0.227781,-0.010752,-0.023881,0.674536,-0.453068,0.005164,-0.094959,0.034127,0.221377,0.368755
1,T cells CD4+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.915953,-0.884380,0.371834,-0.081677,-0.498266,...,-0.494985,-0.303419,0.304955,-0.333905,-0.315516,-0.369626,-0.095079,0.704780,1.096702,-0.869887
2,T cells CD8+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,-0.387721,-0.305378,0.567777,0.303895,-0.022653,...,-0.119422,-0.033608,-0.153123,0.183597,-0.555678,-1.494789,-0.213550,0.415768,0.078439,-0.259365
3,T regulatory cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.232893,0.129029,0.336897,0.486946,0.767661,...,0.451679,0.704643,0.015468,-0.103868,0.865027,0.189114,0.224700,-0.048233,0.216139,-0.085024
4,NK cells,Mometasone Furoate,LSM-3349,C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C...,False,4.290652,-0.063864,-0.017443,-0.541154,0.570982,...,0.758474,0.510762,0.607401,-0.123059,0.214366,0.487838,-0.819775,0.112365,-0.122193,0.676629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,T regulatory cells,Atorvastatin,LSM-5771,CC(C)c1c(C(=O)Nc2ccccc2)c(-c2ccccc2)c(-c2ccc(F...,False,-0.014372,-0.122464,-0.456366,-0.147894,-0.545382,...,-0.549987,-2.200925,0.359806,1.073983,0.356939,-0.029603,-0.528817,0.105138,0.491015,-0.979951
610,NK cells,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,-0.455549,0.188181,0.595734,-0.100299,0.786192,...,-1.236905,0.003854,-0.197569,-0.175307,0.101391,1.028394,0.034144,-0.231642,1.023994,-0.064760
611,T cells CD4+,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,0.338168,-0.109079,0.270182,-0.436586,-0.069476,...,0.077579,-1.101637,0.457201,0.535184,-0.198404,-0.005004,0.552810,-0.209077,0.389751,-0.337082
612,T cells CD8+,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,0.101138,-0.409724,-0.606292,-0.071300,-0.001789,...,0.005951,-0.893093,-1.003029,-0.080367,-0.076604,0.024849,0.012862,-0.029684,0.005506,-1.733112


In [20]:
shared_gene_info = gene_info[gene_info["pr_gene_symbol"].isin(train_df)]
shared_gene_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
3,2049,EPHB3,EPH receptor B3,0,1
4,2101,ESRRA,estrogen related receptor alpha,0,1
5,8717,TRADD,TNFRSF1A-associated via death domain,0,1
...,...,...,...,...,...
12323,4034,LRCH4,leucine-rich repeats and calponin homology (CH...,0,1
12324,399664,MEX3D,mex-3 RNA binding family member D,0,1
12325,54869,EPS8L1,EPS8 like 1,0,1
12326,90379,DCAF15,DDB1 and CUL4 associated factor 15,0,1


Let's see how much space now:

In [21]:
f"We still have {shared_gene_info.shape[0] * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We still have 8.3581314 gigabytes'

It doesn't work (below). Let's do landmarks.

In [22]:
# from cmapPy.pandasGEXpress.parse import parse

# gene_ids = shared_gene_info["pr_gene_id"].astype(str)
# sig_ids = comp_sig_info["sig_id"]

# l5_data = parse("data/GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid = sig_ids, rid = gene_ids)
# l5_data.data_df.shape

In [23]:
landmark_info = gene_info[gene_info["pr_is_lm"]==1]
landmark_gene_row_ids = gene_info["pr_gene_id"][gene_info["pr_is_lm"] == 1]
landmark_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
25,6193,RPS5,ribosomal protein S5,1,1
43,23,ABCF1,ATP binding cassette subfamily F member 1,1,1
49,9552,SPAG7,sperm associated antigen 7,1,1
...,...,...,...,...,...
12184,5467,PPARD,peroxisome proliferator activated receptor delta,1,1
12223,2767,GNA11,guanine nucleotide binding protein (G protein)...,1,1
12224,23038,WDTC1,WD and tetratricopeptide repeats 1,1,1
12286,57048,PLSCR3,phospholipid scramblase 3,1,1


Did that save space?

In [24]:
f"We have {landmark_info.shape[0] * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We have 0.890926704 gigabytes'

I think using the larger gene dataset may be worth it!

In [26]:
landmark_info = shared_gene_info

Great! Let's load in the dataset:

In [27]:
from cmapPy.pandasGEXpress.parse import parse
gene_ids = landmark_info["pr_gene_id"].astype(str)
sig_ids = comp_sig_info["sig_id"]
l5_data = parse("data/GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid = sig_ids, rid = gene_ids)
l5_data.data_df.shape

(9175, 113871)

Let's prepare metadata for annotation.

In [28]:
for sk in ["pert_id","pert_type","distil_id"]:
    del comp_sig_info[sk]
comp_sig_info.set_index("sig_id", inplace=True)
comp_sig_info

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
REP.A007_A375_24H:N13,10-DEBC,A375,10.0 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N14,10-DEBC,A375,3.33 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N15,10-DEBC,A375,1.11 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N16,10-DEBC,A375,0.37 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N17,10-DEBC,A375,0.12 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
...,...,...,...,...,...
LPROT003_A549_6H:G04,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_A549_6H:G06,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_PC3_6H:G01,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_PC3_6H:G03,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12


In [29]:
for gk in ["pr_gene_title","pr_is_bing","pr_is_lm"]:
    del landmark_info[gk]
landmark_info.set_index("pr_gene_id",inplace=True)
landmark_info.index = landmark_info.index.map(str)

Time to annotate!

In [30]:
l5_data.col_metadata_df = comp_sig_info
l5_data.row_metadata_df = landmark_info
l5_data.data_df

cid,REP.A001_A375_24H:A03,REP.A001_A375_24H:A04,REP.A001_A375_24H:A05,REP.A001_A375_24H:A06,REP.A001_A375_24H:A07,REP.A001_A375_24H:A08,REP.A001_A375_24H:A09,REP.A001_A375_24H:A10,REP.A001_A375_24H:A11,REP.A001_A375_24H:A12,...,LJP007_SKL_24H:P19,LJP007_SKL_24H:P20,LJP007_SKL_24H:P21,LJP007_SKL_24H:P22,LJP007_SKL_24H:E21,LJP007_SKL_24H:O13,LJP007_SKL_24H:O14,LJP007_SKL_24H:O24,LJP007_SKL_24H:P24,LJP007_SKL_24H:C19
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
780,4.264143,-0.382211,-0.571711,0.584376,0.658348,-0.004232,-0.314762,-0.049558,-0.909517,-0.850654,...,1.091158,0.264409,0.711080,0.768569,4.4460,4.4395,6.1750,8.0582,10.0000,3.0807
7849,0.057249,0.304313,-0.754999,-0.589973,-0.226854,-0.363419,-0.691129,-0.684283,0.521503,-0.640316,...,-0.493212,-0.041785,-0.606896,0.819984,6.6313,10.0000,2.8649,0.4905,9.1524,4.5834
2049,0.308898,-0.335931,-0.502323,-1.775247,-0.666601,0.080279,0.035644,-0.540970,0.503692,-1.418259,...,0.260368,0.906001,1.230669,0.448981,-1.3394,0.3803,1.6567,-0.4138,-2.9559,-2.5385
2101,-0.104070,0.324702,0.495425,-0.107543,-0.091924,0.645074,-0.035445,-0.643081,-0.050036,-0.320833,...,-0.676510,0.153707,-0.923612,0.281000,0.2792,0.2364,2.2745,-0.4215,4.9306,-0.3057
8717,-0.779874,-0.394772,-0.701756,-0.768190,-0.205493,-0.147390,-1.274809,-0.084787,1.060140,-0.350143,...,0.262250,0.000082,-0.109884,-0.422778,4.1891,4.4511,-0.4829,1.9450,5.3442,0.9768
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4034,0.645438,-0.916510,0.678068,-0.466665,-0.777092,-0.048832,-0.803753,-0.490556,0.020871,0.449013,...,1.177493,0.670997,0.735648,0.862986,1.4647,0.3859,0.2262,0.7516,3.4061,-0.3348
399664,-1.011237,-0.350702,-0.548015,0.336222,0.927741,0.585799,-1.301060,-0.224759,-0.328396,0.118584,...,-0.414293,-0.927003,0.052080,-0.498292,1.1236,2.2361,0.9255,-1.4057,0.2197,1.8597
54869,-1.272611,-0.471564,-0.318550,0.585188,-0.029780,-0.728792,0.568284,0.255533,0.758844,0.213864,...,0.930385,0.458312,-0.082488,0.941404,3.4528,5.0802,2.4934,3.2928,-4.5793,2.9312
90379,-0.770175,0.012531,-0.865771,-0.187241,-0.972365,-0.118854,-0.741667,-0.766926,0.142690,1.482587,...,0.235752,-0.256095,0.628243,-0.316403,2.6078,3.6382,2.7465,2.0752,7.7701,2.5543


Before we finish the labels, let's add a control column

In [31]:
# Add a control column
comp_sig_info["control"] = comp_sig_info["pert_iname"] == "DMSO"
comp_sig_info[comp_sig_info["control"]]

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles,control
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LJP005_A375_24H:A03,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A04,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A05,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A06,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:B03,DMSO,A375,-666,24 h,CS(=O)C,True
...,...,...,...,...,...,...
REP.A028_YAPC_24H:J14,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J15,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J16,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J17,DMSO,YAPC,-666,24 h,CS(=O)C,True


These are some heavy duty operations, so let's add tqdm for progress bars

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

Join on the index (experiment ID):

In [32]:
final_data = comp_sig_info.join(l5_data.data_df.T)
final_data = final_data.rename(columns=landmark_info.to_dict()["pr_gene_symbol"])
final_data

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles,control,DDR1,PAX8,EPHB3,ESRRA,...,RHOT2,RABEP2,ZNF783,NPEPL1,ADAP1,LRCH4,MEX3D,EPS8L1,DCAF15,ACTB
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
REP.A007_A375_24H:N13,10-DEBC,A375,10.0 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.038472,1.849687,0.083768,0.016702,...,0.001606,-0.958471,-0.862591,0.043918,0.243827,0.605550,-0.035159,0.452217,-0.355277,0.102468
REP.A007_A375_24H:N14,10-DEBC,A375,3.33 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.102355,0.727322,-0.661204,0.721278,...,-1.605070,-0.266933,-1.342838,1.173322,0.864832,0.621559,0.699045,0.446316,-0.586099,0.181649
REP.A007_A375_24H:N15,10-DEBC,A375,1.11 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-0.568256,0.511114,-0.097846,0.071177,...,-0.766508,-1.213808,-0.843063,0.761775,0.469623,-1.254542,1.225531,1.476330,0.072201,0.641390
REP.A007_A375_24H:N16,10-DEBC,A375,0.37 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.079726,-0.023865,0.327802,-0.641486,...,0.339242,-0.997942,-0.104656,-1.054154,-0.066694,0.057235,0.297426,0.975332,0.062736,-0.086294
REP.A007_A375_24H:N17,10-DEBC,A375,0.12 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.909248,-0.086209,-0.166124,-0.288303,...,-0.093410,-1.158856,-0.244935,0.383060,-0.019576,-0.798997,-0.463227,0.299951,-0.244413,1.035409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
LPROT003_A549_6H:G04,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,0.225100,0.000000,-0.198400,-2.311300,...,-0.577800,-0.229300,1.321300,0.447600,-1.161400,-1.205300,-0.875000,-0.059900,1.116800,0.314000
LPROT003_A549_6H:G06,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-0.859600,0.125700,-0.141600,-0.767600,...,-0.649200,1.285900,0.154100,-0.978700,-1.341600,-1.560200,-0.709700,0.543900,0.548600,-1.723500
LPROT003_PC3_6H:G01,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-0.215200,-0.674500,3.398900,0.231100,...,2.139500,1.696700,-0.437300,0.182200,0.872200,2.838300,-0.754900,1.883600,2.790700,-1.193800
LPROT003_PC3_6H:G03,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-2.046500,-1.113400,0.135900,-1.299600,...,-1.103900,-1.759000,-0.124200,0.674500,2.116100,-0.507700,0.383500,2.143300,-0.122800,1.611100


Build individual datasets

Calculate common genes

In [33]:
LINCS_TSM_IDX = 6
lincs_genes = final_data.columns[LINCS_TSM_IDX:]
lincs_genes

Index(['DDR1', 'PAX8', 'EPHB3', 'ESRRA', 'TRADD', 'PRPF8', 'CAPNS1', 'RPL35',
       'RPL28', 'EIF4G2',
       ...
       'RHOT2', 'RABEP2', 'ZNF783', 'NPEPL1', 'ADAP1', 'LRCH4', 'MEX3D',
       'EPS8L1', 'DCAF15', 'ACTB'],
      dtype='object', length=9175)

In [34]:
TRAIN_TSM_IDX = 5
shared_genes = train_df.columns[train_df.columns.isin(lincs_genes)]
shared_genes

Index(['A2M', 'A4GALT', 'AAAS', 'AACS', 'AAGAB', 'AAK1', 'AAMDC', 'AAMP',
       'AAR2', 'AARS',
       ...
       'ZSCAN9', 'ZSWIM1', 'ZSWIM8', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDB', 'ZXDC',
       'ZYX', 'ZZEF1'],
      dtype='object', length=9175)

In [35]:
import tqdm
import torch

In [36]:
lincs_cmpd_df = final_data[~final_data["control"]]
lincs_control_pert = final_data[final_data["control"]][shared_genes.union(["cell_id"])].groupby(by="cell_id").mean()
lincs_control_pert

Unnamed: 0_level_0,A2M,A4GALT,AAAS,AACS,AAGAB,AAK1,AAMDC,AAMP,AAR2,AARS,...,ZSCAN9,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A375,-0.110109,-0.049684,0.100239,-0.0049,-0.024816,-0.053884,-0.011712,-0.104259,0.025247,0.222884,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.09114,0.039327
A549,-0.047626,-0.236422,-0.061077,-0.105941,-0.042531,-0.034639,-0.169385,-0.137957,0.038209,0.034758,...,-0.141764,-0.135562,-0.32387,0.378432,0.320842,0.15288,-0.203917,-0.122826,-0.173342,-0.12959
ASC,0.166485,0.05554,0.039872,-0.228517,-0.132621,0.082804,0.091114,0.037994,-0.098226,-0.259645,...,0.289782,0.180714,0.045764,0.405885,0.199263,0.255769,0.198614,0.161731,-0.073762,-0.10241
ASC.C,-0.069586,0.187142,-0.026428,0.102633,0.094621,-0.066426,0.120224,-0.002082,0.104645,-0.200364,...,-0.148608,-0.004214,-0.236261,0.431706,0.095209,0.183738,-0.029021,0.079258,-0.108002,-0.050298
BT20,-0.211375,-0.19087,-0.163334,-0.005101,0.056584,-0.15976,-0.145183,-0.141743,0.005526,-0.041777,...,-0.070543,-0.100917,-0.416693,-0.267169,0.249895,0.153524,-0.177629,-0.150482,0.16118,-0.174431
CD34,-0.543433,0.134133,-0.079466,0.157995,0.427995,-0.292266,-0.055078,-0.013483,0.320555,0.216457,...,-0.231414,-0.135247,-0.169558,0.597274,-0.071756,0.353396,-0.272444,0.232876,-0.502586,0.07208
HA1E,-0.106911,-0.0038,0.020565,0.048322,-0.00612,-0.082098,0.068889,-0.080936,0.078857,0.04413,...,-0.113178,-0.01471,-0.19124,-0.241708,0.146336,-0.082126,-0.045833,-0.104291,-0.045675,-0.083254
HCC515,-0.056381,-0.084919,-0.131694,-0.141488,-0.017202,-0.051744,-0.216641,-0.066836,-0.122941,-0.274716,...,-0.164664,-0.097737,-0.294049,0.471611,0.354953,0.308973,-0.138757,-0.109615,-0.100252,-0.071916
HELA,-0.01744,0.09894,0.187499,-0.029475,-0.034436,-0.067476,0.103601,-0.062789,0.042074,0.005951,...,-0.084067,-0.031107,-0.06536,0.199884,0.106882,0.096741,-0.033152,-0.072278,-0.127704,-0.002219
HEPG2,-0.447554,0.026939,-0.030628,-0.086866,0.082266,-0.035435,-0.158732,0.073541,0.006357,-0.129019,...,-0.082992,-0.077702,-0.120189,0.448783,0.157227,0.311718,-0.075132,0.028011,-0.233428,0.019177


In [37]:
# Join each value with the average of the controls, by cell_id
lincs_joined_df = lincs_cmpd_df.join(lincs_control_pert,on="cell_id",lsuffix=" post_treatment",rsuffix=" pre_treatment")
# Split labels to use join suffix as top level for multindex. For non, overlapping columns, use 'label'
lincs_joined_df.columns = pd.MultiIndex.from_tuples([(y if not pd.isnull(y) else "label",x) for (x,y) in lincs_joined_df.columns.str.split(expand=True)])
# Sort each multindex
lincs_joined_df = lincs_joined_df.sort_index(axis=1,level=[0,1])
# The non-shared genes are kept in the df under "label". Drop them.
lincs_joined_df.drop(lincs_joined_df["label"].select_dtypes('number').columns, axis = 1, level=1,inplace = True)
lincs_joined_df

Unnamed: 0_level_0,label,label,label,label,label,label,post_treatment,post_treatment,post_treatment,post_treatment,...,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment
Unnamed: 0_level_1,canonical_smiles,cell_id,control,pert_idose,pert_iname,pert_itime,A2M,A4GALT,AAAS,AACS,...,ZSCAN9,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1
sig_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
REP.A007_A375_24H:N13,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,A375,False,10.0 um,10-DEBC,24 h,0.096137,0.370747,0.244788,0.588700,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.091140,0.039327
REP.A007_A375_24H:N14,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,A375,False,3.33 um,10-DEBC,24 h,-0.156295,0.633927,-0.344619,0.030247,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.091140,0.039327
REP.A007_A375_24H:N15,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,A375,False,1.11 um,10-DEBC,24 h,0.875770,-0.652506,0.637416,-1.078485,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.091140,0.039327
REP.A007_A375_24H:N16,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,A375,False,0.37 um,10-DEBC,24 h,0.627406,-0.635089,-1.343378,0.801373,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.091140,0.039327
REP.A007_A375_24H:N17,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,A375,False,0.12 um,10-DEBC,24 h,0.150190,0.120465,-1.188625,-0.999034,...,-0.152178,-0.017154,-0.216074,-0.035961,0.141513,0.104522,-0.121313,-0.086904,-0.091140,0.039327
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
LPROT003_A549_6H:G04,O=C1c2ccccc2-c2nc3nonc3nc12,A549,False,10.0 um,smer-3,6 h,0.417300,-0.196100,0.411400,0.271700,...,-0.141764,-0.135562,-0.323870,0.378432,0.320842,0.152880,-0.203917,-0.122826,-0.173342,-0.129590
LPROT003_A549_6H:G06,O=C1c2ccccc2-c2nc3nonc3nc12,A549,False,10.0 um,smer-3,6 h,-0.779600,-0.568800,-1.673500,0.345500,...,-0.141764,-0.135562,-0.323870,0.378432,0.320842,0.152880,-0.203917,-0.122826,-0.173342,-0.129590
LPROT003_PC3_6H:G01,O=C1c2ccccc2-c2nc3nonc3nc12,PC3,False,10.0 um,smer-3,6 h,0.613100,2.832600,-0.500000,-1.048800,...,-0.052953,-0.027302,-0.153890,-0.295438,0.049582,-0.062164,-0.040412,-0.007378,-0.043023,-0.059122
LPROT003_PC3_6H:G03,O=C1c2ccccc2-c2nc3nonc3nc12,PC3,False,10.0 um,smer-3,6 h,-1.265200,-0.076200,-0.353600,1.552700,...,-0.052953,-0.027302,-0.153890,-0.295438,0.049582,-0.062164,-0.040412,-0.007378,-0.043023,-0.059122


Time to build the datasets. There could be a way to do this vectorized but its ok.

In [38]:
kaggle_cmpd_df = train_df[~train_df["control"]]
kaggle_control_pert = train_df[train_df["control"]][shared_genes.union(["cell_type"])].groupby(by="cell_type").mean()

In [39]:
# Join and label as above
kaggle_joined_df = kaggle_cmpd_df.join(kaggle_control_pert,on="cell_type",lsuffix=" post_treatment",rsuffix=" pre_treatment")
# Split labels to use join suffix as top level for multindex. For non, overlapping columns, use 'label'
kaggle_joined_df.columns = pd.MultiIndex.from_tuples([(y if not pd.isnull(y) else "label",x) for (x,y) in kaggle_joined_df.columns.str.split(expand=True)])
# Sort each multindex
kaggle_joined_df = kaggle_joined_df.sort_index(axis=1,level=[0,1])
# The non-shared genes are kept in the df under "label". Drop them.
kaggle_joined_df.drop(kaggle_joined_df["label"].select_dtypes('number').columns, axis = 1, level=1,inplace = True)
kaggle_joined_df

Unnamed: 0_level_0,label,label,label,label,label,post_treatment,post_treatment,post_treatment,post_treatment,post_treatment,...,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment
Unnamed: 0_level_1,SMILES,cell_type,control,sm_lincs_id,sm_name,A2M,A4GALT,AAAS,AACS,AAGAB,...,ZSCAN9,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1
0,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,NK cells,False,LSM-5341,Clotrimazole,-1.625596,0.073229,-0.016823,0.101717,-0.005153,...,-0.163274,0.591602,0.419336,-0.276920,-0.699361,-0.197593,0.150026,0.079472,-0.492517,-1.284090
1,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,T cells CD4+,False,LSM-5341,Clotrimazole,0.371834,0.203559,0.604656,0.498592,-0.317184,...,-1.210226,-0.301428,7.382795,1.841392,0.487361,-0.003076,-1.552542,-0.499788,-0.020261,-1.287937
2,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,T cells CD8+,False,LSM-5341,Clotrimazole,0.567777,-0.480681,0.467144,-0.293205,-0.005098,...,-0.781860,-0.900506,0.450539,0.265504,-0.686671,-0.452621,-0.981731,0.252306,-0.062445,0.769099
3,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,T regulatory cells,False,LSM-5341,Clotrimazole,0.336897,0.718590,-0.162145,0.157206,-3.654218,...,7.814024,14.482867,5.040275,8.676611,5.373513,8.877284,12.642849,6.362721,1.339078,0.765783
4,C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C...,NK cells,False,LSM-3349,Mometasone Furoate,-0.017443,2.022829,0.600011,1.231275,0.236739,...,-0.163274,0.591602,0.419336,-0.276920,-0.699361,-0.197593,0.150026,0.079472,-0.492517,-1.284090
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,CC(C)c1c(C(=O)Nc2ccccc2)c(-c2ccccc2)c(-c2ccc(F...,T regulatory cells,False,LSM-5771,Atorvastatin,-0.456366,-0.544709,0.282458,-0.431359,-0.364961,...,7.814024,14.482867,5.040275,8.676611,5.373513,8.877284,12.642849,6.362721,1.339078,0.765783
610,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,NK cells,False,LSM-45758,Riociguat,0.595734,0.090954,0.169523,0.428297,0.106553,...,-0.163274,0.591602,0.419336,-0.276920,-0.699361,-0.197593,0.150026,0.079472,-0.492517,-1.284090
611,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,T cells CD4+,False,LSM-45758,Riociguat,0.270182,-0.061539,0.002818,-0.027167,-0.383696,...,-1.210226,-0.301428,7.382795,1.841392,0.487361,-0.003076,-1.552542,-0.499788,-0.020261,-1.287937
612,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,T cells CD8+,False,LSM-45758,Riociguat,-0.606292,-0.706087,-0.620919,-1.485381,0.059303,...,-0.781860,-0.900506,0.450539,0.265504,-0.686671,-0.452621,-0.981731,0.252306,-0.062445,0.769099


In [40]:
test_df = pd.read_csv("data/id_map.csv")
test_joined_df = test_df.join(kaggle_control_pert,on="cell_type")
# Because there are no overlapping cells, we must create multiindex manually
gene_cols = shared_genes + " pre_treatment"
label_cols = test_joined_df.columns[~test_joined_df.columns.isin(shared_genes)] + " label"

test_joined_df.columns = pd.MultiIndex.from_tuples([(y,x) for (x,y) in label_cols.union(gene_cols,sort=False).str.split(expand=True)])
test_joined_df

Unnamed: 0_level_0,label,label,label,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment,pre_treatment
Unnamed: 0_level_1,id,cell_type,sm_name,A2M,A4GALT,AAAS,AACS,AAGAB,AAK1,AAMDC,...,ZSCAN9,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1
0,0,B cells,5-(9-Isopropyl-8-methyl-2-morpholino-9H-purin-...,-0.427363,0.540255,-0.132722,0.099653,4.381261,1.118606,8.004196,...,1.279548,-0.205822,3.684938,1.624666,-0.160136,0.366357,-0.755179,0.353838,-1.537940,0.073333
1,1,B cells,ABT-199 (GDC-0199),-0.427363,0.540255,-0.132722,0.099653,4.381261,1.118606,8.004196,...,1.279548,-0.205822,3.684938,1.624666,-0.160136,0.366357,-0.755179,0.353838,-1.537940,0.073333
2,2,B cells,ABT737,-0.427363,0.540255,-0.132722,0.099653,4.381261,1.118606,8.004196,...,1.279548,-0.205822,3.684938,1.624666,-0.160136,0.366357,-0.755179,0.353838,-1.537940,0.073333
3,3,B cells,AMD-070 (hydrochloride),-0.427363,0.540255,-0.132722,0.099653,4.381261,1.118606,8.004196,...,1.279548,-0.205822,3.684938,1.624666,-0.160136,0.366357,-0.755179,0.353838,-1.537940,0.073333
4,4,B cells,AT 7867,-0.427363,0.540255,-0.132722,0.099653,4.381261,1.118606,8.004196,...,1.279548,-0.205822,3.684938,1.624666,-0.160136,0.366357,-0.755179,0.353838,-1.537940,0.073333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
250,250,Myeloid cells,Vandetanib,-26.620091,5.373893,1.014078,-1.304295,0.892647,5.103295,8.042948,...,1.232328,0.429169,0.094279,0.062222,-10.631423,-0.349937,-0.588051,-4.267371,2.360694,-1.036504
251,251,Myeloid cells,Vanoxerine,-26.620091,5.373893,1.014078,-1.304295,0.892647,5.103295,8.042948,...,1.232328,0.429169,0.094279,0.062222,-10.631423,-0.349937,-0.588051,-4.267371,2.360694,-1.036504
252,252,Myeloid cells,Vardenafil,-26.620091,5.373893,1.014078,-1.304295,0.892647,5.103295,8.042948,...,1.232328,0.429169,0.094279,0.062222,-10.631423,-0.349937,-0.588051,-4.267371,2.360694,-1.036504
253,253,Myeloid cells,Vorinostat,-26.620091,5.373893,1.014078,-1.304295,0.892647,5.103295,8.042948,...,1.232328,0.429169,0.094279,0.062222,-10.631423,-0.349937,-0.588051,-4.267371,2.360694,-1.036504


Let's do some quick sanity checks

In [41]:
# Check column ordering is the same for pre_treatment
assert (kaggle_joined_df["pre_treatment"].columns ==  lincs_joined_df["pre_treatment"].columns).all()
assert (kaggle_joined_df["pre_treatment"].columns ==  test_joined_df["pre_treatment"].columns).all()
# For post_treatment
assert (kaggle_joined_df["post_treatment"].columns ==  lincs_joined_df["post_treatment"].columns).all()

In [1]:
# Change some labels in the lincs data so we can compare more easily.
lincs_joined_df = lincs_joined_df.rename(columns={"canonical_smiles":"SMILES","pert_iname":"sm_name","cell_id":"cell_type"})

NameError: name 'lincs_joined_df' is not defined

In [None]:
lincs_joined_df.to_parquet("data/lincs_pretreatment_xxl.parquet")
kaggle_joined_df.to_parquet("data/kaggle_pretreatment_xxl.parquet")
test_joined_df.to_parquet("data/test_pretreatment_xxl.parquet")