# The NIH LINCS data is very large
You can download all relevant files on the [GEO Website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138).

The dataset contains 12328 rows (genes) by 118050 columns (samples) for a total of 1,455,320,400 entries

In [1]:
# Assuming an 8 byte float
base_rows = 12328
base_columns = 118050
f"Thats {base_rows * base_columns * 8 / 1e9} gigabytes"

'Thats 11.6425632 gigabytes'

**This is too large for practical use**, and might be too big to fit in working memory (RAM) for many computers.

But we aren't interested in all the samples. Many samples here measure data for experiments we are not interested in. Luckily, the NIH provides a number of metadata files we can use to help use decide which experiments we are interested in.

### Determining data of interest

The reagent (chemical or genetic) being studied in a given experiment is called the perturbagen, and there a variety of types of perturbagens. 

Let's load in the metadata file that contains info on the perturbagens.

In [2]:
import pandas as pd

pert_info = pd.read_csv("data/GSE70138_Broad_LINCS_pert_info_2017-03-06.txt", sep="\t")

Let's see what kinds of perturbagens there are in the dataset:

In [3]:
pert_info["pert_type"].drop_duplicates()

0            trt_cp
513     ctl_vehicle
1797        trt_xpr
2150      ctl_untrt
2151     ctl_vector
Name: pert_type, dtype: object

Looking at the Connectopedia [entry on perturbagens](https://clue.io/connectopedia/perturbagen_types_and_controls), we can see that the `pert_type` for drugs (compounds) is `trt_cp`. 

Let's see some examples:

In [4]:
# Also want to include controls
compound_perturbagens = pert_info[pert_info["pert_type"].isin(["trt_cp","ctl_vehicle"])]
print(f"Found {compound_perturbagens.shape[0]} different compounds.")
compound_perturbagens[:5]

Found 1797 different compounds.


Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
0,BRD-K70792160,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,GYBXAGDWMCJZJK-UHFFFAOYSA-N,10-DEBC,trt_cp
1,BRD-K68552125,CCCCCCCCCCCCCC(=O)O[C@@H]1[C@@H](C)[C@]2(O)[C@...,PHEDXBVPIONUQT-RGYGYFBISA-N,phorbol-myristate-acetate,trt_cp
2,BRD-K92301463,CCCCC(C)(C)[C@H](O)\C=C\[C@H]1[C@H](O)CC(=O)[C...,QAOBBBBDJSWHMU-WMBBNPMCSA-N,"16,16-dimethylprostaglandin-e2",trt_cp
3,BRD-A29731977,CCCCCC(=O)O[C@@]1(CCC2C3CCC4=CC(=O)CC[C@]4(C)C...,DOMWKUIIPQCAJU-JKPPDDDBSA-N,17-hydroxyprogesterone-caproate,trt_cp
4,BRD-K07954936,OC(=O)CCCC[C@@H]1SC[C@@H]2NC(=N)N[C@H]12,WWVANQJRLPIHNS-ZKWXMUAHSA-N,2-iminobiotin,trt_cp


Note that `pert_iname` in this dataset corresponds with `sm_name` in the Kaggle dataset (`de_train.parquet`). The same holds true `canonical_smiles` and `SMILES`, respectively.

By the way, the negative control (DMSO) exists in this dataset too, with a special `pert_type` called `ctl_vehicle`.

In [5]:
control_perturbagen = pert_info[pert_info["pert_type"]=="ctl_vehicle"]
control_perturbagen

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
513,DMSO,CS(=O)C,IAZDPXIOMUYVGZ-UHFFFAOYSA-N,DMSO,ctl_vehicle


And so are the positive controls: dabrafenib and belinostat.

In [6]:
positive_perturbagens = pert_info[pert_info["pert_iname"].isin(["dabrafenib","belinostat"])]
positive_perturbagens

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
216,BRD-K17743125,ONC(=O)\C=C\c1cccc(c1)S(=O)(=O)Nc1ccccc1,NCNRHFGMJRPRSK-MDZDMXLPSA-N,belinostat,trt_cp
441,BRD-K09951645,CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(...,BFSMGDJOXZAERB-UHFFFAOYSA-N,dabrafenib,trt_cp


### Building an index
The work we just did tells us which `pert_id`'s we are interested in, but we don't quite have an index into the dataset yet. 

Let's load in the sample metadata. Note that there is 1 entry for every sample in the dataset.

In [7]:
sig_info = pd.read_csv("data/GSE70138_Broad_LINCS_sig_info_2017-03-06.txt", sep="\t")
sig_info.shape

(118050, 8)

This sample metadata contains the same `pert_type` column as above perturbagen metadata, but it doesn't have any info on what the perturbagen is:

In [8]:
sig_info.columns

Index(['sig_id', 'pert_id', 'pert_iname', 'pert_type', 'cell_id', 'pert_idose',
       'pert_itime', 'distil_id'],
      dtype='object')

Luckily we can combine all the work we've done so far. 
We want to annotate every sample which is uses a compound perturbagen with it's SMILES and International Chemical Identifier (INCHI).

*Hold on tight!*

In [9]:
compound_perturbagens = compound_perturbagens[['pert_id', 'canonical_smiles']]
key = "pert_id"
comp_sig_info = sig_info.join(compound_perturbagens.set_index(key),on=key,how="right")
comp_sig_info

Unnamed: 0,sig_id,pert_id,pert_iname,pert_type,cell_id,pert_idose,pert_itime,distil_id,canonical_smiles
55789,REP.A007_A375_24H:N13,BRD-K70792160,10-DEBC,trt_cp,A375,10.0 um,24 h,REP.A007_A375_24H_X1_B22:N13|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55790,REP.A007_A375_24H:N14,BRD-K70792160,10-DEBC,trt_cp,A375,3.33 um,24 h,REP.A007_A375_24H_X1_B22:N14|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55791,REP.A007_A375_24H:N15,BRD-K70792160,10-DEBC,trt_cp,A375,1.11 um,24 h,REP.A007_A375_24H_X1_B22:N15|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55792,REP.A007_A375_24H:N16,BRD-K70792160,10-DEBC,trt_cp,A375,0.37 um,24 h,REP.A007_A375_24H_X1_B22:N16|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
55793,REP.A007_A375_24H:N17,BRD-K70792160,10-DEBC,trt_cp,A375,0.12 um,24 h,REP.A007_A375_24H_X1_B22:N17|REP.A007_A375_24H...,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
...,...,...,...,...,...,...,...,...,...
39334,LPROT003_A549_6H:G04,BRD-K92960067,smer-3,trt_cp,A549,10.0 um,6 h,LPROT003_A549_6H_X1.A2_B22:G04,O=C1c2ccccc2-c2nc3nonc3nc12
39335,LPROT003_A549_6H:G06,BRD-K92960067,smer-3,trt_cp,A549,10.0 um,6 h,LPROT003_A549_6H_X1.A2_B22:G06,O=C1c2ccccc2-c2nc3nonc3nc12
39523,LPROT003_PC3_6H:G01,BRD-K92960067,smer-3,trt_cp,PC3,10.0 um,6 h,LPROT003_PC3_6H_X1.A2_B22:G01,O=C1c2ccccc2-c2nc3nonc3nc12
39524,LPROT003_PC3_6H:G03,BRD-K92960067,smer-3,trt_cp,PC3,10.0 um,6 h,LPROT003_PC3_6H_X1.A2_B22:G03,O=C1c2ccccc2-c2nc3nonc3nc12


Lookss great, but I want info about the cells. Let's load in the metadata.

In [10]:
gene_info = pd.read_csv("data/GSE70138_Broad_LINCS_cell_info_2017-04-28.txt", sep="\t")
gene_info

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
0,A375,cell line,A375,-666,-666,tumor,skin,malignant melanoma,adherent,CRL-1619,ATCC,54,F,-666
1,A375.311,cell line,A375,A375,genetically modified to stably express Cas9 pr...,tumor,skin,malignant melanoma,adherent,CRL-1619,ATCC,54,F,-666
2,A549,cell line,A549,-666,-666,tumor,lung,non small cell lung cancer| carcinoma,adherent,CCL-185,ATCC,58,M,Caucasian
3,A549.311,cell line,A549,A549,genetically modified to stably express Cas9 p...,tumor,lung,non small cell lung cancer| carcinoma,adherent,CCL-185,ATCC,58,M,Caucasian
4,A673,cell line,A673,-666,-666,tumor,bone,ewing's sarcoma,adherent,CRL-1598,ATCC,-666,F,-666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,CD34,primary,CD34,-666,-666,normal,bone,bone marrow,suspension,-666,-666,-666,-666,-666
94,PHH,primary,PHH,-666,-666,primary,liver,normal primary liver,-666,-666,CellzDirect,-666,-666,-666
95,SKB,primary,SKB,-666,-666,normal,muscle,myoblast,-666,CC-2580,Lonza,-666,-666,-666
96,SKL,primary,SKL,-666,-666,primary,muscle,normal primary skeletal muscle cells,adherent,CC-2561,LONZA,-666,-666,-666


In [11]:
Out of curiosity, do we have any Henrietta Lacks cells?

Object `cells` not found.


In [12]:
gene_info[gene_info["base_cell_id"]=="HELA"]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
22,HELA,cell line,HELA,-666,-666,tumor,large intestine,adenocarcinoma,adherent,CCL-2,ATCC,31,F,Black
23,HELA.311,cell line,HELA,HELA,genetically modified to stably express Cas9 pr...,tumor,large intestine,adenocarcinoma,adherent,CCL-2,ATCC,31,F,Black


Fascinating. Regardless, lets kinds of cells are available:

In [13]:
gene_info["cell_type"].drop_duplicates()

0          cell line
83    differentiated
89               ESC
90              iPSC
91           primary
Name: cell_type, dtype: object

Let's dig into the primary and differentiated cells:

In [14]:
gene_info[gene_info["cell_type"].isin(["differentiated","primary"])]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
83,MNEU.E,differentiated,MNEU,-666,differentiated from ESC to be motor neurons,normal,-666,-666,adherent,-666,Harvard University,-666,-666,-666
84,NEU,differentiated,NEU,-666,terminally differentiated to be neurons,normal,-666,-666,adherent,-666,-666,-666,-666,-666
85,NEU.KCL,differentiated,NEU,NEU,NEU exposed to KCl (potassium chloride) soluti...,normal,-666,-666,adherent,-666,-666,-666,-666,-666
86,NPC,differentiated,NPC,-666,"differentiated from iPSC, but not terminally d...",primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
87,NPC.CAS9,differentiated,NPC,NPC,NPC that were genetically modified to stably e...,primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
88,NPC.TAK,differentiated,NPC,NPC,"differentiated from iPSC, but not terminally d...",primary,central nervous system,normal stem fibroblast-derived iPScs,adherent,-666,-666,-666,-666,-666
91,ASC,primary,ASC,-666,-666,primary,adipose,normal primary adipocyte stem cells,adherent,-666,-666,-666,-666,-666
92,ASC.C,primary,ASC,-666,-666,primary,adipose,normal primary adipocyte stem cells,adherent,HPA-v,Sciencell,-666,-666,-666
93,CD34,primary,CD34,-666,-666,normal,bone,bone marrow,suspension,-666,-666,-666,-666,-666
94,PHH,primary,PHH,-666,-666,primary,liver,normal primary liver,-666,-666,CellzDirect,-666,-666,-666


No blood cells. Let's look a little further:

In [15]:
gene_info[gene_info["primary_site"]=="blood"]

Unnamed: 0,cell_id,cell_type,base_cell_id,precursor_cell_id,modification,sample_type,primary_site,subtype,original_growth_pattern,provider_catalog_id,original_source_vendor,donor_age,donor_sex,donor_ethnicity
76,U266,cell line,U266,-666,-666,tumor,blood,"myeloman, haematopoietic,lymphoid",mix,ACC9,DSMZ,-666,-666,-666


Not helpful. Let's just keep the info we gathered about the compounds and move on.
Did we save any space?

In [16]:
f"We still have {12328 * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We still have 11.230413504 gigabytes'

But we can include only the genes we care about.
See this [notebook for more info](https://www.kaggle.com/code/laurasisson/exploring-the-lincs-gene-metadata).

In [17]:
gene_info = pd.read_csv("data/GSE70138_Broad_LINCS_gene_info_2017-03-06.txt", sep="\t")
gene_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
2,2978,GUCA1A,guanylate cyclase activator 1A,0,0
3,2049,EPHB3,EPH receptor B3,0,1
4,2101,ESRRA,estrogen related receptor alpha,0,1
...,...,...,...,...,...
12323,4034,LRCH4,leucine-rich repeats and calponin homology (CH...,0,1
12324,399664,MEX3D,mex-3 RNA binding family member D,0,1
12325,54869,EPS8L1,EPS8 like 1,0,1
12326,90379,DCAF15,DDB1 and CUL4 associated factor 15,0,1


In [18]:
train_df = pd.read_parquet("data/de_train.parquet")
train_df

Unnamed: 0,cell_type,sm_name,sm_lincs_id,SMILES,control,A1BG,A1BG-AS1,A2M,A2M-AS1,A2MP1,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1
0,NK cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.104720,-0.077524,-1.625596,-0.144545,0.143555,...,-0.227781,-0.010752,-0.023881,0.674536,-0.453068,0.005164,-0.094959,0.034127,0.221377,0.368755
1,T cells CD4+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.915953,-0.884380,0.371834,-0.081677,-0.498266,...,-0.494985,-0.303419,0.304955,-0.333905,-0.315516,-0.369626,-0.095079,0.704780,1.096702,-0.869887
2,T cells CD8+,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,-0.387721,-0.305378,0.567777,0.303895,-0.022653,...,-0.119422,-0.033608,-0.153123,0.183597,-0.555678,-1.494789,-0.213550,0.415768,0.078439,-0.259365
3,T regulatory cells,Clotrimazole,LSM-5341,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,False,0.232893,0.129029,0.336897,0.486946,0.767661,...,0.451679,0.704643,0.015468,-0.103868,0.865027,0.189114,0.224700,-0.048233,0.216139,-0.085024
4,NK cells,Mometasone Furoate,LSM-3349,C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C...,False,4.290652,-0.063864,-0.017443,-0.541154,0.570982,...,0.758474,0.510762,0.607401,-0.123059,0.214366,0.487838,-0.819775,0.112365,-0.122193,0.676629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,T regulatory cells,Atorvastatin,LSM-5771,CC(C)c1c(C(=O)Nc2ccccc2)c(-c2ccccc2)c(-c2ccc(F...,False,-0.014372,-0.122464,-0.456366,-0.147894,-0.545382,...,-0.549987,-2.200925,0.359806,1.073983,0.356939,-0.029603,-0.528817,0.105138,0.491015,-0.979951
610,NK cells,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,-0.455549,0.188181,0.595734,-0.100299,0.786192,...,-1.236905,0.003854,-0.197569,-0.175307,0.101391,1.028394,0.034144,-0.231642,1.023994,-0.064760
611,T cells CD4+,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,0.338168,-0.109079,0.270182,-0.436586,-0.069476,...,0.077579,-1.101637,0.457201,0.535184,-0.198404,-0.005004,0.552810,-0.209077,0.389751,-0.337082
612,T cells CD8+,Riociguat,LSM-45758,COC(=O)N(C)c1c(N)nc(-c2nn(Cc3ccccc3F)c3ncccc23...,False,0.101138,-0.409724,-0.606292,-0.071300,-0.001789,...,0.005951,-0.893093,-1.003029,-0.080367,-0.076604,0.024849,0.012862,-0.029684,0.005506,-1.733112


In [19]:
shared_gene_info = gene_info[gene_info["pr_gene_symbol"].isin(train_df)]
shared_gene_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
3,2049,EPHB3,EPH receptor B3,0,1
4,2101,ESRRA,estrogen related receptor alpha,0,1
5,8717,TRADD,TNFRSF1A-associated via death domain,0,1
...,...,...,...,...,...
12323,4034,LRCH4,leucine-rich repeats and calponin homology (CH...,0,1
12324,399664,MEX3D,mex-3 RNA binding family member D,0,1
12325,54869,EPS8L1,EPS8 like 1,0,1
12326,90379,DCAF15,DDB1 and CUL4 associated factor 15,0,1


Let's see how much space now:

In [20]:
f"We still have {shared_gene_info.shape[0] * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We still have 8.3581314 gigabytes'

It doesn't work (below). Let's do landmarks.

In [21]:
# from cmapPy.pandasGEXpress.parse import parse

# gene_ids = shared_gene_info["pr_gene_id"].astype(str)
# sig_ids = comp_sig_info["sig_id"]

# l5_data = parse("data/GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid = sig_ids, rid = gene_ids)
# l5_data.data_df.shape

In [22]:
landmark_info = gene_info[gene_info["pr_is_lm"]==1]
landmark_gene_row_ids = gene_info["pr_gene_id"][gene_info["pr_is_lm"] == 1]
landmark_info

Unnamed: 0,pr_gene_id,pr_gene_symbol,pr_gene_title,pr_is_lm,pr_is_bing
0,780,DDR1,discoidin domain receptor tyrosine kinase 1,1,1
1,7849,PAX8,paired box 8,1,1
25,6193,RPS5,ribosomal protein S5,1,1
43,23,ABCF1,ATP binding cassette subfamily F member 1,1,1
49,9552,SPAG7,sperm associated antigen 7,1,1
...,...,...,...,...,...
12184,5467,PPARD,peroxisome proliferator activated receptor delta,1,1
12223,2767,GNA11,guanine nucleotide binding protein (G protein)...,1,1
12224,23038,WDTC1,WD and tetratricopeptide repeats 1,1,1
12286,57048,PLSCR3,phospholipid scramblase 3,1,1


Did that save space?

In [23]:
f"We have {landmark_info.shape[0] * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We have 0.890926704 gigabytes'

Great! Let's load in the dataset:

In [24]:
from cmapPy.pandasGEXpress.parse import parse
gene_ids = landmark_info["pr_gene_id"].astype(str)
sig_ids = comp_sig_info["sig_id"]
l5_data = parse("data/GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid = sig_ids, rid = gene_ids)
l5_data.data_df.shape

(978, 113871)

Let's prepare metadata for annotation.

In [25]:
for sk in ["pert_id","pert_type","distil_id"]:
    del comp_sig_info[sk]
comp_sig_info.set_index("sig_id", inplace=True)
comp_sig_info

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
REP.A007_A375_24H:N13,10-DEBC,A375,10.0 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N14,10-DEBC,A375,3.33 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N15,10-DEBC,A375,1.11 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N16,10-DEBC,A375,0.37 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
REP.A007_A375_24H:N17,10-DEBC,A375,0.12 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12
...,...,...,...,...,...
LPROT003_A549_6H:G04,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_A549_6H:G06,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_PC3_6H:G01,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12
LPROT003_PC3_6H:G03,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12


In [26]:
for gk in ["pr_gene_title","pr_is_bing","pr_is_lm"]:
    del landmark_info[gk]
landmark_info.set_index("pr_gene_id",inplace=True)
landmark_info.index = landmark_info.index.map(str)

Time to annotate!

In [27]:
l5_data.col_metadata_df = comp_sig_info
l5_data.row_metadata_df = landmark_info
l5_data.data_df

cid,REP.A001_A375_24H:A03,REP.A001_A375_24H:A04,REP.A001_A375_24H:A05,REP.A001_A375_24H:A06,REP.A001_A375_24H:A07,REP.A001_A375_24H:A08,REP.A001_A375_24H:A09,REP.A001_A375_24H:A10,REP.A001_A375_24H:A11,REP.A001_A375_24H:A12,...,LJP007_SKL_24H:P19,LJP007_SKL_24H:P20,LJP007_SKL_24H:P21,LJP007_SKL_24H:P22,LJP007_SKL_24H:E21,LJP007_SKL_24H:O13,LJP007_SKL_24H:O14,LJP007_SKL_24H:O24,LJP007_SKL_24H:P24,LJP007_SKL_24H:C19
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
780,4.264143,-0.382211,-0.571711,0.584376,0.658348,-0.004232,-0.314762,-0.049558,-0.909517,-0.850654,...,1.091158,0.264409,0.711080,0.768569,4.4460,4.4395,6.1750,8.0582,10.0000,3.0807
7849,0.057249,0.304313,-0.754999,-0.589973,-0.226854,-0.363419,-0.691129,-0.684283,0.521503,-0.640316,...,-0.493212,-0.041785,-0.606896,0.819984,6.6313,10.0000,2.8649,0.4905,9.1524,4.5834
6193,-2.139334,-0.995924,-0.710110,-0.026398,-1.143599,-0.850314,-1.052307,-0.463051,-0.494277,0.067007,...,0.807524,-0.062587,0.632009,1.067584,3.4302,1.6831,1.8397,3.3238,-1.2545,0.3805
23,-0.221784,-0.670834,0.428894,-0.065268,0.342426,0.539448,0.357474,-0.233215,-0.715200,-0.418760,...,0.255770,-0.218802,0.257857,0.170097,2.6837,4.5242,4.3375,3.6885,-4.9315,2.2650
9552,-0.376555,-0.648242,0.272606,0.542223,0.380470,-0.011210,-1.590959,-0.613891,-0.291982,-0.629392,...,0.301324,-0.841693,0.628758,-0.387287,-1.6202,-2.6985,-1.0555,-1.5548,0.9744,-1.9243
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5467,-1.639110,0.093595,-0.114638,0.226151,-0.250033,-0.713617,0.302641,0.411550,-0.287456,-0.958237,...,-0.638983,-0.694421,-0.270898,-0.718795,1.2024,3.5227,4.2233,0.0099,-1.9593,2.1359
2767,0.685100,0.326673,0.304832,0.404963,-0.952898,0.722581,-0.956834,-0.019838,-0.932226,0.954081,...,0.370300,0.332669,-0.282100,-0.498417,1.5862,-3.5622,0.0381,-3.6135,2.9332,-3.2407
23038,-0.419421,1.048097,-0.249467,-4.357310,-0.011989,0.698287,0.122922,0.866827,1.334106,1.836836,...,1.892722,1.880093,-0.596567,1.565006,-2.1189,-2.1291,-2.8634,-0.1975,-4.6106,-2.6478
57048,1.716090,-0.505179,-0.428352,0.076562,-0.092549,0.322897,-1.166843,0.429393,-0.550419,0.263633,...,-2.228678,-0.528023,-0.963753,0.112534,-1.3866,0.5198,1.1650,-6.1485,1.5921,-1.9503


Before we finish the labels, let's add a control column

In [28]:
# Add a control column
comp_sig_info["control"] = comp_sig_info["pert_iname"] == "DMSO"
comp_sig_info[comp_sig_info["control"]]

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles,control
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LJP005_A375_24H:A03,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A04,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A05,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:A06,DMSO,A375,-666,24 h,CS(=O)C,True
LJP005_A375_24H:B03,DMSO,A375,-666,24 h,CS(=O)C,True
...,...,...,...,...,...,...
REP.A028_YAPC_24H:J14,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J15,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J16,DMSO,YAPC,-666,24 h,CS(=O)C,True
REP.A028_YAPC_24H:J17,DMSO,YAPC,-666,24 h,CS(=O)C,True


Join on the index (experiment ID):

In [29]:
final_data = comp_sig_info.join(l5_data.data_df.T)
final_data = final_data.rename(columns=landmark_info.to_dict()["pr_gene_symbol"])
final_data

Unnamed: 0_level_0,pert_iname,cell_id,pert_idose,pert_itime,canonical_smiles,control,DDR1,PAX8,RPS5,ABCF1,...,P4HTM,SLC27A3,TBXA2R,RTN2,TSTA3,PPARD,GNA11,WDTC1,PLSCR3,NPEPL1
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
REP.A007_A375_24H:N13,10-DEBC,A375,10.0 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.038472,1.849687,1.387801,-0.280553,...,1.638619,-1.108579,0.762555,0.667163,-0.644516,0.483852,-0.129235,0.710601,0.790889,0.043918
REP.A007_A375_24H:N14,10-DEBC,A375,3.33 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.102355,0.727322,1.702942,-0.360274,...,0.418435,-0.792270,0.927009,0.809531,-0.697894,0.188830,-0.594291,-0.555795,0.721365,1.173322
REP.A007_A375_24H:N15,10-DEBC,A375,1.11 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-0.568256,0.511114,0.565644,-0.240555,...,0.888272,-0.677597,1.164146,0.968310,-0.591217,-0.203559,-1.332539,-1.671919,-0.602511,0.761775
REP.A007_A375_24H:N16,10-DEBC,A375,0.37 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.079726,-0.023865,0.704639,-0.254031,...,0.122289,-1.524039,1.073033,0.373594,-0.445966,2.532496,-0.548906,-0.325132,1.272979,-1.054154
REP.A007_A375_24H:N17,10-DEBC,A375,0.12 um,24 h,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,False,-1.909248,-0.086209,1.792485,-1.020507,...,0.499317,-1.332392,0.843685,0.068158,0.738767,-0.188630,0.593074,-0.960322,-0.270928,0.383060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
LPROT003_A549_6H:G04,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,0.225100,0.000000,-0.902900,-0.456600,...,-0.295300,0.420700,-0.226300,0.078000,-0.956200,-0.696600,-0.850100,-1.024800,-0.243300,0.447600
LPROT003_A549_6H:G06,smer-3,A549,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-0.859600,0.125700,0.315600,-0.260800,...,2.897400,0.459000,-0.629100,0.668200,0.000000,-0.138700,-1.151500,0.000000,0.000000,-0.978700
LPROT003_PC3_6H:G01,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-0.215200,-0.674500,-2.441700,2.421300,...,2.501800,2.560700,0.600100,2.412100,-0.745300,0.432700,-1.318600,2.133700,0.605800,0.182200
LPROT003_PC3_6H:G03,smer-3,PC3,10.0 um,6 h,O=C1c2ccccc2-c2nc3nonc3nc12,False,-2.046500,-1.113400,0.103400,-2.118000,...,1.556200,-0.567900,-1.015200,0.282700,-0.121700,-0.551700,-2.266200,0.454400,-0.194900,0.674500


In [30]:
final_data.to_parquet("data/lincs.parquet")

In [31]:
def find_control(cell_id):
    # Could either return random or mean.
    valid_columns = final_data[(final_data["cell_id"]==cell_id) & (final_data["control"])]
    return valid_columns.sample(1).iloc[0]

find_control("A375")

pert_iname              DMSO
cell_id                 A375
pert_idose              -666
pert_itime              24 h
canonical_smiles     CS(=O)C
                      ...   
PPARD              -0.518744
GNA11               0.495973
WDTC1               1.889292
PLSCR3             -0.421194
NPEPL1              2.334804
Name: REP.A017_A375_24H:B03, Length: 984, dtype: object

In [42]:
x = abs(final_data.iloc[:,6:].to_numpy().flatten())
min(x[x.nonzero()])

7.275958e-12