# The NIH LINCS data is very large
You can download all relevant files on the [GEO Website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138).

The dataset contains 12328 rows (genes) by 118050 columns (samples) for a total of 1,455,320,400 entries

In [146]:
# Assuming an 8 byte float
base_rows = 12328
base_columns = 118050
f"Thats {base_rows * base_columns * 8 / 1e9} gigabytes"

'Thats 11.6425632 gigabytes'

**This is too large for practical use**, and might be too big to fit in working memory (RAM) for many computers.

But we aren't interested in all the samples. Many samples here measure data for experiments we are not interested in. Luckily, the NIH provides a number of metadata files we can use to help use decide which experiments we are interested in.

### Determining data of interest

The reagent (chemical or genetic) being studied in a given experiment is called the perturbagen, and there a variety of types of perturbagens. 

Let's load in the metadata file that contains info on the perturbagens.

In [6]:
import pandas as pd

pert_info = pd.read_csv("data/GSE70138_Broad_LINCS_pert_info_2017-03-06.txt", sep="\t")

0            trt_cp
513     ctl_vehicle
1797        trt_xpr
2150      ctl_untrt
2151     ctl_vector
Name: pert_type, dtype: object

Let's see what kinds of perturbagens there are in the dataset:

In [35]:
pert_info["pert_type"].drop_duplicates()

0            trt_cp
513     ctl_vehicle
1797        trt_xpr
2150      ctl_untrt
2151     ctl_vector
Name: pert_type, dtype: object

Looking at the Connectopedia [entry on perturbagens](https://clue.io/connectopedia/perturbagen_types_and_controls), we can see that the `pert_type` for drugs (compounds) is `trt_cp`. 

Let's see some examples:

In [33]:
compound_perturbagens = pert_info[pert_info["pert_type"]=="trt_cp"]
print(f"Found {compound_perturbagens.shape[0]} different compounds.")
compound_perturbagens[:5]

Found 1796 different compounds.


Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
0,BRD-K70792160,CCN(CC)CCCCN1c2ccccc2Oc2ccc(Cl)cc12,GYBXAGDWMCJZJK-UHFFFAOYSA-N,10-DEBC,trt_cp
1,BRD-K68552125,CCCCCCCCCCCCCC(=O)O[C@@H]1[C@@H](C)[C@]2(O)[C@...,PHEDXBVPIONUQT-RGYGYFBISA-N,phorbol-myristate-acetate,trt_cp
2,BRD-K92301463,CCCCC(C)(C)[C@H](O)\C=C\[C@H]1[C@H](O)CC(=O)[C...,QAOBBBBDJSWHMU-WMBBNPMCSA-N,"16,16-dimethylprostaglandin-e2",trt_cp
3,BRD-A29731977,CCCCCC(=O)O[C@@]1(CCC2C3CCC4=CC(=O)CC[C@]4(C)C...,DOMWKUIIPQCAJU-JKPPDDDBSA-N,17-hydroxyprogesterone-caproate,trt_cp
4,BRD-K07954936,OC(=O)CCCC[C@@H]1SC[C@@H]2NC(=N)N[C@H]12,WWVANQJRLPIHNS-ZKWXMUAHSA-N,2-iminobiotin,trt_cp


Note that `pert_iname` in this dataset corresponds with `sm_name` in the Kaggle dataset (`de_train.parquet`). The same holds true `canonical_smiles` and `SMILES`, respectively.

By the way, the negative control (DMSO) exists in this dataset too, with a special `pert_type` called `ctl_vehicle`.

In [19]:
control_perturbagen = pert_info[pert_info["pert_type"]=="ctl_vehicle"]
control_perturbagen

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
513,DMSO,CS(=O)C,IAZDPXIOMUYVGZ-UHFFFAOYSA-N,DMSO,ctl_vehicle


And so are the positive controls: dabrafenib and belinostat.

In [24]:
positive_perturbagens = pert_info[pert_info["pert_iname"].isin(["dabrafenib","belinostat"])]
positive_perturbagens

Unnamed: 0,pert_id,canonical_smiles,inchi_key,pert_iname,pert_type
216,BRD-K17743125,ONC(=O)\C=C\c1cccc(c1)S(=O)(=O)Nc1ccccc1,NCNRHFGMJRPRSK-MDZDMXLPSA-N,belinostat,trt_cp
441,BRD-K09951645,CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(...,BFSMGDJOXZAERB-UHFFFAOYSA-N,dabrafenib,trt_cp


### Building an index
The work we just did tells us which `pert_id`'s we are interested in, but we don't quite have an index into the dataset yet. 

Let's load in the sample metadata. Note that there is 1 entry for every sample in the dataset.

In [41]:
sig_info = pd.read_csv("data/GSE70138_Broad_LINCS_sig_info_2017-03-06.txt", sep="\t")
sig_info.shape

(118050, 8)

This sample metadata contains the same `pert_type` column as above perturbagen metadata, but it doesn't have any info on what the perturbagen is:

In [42]:
sig_info.columns

Index(['sig_id', 'pert_id', 'pert_iname', 'pert_type', 'cell_id', 'pert_idose',
       'pert_itime', 'distil_id'],
      dtype='object')

Luckily we can combine all the work we've done so far. 
We want to annotate every sample which is uses a compound perturbagen with it's SMILES and International Chemical Identifier (INCHI).

*Hold on tight!*

In [137]:
compound_perturbagens = compound_perturbagens[['pert_id', 'canonical_smiles', 'inchi_key']]
key = "pert_id"
comp_sig_info = comp_sig_info.join(compound_perturbagens.set_index(key),on=key)
comp_sig_info

Unnamed: 0,sig_id,pert_id,pert_iname,pert_type,cell_id,pert_idose,pert_itime,distil_id,canonical_smiles,inchi_key
4,LJP005_A375_24H:A07,BRD-K76908866,CP-724714,trt_cp,A375,10.0 um,24 h,LJP005_A375_24H_X1_B19:A07|LJP005_A375_24H_X2_...,COCC(=O)NC\C=C\c1ccc2ncnc(Nc3ccc(Oc4ccc(C)nc4)...,LLVZBTWPGQVVLW-SNAWJCMRSA-N
5,LJP005_A375_24H:A08,BRD-K76908866,CP-724714,trt_cp,A375,3.33 um,24 h,LJP005_A375_24H_X1_B19:A08|LJP005_A375_24H_X2_...,COCC(=O)NC\C=C\c1ccc2ncnc(Nc3ccc(Oc4ccc(C)nc4)...,LLVZBTWPGQVVLW-SNAWJCMRSA-N
6,LJP005_A375_24H:A09,BRD-K76908866,CP-724714,trt_cp,A375,1.11 um,24 h,LJP005_A375_24H_X1_B19:A09|LJP005_A375_24H_X2_...,COCC(=O)NC\C=C\c1ccc2ncnc(Nc3ccc(Oc4ccc(C)nc4)...,LLVZBTWPGQVVLW-SNAWJCMRSA-N
7,LJP005_A375_24H:A10,BRD-K76908866,CP-724714,trt_cp,A375,0.37 um,24 h,LJP005_A375_24H_X1_B19:A10|LJP005_A375_24H_X2_...,COCC(=O)NC\C=C\c1ccc2ncnc(Nc3ccc(Oc4ccc(C)nc4)...,LLVZBTWPGQVVLW-SNAWJCMRSA-N
8,LJP005_A375_24H:A11,BRD-K76908866,CP-724714,trt_cp,A375,0.12 um,24 h,LJP005_A375_24H_X1_B19:A11|LJP005_A375_24H_X2_...,COCC(=O)NC\C=C\c1ccc2ncnc(Nc3ccc(Oc4ccc(C)nc4)...,LLVZBTWPGQVVLW-SNAWJCMRSA-N
...,...,...,...,...,...,...,...,...,...,...
113866,REP.A028_YAPC_24H:K09,BRD-K60230970,MG-132,trt_cp,YAPC,20.0 um,24 h,REP.A028_YAPC_24H_X2_B25:K09|REP.A028_YAPC_24H...,CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(...,TZYWCYJVHRLUCT-VABKMULXSA-N
113867,REP.A028_YAPC_24H:M18,BRD-K96862998,pirfenidone,trt_cp,YAPC,0.04 um,24 h,REP.A028_YAPC_24H_X2_B25:M18|REP.A028_YAPC_24H...,Cc1ccc(=O)n(c1)-c1ccccc1,ISWRGOKTTBVCFA-UHFFFAOYSA-N
113868,REP.A028_YAPC_24H:O01,BRD-K60230970,MG-132,trt_cp,YAPC,20.0 um,24 h,REP.A028_YAPC_24H_X2_B25:O01|REP.A028_YAPC_24H...,CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(...,TZYWCYJVHRLUCT-VABKMULXSA-N
113869,REP.A028_YAPC_24H:O06,BRD-K60230970,MG-132,trt_cp,YAPC,20.0 um,24 h,REP.A028_YAPC_24H_X2_B25:O06|REP.A028_YAPC_24H...,CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(...,TZYWCYJVHRLUCT-VABKMULXSA-N


We didn't save much space with all this cleaning:

In [148]:
f"We still have {12328 * comp_sig_info.shape[0] * 8 / 1e9} gigabytes"

'We still have 10.592612096 gigabytes'

But we can include only the data on landmark genes. Or we can include only the data on the cell types we care about.