# Loading profiles from the JUMP Cell Painting Datasets  
This notebook loads a small number of plates with precomputed features and the metadata information.
## Import libraries

In [63]:
import io
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected" # Set to "svg" or "png" for static plots or "notebook_connected" for interactive plots

## Helper functions

In [64]:
formatter = ('s3://cellpainting-gallery/cpg0016-jump/'
             '{Metadata_Source}/workspace/profiles/'
             '{Metadata_Batch}/{Metadata_Plate}/{Metadata_Plate}.parquet')

def read_parquet_from_s3(s3_path, columns=None):
    dframe = pd.read_parquet(s3_path,
                             storage_options={"anon": True},
                             columns=columns)
    # Force to be string. Needed for merge
    dframe['Metadata_Plate'] = dframe['Metadata_Plate'].astype(str)
    return dframe

## Load metadata

The following files contain the metadata information for the entire dataset.
The schema is [here](https://github.com/jump-cellpainting/datasets/blob/main/metadata/README.md).

In [65]:
plates = pd.read_csv('metadata/plate.csv.gz')
wells = pd.read_csv('metadata/well.csv.gz')
compound = pd.read_csv('metadata/compound.csv.gz')

## Sample plates
As a way of example we are filtering a particular plate type and sampling 2 plates per source.

In [66]:
sample = plates.query('Metadata_PlateType=="TARGET2"').groupby('Metadata_Source').sample(2, random_state=42)
sample

Unnamed: 0,Metadata_Source,Metadata_Batch,Metadata_Plate,Metadata_PlateType
1474,source_10,2021_08_17_U2OS_48_hr_run16,Dest210809-141104,COMPOUND
1318,source_10,2021_06_01_U2OS_48_hr_run2,Dest210601-154924,COMPOUND
108,source_3,CP_27_all_Phenix1,J12436b,COMPOUND
267,source_3,CP_35_all_Phenix1,BAY5870c,COMPOUND
784,source_5,JUMPCPE-20210908-Run28_20210909_072022,AETJUM102,COMPOUND
794,source_5,JUMPCPE-20210909-Run29_20210910_234236,AETJUM110,COMPOUND
1072,source_6,p211123CPU2OS48hw384exp036JUMP,110000297115,COMPOUND
851,source_6,p210824CPU2OS48hw384exp022JUMP,110000293088,COMPOUND
1115,source_8,J1,A1170411,COMPOUND
1205,source_8,J3,A1170499,COMPOUND


## Loading profiles
Setting `columns = None` will load all of the features.

<div class="alert alert-warning">
WARNING: Files are located in S3. This loop loads only two features per each sampled plate; loading many feature and/or many plates can take several minutes.
</div>

In [67]:
dframes = []
columns = ['Metadata_Source', 'Metadata_Plate', 'Metadata_Well', 'Cells_AreaShape_Eccentricity', 'Nuclei_AreaShape_Area']
for _, row in sample.iterrows():
    s3_path = formatter.format(**row.to_dict())
    dframes.append(read_parquet_from_s3(s3_path, columns=columns))
dframes = pd.concat(dframes)

## Join features with metadata
Now we merge profiles with the metadata to get an annotated DataFrame.

In [None]:
metadata = compound.merge(wells, on='Metadata_JCP2022')
ann_dframe = metadata.merge(dframes, on=['Metadata_Source', 'Metadata_Plate', 'Metadata_Well'])

## Plot features
The following scatter plot contains every well in the sampled dataset.

In the interactive plot (see [`imports`](#Import-libraries)), you can hover over the points to see the JCP ID and the InChiKey for a given compound.

Because these are raw, unnormalized features, you will notice discernable clusters corresponding to each source due to batch effects.

Upcoming data releases will included normalized features, where these effects are mitigated to some extent.

In [None]:
from pickle import FALSE, TRUE

px.scatter(
    ann_dframe,
    x="Cells_AreaShape_Eccentricity",
    y="Nuclei_AreaShape_Area",
    color="Metadata_Source",
    hover_name="Metadata_JCP2022",
    hover_data=["Metadata_InChIKey"]
)
