## Meta Data Analysis

There are many meta data come with the images. We will analyze those information in this notebook to make sense the whole database.

In [40]:
import pandas as pd
import numpy as np
import sqlite3

## 1. Global Statistics

### 1.1. Chemical Annotation

`chemical_annotations.csv` contains the global information for the chemical compounds.

> Table containing metadata for many of the compounds from Broad Institute’s Chemical Biology Informatics Platform (CBIP), including (where applicable) compound names, simplified molecular-input line-entry system annotations (SMILES), MLSMR sample identifiers, and PubChem compound identifiers (CID) and substance identifiers (SID).

In [27]:
df = pd.read_csv("./data/test/meta_data/chemical_annotations.csv")
print(df.shape)
df.head(7)

(30616, 10)


Unnamed: 0,BROAD_ID,CPD_NAME,CPD_NAME_TYPE,CPD_SAMPLE_ID,DOS_LIBRARY,SOURCE_NAME,CHEMIST_NAME,VENDOR_CATALOG_ID,CPD_SMILES,USERCOMMENT
0,BRD-A56675431-001-04-0,altizide,INN,SA82748,,Prestwick Chemical Inc.,,Prestw-721,NS(=O)(=O)c1cc2c(NC(CSCC=C)NS2(=O)=O)cc1Cl,
1,BRD-A51829654-001-01-4,"BRL-15,572",common,SA82481,,Biomol International Inc.,,AC-536,OC(CN1CCN(CC1)c1cccc(Cl)c1)C(c1ccccc1)c1ccccc1,
2,BRD-K04046242-001-03-6,equilin,primary-common,SA82922,,Prestwick Chemical Inc.,,Prestw-850,C[C@]12CC[C@H]3C(=CCc4cc(O)ccc34)[C@@H]1CCC2=O,
3,BRD-K16508793-001-01-8,diazepam,INN,SA59660,,MicroSource Discovery Systems Inc.,,1900003,CN1c2ccc(Cl)cc2C(=NCC1=O)c1ccccc1,
4,BRD-K09397065-001-01-6,SR 57227A,to-be-curated,SA82504,,Biomol International Inc.,,AC-561,NC1CCN(CC1)c1cccc(Cl)n1,
5,BRD-K11927976-050-01-1,ER-27319,to-be-curated,SA792875,,Tocris Bioscience,,2471,Cc1ccc2c(c1C)n(CCCN)c1ccccc1c2=O,
6,BRD-K14282469-001-01-5,PAPP,primary-common,SA82523,,Biomol International Inc.,,AC-846,Nc1ccc(CCN2CCN(CC2)c2cccc(c2)C(F)(F)F)cc1,


### 1.2. Image Name Encoding

Each plate has an unique 5-digit plate number. The Cell Image Library actually has more plates than the meta data provided in Giga database.

Each plate has 5 channels using dyes `ERSyto`, `ERSytoBleed`, `Hoechst`, `Mito` and `Ph_golgi`. For example, the dir name of plate 24278 is:
- `24278-ERSyto`
- `24278-ERSytoBleed`
- `24278-Hoechst`
- `24278-Mito`
- `24278-Ph_golgi`

Each image is encoded in 16-bit tiff format, with name like `cdp2bioactives_a01_s1_w2edcec6dc-b1e3-4ffc-80da-9b049a89447b.tif`.

- `a01` is the well index
- `s1` is the site index (depth of field)
- I propose `w2edcec6dc-b1e3-4ffc-80da-9b049a89447b` is just a unique identifier for each image.
- Same well in different channel has the same name (i.e. `cdp2bioactives_a01_s1_`) before the long identifier.


### 1.3. Image Statistics

This table includes the image statistics for each plate.

- **Plate ID**: 5-digit identifier given by the ImageXpress microscope labeling the plate.
- **Num_CIL_images**: Total number of images for the plate hosted at The Cell Image Library (CIL).
- **Num_CIL_wells**: Total number of wells represented in the plate hosted at CIL which have >1 site (i.e., field of view) included.
- **Num_CIL_complete wells**: Total number of wells which have all sites included.
- **Num_CIL_sites**: Total number of sites which have >1 channel included.
- **Num_CIL_complete_sites**: Total number of sites which have all channels included.
- **Num_QC_stats**: Total number of sites for which quality control data is included.
- **Num_blurry_sites**: Total number of sites labelled as blurry/out-of-focus by the quality control workflow.
- **Num_saturated_sites**: Total number of sites labelled as containing saturation artifacts by the quality control workflow.


In [21]:
df = pd.read_csv("./data/test/meta_data/image_curation_statistics.csv")
df.head()

Unnamed: 0,PlateID,Num_CIL_images,Num_CIL_wells,Num_CIL_complete wells,Num_CIL_sites,Num_CIL_complete_sites,Num_QC_stats,Num_blurry_sites,Num_saturated_sites,Num_well_profiles
0,24277,11520,384,384,2304,2304,2304,0,75,384
1,24278,2445,82,81,489,489,489,1,6,384
2,24279,11520,384,384,2304,2304,2304,0,30,384
3,24280,11520,384,384,2304,2304,2304,0,48,383
4,24293,11520,384,384,2304,2304,2304,0,32,384


## 2. Local Statistics

For the following tables/database, each one is associated to one plate. I use plate 24278 for the demonstration.

### 2.1. Quality Control

This table gives the imaging quality statistics in well and site level, but it has no channel information.

Each field of view is assessed for the presence of 2 artifacts (focal blur and saturated objects), and assigned a label of 1 if present and 0 if not. 

In [19]:
df = pd.read_csv("./data/test/meta_data/quality_control/qc.csv")
df.head(7)

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_Metadata_Site,Image_Metadata_isSaturated,Image_Metadata_isBlurry
0,24278,a20,1,0,0
1,24278,a20,2,0,0
2,24278,a20,3,0,0
3,24278,a20,4,0,0
4,24278,a20,5,0,0
5,24278,a20,6,0,0
6,24278,a21,1,0,0


### 2.2. Mean Well Profiles

Per-well averages of each extracted morphological feature computed across the cells.

For each well/site, there are about 1800 average morphology features.

**The compound information is also recorded in this table**. The column `Metadata_broad_sample` is a foreign key of Chemical Annotation table (`BROAD_ID`).

In [33]:
df = pd.read_csv("./data/test/meta_data/profiles/mean_well_profiles.csv")
print(df.shape)
df.head(10)

(384, 1800)


Unnamed: 0,Metadata_Plate,Metadata_Well,Metadata_Assay_Plate_Barcode,Metadata_Plate_Map_Name,Metadata_well_position,Metadata_ASSAY_WELL_ROLE,Metadata_broad_sample,Metadata_mmoles_per_liter,Metadata_solvent,Metadata_pert_id,...,Nuclei_Texture_Variance_DNA_5_0,Nuclei_Texture_Variance_ER_10_0,Nuclei_Texture_Variance_ER_3_0,Nuclei_Texture_Variance_ER_5_0,Nuclei_Texture_Variance_Mito_10_0,Nuclei_Texture_Variance_Mito_3_0,Nuclei_Texture_Variance_Mito_5_0,Nuclei_Texture_Variance_RNA_10_0,Nuclei_Texture_Variance_RNA_3_0,Nuclei_Texture_Variance_RNA_5_0
0,24278,a01,24278,H-BIOA-007-3,a01,treated,BRD-K78364995-236-03-5,2.09006,DMSO,BRD-K78364995,...,3.065597,1.525995,1.59688,1.557685,1.633298,1.627728,1.618267,2.582515,2.474853,2.528092
1,24278,a02,24278,H-BIOA-007-3,a02,treated,BRD-K78414110-001-02-8,5.0,DMSO,BRD-K78414110,...,2.947674,1.595292,1.681501,1.639656,1.438692,1.497101,1.4734,2.402188,2.44303,2.527604
2,24278,a03,24278,H-BIOA-007-3,a03,treated,BRD-K78485176-001-02-9,5.0,DMSO,BRD-K78485176,...,2.959839,1.573649,1.641233,1.630217,1.470714,1.545823,1.499599,2.400266,2.457235,2.536708
3,24278,a04,24278,H-BIOA-007-3,a04,treated,BRD-K78496197-001-01-3,5.0,DMSO,BRD-K78496197,...,3.323827,1.627224,1.800667,1.699291,1.41989,1.578673,1.528876,2.601877,2.439709,2.506089
4,24278,a05,24278,H-BIOA-007-3,a05,treated,BRD-K78599730-001-02-6,5.0,DMSO,BRD-K78599730,...,3.3446,1.772216,1.913539,1.88199,1.292479,1.424874,1.378594,2.433934,2.356783,2.422295
5,24278,a06,24278,H-BIOA-007-3,a06,treated,BRD-K78612426-001-02-6,5.0,DMSO,BRD-K78612426,...,3.499219,1.746578,1.896714,1.836252,1.604872,1.700492,1.687575,2.49805,2.398161,2.474145
6,24278,a07,24278,H-BIOA-007-3,a07,treated,BRD-K78633253-001-01-2,5.0,DMSO,BRD-K78633253,...,3.715294,1.823429,2.092295,2.010445,1.585145,1.758397,1.675881,2.509877,2.409396,2.488828
7,24278,a08,24278,H-BIOA-007-3,a08,treated,BRD-K78637815-001-01-4,5.0,DMSO,BRD-K78637815,...,3.360001,1.693696,1.795996,1.783097,1.587473,1.714071,1.64936,2.390896,2.383596,2.445147
8,24278,a09,24278,H-BIOA-007-3,a09,treated,BRD-K78643075-001-03-3,2.379167,DMSO,BRD-K78643075,...,3.55922,1.857709,1.973541,1.90917,1.654707,1.760128,1.705003,2.320583,2.282689,2.343038
9,24278,a10,24278,H-BIOA-007-3,a10,treated,BRD-K78692225-001-11-2,5.0,DMSO,BRD-K78692225,...,3.543376,1.810797,1.847711,1.791665,1.531595,1.699469,1.603155,2.381528,2.369899,2.450524


In [39]:
chem_df = pd.read_csv("./data/test/meta_data/chemical_annotations.csv")
index = list(chem_df['BROAD_ID']).index(df['Metadata_broad_sample'][0])
chem_df.iloc[index]

BROAD_ID                                        BRD-K78364995-236-03-5
CPD_NAME                                                    cefotaxime
CPD_NAME_TYPE                                                      INN
CPD_SAMPLE_ID                                                  SA83374
DOS_LIBRARY                                                        NaN
SOURCE_NAME                                    Prestwick Chemical Inc.
CHEMIST_NAME                                                       NaN
VENDOR_CATALOG_ID                                           Prestw-139
CPD_SMILES           CO\N=C(/C(=O)N[C@H]1[C@H]2SCC(COC(C)=O)=C(N2C1...
USERCOMMENT                                                        NaN
Name: 650, dtype: object

For example, we find that Well a01 in Plate 24278 has compound "cefotaxime" below.

## 2.3 Extracted Feature

A SQLite database comprising 4 tables (a) 1 per-image cellular statistic (e.g., cell count), (b) 3 per-cell cell tables, measuring size, shape, intensity, textural, and adjacency statistics for the nuclei, cytoplasm, and cell body. 

1. Cell Table
2. Cytoplasm Table
3. Image Table
4. Nuclei Table

I am still researching in the structure of the database. I believe Image Table is the entry table with combine primary key `Image_Metadata_Plate`, `Image_Metadata_Site`, and `Image_Metadata_Well`.

In [52]:
conn = sqlite3.connect('./data/test/meta_data/extracted_features/24278.sqlite')
c = conn.cursor()

In [80]:
# Display all column names in Image table
#cur = c.execute("SELECT * FROM Image")
#[tuple[0] for tuple in cur.description]

In [82]:
# Test primary key
c.execute(
    """
    SELECT Image_Count_Cells, Image_Count_Cytoplasm, Image_Count_Nuclei
    FROM Image
    WHERE Image_Metadata_Plate = 24278 AND Image_Metadata_Site = '1' AND Image_Metadata_Well = 'a01'
    """
)
c.fetchall()

[(58.0, 58.0, 58.0)]

From above we can see there are 58 cells, 58 cytoplasm and 58 nuclei in the image of Plate 24278 Well a01 in Site 1.

**This information is shared across all five channels.**