# Recursion Cellular Image Classification


## Background



In [30]:
# Ref: https://www.kaggle.com/jesucristo/quick-visualization-eda

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sys
%matplotlib inline

In [70]:
train_metadata = pd.read_csv("../data/kaggle/reccell/recursion-cellular-image-classification/train.csv")
test_metadata = pd.read_csv("../data/kaggle/reccell/recursion-cellular-image-classification/test.csv")
train_c_metadata = pd.read_csv("../data/kaggle/reccell/recursion-cellular-image-classification/train_controls.csv")
test_c_metadata = pd.read_csv("../data/kaggle/reccell/recursion-cellular-image-classification/test_controls.csv")

In [61]:
train_metadata.head()

Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144


In [62]:
train_metadata.experiment.unique()

array(['HEPG2-01', 'HEPG2-02', 'HEPG2-03', 'HEPG2-04', 'HEPG2-05',
       'HEPG2-06', 'HEPG2-07', 'HUVEC-01', 'HUVEC-02', 'HUVEC-03',
       'HUVEC-04', 'HUVEC-05', 'HUVEC-06', 'HUVEC-07', 'HUVEC-08',
       'HUVEC-09', 'HUVEC-10', 'HUVEC-11', 'HUVEC-12', 'HUVEC-13',
       'HUVEC-14', 'HUVEC-15', 'HUVEC-16', 'RPE-01', 'RPE-02', 'RPE-03',
       'RPE-04', 'RPE-05', 'RPE-06', 'RPE-07', 'U2OS-01', 'U2OS-02',
       'U2OS-03'], dtype=object)

In [63]:
# Ref https://github.com/recursionpharma/rxrx1-utils/blob/master/rxrx/io.py
def parse_dataset(df, ):
    df['cell_type'] = df.experiment.str.split("-").apply(lambda a: a[0])
    df['batch'] = df.experiment.str.split("-").apply(lambda a: int(a[1]))
    df['well_type'] = 'treatment'
    
    dfs = []
    for site in (1, 2):
        df = df.copy()
        df['site'] = site
        dfs.append(df)
    res = pd.concat(dfs).sort_values(
        by=['id_code', 'site']).set_index('id_code')
    return res

In [64]:
train_metadata = parse_dataset(train_metadata)

In [65]:
train_metadata.head()

Unnamed: 0_level_0,experiment,plate,well,sirna,cell_type,batch,well_type,site
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
HEPG2-01_1_B03,HEPG2-01,1,B03,513,HEPG2,1,treatment,1
HEPG2-01_1_B03,HEPG2-01,1,B03,513,HEPG2,1,treatment,2
HEPG2-01_1_B04,HEPG2-01,1,B04,840,HEPG2,1,treatment,1
HEPG2-01_1_B04,HEPG2-01,1,B04,840,HEPG2,1,treatment,2
HEPG2-01_1_B05,HEPG2-01,1,B05,1020,HEPG2,1,treatment,1


In [66]:
train_metadata.cell_type.unique()

array(['HEPG2', 'HUVEC', 'RPE', 'U2OS'], dtype=object)

In [67]:
train_metadata.describe()

Unnamed: 0,plate,sirna,batch,site
count,73030.0,73030.0,73030.0,73030.0
mean,2.499932,553.406874,5.991921,1.5
std,1.118005,319.784566,4.265442,0.500003
min,1.0,0.0,1.0,1.0
25%,1.25,276.25,3.0,1.0
50%,2.0,553.0,5.0,1.5
75%,3.0,830.0,8.0,2.0
max,4.0,1107.0,16.0,2.0


## Metadata

![Plate](https://assets.fishersci.com/TFS-Assets/CCG/product-images/F260015~p.eps-650.jpg)

### Folder structure
The data come in following folder structure

```
/<set_type>/<cell_type>-<batch #>/<Plate #>/<well_location>_<site>_<microscope_channel>.png
```

* set_type: Describe whether this is a training set or test set.
    * train
    * test
* cell_type: Four different type of cells were used for experiments
    * HEPG2
    * HUVEC
    * RPE
    * U2OS
* batch # - Experiment batch, range in [1, 16]
* plate # - Plate number, range in [1,4]
* well_location: Location of cell on plate, in format of `<column><row>`
* site - The site where the images of well being taken in each well, range in [1,2]
* microscope_channel The microscope channel of each well, range in [1,6]

---

### CSV
There are four important CSV file accompany with the dataset:
* train.csv
* test.csv
* train_controls.csv
* test_controls.csv

In each of each file, there are
* id_code
* experiment
* plate
* well
* sirna

And `*_control.csv` files has additional column describe the type of control. (Positive or negative)
* well_type

The sirna is the label for the task.
* sirna
  
    * train/test: [0, 1107]
    * train_control/test_control [1108, 1138]

Where, 

* `0 - 1107` are the sirna we interested in.
* `1108 - 1137` are the sirna in positive control.
* `1138` is the sirna in negative control

