# Image Sorting

In this notebook, we will sort the cell images, into a folder-subfolder structure that corresponds to the mode of action (MoA) classification, in order to facilitate the training of the CNN in pytorch. We will first load the dataframes which contain the information on which compound was used to treat cells on each image and the corresponding MoA. 

The cell images are available for downloading at the following url: https://bbbc.broadinstitute.org/BBBC021

They are organized per week (week 1 : pictures taken during week 1, week 2: during week 2 etc...). A CSV metadata file lists the correspondance between cell image names, compounds, and MoA.

In [2]:
#let's first load the packages we will need
import pandas as pd
import numpy as np
import shutil
import os


In [3]:
#now we will load the dataframe containing data
df = pd.read_csv('csv_files/BBBC021_final_enhanced_dataset.csv')
df.head()

Unnamed: 0,TableNumber,ImageNumber,Image_FileName_DAPI,Image_PathName_DAPI,Image_FileName_Tubulin,Image_PathName_Tubulin,Image_FileName_Actin,Image_PathName_Actin,Image_Metadata_Plate_DAPI,Image_Metadata_Well_DAPI,Replicate,Image_Metadata_Compound,Image_Metadata_Concentration,Image_Metadata_MoA,Image_Metadata_SMILES
0,4,233,G10_s1_w1BEDC2073-A983-4B98-95E9-84466707A25D.tif,Week4/Week4_27481,G10_s1_w2DCEC82F3-05F7-4F2F-B779-C5DF9698141E.tif,Week4/Week4_27481,G10_s1_w43CD51CBC-2370-471F-BA01-EE250B14B3C8.tif,Week4/Week4_27481,Week4_27481,G10,1,5-fluorouracil,0.003,DNA replication,FC1=CNC(=O)NC1=O
1,4,234,G10_s2_w11C3B9BCC-E48F-4C2F-9D31-8F46D8B5B972.tif,Week4/Week4_27481,G10_s2_w2570437EF-C8DC-4074-8D63-7FA3A7271FEE.tif,Week4/Week4_27481,G10_s2_w400B21F33-BDAB-4363-92C2-F4FB7545F08C.tif,Week4/Week4_27481,Week4_27481,G10,1,5-fluorouracil,0.003,DNA replication,FC1=CNC(=O)NC1=O
2,4,235,G10_s3_w1F4FCE330-C71C-4CA3-9815-EAF9B9876EB5.tif,Week4/Week4_27481,G10_s3_w2194A9AC7-369B-4D84-99C0-DA809B0042B8.tif,Week4/Week4_27481,G10_s3_w4E0452054-9FC1-41AB-8C5B-D0ACD058991F.tif,Week4/Week4_27481,Week4_27481,G10,1,5-fluorouracil,0.003,DNA replication,FC1=CNC(=O)NC1=O
3,4,236,G10_s4_w1747818B4-FFA7-40EE-B0A0-6A5974AF2644.tif,Week4/Week4_27481,G10_s4_w298D4652F-B5BF-49F2-BE51-8149DF83EAFD.tif,Week4/Week4_27481,G10_s4_w42648D36D-6B77-41CD-B520-6E4C533D9ABC.tif,Week4/Week4_27481,Week4_27481,G10,1,5-fluorouracil,0.003,DNA replication,FC1=CNC(=O)NC1=O
4,4,473,G10_s1_w10034568D-CC12-43C3-93A9-DC3782099DD3.tif,Week4/Week4_27521,G10_s1_w2A29ED14B-952C-4BA1-89B9-4F92B6DADEB4.tif,Week4/Week4_27521,G10_s1_w4DAA2E9D1-F6E9-45FA-ADC0-D341B647A680.tif,Week4/Week4_27521,Week4_27521,G10,2,5-fluorouracil,0.003,DNA replication,FC1=CNC(=O)NC1=O


We will now count how many compounds we have for each MoA. To avoid data leakage, we want to make sure images obtained with a single compound do not end up in both the training and the test set. We will therefore select at least one compound per MoA that we will reserve for the test set. Let's first have a look at how many compounds per MoA we have in the dataset

In [20]:
#let's see how many image we have per compound


df.groupby('Image_Metadata_MoA').Image_Metadata_Compound.nunique()

Image_Metadata_MoA
Actin disruptors                                 3
Aurora kinase inhibitors                         4
Cholesterol-lowering                             2
DMSO                                             1
DNA damage                                      10
DNA replication                                  8
EGFR inhibitor                                   1
Eg5 inhibitors                                   2
Epithelial                                       3
Estorgen receptor agonist                        1
Estrogen receptor antagonist                     1
Estrogen receptor modulator                      1
Kinase inhibitors                               14
Microtubule destabilizers                        5
Microtubule stabilizers                          3
Muscarinic acetylcholine receptor antagonist     1
Protein degradation                              4
Protein synthesis                                3
Name: Image_Metadata_Compound, dtype: int64

We have more than one compound for most, but for some MoA, we only have a single compound. We therefore cannot have pictures with this MoA in both the training and the test set. We will therefore remove them from the dataset

In [38]:
to_remove = ['DMSO', 'EGFR inhibitor', 'Estorgen receptor agonist', 'Estrogen receptor antagonist', 'Estrogen receptor modulator',
             'Muscarinic acetylcholine receptor antagonist']

df = df[~df.Image_Metadata_MoA.isin(to_remove)]

Let's now have a look at how many images we have per compound.

In [39]:
min_img = df.groupby('Image_Metadata_Compound').count().TableNumber.min()
max_img = df.groupby('Image_Metadata_Compound').count().TableNumber.max()

print('there are between {} and {} images per compound'.format(min_img, max_img))

there are between 64 and 1416 images per compound


We have at least 64 images per compound. This means that if we sample a random compound for each MoA, we should end up with a substantial number of images that we could reserve for the test set. Let's define a test set with this method

In [40]:
test_cpd = df.groupby('Image_Metadata_MoA').Image_Metadata_Compound.apply(lambda x: x.sample(1)).reset_index(drop=True)
test_cpd

0     cytochalasin B
1          quercetin
2        simvastatin
3          etoposide
4        doxorubicin
5              AZ138
6               AZ-U
7        roscovitine
8         nocodazole
9              taxol
10              ALLN
11           emetine
Name: Image_Metadata_Compound, dtype: object

These compounds will be in the test set, remaining ones will be in the training set. Let's split the train and the test set

In [41]:
df_train = df[~df.Image_Metadata_Compound.isin(test_cpd)]
df_test = df[df.Image_Metadata_Compound.isin(test_cpd)]

print('number of pictures in the train set: {}, number of pictures in the test set: {}'.format(len(df_train), len(df_test)))

number of pictures in the train set: 8832, number of pictures in the test set: 2472


We now have a train and test set, and our test set contains approximately 22% of the total pictures, which is a perfect balance for a train/test split.

We will now organize the images into a proper folder structure. Note that there are three channels for each image. Instead of the classic RGB channel, our channels are DAPI, Tubulin, Actin. These correspond to different components of the cells that are marked and then visualized in microscopy. We will organize our folders according to a following structure:

root/

    dataset/
        train/
            class1/
                DAPI/
                    image1.jpg
                    image2.jpg            
                    ...
                Tubulin/
                    image1.jpg
                    image2.jpg
                    ...
                Actin/
                    image1.jpg
                    image2.jpg
                    ...
            class2/
                DAPI/
                    image1.jpg
                    image2.jpg
                    ...
                Tubulin/
                    image1.jpg
                    image2.jpg
                    ...
                Actin/
                    image1.jpg
                    image2.jpg
                    ...
        test/
            class1/
                DAPI/
                    image1.jpg
                    image2.jpg            
                    ...
                Tubulin/
                    image1.jpg
                    image2.jpg
                    ...
                Actin/
                    image1.jpg
                    image2.jpg
                    ...
            class2/
                DAPI/
                    image1.jpg
                    image2.jpg
                    ...
                Tubulin/
                    image1.jpg
                    image2.jpg
                    ...
                Actin/
                    image1.jpg
                    image2.jpg
                    ...
        ...

In [32]:
#first we iterate through each MoA
for moa in df_train.Image_Metadata_MoA.unique():
    df_filtered = df_train[df_train.Image_Metadata_MoA == moa]

    for n in range(len(df_filtered)): #then we iterate through pictures per MoA
        path_dapi = df_filtered.Image_PathName_DAPI.iloc[n][6:] #and save the path of the respective image for all three channels
        dapi = df_filtered.Image_FileName_DAPI.iloc[n]
        path_dapi = 'images/cells/{}/{}'.format(path_dapi, dapi)

        path_tubulin = df_filtered.Image_PathName_Tubulin.iloc[n][6:]
        tubulin = df_filtered.Image_FileName_Tubulin.iloc[n]
        path_tubulin = 'images/cells/{}/{}'.format(path_tubulin, tubulin)
        
        path_actin = df_filtered.Image_PathName_Actin.iloc[n][6:]
        actin = df_filtered.Image_FileName_Actin.iloc[n]
        path_actin = 'images/cells/{}/{}'.format(path_actin, actin)

        #then we copy each image into the proper folder following the structure showed above
        try:
            shutil.copy(path_dapi, 'sorted/train/{}/dapi/{}'.format(moa, dapi))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/train/{}/dapi/{}'.format(moa, dapi))) #if the folder does not exist yet, we create it
            shutil.copy(path_dapi, 'sorted/train/{}/dapi/{}'.format(moa, dapi))

        try:
            shutil.copy(path_tubulin, 'sorted/train/{}/tubulin/{}'.format(moa, tubulin))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/train/{}/tubulin/{}'.format(moa, tubulin)))
            shutil.copy(path_tubulin, 'sorted/train/{}/tubulin/{}'.format(moa, tubulin))
        
        try:
            shutil.copy(path_actin, 'sorted/train/{}/actin/{}'.format(moa, actin))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/train/{}/actin/{}'.format(moa, actin)))
            shutil.copy(path_actin, 'sorted/train/{}/actin/{}'.format(moa, actin))
        

        

We now do the same thing for the test set

In [33]:
#first we iterate through each MoA
for moa in df_test.Image_Metadata_MoA.unique():
    df_filtered = df_test[df_test.Image_Metadata_MoA == moa]

    for n in range(len(df_filtered)): #then we iterate through pictures per MoA
        path_dapi = df_filtered.Image_PathName_DAPI.iloc[n][6:] #and save the path of the respective image for all three channels
        dapi = df_filtered.Image_FileName_DAPI.iloc[n]
        path_dapi = 'images/cells/{}/{}'.format(path_dapi, dapi)

        path_tubulin = df_filtered.Image_PathName_Tubulin.iloc[n][6:]
        tubulin = df_filtered.Image_FileName_Tubulin.iloc[n]
        path_tubulin = 'images/cells/{}/{}'.format(path_tubulin, tubulin)
        
        path_actin = df_filtered.Image_PathName_Actin.iloc[n][6:]
        actin = df_filtered.Image_FileName_Actin.iloc[n]
        path_actin = 'images/cells/{}/{}'.format(path_actin, actin)

        #then we copy each image into the proper folder following the structure showed above
        try:
            shutil.copy(path_dapi, 'sorted/test/{}/dapi/{}'.format(moa, dapi))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/test/{}/dapi/{}'.format(moa, dapi))) #if the folder does not exist yet, we create it
            shutil.copy(path_dapi, 'sorted/test/{}/dapi/{}'.format(moa, dapi))

        try:
            shutil.copy(path_tubulin, 'sorted/test/{}/tubulin/{}'.format(moa, tubulin))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/test/{}/tubulin/{}'.format(moa, tubulin)))
            shutil.copy(path_tubulin, 'sorted/test/{}/tubulin/{}'.format(moa, tubulin))
        
        try:
            shutil.copy(path_actin, 'sorted/test/{}/actin/{}'.format(moa, actin))
        except IOError as io_err:
            os.makedirs(os.path.dirname('sorted/test/{}/actin/{}'.format(moa, actin)))
            shutil.copy(path_actin, 'sorted/test/{}/actin/{}'.format(moa, actin))