# Mask R-CNN benchmark

Suggesting maskrcnn-benchmark is installed with the correct environment and GCC version (see bottom) the following explains how to run Mask R-CNN benchmark with annotations from Labelbox.

### 1. How to use your own data from Labelbox

- export your annotation file from labelbox in JSON WKT format
- split annotation in training (3/5), validation (1/5) and test (1/5)
- download the images and split them directly in train, val and test
- see script

IMPORTANT: labelbox2coco has to be installed before with pip install LBExporters

In [1]:
import json
import labelbox2coco as lb2co
import urllib.request as down
import os

# ROOT on Monod, folder to save data
DATA = '/data/proj/smFISH/Students/Max_Senftleben/files'
ROOT_DIR = '/home/max/mrcnn_b_work'


In [21]:
def convert_annotation(labeled_data):
    
    coco_output = labeled_data[:-5] + "_coco.json"
    lb2co.from_json(labeled_data = labeled_data, coco_output=coco_output)
    
    # check if temp file exists
    if os.path.isfile(coco_output) == True: 
        print("remove temp file: ", labeled_data)
        os.remove(labeled_data)
    else:
        print("file was not converted")
        
def handle_annotation(anno_dir, lbx_json, image_dir):
    
    # make split folder for images
    train_dir = image_dir + "/train/"
    if not os.path.exists(train_dir):
        os.makedirs(train_dir)

    val_dir = image_dir + "/val/"
    if not os.path.exists(val_dir):
        os.makedirs(val_dir)
        
    test_dir = image_dir + "/test/"
    if not os.path.exists(test_dir):
        os.makedirs(test_dir)
    
    # split 3/5 train, 1/5 val and 1/5 test
    data = json.load(open(anno_dir + lbx_json))
    anno_train = []
    anno_val = []
    anno_test = []
    counter = 0
    for one_img in data:
        
        # training set 3/5
        if counter <= len(data) * 0.6:
            
            # download image
            img_id = one_img["External ID"]
            name_ = train_dir + img_id
            down.urlretrieve(one_img["Labeled Data"], name_)
            print("downloading file ", name_)
            
            # append to json list
            anno_train.append(one_img)
            counter += 1
            
        # validation set 1/5
        elif counter <= len(data) * 0.8:
            
            # download image
            img_id = one_img["External ID"]
            name_ = val_dir + img_id
            down.urlretrieve(one_img["Labeled Data"], name_)
            print("downloading file ", name_)
            
            # append to json list
            anno_val.append(one_img)
            counter += 1
           
        # test set 1/5
        else:
            
            # download image
            img_id = one_img["External ID"]
            name_ = test_dir + img_id
            down.urlretrieve(one_img["Labeled Data"], name_)
            print("downloading file ", name_)
            
            # append to json list
            anno_test.append(one_img)
            counter += 1
    
    temp_train = anno_dir + "/train.json"
    with open(temp_train, "w") as wr:
        json.dump(anno_train,wr, separators=(',', ':'))
        print("convert ", temp_train)
        
    convert_annotation(temp_train)

    temp_val = anno_dir + "/val.json"
    with open(temp_val, "w") as wr:
        json.dump(anno_val, wr, separators=(',', ':'))
        print("convert ", temp_val)
        
    convert_annotation(temp_val)
    
    temp_test = anno_dir + "/test.json"
    with open(temp_test, "w") as wr:
        json.dump(anno_test, wr, separators=(',', ':'))
        print("convert ", temp_test)
        
    convert_annotation(temp_test)

In [22]:
# specify filenames

# Folder for Annotation files to save
# Examples:

ANNO = "/home/maxsen/git/master_thesis/data/annotations/new_nuclei"
#anno_dir = ROOT_DIR + "/annotation/new_nuclei_mask"


# Raw Annotation file from labelbox
# Examples:

annotation_file = "/nuclei_20190205.json"
#annotation_file = "/nuclei_20190205_with_masks.json"

# Folder for saving the Images
# Examples:

img_dir = "/home/maxsen/DEEPL/data/nuclei_20190205_data"
#img_dir = DATA + "/data/self_label"


The command below will download the images in the specified folder and will create there three folder /train, /val and /test. In the folder of the annotation file, it will create three annotation files, train_coco.json, val_coco.json and test_coco.json.

In [23]:
handle_annotation(ANNO, annotation_file, img_dir)

downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_144.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_195.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_106.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_196.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_107.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_7.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_199.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_108.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_8.png
downloading file  /home/maxsen/DEEPL/data/nuclei_20190205_data/train/Nuclei_SN_Hyb2_pos_200.png
downloading file  /home/maxsen/DEEPL/data/nu

### 2. Rewrite IDs in the annotation files so that they are fully in COCO style

This step is a bit tricky as it depends on your data files. COCO annotation files have in this case 5 sub-classes, namely images, annotations, licenses, categories and info. In COCO["images"] the ["id"] have to be changed to an integer referring to the correct file name and ["file_name"] has to be changed to the correct filename such as "filename_1234.jpg", where 1234 is the integer ID. In COCO["annotations"] the ["image_id"] has to be changed to the integer referring to the image file, in this case 1234. This is done in the following script, where one has to provide the annotation file to be changed. This has to be done one time for each the training, validation and testing annotation file. The new annotation file will be named for example train_coco_id.json.

In this case, the script assumes that the filename is a long url with its original filename something like "http://url-to-file/Nuclei_SN_Hyb2_pos_106.png8589173798579". The script then assigns "Nuclei_SN_Hyb2_pos_106.png" as the file name and "106" as the integer ID. By using other images, the script below may be modified.

In [16]:
def rewrite_ids_nuclei(annotation_file):
    
    anno_dir = json.load(open(annotation_file))
    
    for i in range(len(anno_dir['images'])):
        one_element = anno_dir['images'][i]
        
        # get name of file
        index_1 = one_element['file_name'].find('Nuclei')
        index_2 = one_element['file_name'].find('.png')
        correct_name_for_id = one_element['file_name'][index_1:index_2]
        print(correct_name_for_id)
        
        # remove URL in filename
        anno_dir['images'][i]['file_name'] = correct_name_for_id + '.png'
        
        # keep wrong ID to get data from 'annotations'
        wrong_name_for_id = anno_dir['images'][i]['id']
        
        # change ID to correct with type int
        anno_dir['images'][i]['id'] = int(correct_name_for_id[19:])
        
        for i in range(len(anno_dir['annotations'])):
            if anno_dir['annotations'][i]['image_id'] == wrong_name_for_id:
                anno_dir['annotations'][i]['image_id'] = int(correct_name_for_id[19:])
                

    with open(annotation_file[:-5] + '_id.json', "w") as wr:
        json.dump(anno_dir, wr, separators=(',', ':'))

In [20]:
# Example:
rewrite_ids_nuclei(ANNO + '/train_coco.json')

Nuclei_SN_Hyb2_pos_144
Nuclei_SN_Hyb2_pos_195
Nuclei_SN_Hyb2_pos_106
Nuclei_SN_Hyb2_pos_196
Nuclei_SN_Hyb2_pos_107
Nuclei_SN_Hyb2_pos_7
Nuclei_SN_Hyb2_pos_199
Nuclei_SN_Hyb2_pos_108
Nuclei_SN_Hyb2_pos_8
Nuclei_SN_Hyb2_pos_200
Nuclei_SN_Hyb2_pos_203
Nuclei_SN_Hyb2_pos_109
Nuclei_SN_Hyb2_pos_110
Nuclei_SN_Hyb2_pos_11
Nuclei_SN_Hyb2_pos_206
Nuclei_SN_Hyb2_pos_116
Nuclei_SN_Hyb2_pos_13
Nuclei_SN_Hyb2_pos_208
Nuclei_SN_Hyb2_pos_118
Nuclei_SN_Hyb2_pos_15
Nuclei_SN_Hyb2_pos_211
Nuclei_SN_Hyb2_pos_120
Nuclei_SN_Hyb2_pos_16
Nuclei_SN_Hyb2_pos_212
Nuclei_SN_Hyb2_pos_124
Nuclei_SN_Hyb2_pos_18
Nuclei_SN_Hyb2_pos_213
Nuclei_SN_Hyb2_pos_125
Nuclei_SN_Hyb2_pos_20
Nuclei_SN_Hyb2_pos_214
Nuclei_SN_Hyb2_pos_126
Nuclei_SN_Hyb2_pos_23
Nuclei_SN_Hyb2_pos_127
Nuclei_SN_Hyb2_pos_215
Nuclei_SN_Hyb2_pos_28
Nuclei_SN_Hyb2_pos_217
Nuclei_SN_Hyb2_pos_129
Nuclei_SN_Hyb2_pos_29
Nuclei_SN_Hyb2_pos_218
Nuclei_SN_Hyb2_pos_131
Nuclei_SN_Hyb2_pos_31
Nuclei_SN_Hyb2_pos_219
Nuclei_SN_Hyb2_pos_133
Nuclei_SN_Hyb2_pos_32
Nuc

### 3. Configure paths_catalog.py

The paths of the image data and the annotation data have to be added to the maskrcnn-benchmark/maskrcnn_benchmark/config/paths_catalog.py file like this:

    "coco_nuclei_train": { 
            "img_dir": "/data/proj/smFISH/Students/Max_Senftleben/files/data/nuclei_20190205_data/train",
            "ann_file":	"/data/proj/smFISH/Students/Max_Senftleben/files/annotation/new_nuclei/train_coco_id.json"
        },
    "coco_nuclei_val": { 
            "img_dir": "/data/proj/smFISH/Students/Max_Senftleben/files/data/nuclei_20190205_data/val",
            "ann_file":	"/data/proj/smFISH/Students/Max_Senftleben/files/annotation/new_nuclei/val_coco_id.json"
        },
    "coco_nuclei_test": { 
            "img_dir": "/data/proj/smFISH/Students/Max_Senftleben/files/data/nuclei_20190205_data/test",
            "ann_file":	"/data/proj/smFISH/Students/Max_Senftleben/files/annotation/new_nuclei/test_coco_id.json"
        }


### 4. Configure existing or make own .yaml configuration file 

Existing config files can be found in maskrcnn-benchmark/config/, datasets have to be specified at DATASETS according to the keyword set in the paths_catalog.py file like this:

    DATASETS:
      TRAIN: ("coco_nuclei_train", "coco_nuclei_val")
      TEST: ("coco_nuclei_test",)

Here, several training parameters may be specified and adjusted to the number of GPUs being used while training. These parameters are for training on ONE GPU:

    SOLVER:
      BASE_LR: 0.0025
      STEPS: (480000, 640000)
      MAX_ITER: 720000
      IMS_PER_BATCH: 2

The developer's recommmend multiplying the learning rate and the images per batch by the number of GPUs and dividing the steps and the maximum iterations by the number GPUs. See examples in maskrcnn-benchmark/configs/, where they were training with 8 GPUs. One may add the output directory as well with `OUTPUT_DIR: "/path/to/"`.

I ran into an error `IndexError: index 0 is out of bounds for dimension 0 with size 0` while training on a test data set from labelbox with relatively small .jpeg images (I did not get the error while training the nuclei though). A quick fix can be setting `DATALOADER.ASPECT_RATIO_GROUPING = FALSE` in the config file, but the developpers are still checking the code and have not provided another solution.

### 5. Run training

Run:

    python maskrcnn-benchmark/tools/train_net.py --config-file "path/to/config.yaml"

When receiving Segmentation fault error, the GCC version may be updated (has to be version +4.9). Info here https://paper.dropbox.com/doc/Working-on-Monod-setup-environments-and-run-on-GPUs--AXH64wJuBgEe8XwCtJ09DZCqAg-hX2FfDYdlhY10ksm0BhH6. Checking the GCC version cna be done with `gcc --version`. After updating GCC to a higher version one has to recompile maskrcnn-benchmark by removing the folder `maskrcnn-benchmark/build` and do `python maskrcnn-benchmark/setup.py build develop`