# Spot Nuclei. Speed Cures.
Imagine speeding up research for almost every disease, from lung cancer and heart disease to rare disorders. The 2018 Data Science Bowl offers our most ambitious mission yet: create an algorithm to automate nucleus detection.

We’ve all seen people suffer from diseases like cancer, heart disease, chronic obstructive pulmonary disease, Alzheimer’s, and diabetes. Many have seen their loved ones pass away. Think how many lives would be transformed if cures came faster.

By automating nucleus detection, you could help unlock cures faster—from rare disorders to the common cold. Want a snapshot about the 2018 Data Science Bowl? View this video.

# Why nuclei?
Identifying the cells’ nuclei is the starting point for most analyses because most of the human body’s 30 trillion cells contain a nucleus full of DNA, the genetic code that programs each cell. Identifying nuclei allows researchers to identify each individual cell in a sample, and by measuring how cells react to various treatments, the researcher can understand the underlying biological processes at work.

By participating, teams will work to automate the process of identifying nuclei, which will allow for more efficient drug testing, shortening the 10 years it takes for each new drug to come to market. Check out this video overview to find out more.



In [1]:
from fastai.vision import *
from fastai.metrics import error_rate

# Import Libraries here
import os
import json 
import shutil
import zipfile

%reload_ext autoreload
%autoreload 2
%matplotlib inline

# notebook project directories
base_dir = '/hdd/data/nuclei/'

!mkdir -p "{base_dir}"

# set the random seed
np.random.seed(2)

In [None]:
# Download the histopathological data
!pip install kaggle
!kaggle competitions download -c data-science-bowl-2018 -p "{base_dir}"

In [49]:
# Check the download here
path = Path(base_dir)
path.ls()

[PosixPath('/hdd/data/nuclei/stage2_sample_submission_final.csv'),
 PosixPath('/hdd/data/nuclei/stage2_sample_submission_final.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_solution.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_sample_submission.csv'),
 PosixPath('/hdd/data/nuclei/stage1_solution.csv'),
 PosixPath('/hdd/data/nuclei/stage1_sample_submission.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_train_labels.csv'),
 PosixPath('/hdd/data/nuclei/stage1_train_labels.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage2_test_final.zip'),
 PosixPath('/hdd/data/nuclei/stage1_test.zip'),
 PosixPath('/hdd/data/nuclei/stage1_train.zip')]

In [47]:
rm -Rf '/hdd/data/nuclei/stage2_sample_submission_final'

In [53]:
#now unzip the test files

path = Path(base_dir)
path.ls()
for zipFilePath in path.ls():
    csvIdx = (len(str(zipFilePath)))-8
    csvExtension = str(zipFilePath)[csvIdx:]
    zipIdx = (len(str(zipFilePath)))-4
    zipExtension = str(zipFilePath)[zipIdx:]
    if csvExtension == '.csv.zip' :
        dest_dir_csv = Path(base_dir)
        labels_csv_zip = zipfile.ZipFile(zipFilePath, 'r')
        labels_csv_zip.extractall(dest_dir_csv)
        labels_csv_zip.close()
    elif zipExtension == '.zip':
        zipDestDir = str(zipFilePath)[:zipIdx]
        !mkdir -p "{zipDestDir}/"
        dest_dir_images = Path(zipDestDir + '/')
        images_zip = zipfile.ZipFile(zipFilePath, 'r')
        images_zip.extractall(dest_dir_images)
        images_zip.close()
        
#now check paths to see if they unzipped correctly
path.ls()


[PosixPath('/hdd/data/nuclei/stage2_sample_submission_final.csv'),
 PosixPath('/hdd/data/nuclei/stage2_sample_submission_final.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_solution.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_sample_submission.csv'),
 PosixPath('/hdd/data/nuclei/stage2_test_final'),
 PosixPath('/hdd/data/nuclei/stage1_solution.csv'),
 PosixPath('/hdd/data/nuclei/stage1_sample_submission.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage1_train'),
 PosixPath('/hdd/data/nuclei/stage1_train_labels.csv'),
 PosixPath('/hdd/data/nuclei/stage1_train_labels.csv.zip'),
 PosixPath('/hdd/data/nuclei/stage2_test_final.zip'),
 PosixPath('/hdd/data/nuclei/stage1_test'),
 PosixPath('/hdd/data/nuclei/stage1_test.zip'),
 PosixPath('/hdd/data/nuclei/stage1_train.zip')]

# Data Block API


In [None]:
"""
    Here are the key folders and CSVs in the data set:

    PosixPath('/hdd/data/nuclei/stage2_test_final'),
    PosixPath('/hdd/data/nuclei/stage1_train'),
    PosixPath('/hdd/data/nuclei/stage1_test'),
    PosixPath('/hdd/data/nuclei/stage1_solution.csv'),
    PosixPath('/hdd/data/nuclei/stage1_train_labels.csv'),
    PosixPath('/hdd/data/nuclei/stage2_sample_submission_final.csv'),
    PosixPath('/hdd/data/nuclei/stage1_sample_submission.csv'),

    In the train folder we have the following structure
    
    stage1_train/
        f534b43bf37ff946a310a0f08315d76c3fb3394681cf523acef7c0682240072a/
            -> images/
            -> masks/
    
"""

path_img_train = base_dir + 'stage1_train' # need to split this folder into train and val sets
path_img_test = base_dir + 'stage1_test' # images only, use to test

data = (SegmentationItemList.from_folder(path_img_train)
        #Where to find the data? -> in path_img and its subfolders
        .split_by_rand_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_func(get_y_fn, classes=codes)
        #How to label? -> use the label function on the file name of the data
        .transform(get_transforms(), tfm_y=True, size=128)
        #Data augmentation? -> use tfms with a size of 128, also transform the label images
        .databunch())