# Problem

You are given a sample dataset ‘sample_dataset_for_testing.tar.gz’ . This is an anonymized sample Chest CT scan data. The dataset has 10 folders (subset[0-9]mask). Each subsetmask folder has one or many patient folders. Each patient folder has many tiff files. The tiff files numbering starts from 0.tiff , 1.tiff, 2.tiff ,….. N.tiff. Each tiff file represents a Chest CT slice. All slices together represent a 3D CT Chest scan for a patient.

In each patient folder, you are also given n_mask.tiff files (n is an integer) wherever a mask is present. A mask file is a binary tiff file which represents the location of abnormality. So for example, if a mask file is of size 512 x 512 and out of all 262144 pixels, if 20 pixels have intensity > 0 , then it means that abnormality is present in these 20 pixels. Note that the mask file ‘n_mask.tiff’ is indicating the location of abnormality in ‘n.tiff’ slice. If a patient is normal, then there will not be any mask file present in his/ her folder.

You goal is to design a Convolutional Neural Network to identify abnormalities in 3D Chest CT scans. So, given a test patient folder containing many tiff files, your network should be able to detect the location of abnormalities (similar to how a radiologist checks a patient’s 3D CT scan and marks abnormalities).

Since you have a small dataset, your model will not be robust. The goal of this exercise is not to test model’s robustness but to test your basic Python and deep learning skills.

You can model the problem either as a segmentation or bounding box or pixel classification or any other way. You have the freedom to choose any standard architecture or design a small network from scratch.

We want you to write a Python code for reading data, preprocessing and postprocessing. We want you to use a deep learning framework like PyTorch / Tensorflow / Caffe / Torch / Keras / Theano ( preferably PyTorch ) for training your network.
You can split the dataset into training , validation and test as you wish.

Remember , this is an exercise to test your python programming and basic deep learning modeling skills. You need not train your model for convergence or accuracy.

Best of Luck !!

# Solution

#### The Workflow of the entire project is described:
  
     1) Counting the total number of patients.
     2) Counting the total number of images in each folders.
     3) Now the challange is to divide the images into two discint classes i.e. cancer and safe.
     4) According to the given statement the images are divided into 2 classes and marked as cancer when atleast 20 pixels have their intensity greater than 0.
     5) The Next challange for me was to stop the overwritting of files when they were copied from one directories to the another.
     6) After Successfull tranfering the files in 2 classes, our challange is to build the classifier model. Hence, I am using a laptop without GPU I will train my model in Google Colab, then I will download the model for the predictions in the local machine.
     7) A CNN architecture is created using Keras and tf as backend, hence our goal is not to get the accuracy.
     8) The model is trained and the model is downloaded into the local machine for prediction.
     9) As the size of the .h5 file is very large it is recomended to predict in the Google Colab, my system was hanged when I was trying to load the model.
     

### Here the total number of patients and the total number of files in each folder is calculated

In [1]:
import pandas as pd
import os
import numpy as np


data_dir = 'fullsampledata'
patients = os.listdir(data_dir)
print("-------------------------------------------------------")
print("Total number of Patients =",(len(patients)))

np_xer=np.zeros(len(patients))
df_patients=pd.DataFrame({'patients_records':patients,'Cancer':np_xer})
print("-------------------------------------------------------")
copp=np.zeros(10)
tot_nf=0
for i in range(len(patients)):
    kk='fullsampledata/'+patients[i]
    ppp=os.listdir(kk)
    newd='fullsampledata/'+patients[i]+'/'+ppp[0]
    nnn=os.listdir(newd)
    print('The total number of files in',patients[i],'is',len(nnn))
    tot_nf=tot_nf+len(nnn)
    copp[i]=len(nnn)
print("-------------------------------------------------------") 
print("Total Number of files:",tot_nf) 
print("-------------------------------------------------------")

-------------------------------------------------------
Total number of Patients = 10
-------------------------------------------------------
The total number of files in subset0mask is 302
The total number of files in subset1mask is 305
The total number of files in subset2mask is 219
The total number of files in subset3mask is 341
The total number of files in subset4mask is 310
The total number of files in subset5mask is 358
The total number of files in subset6mask is 332
The total number of files in subset7mask is 343
The total number of files in subset8mask is 343
The total number of files in subset9mask is 345
-------------------------------------------------------
Total Number of files: 3198
-------------------------------------------------------


#### According to the given statement the images are divided into 2 classes and marked as cancer when atleast 20 pixels have their intensity greater than 0.

##### A DataFrame is created which contains the location of the images and the lables i.e. cancer/safe.

In [2]:
np_xer=np.zeros(tot_nf)
df_main=pd.DataFrame({'location':np.zeros(tot_nf),'Status_Cancer':np.zeros(tot_nf)})

m=0
import matplotlib.image as mpimg
for j in range(len(patients)):
    kk='fullsampledata/'+patients[j]
    ppp=os.listdir(kk)
    newd='fullsampledata/'+patients[j]+'/'+ppp[0]
    nnn=os.listdir(newd)
    count=0
    for n in nnn:
        df_main.location[m]=newd+'/'+n
        if '_mask' in n:
            new_name=newd+'/'+n
            img=mpimg.imread(new_name)
            countp=0
            flag=0
            for i in range(512):
                for k in range(512):
                    if(img[i][k][0]>0):
                        countp=countp+1
            if(countp>=20):
                df_main.Status_Cancer[m-1]=1
                df_main.drop(df_main.index[m])
            else:
                df_main.Status_Cancer[m]=0
                
                count=count+1
            
        m=m+1
        print("Please Wait...")
                
df_main=df_main[~df_main.location.str.contains("_mask")]
df_main.to_csv("main.csv")

Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wai

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Please Wait...
Please Wait...
Please Wait...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wait...
Please Wai

KeyboardInterrupt: 

### Here two separate dataframes are created from the main dataframe of 2 classes

In [None]:
np_xer=np.zeros(len(df_main.index))

df_cpatients=pd.DataFrame({'location':np_xer})
df_spatients=pd.DataFrame({'location':np_xer})

for pag in range(len(df_main.index)):
    if(df_main.Status_Cancer[pag]==1.0):
        df_cpatients.location[pag]=df_main.location[pag]
    else:
        df_spatients.location[pag]=df_main.location[pag]

df_cpatients.drop(df_cpatients[df_cpatients['location'] == 0 ].index , inplace=True)
df_spatients.drop(df_spatients[df_spatients['location'] == 0 ].index , inplace=True)
df_cpatients.to_csv("report_cancer2.csv")
df_spatients.to_csv("report_safe2.csv")

## The data is split into training and validation data from the respective dataframes

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import shutil 

cancer_data=pd.read_csv("report_cancer2.csv")
safe_data=pd.read_csv("report_safe2.csv")

#suffle the data
safe_data= safe_data.sample(frac=1).reset_index(drop=True)


cancer_train,cancer_test,safe_train,safe_test=train_test_split(cancer_data,safe_data[:52],test_size=0.25,random_state=0)

cancer_train = cancer_train.drop('Unnamed: 0', 1)
cancer_test = cancer_test.drop('Unnamed: 0', 1)
safe_train = safe_train.drop('Unnamed: 0', 1)
safe_test = safe_test.drop('Unnamed: 0', 1)

cancer_train = cancer_train.reset_index(drop=True)
cancer_test = cancer_test.reset_index(drop=True)
safe_train = safe_train.reset_index(drop=True)
safe_test = safe_test.reset_index(drop=True)

import os 


dest='train/cancer/'
count_t=0
al=0
for al in range(cancer_train.shape[0]):
    src=cancer_train.location[al]
    if not os.path.exists(src):
        shutil.copy(src,dest)
    else:
        count_t=count_t+1
        dest='train/cancer/'+str(count_t)+'.tiff'
        shutil.copy(src,dest)
        
dest='validation/cancer/'
count_t=0
al=0
for al in range(cancer_test.shape[0]):
    src=cancer_test.location[al]
    if not os.path.exists(src):
        shutil.copy(src,dest)
    else:
        count_t=count_t+1
        dest='validation/cancer/'+str(count_t)+'.tiff'
        shutil.copy(src,dest)
al=0       
dest='train/safe/'
count_t=0
for al in range(safe_train.shape[0]):
    src=safe_train.location[al]
    if not os.path.exists(src):
        shutil.copy(src,dest)
    else:
        count_t=count_t+1
        dest='train/safe/'+str(count_t)+'.tiff'
        shutil.copy(src,dest)

dest='validation/safe/'
al=0
count_t=0
for al in range(safe_test.shape[0]):
    src=safe_test.location[al]
    if not os.path.exists(src):
        shutil.copy(src,dest)
    else:
        count_t=count_t+1
        dest='validation/safe/'+str(count_t)+'.tiff'
        shutil.copy(src,dest)

# Our Model is trained in the Google Colab and the model is downloaded


# Prediction

In [None]:
import keras
from keras.models import load_model
from keras.utils import CustomObjectScope
from keras.initializers import glorot_uniform

with CustomObjectScope({'GlorotUniform': glorot_uniform()}):
        model = load_model('pranab1.h5')

In [None]:
from tensorflow.keras.preprocessing.image import array_to_img, img_to_array, load_img
img_pathh='train/cancer/3.tiff'
img = load_img(img_pathh, target_size=(512, 512))  # this is a TIFF image, we are neglating the transperent channel
x = img_to_array(img)  # Numpy array with shape (512, 512, 3)
x = x.reshape((1,) + x.shape)

In [None]:
result=model.predict(x)

In [None]:
if result[0][0]==1:
    prediction='safe'
else:
    prediction='has cancer'

In [None]:
print("The Given Input image has:",prediction)