# Parker Dunn
(pgdunn@bu.edu | pdunn91@gmail.com)  
Created on July 1st, 2022


__Assignment for COURSERA: Introduction to Deep Learning (via CU Boulder)__

__Assignment:__ Week 3 - CNN Cancer Detection Kaggle Mini-Project


# Information about the Competition/Data
___

The Kaggle competition is called "Histopathologic Cancer Detection"  
LINK: https://www.kaggle.com/c/histopathologic-cancer-detection

### Data Description (from Kaggle)

In this dataset, you are provided with a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. You are predicting the labels for the images in the test folder. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

The original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates. We have otherwise maintained the same data and splits as the PCam benchmark.

___
# Imports

In [16]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# I moved imports below

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    print(dirname)
print(os.getcwd())
os.listdir('/kaggle/input')
os.listdir('/kaggle/input/histopathologic-cancer-detection')


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [8]:
from skimage import io
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
import seaborn as sns
#import multiprocessing

%matplotlib inline

## Below are functions used during Step 2

In [17]:
def load_image_info():
    # get image filenames
    
    # train_locs = glob.glob("train/*.tif")
    # test_locs = glob.glob("test/*.tif")
    
    #train_locs = glob.glob("data/train/*.tif")
    #test_locs = glob.glob("data/train/*.tif")
    
    train_locs = glob.glob("/kaggle/input/histopathologic-cancer-detection/train/*.tif")
    test_locs = glob.glob("/kaggle/input/histopathologic-cancer-detection/test/*.tif")
    
    num_train = len(train_locs)
    num_test = len(test_locs)
    
    y_train = pd.read_csv("/kaggle/input/histopathologic-cancer-detection/train_labels.csv", header=0)
    
    return train_locs, test_locs, y_train

def show_training_image(img_info):
    # displaying the image
    #file = "train/" + img_info.loc["id"] + ".tif"
    file = "/kaggle/input/histopathologic-cancer-detection/train/"+img_info.loc["id"]+".tif"
    image = io.imread(file)
    plt.imshow(image)
    plt.title("{}\n Class: {}".format(img_info.loc["id"], img_info.loc["label"]))
    
    # Drawing the center 32x32 region on the picture
    rectangle = plt.Rectangle((32,32), 32, 32, ec="red", linewidth=1.5, fill=False)
    plt.gca().add_patch(rectangle)
    plt.legend(["Classification Region"])
    
def show_training_images(img_info, dim):
    fig, axes = plt.subplots(nrows=dim[0], ncols=dim[1], figsize=(10,8))
    
    ax = axes.flatten()
    plt.subplots_adjust(hspace=0.4)
    
    for i in range(len(img_info.index)):
        #file = "train/" + img_info.loc[i,"id"] + ".tif"
        file = "/kaggle/input/histopathologic-cancer-detection/train/"+img_info.loc[i,"id"]+".tif"
        image = io.imread(file)
        ax[i].imshow(image)
        ax[i].set_title("{}\n Class: {}".format(img_info.loc[i,"id"], img_info.loc[i,"label"]))
        rectangle = plt.Rectangle((32,32), 32, 32, ec="red", linewidth=1.5, fill=False)
        ax[i].add_patch(rectangle)
        ax[i].legend(["Classification Region"])
        
def load_image_data(img):
    #file = "train/"+img+".tif"
    #file = "data/train/"+img+".tif"
    file = "/kaggle/input/histopathologic-cancer-detection/train/"+img+".tif"
    image = io.imread(file)
    return image

def calculate_img_avgs(images):
    avg = np.zeros((96,96,3))
    for img in images:
        image_np = load_image_data(img)
        avg = avg + image_np
    avg = avg/len(images)
    return avg

# Step 2 - Exploratory Data Analysis (EDA)

### Inspecting a single image

In [18]:
training_images, testing_images, y_train = load_image_info()

print(y_train.iloc[0,:])
print(type(y_train.iloc[0,:]),"\n\n")

show_training_image(y_train.iloc[0,:])

### Inspecting multiple images

In [19]:
show_training_images(y_train.iloc[0:4,:], (2,2))

### Examining information distribution of the images

Here, I looked at...
* the "average" picture
* comparison of purple color distributions

In [29]:
%%time

# loop to collect averages of 100 images at a time
# the 100 image avgs are saved in a list
avg_img_100 = []
moving_avg = np.zeros((96,96,3))
counter = 0

for i in range(len(y_train.index)):
    if (i % 100 == 0) and (i != 0):
        moving_avg = moving_avg/100
        avg_img_100.append(moving_avg)
        moving_avg = np.zeros((96,96,3))
        counter = 0
    if (i % 10000 == 0):
        print("Progress...")
    
    image = load_image_data(y_train.loc[i,"id"])
    counter += 1
    moving_avg = moving_avg + image

if counter > 0:
    print("End value of counter: ",counter)
    avg_img_100.append(moving_avg/counter)

print("Number of NumPy arrays saved in avg_img_100: ", len(avg_img_100))
# print(avg_img_100[0].shape)

# getting the "average image" from the averages of 100 images
avg_img = np.zeros((96,96,3))
for i in range(len(avg_img_100)):
    avg_img = avg_img + avg_img_100[i]

avg_img = avg_img/len(avg_img_100)

___
__Alternate approach to loading the avg image data__  
** Nevermind for now **

In [None]:
# processes = []
# group_avgs = []
# ii = 0

# while ii < len(y_train.index):
#     if len(processes) < 4: # then spawn a new process
        
#     else:
#         for p in processes:
#             p.join()
    
    

___

In [30]:
savable_avg_img = avg_img.reshape(96*96,3)

np.savetxt("/kaggle/working/avg_training_image.txt", savable_avg_img, delimiter=",")

print("Shape of 'avg_img': ", avg_img.shape)
print("\nSample values...\n", avg_img[0:2,0:2,:])

In [31]:
# Displaying the avg image
avg_img_int = avg_img.astype('int')
plt.imshow(avg_img_int)
plt.title("Average training image - Both Classes")

Probably should have figured that the avg of all images would not be particularly helpful.

I am curious to see if there is a difference between the average positive vs. negative image. I will basically repeat the same process once more; hopefully, there is some more useful information there.

In [32]:
%%time

# loop to collect averages of 100 images at a time
# the 100 image avgs are saved in a list
avg_img_pos = []
avg_img_neg = []
moving_avg_pos = np.zeros((96,96,3))
moving_avg_neg = np.zeros((96,96,3))
counter_pos = 0
counter_neg = 0

# Simultaneous task -> track and save values for R & B channels of each image
pos_R = []
pos_B = []
neg_R = []
neg_B = []
# End of setup for simultaneous task

for i in range(len(y_train.index)):
    if (counter_pos == 1000):
        moving_avg_pos = moving_avg_pos/counter_pos
        avg_img_pos.append(moving_avg_pos)
        moving_avg_pos = np.zeros((96,96,3))
        counter_pos = 0
    
    if (counter_neg == 1000):
        moving_avg_neg = moving_avg_neg/counter_neg
        avg_img_neg.append(moving_avg_neg)
        moving_avg_neg = np.zeros((96,96,3))
        counter_neg = 0
    
    image = load_image_data(y_train.loc[i,"id"])
    #print(image[:,:,0].shape)
    
    if (y_train.loc[i,"label"] == 1):
        counter_pos += 1
        moving_avg_pos = moving_avg_pos + image
        
        avg_R = image[:,:,0].reshape((1,-1)).mean()
        avg_B = image[:,:,2].reshape((1,-1)).mean()
        pos_R.append(avg_R)
        pos_B.append(avg_B)
    else:
        counter_neg += 1
        moving_avg_neg = moving_avg_neg + image
        
        avg_R = image[:,:,0].reshape((1,-1)).mean()
        avg_B = image[:,:,2].reshape((1,-1)).mean()
        neg_R.append(avg_R)
        neg_B.append(avg_B)

if (counter_pos > 0):
    avg_img_pos.append(moving_avg_pos/counter_pos)
if (counter_neg > 0):
    avg_img_neg.append(moving_avg_neg/counter_neg)

avg_pos = np.zeros((96,96,3))
avg_neg = np.zeros((96,96,3))

# POSITIVE IMAGES
for i in range(len(avg_img_pos)):
    avg_pos = avg_pos + avg_img_pos[i]
avg_pos = avg_pos/len(avg_img_pos)

# NEGATIVE IMAGES
for j in range(len(avg_img_neg)):
    avg_neg = avg_neg + avg_img_neg[i]
avg_neg = avg_neg/len(avg_img_neg)

In [38]:
# Checking on my "side-task" data
print("posR ", len(pos_R), "\n",
      "posB ", len(pos_B), "\n",
      "negR ", len(neg_R), "\n",
      "negB ", len(neg_B), "\n")

# Saving some data

pos_images_channel_vals = pd.DataFrame({'R':pos_R, 'B':pos_B})
neg_images_channel_vals = pd.DataFrame({'R':neg_R, 'B':neg_B})

pos_images_channel_vals.to_csv(path_or_buf="/kaggle/working/positive_images_channel_vals.csv")
neg_images_channel_vals.to_csv(path_or_buf="/kaggle/working/negative_images_channel_vals.csv")

In [45]:
for img, file in zip([avg_pos, avg_neg], ["average_positive_image", "average_negative_image"]):
    print(file, img.shape)
    #save_img_data(img, 96, file) - func wasn't working so I deleted
    np.save(f"/kaggle/working/{file}", img)

In [33]:
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(12,10), sharey=True)

avg_pos_int = avg_pos.astype('int')
avg_neg_int = avg_neg.astype('int')

axes[0].imshow(avg_pos_int)
axes[0].set_title("Avg Positive Image")
axes[1].imshow(avg_neg_int)
axes[1].set_title("Avg Negative Image")

There isn't much to take away from these images. Since they are averages over many images, they look mostly uniform.

There is a slight difference in the overall color and presentation of cell shapes in the images. The shade of purple of the average negative images appears to be slightly darker and more transparent. The average negative image also apperas to have some darker spots that are faintly identfiable.

The unhealthy cancer cells lose their distinct oval/circular shape. The distinction between the two images does suggest that the distince shape of the healthy cells is a fundamental feature of the images that can help me distiguish between the two classes of images.

__Clearly, purple is the dominant color of these images (due to the staining process used for visualization). NEXT, I will compare how the component colors of purple (red and blue) compare between the positive and negative images.__

In [51]:
plt.style.use('fivethirtyeight')
light_blue="#ADD8E6"
light_red='#FFCCCB'

fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(12,10), sharey=True)
plt.subplots_adjust(hspace=0.4)

axes[0,0].set_title("Positive Images - Red")
axes[0,0].set_xlabel("Avg Pixel Value (0 - 255)",fontsize="small")
axes[0,0].set_ylabel("Count", fontsize="small")
axes[0,0].set_xlim(left=0,right=255)
axes[0,0].hist(pos_R, 20, range=(0,255), color=light_red)

axes[0,1].set_title("Positive Images - Blue")
axes[0,1].set_xlabel("Avg Pixel Value (0 - 255)",fontsize="small")
axes[0,1].set_ylabel("Count", fontsize="small")
axes[0,1].set_xlim(left=0,right=255)
axes[0,1].hist(pos_B, 20, range=(0,255), color=light_blue)

axes[1,0].set_title("Negative Images - Red")
axes[1,0].set_xlabel("Avg Pixel Value (0 - 255)",fontsize="small")
axes[1,0].set_ylabel("Count", fontsize="small")
axes[1,0].set_xlim(left=0,right=255)
axes[1,0].hist(neg_R, 20, range=(0,255), color=light_red)

axes[1,1].set_title("Negative Images - Blue")
axes[1,1].set_xlabel("Avg Pixel Value (0 - 255)",fontsize="small")
axes[1,1].set_ylabel("Count", fontsize="small")
axes[1,1].set_xlim(left=0,right=255)
axes[1,1].hist(neg_B, 20, range=(0,255), color=light_blue)

The faint shapes of the average negative image above show up here a little bit. The distribution of purple in the negative images appears to be bimodal, but the positive images resemble a normal distribution. When comparing to the average images above, it is important to note that these histograms are averaged across all the pixels of an image. The average images, however, are averaged across all images and show average data for each pixel.

The histograms suggest that there are two different "average" hues of purple among the negative images, while the positive images tend to average out to similar hues of purple. Since the averages of both channels in the negative and positive images appear to be in similar locations, the color distribution does not really provide a way to distiguish between the two classes consistently. The distributions may suggest something about the colors of the features/objects in the images though, which hopefully the CNN model can identify.


### One thing that I did not do...

The histograms above are really an amalgamation of lots of pixel information. When generating the data, the R and B channels of each image were averaged across an entire image then saved. It would be interesting to investigate/generate histograms of the R and B channels for each pixel across all images. Or, even better, useful information might be available by looking at the avg. value of the R & B channels across sections of the images. These histograms would reveal more detailed information about distribution of objects in the images, which is essentially removed in the plots above because of averaging across an entire image.

By implementing a convolutional neural network, hopefully, the object level information can be extracted better than the plots created above.