# Final Capstone -- Invasive Ductal Carcinoma Image Classification

This project develops classification models for images of invasive ductal carcinoma (IDC) sourced from Kaggle (url:  https://www.kaggle.com/paultimothymooney/breast-histopathology-images#10253_idx5_x1001_y1001_class0.png).  This dataset consists of 277,524 color images (50 pixels by 50 pixels) generated from 162 slides of breast cancer cells scanned at 40x.  There are 198,738 images that correspond to negative results for IDC and 78,786 images that correspond to positive results for IDC.  The images are supplied in a set of folders that correspond to a particular patient.  Within each folder corresponding to a patient, there are two folders--one for positive-IDC images and one for negative-IDC images.  The code below serves to read each of these images from file and store them in a NumPy array for further use along with corresponding arrays that indicate the label.  This is done in batches of 100,000 images due to RAM restrictions.  Some images in the dataset are not of the advertised shape (50 x 50) and are discarded.

## Perform imports

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from os import listdir

## Load and save images as NumPy arrays

In [0]:
# list to hold image arrays and list to hold labels
img_array_list = []
label_list = []
# local path folder holding all image subfolders
top_path = 'C:\\Users\\hsvjc\\Desktop\\Thinkful\\Final Capstone\\Images\\breast-histopathology-images'
# batch size counter
img_array_list_len = 0
# batch label counter
save_file_count = 1
# local path at which to save NumPy arrays
save_path = 'C:\\Users\\hsvjc\\Desktop\\Thinkful\\Final Capstone\\Images\\Image NumPy\\'
# loop through patient subfolders
for file in listdir(top_path):
    if (file != 'IDC_regular_ps50_idx5'):
        # loop through class subfolders
        new_path = top_path + '\\' + file
        for file2 in listdir(new_path):
            new_path2 = new_path + '\\' + file2
            # save the label for folder of images
            if (file2 == '0'):
                label = 0
            elif (file2 == '1'):
                label = 1
            else:
                print('Label problem')
            # loop through images
            for file3 in listdir(new_path2):
                filepath = new_path2 + '\\' + file3
                # read image
                img = plt.imread(filepath)
                # identify whether image shape is 50 x 50 x 3 and add image to list if so
                if (img.shape == (50, 50, 3)):
                    img_array_list.append(np.reshape(img, (1, 50, 50, 3), order='C'))
                    label_list.append(label)
                    img_array_list_len += 1
                # save a batch when number of images reaches 100,000
                if (img_array_list_len == 100000):
                    # array of images
                    X_concat = np.concatenate(img_array_list, axis=0)
                    # array of labels corresponding to image based on index
                    Y = np.array(label_list)
                    np.save(save_path + 'X' + str(save_file_count), X_concat)
                    np.save(save_path + 'Y' + str(save_file_count), Y)
                    save_file_count += 1
                    img_array_list = []
                    label_list = []
                    img_array_list_len = 0
# save last batch
X_concat = np.concatenate(img_array_list, axis=0)
Y = np.array(label_list)
np.save(save_path + 'X' + str(save_file_count), X_concat)
np.save(save_path + 'Y' + str(save_file_count), Y)

After the batches of NumPy arrays were saved, those NumPy files were uploaded to Google Drive so that they could be used with Google Colab for greater processing power.