#### In this notebook we will use the preprocessing tools built for this project to convert the images in our dataset to a format compatible with neural network modeling.

In [1]:
from module4_scripts.dir_constructor import DataDirectoryConstructor
from module4_scripts.reader import ImageReader
from module4_scripts.preprocessor import Preprocessor

Using TensorFlow backend.


In [2]:
LABELS = ["NORMAL", "PNEUMONIA"]

## Data Directory Construction

The dataset is sourced from the Kaggle link here: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia


The first step in this project is to download and organize the data in the directory. What we want is for the data to be split into train, validation and test splits, using appropriate set size allocations. Within each set, we'll have subsets for the two classes in the dataset, "NORMAL", and "PNEUMONIA". We can use the DataDirectoryConstructor class to do exactly this.

In [3]:
DataDirectoryConstructor(
    directory="data/",
    subdirs=LABELS,
    train_size=0.8,
    test_size=0.1,
    val_size=0.1
).split_dataset()

Class Directories:
	data/NORMAL/
	data/PNEUMONIA/

Splits: 
	Train size: 0.8 
	Test size: 0.1 
	Validation size: 0.1

Generated Train data directory
Generated Test data directory
Generated Val data directory


## Image Reading

Now that we have our data directory set up, the next step is to read the images and create image generators for later preprocessing. We can use the ImageReader class for this. Note that this class also creates an additonal generator for generating augmented training data, so we can expand our training set for modeling. More info on how this class works can be found in its dosctrings. 

In [4]:
#create ImageReader instance object
image_reader = ImageReader()
#read the images, create the generators
image_reader.read()
#display summary report of image reading
image_reader.display_read_summary()

Found 4685 images belonging to 2 classes.
Found 4685 images belonging to 2 classes.
Found 586 images belonging to 2 classes.
Found 585 images belonging to 2 classes.

=== Classes ===
NORMAL: 0
PNEUMONIA: 1

=== Directory Breakdown ===
 ________________________________________
Train
	NORMAL:
		Count: 1265
		Proportion: 0.27
	PNEUMONIA:
		Count: 3420
		Proportion: 0.73
________________________________________
Test
	NORMAL:
		Count: 158
		Proportion: 0.27
	PNEUMONIA:
		Count: 428
		Proportion: 0.73
________________________________________
Validation
	NORMAL:
		Count: 160
		Proportion: 0.27
	PNEUMONIA:
		Count: 425
		Proportion: 0.73
________________________________________


The display_read_summary method displays some info of the data directory. We can see the structure of the directory and the counts and proportions of the images of each class. It's evident that there is some class imbalance present in the dataset. This can be addressed later on during modeling when we apply class weighting. (Note that the count of the training set will be proportionally doubled when we apply training data augmentation). 

## Pre-processing

Now that we have read the images, we have to transform these images into data types that can be interpreted by the model,  numerical arrays. 

Brief overview for how his will work:

Computers read images as 3D tensors, where each 2D matrix represents a color. The colors are red, green, and blue - thus the tensor is a stack of 3 matrices. The dimensions of the 2D matrices are equal to the pixel dimensions of the image, and each element in a matrix represents a pixel color intensity, and this value can be between 0 and 255. The combination of red, green, and blue pixel intensities result in the color you see on one pixel of an image (of course, if the pixel dimensions of an image are very high, the human eye can't possibly see one pixel). The Preprocessor class can convert images to arrays.

In [5]:
#instantiate a Preprocessor instance object
preprocessor = Preprocessor(**vars(image_reader))
#preprocess the data, set the augment_data argument 
#to True so as to increase size of training data
preprocessor.preprocess(augment_data=True)

Appending augmented data to Train set

=== Processing Train set ===
Original shapes: 
	Images: (9370, 256, 256, 3) 
	Labels: (9370,)
Reshaped shapes: 
	Images: (9370, 196608) 
	Labels: (9370, 1)

=== Processing Test set ===
Original shapes: 
	Images: (586, 256, 256, 3) 
	Labels: (586,)
Reshaped shapes: 
	Images: (586, 196608) 
	Labels: (586, 1)

=== Processing Validation set ===
Original shapes: 
	Images: (585, 256, 256, 3) 
	Labels: (585,)
Reshaped shapes: 
	Images: (585, 196608) 
	Labels: (585, 1)


Different kinds of models require different input shapes. The densely connected neural networks that we'll work with require that input shapes are 2D, and the convolutional neural networks that we'll work with require that input shapes are 4D. The 'preprocessor' object now holds all data in both 2D and 4D form, which we can use for model building. 

To end the preprocessing stage, we'll compress and save all the sets in the local directory "npy/" as an .npz file. (This compressed file is close to ~3GB, so it was not uploaded to the repo).

In [6]:
preprocessor.save_arrays(path="npy/arrays")

Compressing and saving all sets to npy/arrays.npz
Done
