In [None]:
% cd ..

# Machine Learning On a 'Real' Problem

In programming, we come across problems that are often referred to as "embarrassingly parallel"--meaning a large number of independent operations are run sequentiually, utilizing only one of a modern machine's many processors. And every once in a while we come across problems that are also embarrassing, but more so in the light of "a computer should be doing this". 

I was pulled into a research organization a while ago that had a process where they would runupwards of 1500 models, spit out a handful of statistics and plots, and have volunteers comb through PDF reports showing figure after figure, looking for telltale signs of non-normallly distributed error in the model, outliers, and ceiling effects. If the volunteers saw anyhting, they'd notate which plot showed the issue and what the issue was, so that a research could review that model later. This struck me as a task that could be turned into a model and accomplished in minues, not days. 

Since this process had been happening for years, they had tens of thousands of plots and data on which ones should be flagged. The hard work on clasifying the training data had already been done, the easy part is the automation. I never had the chance to build that model--I was contracted for other work, but I always wondered how that classifier would fair. And I've decided to simulate some data to find out. 

In this post I'll just go through the process to turing a bunch of .png files into a training set and training a simple neural network. Future posts will dive into improving the model, using more advanced neural networks (convolutional NN vs. the multi-layer perceptron used here).

## Processing Data 

I stuck with models that are fairly similar to what that organization produced and will only be looking at one type of plot residuals vs. fitted values, since all three of the issues will be visible in that type of plot. Below are example plots for no issues, non-normal residuals, outliers, and ceiling effect. Notice that this particular "no issues" plot has some points that could possibly be considered outliers, so it will be interesting to see if the model can learn to discriminate. 

<div class="row">
  <div class="column">
    <img width="240" height="200" alt="No issues" 
         src="../data/png/none_0000001.png"
         title="Normal Data">
    <img width="240" height="200" alt="Outliers" 
         src="../data/png/outlier_0000001.png"
         title="Outliers">
  </div>
  <div class="column">
    <img width="240" height="200" alt="Non-normal" 
         src="../data/png/biased_0000001.png"
         title="Non-normal Error">
    <img width="240" height="200" alt="Ceiling" 
         src="../data/png/ceiling_0000001.png"
         title="Ceiling Effect">
    </div>
</div>

### Downsampling Images

You'll notice our images have a single red line in them. Color pictures are larger, and thus add to processing time, but  we don't lose information by making that line gray, so we'll want to convert all our images to grayscale. All these plots are also 480x400 pixels. We can likely make these smaller without losing the data we need to correctly clasify them. We'll do this with the `scikit-image` python package. 

We'll also rescale our image, making it physically smaller. 

In [None]:
import warnings
warnings.filterwarnings('ignore')

import functools

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

import os

from skimage.io import imread, imshow, imsave
from skimage import color
from skimage.transform import rescale
from skimage.measure import block_reduce

In [None]:
img = imread("data/png/none_0000001.png")

img_gray = color.rgb2gray(img)
img_scale = rescale(img_gray, .35, anti_aliasing=True)

#### Original Image:

In [None]:
imshow(img);

#### Grayscale Image:

In [None]:
imshow(img_gray);

#### Rescaled Image:

In [None]:
imshow(img_scale);

You can see our image has lost a fair amount of quality, but still contains all the same important information as the original and has been reduced in size by over 95%. This will be important, sd we have 40k images to train this model on. 

In [None]:
print("Original has {n} voxels {d}".format(n=functools.reduce(lambda a,b : a*b, img.shape), d=repr(img.shape)))
print("Reduced has {n} voxels {d}".format(n=functools.reduce(lambda a,b : a*b, img_scale.shape), d=repr(img_scale.shape)))

## Creating A Training Set

We've got a lot of data to process, so we'll need to write a function to convert all those images to individual numpy arrays, and also get an array that contains our response classification. Since I wrote procces to simulate all this data to output a csv with file paths and classifications, this part is pretty easy. 

In [None]:
def process_image(path):
    img = imread(path)
    img_gray = color.rgb2gray(img)
    img_scale = rescale(img_gray, .35, anti_aliasing=True)
    return img_scale

In [None]:
meta_data = pd.read_csv('data/control_file.csv')
meta_data.head()

In [None]:
print(meta_data['type'].value_counts())

We'll need dummy encoded variables for the "type" class. 

In [None]:
dummies = pd.get_dummies(meta_data['type'], prefix='d')
meta_data = pd.concat([meta_data, dummies], axis=1)

### Preprocessing all files

Since processing all 60k files will take a while, We'll want to do it once and save the data so I can use it later. 

``` python
from multiprocessing import Pool
import os

def process_save_image(file):
    path, name = os.path.split(file)
    img = process_image(path)
    imsave(os.path.join(path, '../png_redux', name), img)

p = Pool(3)
p.map(process_save_image, meta_data['filename'].tolist())
```

### Image to Numpy array

We now have out smaller images, but we need those to be numpy arrays and we want that data linked up to our meta data. Scikit image and pandas make that easy. 

In [None]:
def img_reader(file):
    path, name = os.path.split(file)
    newpath = os.path.join(path, '../png_redux', name)
    img = imread(newpath)
    return img

In [None]:
## this reads all images into numpy arrays in the column img_array
meta_data['img_array'] = meta_data['filename'].apply(img_reader)

## Training a Model

We'll just train a simple model here and not dive into evaluating how well it performed too much. Future posts will cover that. 

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(meta_data, shuffle=True, test_size=0.2)
del meta_data

In [None]:
test_x = np.stack(test['img_array'].tolist())

train_y = train['d_none'].values
test_y = test['d_none'].values

In [None]:
train_x = np.stack(train['img_array'].tolist())

In [None]:
mod = MLPClassifier()
mod.fit(train_x, train_y)