# Session 07: Predictive Models

We now turn to a discussion of predictive models, a topic that is directly
important for understand how images are used in AI and algorithmic decision
making. It is also crutial for building a deeper understanding of image
collections through extract features.

## Setup

We need to load the modules within each notebook. Here, we load the
same set as in the previous question.

In [None]:
%pylab inline

import numpy as np
import scipy as sp
import pandas as pd
import sklearn
from sklearn import linear_model
import urllib

import os
from os.path import join

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.rcParams["figure.figsize"] = (8,8)

## Cats and dogs

We will now look at a different kind of visual dataset, consisting of images
of cats and dogs. Here is the associated metadata (we really only have one
piece of information for each image):

In [None]:
df = pd.read_csv(join("..", "data", "catdog.csv"))
df

Let's look a few of these images:

In [None]:
plt.figure(figsize=(14, 14))

idx = np.random.permutation(range(len(df)))[:15]

for ind, i in enumerate(idx):
    plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
    plt.subplot(5, 3, ind + 1)

    img = imread(join('..', 'images', 'catdog', df.filename[i]))
    plt.imshow(img)
    plt.axis("off")

How easy is it for you to tell whether the image is of a cat or a dog?
How might we go about teaching the computer to learn the difference?

## Features for learning a model

Much like our exploratory work, we need to extract features from images in order to
build predictive models. We will start with two relatively simple numeric features
from each image: the average value and the average saturation. Let's built a matrix
of these features now:

In [None]:
X = np.zeros((len(df), 2))

for i in range(len(df)):
    img = imread(join("..", "images", "catdog", df.filename[i]))
    img_hsv = matplotlib.colors.rgb_to_hsv(img)
    X[i, 0] = np.mean(img_hsv[:, :, 1])
    X[i, 1] = np.mean(img_hsv[:, :, 2])
    if i % 25 == 0:
        print("Done with {0:d} of {1:d}".format(i, len(df)))

We will also build an array that is equal to 0 for cats and 1 for dogs:

In [None]:
y = np.int32(df.animal.values == "dog")
y

## Building and evaluating a model for predictive learning

### Linear regression

We are now going to use the sklearn module to build predictive models from
the dataset. We will start with a relatively simply model: a linear regression.

The sklearn module has a consistent format for producing models. You start 
by creating an empty model:

In [None]:
model = linear_model.LinearRegression()

Next, we use the dataset to *fit* the model to the data. This uses patterns seen
in the data to distinguish between cats and dogs.

In [None]:
model.fit(X, y)

Our regression model makes a predict according to:

    prediction = a + b * avg_saturation + c * avg_value
    
The model used the data to determine the best parameters
for the numbers a, b, and c. We can see them here:

In [None]:
model.intercept_

In [None]:
model.coef_

How well does the model do at predicting which images are cats and which
images are dogs? Let's see all of the predictions:

In [None]:
pred = model.predict(X)
pred

The numbers are not exactly zero or one, so to compare we need to round to the
closest integer:

In [None]:
yhat = np.int32(pred > 0.5)
yhat

We can evaluate the model using a number of functions from the sklearn
metrics submodule. Here is the accuracy, just the percentage of predictions
that were correct:

In [None]:
sklearn.metrics.accuracy_score(y, yhat)

You'll often also hear about precision and recall. Precision tells
us what percentage of those images classified as a dog were actually
dogs:

In [None]:
sklearn.metrics.precision_score(y, yhat)

Recall shows the percentage of dogs that we correctly determined were dogs:

In [None]:
sklearn.metrics.recall_score(y, yhat)

Can you think of a way to make the precision really high without doing much
work? How about the recall? 

A popular metric that balances the recall and precision is the F1 score:

In [None]:
sklearn.metrics.f1_score(y, yhat)

Finally, a ROC curve shows well we would do if we used a cut-off
score other than 0.5. They are very common in CS papers and it helps
to be able to understand them if you want to look at new advances
in computer vision.

In [None]:
fpr, tpr, _ = sklearn.metrics.roc_curve(y, pred)
plt.plot(fpr, tpr, 'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

You can summarize the ROC curve with a measurment called the AUC:
area under the curve.

In [None]:
sklearn.metrics.auc(fpr, tpr)

### Nearest neighbors

Let's try a different model: k-nearest neighbors. Each image is classified
according to the other images that are closest to it. The syntax is almost 
exactly the same, but the predictions directly return predictions.

In [None]:
import sklearn.neighbors

model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
#model.fit(X, y)
#yhat = model.predict(X)
#yhat

This seems to do much better:

In [None]:
sklearn.metrics.accuracy_score(y, yhat)

Unfortunately, there's a bit of a problem here. Can you figure out
why this is not a very fair comparision?

## Spliting the data

Let's split the data into two groups. This is made fairly easy with the sklearn
module.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

Now, we will train the model with the training data:

In [None]:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10)
model.fit(X_train, y_train)

But do predictions on the test data:

In [None]:
yhat = model.predict(X_test)
sklearn.metrics.accuracy_score(y_test, yhat)

And now, we see there there is not much improvement compared to the linear
regression.

## Adding features

We won't be able to get a very good classification algorithm using only the
two features we have started with. We'll need something more than just a fancy
model. Let's go back to the histogram features from the last set of notes.

In [None]:
X = np.zeros((len(df), 50))

for i in range(len(df)):
    img = imread(join("..", "images", "catdog", df.filename[i]))
    img_hsv = matplotlib.colors.rgb_to_hsv(img)
    img_hsv[img_hsv[:, :, 1] < 0.2, 0] = img_hsv[img_hsv[:, :, 1] < 0.2, 2] + 1
    X[i, :] = np.histogram(img_hsv[:, :, 0].flatten(), bins=50, range=(0, 2))[0]
    if i % 25 == 0:
        print("Done with {0:d} of {1:d}".format(i, len(df)))

We'll make a training and testing split one more time:

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

And then, build a model from the data, testing the accuracy:

In [None]:
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)
yhat = np.int32(pred > 0.5)
sklearn.metrics.accuracy_score(y_test, yhat)

You should find that the accuracy is higher than it was before.
Better data makes better models. Notice that the ROC curve also
looks better:

In [None]:
fpr, tpr, _ = sklearn.metrics.roc_curve(y_test, pred)

In [None]:
plt.plot(fpr, tpr, 'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
sklearn.metrics.auc(fpr, tpr)

We also can try this with the nearest neighbors model. 

In [None]:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
yhat = model.predict(X_test)
sklearn.metrics.accuracy_score(y_test, yhat)

Try to change the numebr of neighbors to improve the model. You should
be able to get something similar to the linear regression.