# Step 5-6 - Models Based on Extracted Features

**Author(s):** bfoo@google.com

In this tutorial, we will perform training over the features collected from step 4's image and feature analysis step. Two tools will be used in this demo:

* Scikit learn: the widely used, single machine Python machine learning library
* TensorFlow: Google's home-grown machine learning library that allows distributed machine learning

# Preamble

Load all the necessary libraries. Also load the pickled features and labels from the previous step.

In [0]:
import cv2
import numpy as np
import os
import pickle
import shutil
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from random import random
from scipy import stats
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve

import tensorflow as tf
from tensorflow.contrib.learn import LinearClassifier
from tensorflow.contrib.learn import Experiment
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.layers import real_valued_column
from tensorflow.contrib.learn import RunConfig

In [0]:
# TODO: replace with your local dataprep directory
DATAPREP_DIR = '../../data/'
OUTPUT_DIR = '../../data/output_linear_small/'  # Directory where we store our logging and models.


## Load stored features and labels

Load from the pkl files saved in step 4 and confirm that the feature length is correct.

In [0]:
training_std = pickle.load(open(os.path.join(DATAPREP_DIR, 'training_std.pkl'), 'r'))
debugging_std = pickle.load(open(os.path.join(DATAPREP_DIR + 'debugging_std.pkl'), 'r'))
training_labels = pickle.load(open(os.path.join(DATAPREP_DIR + 'training_labels.pkl'), 'r'))
debugging_labels = pickle.load(open(os.path.join(DATAPREP_DIR + 'debugging_labels.pkl'), 'r'))

FEATURE_LENGTH = training_std.shape[1]
print FEATURE_LENGTH

In [0]:
# Examine the feature data we loaded:
print(type(training_std))
print(np.shape(training_std))
training_std[:3]

In [0]:
# Examine the label data we loaded:
print(type(training_labels))
print(np.shape(training_labels))
training_labels[:3]

# Logistic Regression in Scikit Learn

Scikit learn has a very easy interface for training a logistic regression model.

Logistic regression is a generalized linear model that predicts a probability value of whether each picture is a cat. For the mathematicians in the room, the model computes a scalar weight w associated with each feature x, sums it up, adds an intercept scalar b, and sends the whole result through a sigmoid function. We can then apply a threshold T to the score, and give each image a label 1 if the score is greater than T, and 0 if score is less than T.

## Metric to Optimize

In order to tune models, we must first consider metrics used to evaluate the performance of the model. While many metrics are available, because we performed stratified sampling on our dataset and have equal volumes of positive and negative labels (i.e. a balanced dataset), a good performance metric here is average accuracy.

In [0]:
def get_accuracy(labels, predictions, threshold=0.5):
  pred_label = predictions >= 0.5
  correct = sum(pred_label == labels)
  wrong = sum(pred_label != labels)
  return float(correct) / (correct + wrong)

## Tuning Logistic Regression

As we focus on optimizing our accuracy metric, keep in mind that nearly all machine learning models require tuning during the training phase. Logistic regression is considered one of the simplest models to learn, but it still relies on parameters to configure. These parameters are often called "hyperparameters", as you will not modify them during the course of a single training sequence, but you will try training the model multiple times using different values of hyperparameters to explore the values that lead the best performance.

In logistic regression, one of the hyperparameters is known as the regularization term C. Regularization is a penalty associated with the complexity of the model itself, such as the value of its weights. The example below uses "L1" regularization, which has the following behavior: as C decreases, the number of non-zero weights also decreases (complexity decreases). 

A high complexity model (high C) will fit very well to the training data, but will also capture the noise inherent in the training set. This could lead to poor performance when predicting labels on the debugging set.

A low complexity model (low C) does not fit as well with training data, but will generalize better over unseen data. There is a delicate balance in this process, as oversimplifying the model also hurts its performance.

**EXERCISE: Try changing the value of C and re-running this cell. What happens to the number of non-zero weights in the model? What happens to the training and debugging accuracies when C is very large?**


In [0]:
# plug into scikit learn for logistic regression training
model = LogisticRegression(penalty='l1', C=0.2)
model.fit(training_std, training_labels)

# Print zero coefficients to check regularization strength
print 'non-zero weights', sum(model.coef_[0] > 0)

# Get the output predictions of the training and debugging inputs
training_predictions = model.predict_proba(training_std)[:, 1]
debugging_predictions = model.predict_proba(debugging_std)[:, 1]

# Compute our accuracy metric for training and debugging
print 'average training accuracy ' + str(get_accuracy(training_labels, training_predictions))
print 'average debugging accuracy ' + str(get_accuracy(debugging_labels, debugging_predictions))

# Sklearn Support Vector Machine

**WARNING: Note that SVMs are more complex than logistic regression and take a longer time to run. If you downloaded the entire training set, this can take a very long time to run! **

Logistic regression is often considered one of the simplest binary classification models.

A more sophisticated machine learning model is the support vector machine (SVM). SVMs are more commonly used with computer vision features because of its ability to capture nonlinear relationships between features and labels, while also capturing clusters of positive or negative labels in the feature vector space. For instance, Scikit learn's SVM defaults to what is called the radial-basis function (RBF), where the likelihood that a feature vector is classified as cat or not cat depends strongly on its proximity to other training feature vectors the respective classes.

SVMs have more parameters to configure. A larger C increases sensitivity to mislabeled points and can lead to overfitting (great training but poor debugging performance). Smaller gamma causes the model to average training feature vectors around a wider range of the feature space.

**Exercise: play with the different values of C and gamma, especially at the extreme ends. What leads to great training but not so great debugging performance? What leads to blind inference (50% accuracy)? What is the best debugging performance you can achieve?**

In [0]:
# Do the same set of steps for SVMs
svm_model = svm.SVC(probability=True, C=50.0, gamma=0.002)
svm_model.fit(training_std, training_labels)

training_predictions = svm_model.predict_proba(training_std)[:, 1]
debugging_predictions = svm_model.predict_proba(debugging_std)[:, 1]
precision, recall, _ = precision_recall_curve(debugging_labels, debugging_predictions)
# Get average debugging precision score
print 'average training accuracy ' + str(get_accuracy(training_labels, training_predictions))
print 'average debugging accuracy ' + str(get_accuracy(debugging_labels, debugging_predictions))

# Tensorflow Model

Tensorflow is a Google home-grown tool that allows one to define a model and run distributed training on it. In this notebook, we focus on the atomic pieces for building a tensorflow model. However, this will all be trained locally. 

### A couple warnings

One abstraction you will notice below is that Tensorflow creates its own objects that are different from Python. In particular, if you consider a distributed environment where data is not immediately available on disk, you would instead define a processing graph such that when data becomes available, the data will run through the graph and compute all of the metrics, objective functions, and everything else that you would need.

Secondly, Tensorflow is more complicated than sklearn, as you will soon see below. Tensorflow contains many more parameters to configure, which allows experts to fine tune many aspects of the model and optimize them in distributed environments. However, we will keep Tensorflow simple and digestible in this tutorial and stick with just a few parameters.

# Input functions

Tensorflow requires the user to define input functions, which are functions that return rows of feature vectors, and their corresponding labels. Tensorflow will periodically call these functions to obtain data as model training progresses. 

Why not just provide the feature vectors and labels upfront? Again, this comes down to the distributed aspect of Tensorflow, where data can be received from various sources, and not all data can fit on a single machine. For instance, you may have several million rows distributed across a cluster, but any one machine can only provide a few thousand rows. Tensorflow allows you to define the input function to pull data in from a queue rather than a numpy array, and that queue can contain training data that is available at that time.

Another practical reason for supplying limited training data is that sometimes the feature vectors are very long, and only a few rows can fit within memory at a time. Finally, complex ML models (such as deep neural networks) take a long time to train and use up a lot of cpu and memory, and so limiting the training samples at each machine allows us to train faster and without memory issues.

The input function's returned features is defined as a dictionary of scalar, categorical, or tensor-valued features. The returned labels from an input function is defined as a single tensor storing the labels. In this notebook, we will simply return the entire set of features and labels with every function call.

In [0]:
def train_input_fn():
  training_X_tf = tf.convert_to_tensor(training_std, dtype=tf.float32)
  training_y_tf = tf.convert_to_tensor(training_labels, dtype=tf.float32)
  return {'features': training_X_tf}, training_y_tf

def eval_input_fn():
  debugging_X_tf = tf.convert_to_tensor(debugging_std, dtype=tf.float32)
  debugging_y_tf = tf.convert_to_tensor(debugging_labels, dtype=tf.float32)
  return {'features': debugging_X_tf}, debugging_y_tf

## Tensorflow Logistic Regression

Tensorflow's linear classifiers, such as logistic regression, are structured as estimators. An estimator has the ability to compute the objective function of the ML model, and take a step towards reducing it. Tensorflow has built-in estimators such as "LinearClassifier", which is just a logistic regression trainer. These estimators have additional metrics that are calculated, such as the average accuracy at threshold = 0.5. (Woah, we have exactly the metric we want!)

Additionally, we can configure an optimizer, which basically determines how the model will shift its weights in order to optimize the objective function. There are several different optimizers available for experts to decide. However, FtrlOptimizer is the default optimizer.

The LEARNING_RATE parameter is another option that can be tuned. If this value is very large, the model coefficients may diverge and performance will be very poor as each jump, skip, and hop is a mile away. If the value is too small, the model can take a very long time to train to its final value.

In [0]:
# Tweak this hyperparameter to improve debugging precision-recall AUC. 
REG_L1 = 5.0 # Use the inverse of C in sklearn, i.e 1/C.
LEARNING_RATE = 2.0 # How aggressively to adjust coefficients during optimization
TRAINING_STEPS = 20000

# The estimator requires an array of features from the dictionary of feature columns to use in the model
feature_columns = [real_valued_column('features', dimension=FEATURE_LENGTH)]

# Use Tensorflow's built-in LinearClassifier estimator, which implements a logistic regression underneath
# You can go to the model_dir below to see what Tensorflow leaves behind during training. Delete the directory
# if you wish to retrain.
estimator = LinearClassifier(feature_columns=feature_columns,
                             optimizer=tf.train.FtrlOptimizer(
                               learning_rate=LEARNING_RATE,
                               l1_regularization_strength=REG_L1),
                             model_dir=os.path.join(OUTPUT_DIR, 'model-reg-' + str(REG_L1))
                            )

## Experiments and Runners

An experiment is a Tensorflow object that stores the estimator, as well as several other parameters. It can also periodically write the model progress into checkpoints which can be loaded later if you would like to continue the model where the training last left off.

Some of the parameters are:

* train_steps: how many times to adjust model weights before stopping
* eval_steps: when a summary is written, the model, in its current state of progress, will try to predict the debugging data and calculate its accuracy. Eval_steps is set to 1 because we only need to call the input function once (already returns the entire evaluation dataset).
* The rest of the parameters just says "do evaluation once".

(If you run the below script multiple times without changing REG_L1 or train_steps, you will notice that the model does not train, as you've already trained the model that many steps for the given configuration).

## On Tensorflow Outputs

There is a lot of text that is outputted by Tensorflow. Such text can be useful when debugging a distributed training pipeline, but is pretty noisy when running from a notebook locally. The line to look for is the chunk at the end where "accuracy" is reported. This is the final result of the model.

In [0]:
def experiment_fn(output_dir):
  return Experiment(estimator=estimator,
                    train_input_fn=train_input_fn,
                    eval_input_fn=eval_input_fn,
                    train_steps=TRAINING_STEPS,
                    eval_steps=1,
                    min_eval_frequency=1,
                    local_eval_frequency=1)

learn_runner.run(experiment_fn, os.path.join(OUTPUT_DIR, 'model-reg-' + str(REG_L1)))