[Source kernel](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/lesson2.ipynb)

# Linear models with CNN features

## Introduction

In [1]:
import os, json, sys
from glob import glob
import numpy as np
from numpy.random import random, permutation
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom
import scipy

In [2]:
sys.path.insert(0, './../utils')
from importlib import reload
import utils; reload(utils)
from utils import plots, get_batches, plot_confusion_matrix, get_data

Using Theano backend.


In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt
%matplotlib inline

In [4]:
import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.models import Sequential
from keras.layers import Input
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD, RMSprop
from keras.preprocessing import image

## Linear models in keras

It turns out that each of the Dense() layers is just a linear model, followed by a simple activation function. We'll learn about the activation function later - first, let's review how linear models work.

In [5]:
x = random((30,2))
y = np.dot(x, [2., 3.]) + 1.

In [6]:
x[:5]

array([[ 0.2352,  0.4929],
       [ 0.0987,  0.9733],
       [ 0.9157,  0.4227],
       [ 0.5391,  0.7782],
       [ 0.561 ,  0.7058]])

In [7]:
y[:5]

array([ 2.9492,  4.1172,  4.0997,  4.4128,  4.2395])

We can use keras to create a simple linear model (Dense() - with no activation - in Keras) and optimize it using SGD to minimize mean squared error (mse):

In [8]:
lm = Sequential([ Dense(1, input_shape=(2,)) ])
lm.compile(optimizer=SGD(lr=0.1), loss='mse')

This has now learnt internal weights inside the lm model, which we can use to evaluate the loss function (MSE).

In [9]:
lm.evaluate(x, y, verbose=0)

19.47471809387207

In [10]:
lm.fit(x, y, epochs=5, batch_size=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f73fbb1a240>

In [11]:
lm.evaluate(x, y, verbose=0)

0.022048894315958023

In [12]:
lm.get_weights()

[array([[ 1.6349],
        [ 2.58  ]], dtype=float32), array([ 1.4228], dtype=float32)]

## Train linear model on predictions

Using a Dense() layer in this way, we can easily convert the 1,000 predictions given by our model into a probability of dog vs cat--simply train a linear model to take the 1,000 predictions as input, and return dog or cat as output, learning from the Kaggle data. This should be easier and more accurate than manually creating a map from imagenet categories to one dog/cat category.

### Training the model

In [13]:
from vgg16 import Vgg16

In [14]:
path = '../dogs_vs_cats/intermediate/'
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)

In [15]:
vgg = Vgg16

In [16]:
#batch_size=100
batch_size=4

Our overall approach here will be:

1. Get the true labels for every image
2. Get the 1,000 imagenet category predictions for every image
3. Feed these predictions as input to a simple linear model.

Let's start by grabbing training and validation batches.

In [17]:
# Use batch size of 1 since we're just doing preprocessing on the CPU
val_batches = get_batches(path+'val0.3', shuffle=False, batch_size=1)
batches = get_batches(path+'train0.3', shuffle=False, batch_size=1)

Found 7500 images belonging to 2 classes.
Found 17500 images belonging to 2 classes.


Loading and resizing the images every time we want to use them isn't necessary - instead we should save the processed arrays. By far the fastest way to save and load numpy arrays is using **bcolz**. This also compresses the arrays, so we save disk space. Here are the functions we'll use to save and load using bcolz.

In [18]:
import bcolz
def save_array(fname, arr): 
    c=bcolz.carray(arr, rootdir=fname, mode='w')
    c.flush()
    
def load_array(fname): 
    return bcolz.open(fname)[:]

Look at error [here](http://forums.fast.ai/t/type-error-in-lesson-2-get-data-method/1105/13).

In ```utils.py``` ```get_data``` function, change:

- (new) return np.concatenate([batches.next() for i in range(batches.samples)])
- (old) return np.concatenate([batches.next() for i in range(batches.nb_sample)])

In [19]:
trn_data = get_data(path+'train0.3')

Found 17500 images belonging to 2 classes.


MemoryError: 

In [None]:
val_data = get_data(path+'val0.3/')

In [None]:
save_array(model_path+'train_data.bc', trn_data)
save_array(model_path+'valid_data.bc', val_data)

In [None]:
save_array(model_path+'train_data.bc', trn_data)
save_array(model_path+'valid_data.bc', val_data)

In [None]:
trn_data = load_array(model_path+'train_data.bc')
val_data = load_array(model_path+'valid_data.bc')

In [None]:
train_data.shape, val_data.shape

Keras returns classes as a single column, so we convert to one hot encoding

In [None]:
def onehot(x): 
    return np.array(OneHotEncoder().fit_transform(x.reshape(-1,1)).todense())

In [None]:
val_classes = val_batches.classes
train_classes = batches.classes

In [None]:
train_classes.shape

In [None]:
val_labels = onehot(val_classes)
train_labels = onehot(train_classes)