## Nathan's Kaggle Dogs vs Cats Redux

This notebooks will:

0. Introduce one way to pull data from kaggle
1. Create directory structure for utilities, vgg, and data (training, validation, sample, and test)
2. Download and unzip necessary data from Kaggle
3. Set up VGG to analyze training images and build model
4. Finetune the model to categorize just dogs and cats
5. Submit results to Kaggle via Kaggle CLI

# Downloading kaggle data

While a client for downloading kaggle exists, [kaggle-cli](https://github.com/floydwch/kaggle-cli), I had trouble with it.  

Instead I logged into Kaggle in Chrome and [copied the curl request](https://coderwall.com/p/-fdgoq/chrome-developer-tools-adds-copy-as-curl), then pasted that into the command line followed by -o filename.zip to make sure the output was to a file and not to STDOUT.

After dowloading both train.zip and test.zip, I unzipped the contents into the /data/redux/ directory

## Double Check Directory Structure

For now, we should have a directory structure that looks like this:

```
utils/
    vgg16.py                        // from fast.ai, this is the algorithm for image classification
    utils.py                        // from fast.ai, this is a collection of functions
lesson1/
    nathan_dogs_cats_redux.ipynb    // this file
    lesson1.ipynb                   // from fast.ai, this is a notebook that describes how to use vgg16
    data/
        redux/                      // kaggle data will be put into this dir and unzipped into train/ and test/
            train/
                cat.348.jpg
                dog.3778.jpg
                ...
            test/
                123.jpg
                857.jpg
                ...
```                

In [2]:
# sanity check to make sure we are in the right directory (/home/ubuntu/nbs/lesson1/)
%pwd

u'/home/ubuntu/nbs/lesson1'

In [44]:
# allow plots to show up in this notebook
%matplotlib inline

In [3]:
# create references to directories we will use often
import os, sys
current_dir = os.getcwd()
HOME_DIR_LESSON = current_dir
HOME_DIR_DATA = current_dir+'/data/redux'

In [7]:
# download kaggle data and unzip if this hasn't been done before
if os.path.isfile(HOME_DIR_DATA+'/train.zip') == False:
    print("No Kaggle Training zip found, double check you have training data")
elif os.path.isfile(HOME_DIR_DATA+'/test.zip') == False:
    print("No Kaggle Testing zip found, double check you have test data")
else:
    print("Kaggle zips found, assuming they've been unzipped appropriately")

Kaggle zips found, assuming they've been unzipped appropriately


In [8]:
#Allow relative imports to directories above lesson1/
sys.path.insert(1, os.path.join(sys.path[0], '..'))

# Slice up Kaggle Data for training

Now we're going to slice up the Kaggle data into further pieces, e.g the training set will be broken up into 3 subsets: train, validate, and sample.  We will create new directories and move around Kaggle data appropriately.

##  1. Create proper directory structure

```
utils/
lesson1/
    data/
        redux/                       
            train/
            test/
                *unknown/
            *valid/
            *results/
            *sample/
                *train/
                *test/
                *valid/
                *results/
            
        
```         

In [11]:
# Create directories if they don't exist
%cd $HOME_DIR_DATA

if os.path.exists(HOME_DIR_DATA+'/valid') == False:
    %mkdir valid
if os.path.exists(HOME_DIR_DATA+'/results') == False:
    %mkdir results
if os.path.exists(HOME_DIR_DATA+'/sample/train') == False:
    %mkdir -p sample/train
if os.path.exists(HOME_DIR_DATA+'/sample/test') == False:
    %mkdir -p sample/test
if os.path.exists(HOME_DIR_DATA+'/sample/valid') == False:
    %mkdir -p sample/valid
if os.path.exists(HOME_DIR_DATA+'/sample/results') == False:
    %mkdir -p sample/results
if os.path.exists(HOME_DIR_DATA+'/test/unkonwn') == False:
    %mkdir -p test/unknown

/home/ubuntu/nbs/lesson1/data/redux


## 2. Break up Kaggle data

In [12]:
%cd $HOME_DIR_DATA/train

/home/ubuntu/nbs/lesson1/data/redux/train


In [19]:
# import glob module for finding pathnames (note: results in arbitrary order)
import glob 
import numpy as np

##### Get all training data and shuffle it before breaking it up

In [27]:
# get array of all files in training set
training_pics = glob.glob('*.jpg')
# permute them for randomization
shuf = np.random.permutation(training_pics)
train_set_size = len(shuf)
print train_set_size,': total training files'

25000 : total training files


### *Seperate* Data For Validation Set

##### Separate (i.e. move) 10% of original training data off for validation set.  This shouldn't be done more than once.

In [31]:
valid_set_size = int(train_set_size*0.1)
if len(os.listdir(HOME_DIR_DATA+'/valid/')) == 0:
    for i in range(valid_set_size):
        os.rename(shuf[i], HOME_DIR_DATA+'/valid/' + shuf[i])

### *Copy* Data For Test Set

##### Copy ~1% subset of remaining training data for sampling (i.e. training on very small dataset for speed considerations)

In [32]:
from shutil import copyfile

In [40]:
# get array of remaining files after splitting off validation set
training_pics_sub_val = glob.glob('*.jpg')
shuf = np.random.permutation(training_pics_sub_val)
sample_test_size = int(len(shuf)*0.01)
if len(os.listdir(HOME_DIR_DATA+'/sample/train/')) == 0:
    for i in range(sample_test_size):
        copyfile(shuf[i], HOME_DIR_DATA+'/sample/train/'+shuf[i])

##### Copy ~2% subset of seperated validation data for sampling

In [41]:
%cd $HOME_DIR_DATA/valid

/home/ubuntu/nbs/lesson1/data/redux/valid


In [45]:
if len(os.listdir(HOME_DIR_DATA+'/sample/valid/')) == 0:
    print('/sample/valid is empty')
    validation_set = glob.glob("*.jpg")
    shuf = np.random.permutation(validation_set)
    sample_valid_size = int(len(shuf)*0.02)
    for i in range(sample_valid_size):
        copyfile(shuf[i], HOME_DIR_DATA+'/sample/valid/'+shuf[i])

/sample/valid is empty


### *Break up* Test Set

In [46]:
%cd $HOME_DIR_DATA/sample/train

/home/ubuntu/nbs/lesson1/data/redux/sample/train


In [48]:
os.path.isdir(HOME_DIR_DATA+'/sample/train/dogs')

False

In [49]:
# Divide sample train cat/dog images into seperate directories

%cd $HOME_DIR_DATA/sample/train
if os.path.isdir(HOME_DIR_DATA+'/sample/train/dogs') == False:
    %mkdir dogs
    %mv dog.*.jpg dogs/
if os.path.isdir(HOME_DIR_DATA+'/sample/train/cats') == False:
    %mkdir cats
    %mv cat.*.jpg cats/

/home/ubuntu/nbs/lesson1/data/redux/sample/train


In [50]:
# Divide sample valid cat/dog images into seperate directories

%cd $HOME_DIR_DATA/sample/valid
if os.path.isdir(HOME_DIR_DATA+'/sample/valid/dogs') == False:
    %mkdir dogs
    %mv dog.*.jpg dogs/
if os.path.isdir(HOME_DIR_DATA+'/sample/valid/cats') == False:
    %mkdir cats
    %mv cat.*.jpg cats/

/home/ubuntu/nbs/lesson1/data/redux/sample/valid


In [51]:
# Divide training set cat/dog images into seperate directories

%cd $HOME_DIR_DATA/train
if os.path.isdir(HOME_DIR_DATA+'/train/dogs') == False:
    %mkdir dogs
    %mv dog.*.jpg dogs/
if os.path.isdir(HOME_DIR_DATA+'/train/cats') == False:
    %mkdir cats
    %mv cat.*.jpg cats/

/home/ubuntu/nbs/lesson1/data/redux/train


In [53]:
# Divide validation set cat/dog images into seperate directories

%cd $HOME_DIR_DATA/valid
if os.path.isdir(HOME_DIR_DATA+'/valid/dogs') == False:
    %mkdir dogs
    %mv dog.*.jpg dogs/
if os.path.isdir(HOME_DIR_DATA+'/valid/cats') == False:
    %mkdir cats
    %mv cat.*.jpg cats/

/home/ubuntu/nbs/lesson1/data/redux/valid


### Create single 'unknown' class for test set

In [56]:
%cd $HOME_DIR_DATA/test
%mv *.jpg unknown/

/home/ubuntu/nbs/lesson1/data/redux/test


# Train First Generation Model

In [40]:
# import modules
%cd $HOME_DIR_LESSON
from utils import utils
from utils.vgg16 import Vgg16

/home/ubuntu/nbs/lesson1


#### We use vgg16, an existing image classifier which classifies images into one of 1000 different categories found on imagenet.

In [34]:
%cd $HOME_DIR_DATA

# set paths for training set (can take away sample/ once evertyhing's working)
path = HOME_DIR_DATA + '/sample/' #'/' for whole dataset
test_path = path + 'test/'
results_path = path + 'results/'
train_path = path + 'train/'
valid_path = path + 'valid/'

/home/ubuntu/nbs/lesson1/data/redux


In [41]:
# set Vgg16 helper class from vgg16 library
vgg = Vgg16()

In [42]:
# set constants: 
# batch size usu. recommended to be no larger than 64, can adjust
# to be smaller if running out of memory
batch_size = 64
# Increasing no_of_epochs should improve accuracy
no_of_epochs = 10

In [43]:
# grab a few images at a time for training and validation. 
# a batch is a collection of images and labels from 
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)

Found 225 images belonging to 2 classes.
Found 50 images belonging to 2 classes.


#### Let's see what we have so far

In [45]:
from matplotlib import pyplot as plt