# Statoil/C-CORE Iceberg Classifier Challenge
## Ship or iceberg, can you decide from space?


[Ship or Iceberg](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition)
(https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)    
    

To start you will need to download the git repo. Ensure your directory structure looks like this
```
utils/
    vgg16.py
    utils.py
attempt1/
    statoil-iceberg-classifier-challenge.ipynb

```
You should launch your notebook inside the attempt1 directory
```
cd attempt1
jupyter notebook
```


In [None]:
#Verify we are in the attempt1 directory
%pwd

In [None]:
#Create references to important directories we will use over and over
import os, sys
current_dir = os.getcwd()
ATTEMPT_HOME_DIR = current_dir
DATA_HOME_DIR = current_dir+'/data/icebergs'

In [None]:
%mkdir data
%mkdir -p data/icebergs
%cd $DATA_HOME_DIR

In [None]:
#Install Kaggle CLI
! pip install -U kaggle-cli

In [None]:
! kg config -g -u dan.king.001@gmail.com -p insertpasswordhere -c statoil-iceberg-classifier-challenge
! kg download

In [None]:
! sudo apt-get install p7zip-full

In [None]:
! 7z e test.json.7z
! 7z e train.json.7z

In [None]:
%cd $ATTEMPT_HOME_DIR

In [None]:
#Allow relative imports to directories above attempt1/
sys.path.insert(1, os.path.join(sys.path[0], '..'))

#import modules
from utils import *
from vgg16 import Vgg16

#Instantiate plotting tool
#In Jupyter notebooks, you will need to run this command before doing any plotting
%matplotlib inline

# Action Plan
1. Create Validation and Sample sets
2. Rearrange image files into their respective directories 
3. Finetune and Train model
4. Generate predictions
5. Validate predictions
6. Submit predictions to Kaggle

## Create validation set and sample

In [None]:
#Create directories
%cd $DATA_HOME_DIR
%mkdir valid
%mkdir results
%mkdir test
%mkdir train
%mkdir -p train/ships
%mkdir -p train/icebergs
%mkdir -p valid/ships
%mkdir -p valid/icebergs
%mkdir -p sample/train
%mkdir -p sample/train/ships
%mkdir -p sample/train/icebergs
%mkdir -p sample/test
%mkdir -p sample/valid
%mkdir -p sample/valid/ships
%mkdir -p sample/valid/icebergs
%mkdir -p sample/results
%mkdir -p test/unknown

In [None]:
import pandas as pd 
import cv2 
import numpy as np 
import matplotlib.pyplot as plt  

In [None]:
def create_train():
    # Read the json files into a pandas dataframe
    df_train = pd.read_json('train.json')
    print("df_train.size",len(df_train))
    for ix, row in df_train.iterrows():
        img = np.array(row['band_1']).reshape((75, 75))
        img2 = np.array(row['band_2']).reshape((75, 75))
        img3 = img + img2
        img3 -= img3.min()
        img3 /= img3.max()
        img3 *= 255
        plt.imshow(img3)
        img3 = img3.astype(np.uint8)
        if row['is_iceberg']==0:
            cv2.imwrite("train/ships/" + row['id'] + ".png".format(ix), img3)
        elif row['is_iceberg']==1:
            cv2.imwrite("train/icebergs/" + row['id'] + ".png".format(ix), img3)

In [None]:
create_train()

In [None]:
def create_test():
    # Read the json files into a pandas dataframe
    df_train = pd.read_json('test.json')
    # fig = plt.gcf()
    print("df_train.size",len(df_train))
    for ix, row in df_train.iterrows():
        img = np.array(row['band_1']).reshape((75, 75))
        img2 = np.array(row['band_2']).reshape((75, 75))
        img3 = img + img2
        img3 -= img3.min()
        img3 /= img3.max()
        img3 *= 255
        img3 = img3.astype(np.uint8)
        cv2.imwrite("test/unknown/" + row['id'] + ".png".format(ix), img3)

In [None]:
create_test()

## Rearrange image files into their respective directories

In [None]:
%cd $DATA_HOME_DIR/train/ships

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(100): os.rename(shuf[i], DATA_HOME_DIR+'/valid/ships/' + shuf[i])

In [None]:
from shutil import copyfile

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(100): copyfile(shuf[i], DATA_HOME_DIR+'/sample/train/ships/' + shuf[i])

In [None]:
%cd $DATA_HOME_DIR/train/icebergs

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(100): os.rename(shuf[i], DATA_HOME_DIR+'/valid/icebergs/' + shuf[i])

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(100): copyfile(shuf[i], DATA_HOME_DIR+'/sample/train/icebergs/' + shuf[i])

In [None]:
%cd $DATA_HOME_DIR/valid/ships/

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(50): copyfile(shuf[i], DATA_HOME_DIR+'/sample/valid/ships/' + shuf[i])

In [None]:
%cd $DATA_HOME_DIR/valid/icebergs/

In [None]:
g = glob('*.png')
shuf = np.random.permutation(g)
for i in range(50): copyfile(shuf[i], DATA_HOME_DIR+'/sample/valid/icebergs/' + shuf[i])

In [None]:
%rm $DATA_HOME_DIR/train.json
%rm $DATA_HOME_DIR/train.json.7z
%rm $DATA_HOME_DIR/test.json
%rm $DATA_HOME_DIR/test.json.7z
%rm $DATA_HOME_DIR/sample_submission.csv.7z

## Finetuning and Training

In [None]:
%cd $DATA_HOME_DIR

#Set path to sample/ path if desired
path = DATA_HOME_DIR + '/' #'/sample/'
test_path = DATA_HOME_DIR + '/test/' #We use all the test data
results_path=DATA_HOME_DIR + '/results/'
train_path=path + '/train/'
valid_path=path + '/valid/'

In [None]:
#import Vgg16 helper class
vgg = Vgg16()

In [None]:
#Set constants. You can experiment with no_of_epochs to improve the model

#batch_size is as large as you can without running out of GPU memory. 
batch_size=64
no_of_epochs=10

In [None]:
#Finetune the model
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)
vgg.finetune(batches)

#Not sure if we set this for all fits
vgg.model.optimizer.lr = 0.01

In [None]:
#Notice we are passing in the validation dataset to the fit() method
#For each epoch we test our model against the validation set
latest_weights_filename = None
for epoch in range(no_of_epochs):
    print "Running epoch: %d" % epoch
    vgg.fit(batches, val_batches, nb_epoch=1)
    latest_weights_filename = 'ft%d.h5' % epoch
    vgg.model.save_weights(results_path+latest_weights_filename)
print "Completed %s fit operations" % no_of_epochs

## Generate Predictions

Let's use our new model to make predictions on the test dataset

In [None]:
batches, preds = vgg.test(test_path, batch_size = batch_size*2)

In [None]:
#For every image, vgg.test() generates two probabilities 
#based on how we've ordered the iceberg/ship directories.
#It looks like column one is icebergs and column two is ships
print preds[:5]

filenames = batches.filenames
print filenames[:5]

In [None]:
#You can verify the column ordering by viewing some images
from PIL import Image
Image.open(test_path + filenames[2])

In [None]:
#Save our test results arrays so we can use them again later
save_array(results_path + 'test_preds.dat', preds)
save_array(results_path + 'filenames.dat', filenames)

## Validate Predictions

Keras' *fit()* function conveniently shows us the value of the loss function, and the accuracy, after every epoch ("*epoch*" refers to one full run through all training examples). The most important metrics for us to look at are for the validation set, since we want to check for over-fitting. 

- **Tip**: with our first model we should try to overfit before we start worrying about how to reduce over-fitting - there's no point even thinking about regularization, data augmentation, etc if you're still under-fitting! (We'll be looking at these techniques shortly).

As well as looking at the overall metrics, it's also a good idea to look at examples of each of:
1. A few correct labels at random
2. A few incorrect labels at random
3. The most correct labels of each class (ie those with highest probability that are correct)
4. The most incorrect labels of each class (ie those with highest probability that are incorrect)
5. The most uncertain labels (ie those with probability closest to 0.5).

Let's see what we can learn from these examples. (In general, this is a particularly useful technique for debugging problems in the model. However, since this model is so simple, there may not be too much to learn at this stage.)

Calculate predictions on validation set, so we can find correct and incorrect examples:

In [None]:
vgg.model.load_weights(results_path+latest_weights_filename)

In [None]:
val_batches, probs = vgg.test(valid_path, batch_size = batch_size)

In [None]:
filenames = val_batches.filenames
expected_labels = val_batches.classes #0 or 1

#Round our predictions to 0/1 to generate labels
our_predictions = probs[:,0]
our_labels = np.round(1-our_predictions)

In [None]:
from keras.preprocessing import image

#Helper function to plot images by index in the validation set 
#Plots is a helper function in utils.py
def plots_idx(idx, titles=None):
    plots([image.load_img(valid_path + filenames[i]) for i in idx], titles=titles)
    
#Number of images to view for each visualization task
n_view = 4

In [None]:
#1. A few correct labels at random
correct = np.where(our_labels==expected_labels)[0]
print "Found %d correct labels" % len(correct)
idx = permutation(correct)[:n_view]
plots_idx(idx, our_predictions[idx])

In [None]:
#2. A few incorrect labels at random
incorrect = np.where(our_labels!=expected_labels)[0]
print "Found %d incorrect labels" % len(incorrect)
idx = permutation(incorrect)[:n_view]
plots_idx(idx, our_predictions[idx])

In [None]:
#3a. The images we most confident were icebergs, and are actually icebergs
correct_icebergs = np.where((our_labels==0) & (our_labels==expected_labels))[0]
print "Found %d confident correct icebergs labels" % len(correct_icebergs)
most_correct_icebergs = np.argsort(our_predictions[correct_icebergs])[::-1][:n_view]
plots_idx(correct_icebergs[most_correct_icebergs], our_predictions[correct_icebergs][most_correct_icebergs])

In [None]:
#3b. The images we most confident were ships, and are actually ships
correct_ships = np.where((our_labels==1) & (our_labels==expected_labels))[0]
print "Found %d confident correct ships labels" % len(correct_ships)
most_correct_ships = np.argsort(our_predictions[correct_ships])[:n_view]
plots_idx(correct_ships[most_correct_ships], our_predictions[correct_ships][most_correct_ships])

In [None]:
#4a. The images we were most confident were icebergs, but are actually ships
incorrect_icebergs = np.where((our_labels==0) & (our_labels!=expected_labels))[0]
print "Found %d incorrect icebergs" % len(incorrect_icebergs)
if len(incorrect_icebergs):
    most_incorrect_icebergs = np.argsort(our_predictions[incorrect_icebergs])[::-1][:n_view]
    plots_idx(incorrect_icebergs[most_incorrect_icebergs], our_predictions[incorrect_icebergs][most_incorrect_icebergs])

In [None]:
#4b. The images we were most confident were ships, but are actually icebergs
incorrect_ships = np.where((our_labels==1) & (our_labels!=expected_labels))[0]
print "Found %d incorrect ships" % len(incorrect_ships)
if len(incorrect_ships):
    most_incorrect_ships = np.argsort(our_predictions[incorrect_ships])[:n_view]
    plots_idx(incorrect_ships[most_incorrect_ships], our_predictions[incorrect_ships][most_incorrect_ships])

In [None]:
#5. The most uncertain labels (ie those with probability closest to 0.5).
most_uncertain = np.argsort(np.abs(our_predictions-0.5))
plots_idx(most_uncertain[:n_view], our_predictions[most_uncertain])

Perhaps the most common way to analyze the result of a classification model is to use a [confusion matrix](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/). Scikit-learn has a convenient function we can use for this purpose:

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(expected_labels, our_labels)

We can just print out the confusion matrix, or we can show a graphical view (which is mainly useful for dependents with a larger number of categories).

In [None]:
plot_confusion_matrix(cm, val_batches.class_indices)

## Submit Predictions to Kaggle!

Here's the format Kaggle requires for new submissions:
```
id,is_iceberg
5941774d,0.5
4023181e,0.5
b20200e4,0.5
e7f018bb,0.5
```

Kaggle wants the imageId followed by the probability of the image being an iceberg. Kaggle uses a metric called [Log Loss](http://wiki.fast.ai/index.php/Log_Loss) to evaluate your submission.

In [None]:
#Load our test predictions from file
preds = load_array(results_path + 'test_preds.dat')
filenames = load_array(results_path + 'filenames.dat')

In [None]:
#Grab the iceberg prediction column
isiceberg = preds[:,1]
print "Raw Predictions: " + str(isiceberg[:5])
print "Mid Predictions: " + str(isiceberg[(isiceberg < .6) & (isiceberg > .4)])
print "Edge Predictions: " + str(isiceberg[(isiceberg == 1) | (isiceberg == 0)])

[Log Loss](http://wiki.fast.ai/index.php/Log_Loss) doesn't support probability values of 0 or 1--they are undefined (and we have many). Fortunately, Kaggle helps us by offsetting our 0s and 1s by a very small value. So if we upload our submission now we will have lots of .99999999 and .000000001 values. This seems good, right?

Not so. There is an additional twist due to how log loss is calculated--log loss rewards predictions that are confident and correct (p=.9999,label=1), but it punishes predictions that are confident and wrong far more (p=.0001,label=1). See visualization below.

In [None]:
#Visualize Log Loss when True value = 1
#y-axis is log loss, x-axis is probabilty that label = 1
#As you can see Log Loss increases rapidly as we approach 0
#But increases slowly as our predicted probability gets closer to 1
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import log_loss

x = [i*.0001 for i in range(1,10000)]
y = [log_loss([1],[[i*.0001,1-(i*.0001)]],eps=1e-15) for i in range(1,10000,1)]

plt.plot(x, y)
plt.axis([-.05, 1.1, -.8, 10])
plt.title("Log Loss when true label = 1")
plt.xlabel("predicted probability")
plt.ylabel("log loss")

plt.show()

In [None]:
#So to play it safe, we use a sneaky trick to round down our edge predictions
#Swap all ones with .95 and all zeros with .05
isiceberg = isiceberg.clip(min=0.21, max=0.79)

In [None]:
#Extract imageIds from the filenames in our test/unknown directory 
filenames = batches.filenames
print filenames[:5]
ids = np.array([(f[8:f.find('.')]) for f in filenames])
print ids[:5]

Here we join the two columns into an array of [imageId, isDog]

In [None]:
subm = np.stack([ids,isiceberg], axis=1)
subm[:5]

In [None]:
%cd $DATA_HOME_DIR
submission_file_name = 'submission1.csv'
np.savetxt(submission_file_name, subm, delimiter=',', header='id,is_iceberg',fmt='%s', comments='')

In [None]:
from IPython.display import FileLink
%cd $ATTEMPT_HOME_DIR
FileLink('data/icebergs/'+submission_file_name)

You can download this file and submit on the Kaggle website or use the Kaggle command line tool's "submit" method.