# Code for a Support Vector Machine Entry with Scikit-Learn ~ LB 1.6 (Benchmark)

## Why do this?
Convolutional Neural Networks are demonstrated to be highly effective at solving vision problems. Why use an SVM?

 - Excellent way to learn and understand the problem
 - Better results than you'd expect
 - Great way to learn how to use the SVM library in SKLearn (that's why I did it)
 - Will learn some handy tricks in opencv (especially regarding feature extraction), along the way

Personally, I think the CNN's (when well optimized) will perform better, but there's a lot to learn from this process. Particularly with regard to feature extraction. One thing I'm wondering is whether we could gain performance improvements by feeding extracted features to our CNNs instead of the raw image. Something to experiment with and explore.

## Overview of the process
The process for using an SVM for this problem is a little different than using a ConvNet.

Most importantly, we will need to extract features from the imagery ourselves, rather than letting the CNN do it. So we get to pull back the curtain a bit, and see what's happening behind the scenes (if you'll forgive the metaphor).

**The Steps We'll Take**
 - Load Data
 - Extract Features
 - Feed Features into SVM Model
 - Fit the SVM
 - Predict with our SVM

From there, we can iterate and improve the model. Just as with ConvNets, if we send better data into the model, we will get better results. 

## Loading data
The first step in this process is loading data. Here is a script to get our images loaded into memory.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import svm
import cv2
import csv
import os
from os import listdir
from os.path import isfile, join

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
mydirvalues = [d for d in os.listdir(os.path.dirname(os.path.abspath(__file__)))]
print(mydirvalues)
onlyfiles = [f for f in listdir("../input/train/") if isfile(join("../input/train/", f))]
print(onlyfiles)

dir_names = [d for d in listdir("../input/train/") if not isfile(join("../input/train/", d))]
print(dir_names)

file_paths = {}
class_num = 0
for d in dir_names:
     fnames = [f for f in listdir("../input/train/"+d+"/") if isfile(join("../input/train/"+d+"/", f))]
     print(fnames)
     file_paths[(d, class_num, "../input/train/"+d+"/")] = fnames
     class_num += 1

## Feature Extraction
This is probably the most important, and the most interesting part of the process--it's where we have the most options. 

We need to extract features from the image in some way.

### Here are some good options
 1. **SIFT (Scale Invariant Feature Transform)** - This is what I chose, but it doesn't work on the Kaggle servers. You'll need to run code for this on your computer. It yields an Nx128 set of features, each extracted from detected keypoints in the image. The problem with SIFT, however, is that it is NOT open source. So, technically you need to pay royalties to its inventors in commercial applications.
 2. **ORB** - OpenCV's open source alternative to SIFT. I have code for using ORB in the section below, as well. But the code may need to be restructured from Nx128 to NxM, depending on how many dimensions ORB gives you back for each feature.  More on ORB: http://www.willowgarage.com/sites/default/files/orb_final.pdf 
 3. **HoG (Histogram of Gradients)** - Another good option, this often used to find both people and animals in photos. More info: https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients 

The point here is: once features are extracted, we have essentially taken the the "most important" or "distinctive" keypoints from each image and mapped them into a higher dimensional space via our feature extraction method.

We then feed our higher dimensional information vector into the SVM, along with a label, and we train the SVM on these data in our high-dimensional space. The SVM learns what boundaries mark distinctions between each class type, and so:  when we extract the features from a test image--using the same method--the hope is that the description vector takes us to an area in the feature space that is common to vectors taken from images of the same class-type. 

Whether you use an SVM for this problem or not, I do believe that some of these feature extraction techniques could be useful, if fed into a neural network, rather than just a raw image. 

In [None]:
    # General steps:
    # Extract feature from each file as HOG or similar... or SIFT... or Similar...
    # map each to feature space... and train some kind of classifier on that. SVM is a good choice.
    # do the same for each feature in test set...
    training_data = np.array([])
    training_labels = np.array([])

    for key in file_paths:
        category = key[1]
        directory_path = key[2]
        file_list = file_paths[key]

        # shuffle this list, so we get random examples
        np.random.shuffle(file_list)
        
        # Stop early, while testing, so it doesn't take FOR-EV-ER (FOR-EV-ER)
        i = 0

        # read in the file and get its SIFT features
        for fname in file_list:
            fpath = directory_path + fname
            print(fpath)
            print("Category = " + str(category))
            # extract features!
            gray = cv2.imread(fpath,0)
            gray = cv2.resize(gray, (400, 250))  # resize so we're always comparing same-sized images
                                                 # Could also make images larger/smaller
                                                 # to tune for greater accuracy / more speedd
                        
            """ My Choice: SIFT (Scale Invariant Feature Transform)"""
            # However, this does not work on the Kaggle server
            # because it's in a separate package in the opencv version used on the Kaggle server.
            # This is a very robust method however, worth trying when it's reasonable to do so. 
            detector = cv2.SIFT()
            kp1, des1 = detector.detectAndCompute(gray, None)
            
            """ Another option that will work on Kaggle server is ORB"""
            # find the keypoints with ORB
            #kp = cv2.orb.detect(img,None)
            # compute the descriptors with ORB
            #kp1, des1 = cv2.orb.compute(img, kp)
            
            """ Histogram of Gradients - often used to for detected people/animals in photos"""
             # Havent' tried this one in the SVM yet, but here's how to get the HoG, using openCV
             # hog = cv2.HOGDescriptor()
             #img = cv2.imread(sample)
             # h = hog.compute(im)

            # This is to make sure we have at least 100 keypoints to analyze
            # could also duplicate a few features if needed to hit a higher value
            if len(kp1) < 100:
                continue
                
            # transform the data to float and shuffle all keypoints
            # so we get a random sampling from each image
            des1 = des1.astype(np.float64)
            np.random.shuffle(des1)
            des1 = des1[0:100,:] # trim vector so all are same size
            vector_data = des1.reshape(1,12800) 
            list_data = vector_data.tolist()

            # We need to concatenate ont the full list of features extracted from each image
            if len(training_data) == 0:
                training_data = np.append(training_data, vector_data)
                training_data = training_data.reshape(1,12800)
            else:
                training_data   = np.concatenate((training_data, vector_data), axis=0)
                
            training_labels = np.append(training_labels,category)

            # early stop
            i += 1
            if i > 50:
                break

# Fit the SVM
Now comes the training step. May take a few minutes to fit the SVM itself.

In [None]:
    # Alright! Now we've got features extracted and labels
    X = training_data
    y = training_labels
    y = y.reshape(y.shape[0],)

    # Create and fit the SVM
    # Fitting should take a few minutes
    clf = svm.SVC(kernel='linear', C = 1.0, probability=True)
    clf.fit(X,y)

# Make a Prediction
Make a prediction on an example fish, just to test it out. Should be LAG --> or class #3, if you get it right. Picked LAG because it is very distinctive to human eye.

In [None]:
    # Now, extract one of the images and predict it
    gray = cv2.imread('../inputtest_stg1/img_00071.jpg', 0)  # Correct is LAG --> Class 3
    kp1, des1 = detector.detectAndCompute(gray, None)

    des1 = des1[0:100, :]   # trim vector so all are same size
    vector_data = des1.reshape(1, 12800)

    print("Linear SVM Prediction:")
    print(clf.predict(vector_data))        # prints highest probability class, only
    print(clf.predict_proba(vector_data))  # shows all probabilities for each class. 
                                           #    need this for the competition

## Save the SVM for later Use
This code can be use to save (and load) your SVM for later, to avoid re-doing expensive computations. I've commented it out because we can't save onto the Kaggle server. But would encourage using it on your own computer. It's a timesaver.

In [None]:
    # save SVM model
    # joblib.dump(clf, 'filename.pkl')
    # to load SVM model, use:  clf = joblib.load('filename.pkl')

## Predict the whole Data Set
Make a prediction for all data in the prediction set.

In [None]:
    # early stoppage...
    # only do 10
    i = 0
    for f in fnames:
        file_name = "test_stg1/" + f
        print("---Evaluating File at: " + file_name)
        gray = cv2.imread(file_name, 0)  # Correct is LAG --> Class 3
        gray = cv2.resize(gray, (400, 250))  # resize so we're always comparing same-sized images
        kp1, des1 = detector.detectAndCompute(gray, None)

        # ensure we have at least 100 keypoints to analyze
        if len(kp1) < 100:
            # and duplicate some points if necessary
            current_len = len(kp1)
            vectors_needed = 100 - current_len
            repeated_vectors = des1[0:vectors_needed, :]
            # concatenate repeats onto des1
            while len(des1) < 100:
                des1 = np.concatenate((des1, repeated_vectors), axis=0)
            # duplicate data just so we can run the model.
            des1[current_len:100, :] = des1[0:vectors_needed, :]

        np.random.shuffle(des1)  # shuffle the vector so we get a representative sample
        des1 = des1[0:100, :]   # trim vector so all are same size
        vector_data = des1.reshape(1, 12800)
        print("Linear SVM Prediction:")
        print(clf.predict(vector_data))
        svm_prediction = clf.predict_proba(vector_data)
        print(svm_prediction)
        
        # format list for csv output
        csv_output_list = []
        csv_output_list.append(f)
        for elem in svm_prediction:      
            for value in elem:
                csv_output_list.append(value)

        # append filename to make sure we have right format to write to csv
        print("CSV Output List Formatted:")
        print(csv_output_list)

        # and append this file to the output_list (of lists)
        prediction_output_list.append(csv_output_list)

        # Uncomment to stop early
        if i > 10:
            break
        i += 1

## Format CSV for Output

In [None]:
    # Write to csv
    print(prediction_output_list[0:5])
    """  Uncomment to write to your CSV. Can't do this on Kaggle server directly.
    try:
        with open("sift_and_svm_submission.csv", "wb") as f:
            writer = csv.writer(f)
            headers = ['image', 'ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT']
            writer.writerow(headers)
            writer.writerows(prediction_output_list)
    finally:
        f.close()
    """