This notebook presents several approaches followed to classify the data in the Leaf dataset. In addition a brief EDA is included.

The data provided by Kaggle consists basically on two different types: Binary pictures of the leaf-samples and vectors of these leaf-samples containing pre-extracted features.

In particular, the training set consists of N = 990 samples. For each sample, we have a binary image of the leaf of variable size and an d-dimensional feature vectors, where d = 192.

#1. Feature vectors
We will first work with the feature vectors of the leaf-samples

## 1.1 Prepare workspace
Let us begin by loading some basic libraries and some of the data that we will be using throughout this lab

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plot

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
# Load feature vectors
df = pd.read_csv("../input/train.csv")
# Display some of the feature vectors
df

## 1.2 Exploratory Data Analysis
First of all, let us have a brief look at the feature vectors. In this regard, we will use simple approaches such as PCA and t-SNE. Note that there are several classes, which makes the visualisation a bit difficult.

### 1.2.1 PCA
First, we will use Principal Component Analysis to obtain an idea of how is our data distributed. PCA basically consists on finding the dimensions that retain the maximum variance of the data. For it to be easily visualized, we usually choose to find 2 dimensions. Thus, we can say that we are performing dimensionality reduction (we go from the original 192 dimensions to just 2).

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)

# Obtain list of features
features = list(df)
del(features[:2]) # Delete id and species from features

# Apply PCA to our data
pca_result = pca.fit_transform(df[features].values)

# Define dataframe containing low-dimension representation of original data
pca_df = pd.DataFrame()
pca_df['species'] = df['species']
pca_df['pca-one'] = pca_result[:,0]
pca_df['pca-two'] = pca_result[:,1] 

# How much variance does this new representation retain from the original data?
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

From the previous results, we observe that the two principal components retain roughly a 25% of the total variance of the data. It looks like they are good representatives!

In [None]:
from ggplot import *
chart = ggplot( pca_df, aes(x='pca-one', y='pca-two', color='species') ) \
        + geom_point(size=75,alpha=0.8) \
        + ggtitle("First and Second Principal Components colored by species")
chart

### 1.2.2 t-SNE
After working with PCA, let us dive into t-SNE. The key difference to PCA is that it follows a probabilistic approach rather than a mathematical one.

In [None]:
import time
from sklearn.manifold import TSNE

# Start t-SNE algorithm
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=400)
tsne_results = tsne.fit_transform(df[features].values)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
# Create dataframe for the new representation of the data
df_tsne = pd.DataFrame()
df_tsne['species'] = df['species']
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]

# Create chart and display it
chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='species') ) \
        + geom_point(size=70,alpha=0.1) \
        + ggtitle("tSNE dimensions colored by species")
chart

A common approach prior to perform t-SNE is to obtain a low-dimensionality representation of the data using PCA (say with 50 principal components) and then apply t-SNE on top of this. In this regard, let us apply PCA on our original data

In [None]:
pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(df[features].values)

print('Explained variation per principal component (PCA): {}'.format(np.sum(pca_50.explained_variance_ratio_)))

We observer that more than 95% of the variance is retained in 50 dimensions! Let's apply t-SNE!

In [None]:
time_start = time.time()

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_pca_results = tsne.fit_transform(pca_result_50)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

# Create dataframe for the new representation of the data
df_tsne = pd.DataFrame()
df_tsne['species'] = df['species']
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]

# Create chart and display it
chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='species') ) \
        + geom_point(size=70,alpha=0.1) \
        + ggtitle("tSNE dimensions colored by species")
chart

## 1.3 Feed Forward Neural Network
Now, let us proceed by applying some fancy Neural Network to this dataset. A Feed Forward Neural Network is a Neural Network that is built by _layers_. Each layer consists of a set of units (often refered to as _neurons_). Particularity of these networks is that units within the same layer do not interact between each other. The following figure illustrates the architecture of these networks

![](http://cse22-iiith.vlabs.ac.in/exp4/images/structure.png)

The input layer is where we introduce the training data. In this first attempt, we will use the feature vectors. Thus, the input layer has 192 units. In Classification problems, the _output layer_ has as many units as classes we have in our problem (in our case, we have 99). In between, we have the _hidden layers_ (the number of hidden layers might vary from task to task). 

In general, at each layer these networks compute linear combinations of the output vector of the previous layer and subsequently apply non-linearities (also known as activation functions). In our case, we will be using the ReLu activation function, which has been proven in the literature to be robust to the vanishing gradient problem and has a faster convergence than other standard activation functions (e.g. sigmoid, tanh).

In the following, we initialise the model parameters and define the network architecture.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf

# Parameters
learning_rate = 0.0001
training_epochs = 200
batch_size = 10
display_step = 10

# Network Parameters
n_features = 192
n_classes = 99
neurons_layers = [n_features, 256, 512, 256, n_classes] # Number of features in input, hidden and output layers

# Load Train data
train_data = {}
df = pd.read_csv("../input/train.csv") # Load training set
tmp = df.as_matrix()
train_data['samples'] = tmp[:, 2:] # Obtain all the feature vectors (without ids and species)
train_data['labels'] = tmp[:, 1] # Obtain labels (i.e. species)
N_train = len(train_data['samples']) # Number of training samples

# Obtain list of features
features = list(df)
del(features[:2]) # Delete id and species from features

# Load Test data
test_data = {}
df = pd.read_csv("../input/test.csv")
tmp = df.as_matrix()
test_data['samples'] = tmp[:, 1:]
test_ids = df.pop('id')

The label of the samples comes as string format. However, we would prefer to have it in numerical values. Even more, we would like to use **one-hot encoding**.

In [None]:
# One hot encoding map
enc = np.eye(n_classes)
sparse2dense = {i: enc[i] for i in range(n_classes)}

# Map class names to one-hot representation
class_names = np.unique(train_data['labels'])
class_encodings = {}
for i in range(n_classes):
    class_encodings[class_names[i]] = sparse2dense[i]

To evaluate how good our model is performing, we would like to have a validation set we can run our model on. Only assessing the performance of our model on the training set could easily lead to overfitting. A first approach is to take a fraction of the training set to train the model and the remaining as the validation set. However, we do not have much data and hence should be thinking of another approach. 

Another option is to use k-fold Cross Validation. This technique is based on splitting the training set in _k_ subsets and pick _1_ as the validation set and the remaining as training set.  This is done for all subsets. The following picture illustrates this idea.

![](https://static.oschina.net/uploads/img/201609/26155106_OfXx.png)

To assess the performance of our model, typically a validation set is used. Evaluating how good our model is on the training set can easily lead to overfitting. However, in our scenario the training data is to scarce. Hence, another approach is considered: **k-fold Cross Validation**. This technique is based on splitting the training set in _k_ subsets, where one of them is used as the validation set and the rest are merged and used to train the model. This procedure is ran several times iterating the fold assigned to the test set. The following picture illustrates this idea.

![](https://static.oschina.net/uploads/img/201609/26155106_OfXx.png)

In [None]:
# Create Validation and Training sets
tmp = pd.read_csv("../input/train.csv").as_matrix()

idx = np.random.choice(N_train, N_train, replace=False)
N_train = int(0.75*N_train)
train_data['samples'] = tmp[idx[:N_train], 2:]
train_data['labels'] = np.array([class_encodings[t] for t in tmp[idx[:N_train], 1]])

val_data = {}
val_data['samples'] = tmp[idx[N_train:], 2:]
val_data['labels'] = np.array([class_encodings[t] for t in tmp[idx[N_train:], 1]])

We define our first Tensorflow variables. In particular we define the placeholders for the input and output vectors, with their corresponding sizes.

Furthermore, we also initialise the model parameters, i.e. the weight matrices and bias vectors.

In [None]:
# tf Graph input
x = tf.placeholder(tf.float32, [None, n_features])
y = tf.placeholder(tf.float32, [None, n_classes])

# Model parameters (weights, biases)
weights = [
    tf.Variable(tf.random_normal([neurons_layers[k], neurons_layers[k + 1]])) for k in range(len(neurons_layers)-1)
]

biases = [
    tf.Variable(tf.random_normal([neurons_layers[k+1]])) for k in range(len(neurons_layers)-1)
]

Now, we define a method that is responsible for building a fully connected feed forward NN with Relu activations given a predefined model parameters.

In [None]:
# Function implementing a fully connected feed forward NN, with relu activations
def multilayer_perceptron(x, weights, biases):
    
    # Compute number of hidden layers
    n_hidden = len(weights)-1
    
    # Input layer to hidden layer
    out = tf.add(tf.matmul(x, weights[0]), biases[0])
    
    # Check that there are hidden layers
    if n_hidden > 0:
    
        # Iterate over all hidden layers
        for k in range(1, n_hidden+1):
            out = tf.nn.relu(out) # Apply activation fct on previous layer output
            out = tf.add(tf.matmul(out, weights[k]), biases[k]) # linear combination
               
    return out

In [None]:
for k in range(1,1):
   print(k)

Last step prior to the training is to define the **forward** and **backward pass** of the network. For the forward pass we will use the method `multilayer_perceptron` and for the backward pass we will use the softmax-cross entropy loss and the Adam optimiser.

In [None]:
# Construct model
y_pred = multilayer_perceptron(x, weights, biases)
p_pred = tf.nn.softmax(y_pred)

# Define cross entropy loss
cost = tf.nn.softmax_cross_entropy_with_logits(logits=y_pred, labels=y)
cost = tf.reduce_mean(cost)*batch_size

# Define optimizer
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# Evaluation of the model on a set
correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

In [None]:
# Arrays containing cost for each epoch
cost_train = np.zeros([training_epochs,1])
cost_val = np.zeros([training_epochs,1])

# Initializing the variables
init = tf.global_variables_initializer()


# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(N_train/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            idx = range(i*batch_size, (i+1)*batch_size)
            # Train samples
            batch_x = train_data['samples'][idx]
            # Train labels
            #print(idx)
            batch_y = train_data['labels'][idx]

            ## Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([optimizer, cost], feed_dict={x: batch_x,
                                                          y: batch_y})
            # Compute average loss
            avg_cost += c / total_batch
            
        # Display logs per epoch step
        if epoch % display_step == 0:
            print("Epoch:", '%04d' % (epoch+1), "cost=", \
                "{:.9f}".format(avg_cost))
            print("Accuracy:", accuracy.eval({x: val_data['samples'], y: val_data['labels']}))
        cost_train[epoch] = avg_cost
        cost_val[epoch] = cost.eval({x: val_data['samples'], y: val_data['labels']})
        
    print("Optimization Finished!")

    
    
    p = sess.run([p_pred], {x: test_data['samples']})

In [None]:
val_data['samples'].shape

After the training, it is a good practice to visualise how the cost curve evolved as the number of epochs increased

In [None]:
import matplotlib.pyplot as plt

loss_train_curve, = plt.plot(cost_train, label='Training Set')
loss_val_curve, = plt.plot(cost_val, label='Validation Set')
plt.legend(handles=[loss_train_curve, loss_val_curve])

In [None]:
# prepare csv for submission
submission = pd.DataFrame(p[0], index=test_ids, columns=class_names)
submission.to_csv('submission.csv')

## ConvNets

This part is still to do

In [None]:
from PIL import Image
import glob
image_list = []
for filename in glob.glob('../input/images/*.jpg'): #assuming gif
    im=Image.open(filename)
    image_list.append(im)