# HMIN339 : Méthodes Avncées de la Science de données


## **`Réconnaissance Visuelle de Plantes`**

### Object du Project :
Réconnaissance d'espèces de plantes à partir de photos

### Jeu de Départ : 
3474 images appartenant à 50 espèces différentes

### Encadrement :
* **`Konstantin TODOROV`**
* **`Pascal PONCELET`**
 
### Fait par :
* **`BEYA NTUMBA Joel`**
* **`MINKO AMOA Dareine`**
* **`QUENETTE Christophe`**
* **`SHAQURA Tasnim`**

# Linear Classifcation of the DataSets

In [1]:
import warnings
import numpy as np
from keras.preprocessing import image
import cv2 as cv
from pathlib import Path
from sklearn.model_selection import GridSearchCV, train_test_split
from skimage.io import imread
import pandas as pd

warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV, train_test_split

Using TensorFlow backend.


## 1737 images for Class 0, 1737 for Class 1.

In [2]:
y0 = np.zeros(1737)
y1 = np.ones(1737)

## concatenate y0 and y1 to form y

## Import preprocessed images trainings

In [3]:
X = np.zeros((3474, 150, 150, 3))
X.shape

(3474, 150, 150, 3)

In [4]:
%store -r preprocessed_images
X, y = preprocessed_images
X = X.astype('float32')
y = np.concatenate((y0, y1), axis=0)

In [5]:
validation_size=0.6 #40% du jeu de données pour le test

# testsize_train= 1-validation_size
# testsize = 1 - validation_size
seed=42

# séparation jeu d'apprentissage et jeu de test
X_train,X_test,y_train,y_test=train_test_split(X, 
                                               y, 
                                               train_size=validation_size, 
                                               random_state=seed,
                                               test_size=0.20)

X_val,X_test,y_val,y_test=train_test_split(X_test, 
                                               y_test, 
                                               train_size=validation_size, 
                                               random_state=seed,
                                               test_size=0.3)

print("X_train: " + str(X_train.shape))
print("X_test: " + str(X_test.shape))
print("X_val: " + str(X_val.shape))
print("y_train: " + str(y_train.shape))
print("y_test: " + str(y_test.shape))
print("y_val: " + str(y_val.shape))

X_train: (2084, 150, 150, 3)
X_test: (209, 150, 150, 3)
X_val: (417, 150, 150, 3)
y_train: (2084,)
y_test: (209,)
y_val: (417,)


#### Forming X_test, X_train, y_train, y_test ####

In [6]:
num_training = X_train.shape[0]
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]
num_test = X_test.shape[0]
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
num_val = X_val.shape[0]
mask = list(range(num_val))
X_val = X_val[mask]
y_val = y_val[mask]

### Reshape the image data into rows

In [7]:
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
print(X_train.shape, X_test.shape, X_val.shape)
print(y_train.shape, y_test.shape, y_val.shape)

(2084, 67500) (209, 67500) (417, 67500)
(2084,) (209,) (417,)


### Getting data to zero mean, i.e centred around zero

In [8]:
mean_image = np.mean(X_train, axis=0)
X_train -= mean_image
X_test -= mean_image
X_val -= mean_image

### Append the biais dimension of ones (i.e. biais trick) so that our SVM only has to worry about optimizing a single weight matrix W.

In [9]:
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
print(X_train.shape, X_test.shape, X_val.shape)
print("Data ready")

(2084, 67501) (209, 67501) (417, 67501)
Data ready


## **Explanation:**
X_train, X_test and X_val are centred around zero by subtracting the mean from the sets. A zero mean is a very common data pre-processing technique as it ensures that the gradients calculated remain controlled and to increase robustness to noise. It is important to note that the mean should be calculated only using the training data and not the validation or test sets. The biases are appended to weights as the last column of the weight matrix, so that now only one matrix W is to be optimized. This is also called the bias trick.

## Defining Loss Function

In [10]:
def svm_loss_vectorized(W, X, y, reg):
    loss = 0.0
    dW = np.zeros(W.shape)  # initialize the gradient as zero
    num_classes = W.shape[1]
    num_train = X.shape[0]
    scores = X.dot(W)
    y = [int(x) for x in y]
    correct_class_scores = scores[np.arange(num_train), y].reshape(num_train, 1)
    margin = np.maximum(0, scores - correct_class_scores + 1)
    margin[np.arange(num_train), y] = 0  # do not consider correct class in loss
    loss = margin.sum() / num_train
    # Add regularization to the loss.
    loss += reg * np.sum(W * W)

    margin[margin > 0] = 1
    valid_margin = margin.sum(axis=1)
    margin[np.arange(num_train), y] -= valid_margin
    dW = (X.T).dot(margin) / num_train
    # Regularization gradient
    dW = dW + reg * 2 * W
    return loss, dW

## **Explanation:**
The svm_loss_vectorized function has the arguments W matrix that consists of the weights as well as the bias,input matrix X, target matrix y and reg the regularization strength. The scores metric is calculated according to W.X + b, which is reduced to W.X as the biases are included in matrix W with the bias trick. The loss is calculated from the average difference between the true target matrix y and the predicted scores. A further L2 regularization loss is added to encourage the weights to stay low.

## Defining Linear Classifier

In [11]:
from __future__ import print_function
from builtins import object

class LinearClassifier(object):

    def __init__(self):
        self.W = None

    def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
              batch_size=200, verbose=False):

        num_train, dim = X.shape
        num_classes = float(np.max(y)) + 1.0 # assume y takes values 0...K-1 where K is number of classes
        if self.W is None:
            self.W = 0.001 * np.random.randn(dim, int(num_classes))
        
        # Run stochastic gradient descent to optimize W
        loss_history = []
        for it in range(num_iters):
            X_batch = None
            y_batch = None

            batch_indices = np.random.choice(num_train, batch_size, replace=False)
            X_batch = X[batch_indices]
            y_batch=y[batch_indices]

            # evaluate loss and gradient
            loss, grad = self.loss(X_batch, y_batch, reg)
            loss_history.append(loss)

            # Update the weights using the gradient and the learning rate.          #

            self.W -= learning_rate*grad

            if verbose and it % 100 == 0:
                print('iteration %d / %d: loss %f' % (it, num_iters, loss))

        return loss_history

    def predict(self, X):
        """
        Use the trained weights of this linear classifier to predict labels for
        data points.

        Inputs:
        - X: A numpy array of shape (N, D) containing training data; there are N
          training samples each of dimension D.

        Returns:
        - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional
          array of length N, and each element is an integer giving the predicted
          class.
        """
        y_pred = np.zeros(X.shape[0])
        scores = X.dot(self.W)
        y_pred = scores.argmax(axis=1)
        return y_pred

In [12]:
class LinearSVM(LinearClassifier):
    """ A subclass that uses the Multiclass SVM loss function """

    def loss(self, X_batch, y_batch, reg):
        return svm_loss_vectorized(self.W, X_batch, y_batch, reg)

## **Explanation:** ##
The Linear Classifier class during training calculates the loss via the svm_loss_vectorized() function, which returns the loss as well as the gradient dw. The weights matrix is then updated with the equation W = W-a*dw. This process is repeated over several iterations called as epochs.

## Training the Linear Classifier

In [13]:
svmd = LinearSVM()
loss_hist = svmd.train(X_train, y_train, learning_rate=1e-7, reg=2.5e4, num_iters=1500, verbose=True)

y_train_pred = svmd.predict(X_train)
print('training accuracy: %f' % (np.mean(y_train == y_train_pred),))

y_val_pred = svmd.predict(X_val)
print('validation accuracy: %f' % (np.mean(y_val == y_val_pred),))

iteration 0 / 1500: loss 3384.370059
iteration 100 / 1500: loss 1240.718282
iteration 200 / 1500: loss 455.267525
iteration 300 / 1500: loss 167.225650
iteration 400 / 1500: loss 62.139692
iteration 500 / 1500: loss 23.739515
iteration 600 / 1500: loss 9.461212
iteration 700 / 1500: loss 4.108287
iteration 800 / 1500: loss 2.172159
iteration 900 / 1500: loss 1.054002
iteration 1000 / 1500: loss 1.514256
iteration 1100 / 1500: loss 0.838581
iteration 1200 / 1500: loss 1.421074
iteration 1300 / 1500: loss 0.656138
iteration 1400 / 1500: loss 1.201591
training accuracy: 0.654031
validation accuracy: 0.517986


## **Explanation:**
An object of class LinearSVM is trained over X_train for 1500 epochs using some hyperparameters, learning rate and regularization strength. These hyperparameters will be tuned to achieve best possible performance.

## Tuning Hyperparameters to get the Best Fit

#### Use the validation set to the hyperarameters (Regularizition strength and learning rate. You should experiment with different ranges for the learning rates and regularization strenghts

In [14]:
learning_rates = [5e-7, 5e-6, 5e-5, 5e-4, 5e-3]
regularization_strengths = [2.5e4, 5e4, 2.5e3, 2.5e2, 5e2, 1e1]

results = {}
best_val = -1  # The highest validation accuracy that we have seen so far.
best_svm = None  # The LinearSVM object that achieved the highest validation rate.

### Declare blr, brg to store the best LinearSVM object's learning rate and regularization.

In [18]:
blr = None
brg = None

grid_search = [(lr, rg) for lr in learning_rates for rg in regularization_strengths]
for lr, rg in grid_search:
    svmd = LinearSVM()
    train_loss = svmd.train(X_train, y_train, learning_rate=lr, reg=rg,
                            num_iters=2000, verbose=False)
    # Predict values for training set
    y_train_pred = svmd.predict(X_train)
    # Calculate accuracy
    train_accuracy = np.mean(y_train_pred == y_train)
    # Predict values for validation set
    y_val_pred = svmd.predict(X_val)
    # Calculate accuracy
    val_accuracy = np.mean(y_val_pred == y_val)
    # Save results
    results[(lr, rg)] = (train_accuracy, val_accuracy)
    if best_val < val_accuracy:
        blr = lr
        brg = rg
        best_val = val_accuracy
        best_svm = svmd
        
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
        lr, reg, train_accuracy, val_accuracy))

print('best validation accuracy achieved during cross-validation: %f' % best_val)

y_test_pred = best_svm.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('linear SVM on raw pixels final test set accuracy: %f' % test_accuracy)
svmd = LinearSVM()
loss_hist = svmd.train(X_train, y_train, learning_rate=blr, reg=brg,
                       num_iters=2000, verbose=False)
y_svmd = svmd.predict(X_test)
print("SVMD")
print(y_svmd[0:10])
print("actual")
print(y_test[0:10])
print("best_svm")
print(y_test_pred[0:10])

lr 5.000000e-07 reg 1.000000e+01 train accuracy: 0.996161 val accuracy: 0.494005
lr 5.000000e-07 reg 2.500000e+02 train accuracy: 0.994242 val accuracy: 0.494005
lr 5.000000e-07 reg 5.000000e+02 train accuracy: 0.994722 val accuracy: 0.503597
lr 5.000000e-07 reg 2.500000e+03 train accuracy: 0.621881 val accuracy: 0.515588
lr 5.000000e-07 reg 2.500000e+04 train accuracy: 0.541267 val accuracy: 0.529976
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.513916 val accuracy: 0.455635
lr 5.000000e-06 reg 1.000000e+01 train accuracy: 0.995681 val accuracy: 0.448441
lr 5.000000e-06 reg 2.500000e+02 train accuracy: 0.612764 val accuracy: 0.537170
lr 5.000000e-06 reg 5.000000e+02 train accuracy: 0.657390 val accuracy: 0.532374
lr 5.000000e-06 reg 2.500000e+03 train accuracy: 0.533109 val accuracy: 0.465228
lr 5.000000e-06 reg 2.500000e+04 train accuracy: 0.499040 val accuracy: 0.458034
lr 5.000000e-06 reg 5.000000e+04 train accuracy: 0.493282 val accuracy: 0.453237
lr 5.000000e-05 reg 1.000000

In [17]:
from sklearn import svm, metrics

classifier = svm.LinearSVC(penalty='l2', loss='squared_hinge', max_iter=2000)
classifier.fit(X_train, y_train)
print(classifier.score(X_train, y_train))
print(classifier.score(X_val, y_val))
print(classifier.score(X_test, y_test))

0.996641074856046
0.460431654676259
0.5263157894736842
