# Plant recognition project

## 1 Introduction
 

---

Accurate identification of plant species is essential for a wide range of use cases starting from biodiversity and conservation project, agriculture project and simple nature explorations among others. So far, this task required a specialist knowledge, was time consuming and was often difficult even for professionals.

The image classification algorithms are considered to be as a promising directions in reducing the complexity in plan species classification and assisting the professionals whenever possible. Before the deep learning methods became
available, the task was primarily tackled by identifying the leaf shape patterns, however this required a clear picture of a leave against a white background and had limited accuracy. Deep learning methods allow to use “noisy” photo images and provide increased accuracy.

A range of academic studies has been conducted in the field [1],[2] There exist a number of mobile apps that aim to tackle the problem: noticeable progress in this way was achieved by several projects and apps like LeafSnap
(http://leafsnap.com/), PlantNet (http://identify.plantnet-project.org/) or Folia
(http://liris.univ-lyon2.fr/reves/content/en/index.php).

In addition, the CLEF evaluation forum (http://www.imageclef.org/) hosted a number of challenges over the last few years aiming to increase the accuracy of the identification, which is the main source of both the dataset and an inspiration for the project in general (http://www.imageclef.org/lifeclef/2015/plant)

<img src="example.jpg" alt="Example plant picture" style="width: 400px;"/>

---
### This notebook

In this notebook, I will be documenting my approach to solving the problem and showing the interim results.


### The Road Ahead

I have broke down the notebook into separate steps.  Feel free to use the links below to navigate the notebook.

* [Step 1](#step1): Import Datasets
* [Step 2](#step2): Detect Humans
* [Step 3](#step3): Detect Dogs

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut.  Markdown cells can be edited by double-clicking the cell to enter edit mode.

---
<a id='step1'></a>
## Step 1: Set up and Dataset import

### Import plant Dataset

In the code cell belows, we load the additional libraries for future code, set up global variables and import datasets of plant images. We populate a few variables through the use of the `load_files` function from the scikit-learn library:
- `train_files`, `valid_files`, `test_files` - numpy arrays containing file paths to images
- `train_targets`, `valid_targets`, `test_targets` - numpy arrays containing onehot-encoded classification labels 
- `dog_names` - list of string-valued dog breed names for translating labels

### Set up and globals

In the code cell below, we import all required modules used later on and set up a seed number for reproducibility

In [10]:
# load all required libraries
import numpy as np
import pandas as pd
import random
import cv2 
import os               
import matplotlib.pyplot as plt  
import xml.etree.ElementTree as ET                
from tqdm import tqdm
from glob import glob
from sklearn.datasets import load_files       
from keras.utils import np_utils
from keras.preprocessing import image
from PIL import ImageFile 

%matplotlib inline 

random.seed(999) # this is required to ensure reproducibility

In [2]:
# Globals
ImageFile.LOAD_TRUNCATED_IMAGES = True 

BATCH_SIZE = 32   # tweak to your GPUs capacity
IMG_HEIGHT = 224   # ResNetInceptionv2 & Xception like 299, ResNet50 & VGG like 224
IMG_WIDTH = IMG_HEIGHT
CHANNELS = 3
DIMENSIONS = (IMG_HEIGHT,IMG_WIDTH,CHANNELS)

### Import Datasets

In the code cell below, we import a dataset of images and corresponding XML files containing metadata, where the file paths are stored in the numpy arrays 'picture_files' and 'metadata_files'

This assumes we have the following structure:

├── data
│   ├── test
│   ├── train
│   ├── small_data_test
│   ├── test_tensor_file.npy
│   ├── train_tensor_file.npy
├── keras.best.h5
├── plant_recognition_project.ipynb

In [3]:
# define function to load function
# in both test and train folders we have a mixture or .jpg and .xml files
# we need to treat them separately

def load_dataset(path):
    path_pictures = path + "*.jpg"
    path_metadata = path + "*.xml"
    picture_files = np.array(glob(path_pictures))
    metadata_files = np.array(glob(path_metadata))
    return picture_files, metadata_files

In [4]:
# this is function to create a dictionary out of an individual xml file

def get_xml_metadata(file):
    pic_meta = {}
    pic_meta['file_name'] = os.path.basename(file)
    tree = ET.parse(file)
    root = tree.getroot()
    # create a dictionary with all metadata
    for child in root:
        pic_meta[child.tag] = child.text
    return pic_meta

In [5]:
#lets get all data from both the test and the train folders

train_path = './data/train/'
test_path = './data/test/'

train_images, train_metadata = load_dataset(train_path)
test_images, test_metadata = load_dataset(test_path)

# print statistics about the dataset
print('There are %d total train picture files.' % len(train_images))
print('There are %d total train metadata files. \n' % len(train_metadata))

print('There are %d total test picture files.' % len(test_images))
print('There are %d total test metadata files.' % len(test_metadata))

There are 91758 total train picture files.
There are 91758 total train metadata files. 

There are 21446 total test picture files.
There are 21446 total test metadata files.


In [23]:
# Let's take a look at the xml file structure
file_meta = get_xml_metadata(train_metadata[10])
file_meta

{'Author': 'liliane roubaudi',
 'ClassId': '30052',
 'Content': 'Flower',
 'Date': '2013-8-13',
 'Family': 'Convolvulaceae',
 'Genus': 'Convolvulus',
 'ImageId2014': '11652',
 'Latitude': None,
 'LearnTag': 'Train',
 'Location': 'Nantua',
 'Longitude': None,
 'MediaId': '37007',
 'ObservationId': '18801',
 'ObservationId2014': '16074',
 'Species': 'Convolvulus arvensis L.',
 'Vote': '4',
 'YearInCLEF': 'PlantCLEF2014',
 'file_name': '37007.xml'}

In [6]:
# So, the plant species is captured in the 'Species' field
#lets get all species names for both test and train metadata files

test_species = []
for file in test_metadata:
    metadata_inf = get_xml_metadata(file)
    test_species.append(metadata_inf['Species'])

train_species = []
for file in train_metadata:
    metadata_inf = get_xml_metadata(file)
    train_species.append(metadata_inf['Species'])

In [20]:
# later we will need to have one-hot encoded values of the labels
# we will use sklearn to trasfer non-integer list to an integer array
# we can do an inverse transfer later as well
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(train_species)
train_to_integer = le.transform(train_species)
test_to_integer = le.transform(test_species)

In [21]:
# how we can do one hot encoding 
Y_train = np_utils.to_categorical(train_to_integer, 1000)
Y_test = np_utils.to_categorical(test_to_integer, 1000)

### Data preprocessing

In this project, I will be Keras as a main library to work with CNNs and will be using TensorFlow as backend.
The below info is a reminder for me to get the right data in Keras:

Keras CNNs require a 4D array (tensor) as input, with shape:
$$
(\text{nb_samples}, \text{rows}, \text{columns}, \text{channels}),
$$

where `nb_samples` corresponds to the total number of images (or samples), and `rows`, `columns`, and `channels` correspond to the number of rows, columns, and channels for each image, respectively.  

The `path_to_tensor` function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN.  The function first loads the image and resizes it to a square image that is $ IMG\_HEIGHT \times IMG\_HEIGHT $ pixels. where IMG\_HEIGHT is defined as a global variable and is dictated by the choise of the CNN. Next, the image is converted to an array, which is then resized to a 4D tensor.  In this case, since we are working with color images, each image has three channels.  Likewise, since we are processing a single image (or sample), the returned tensor will always have shape

$$
(1, IMG\_HEIGHT, IMG\_HEIGHT, 3).
$$

The `paths_to_tensor` function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape 

$$
(\text{nb_samples}, IMG\_HEIGHT, IMG\_HEIGHT, 3).
$$

Here, `nb_samples` is the number of samples, or number of images, in the supplied array of image paths.  It is best to think of `nb_samples` as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!

In [None]:
from keras.preprocessing import image                  
from tqdm import tqdm

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(IMG_HEIGHT, IMG_HEIGHT))
    # convert PIL.Image.Image type to 3D tensor with shape (IMG_HEIGHT, IMG_HEIGHT, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, IMG_HEIGHT, IMG_HEIGHT, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

In [None]:
# pre-process the test data for Keras
#but only if it hasn't been already done
def test_preproc():
    try:
        test_tensor = np.load(test_tensor_file.npy)
    except:
        test_tensors = paths_to_tensor(test_images).astype('float32')/255   
        #it does take a fair amount of time to do the pre-processing, 
        #so I will save the files on the disk to save time should something go wrong
        test_tensor_file = '/data/test_tensor_file'
        np.save(test_tensor_file,test_tensors,allow_pickle=True)
    return test_tensor


In [None]:
# pre-process the train data for Keras
# the file is so large it actually kills my kernel, so I am splitting the files in 5
def train_preproc():
    try:
        train_tensors_1 = np.load('data/train_tensors_1_f.npy')
        print('train_1_loaded')
        train_tensors_2 = np.load('data/train_tensors_2_f.npy')
        print('train_2_loaded')
        train_tensors_3 = np.load('data/train_tensors_3_f.npy')
        print('train_3_loaded')
        train_tensors_4 = np.load('data/train_tensors_4_f.npy')
        print('train_4_loaded')
        train_tensors_5 = np.load('data/train_tensors_5_f.npy')
        print('train_5_loaded')
        
        train_tensor = np.vstack([train_tensors_1,train_tensors_2,train_tensors_3,train_tensors_4,train_tensors_5])
        
        print('concatinated')
        
        del train_tensors_1
        del train_tensors_2
        del train_tensors_3
        del train_tensors_4
        del train_tensors_5
        
    except:
        train_tensors_1 = paths_to_tensor(train_images[:20000]).astype('float32')/255
        train_tensors_2 = paths_to_tensor(train_images[20000:40000]).astype('float32')/255
        train_tensors_3 = paths_to_tensor(train_images[40000:60000]).astype('float32')/255
        train_tensors_4 = paths_to_tensor(train_images[60000:80000]).astype('float32')/255
        train_tensors_5 = paths_to_tensor(train_images[80000:]).astype('float32')/255
        
        train_tensors_1_f = 'data/train_tensors_1_f'
        train_tensors_2_f = 'data/train_tensors_2_f'
        train_tensors_3_f = 'data/train_tensors_3_f'
        train_tensors_4_f = 'data/train_tensors_4_f'
        train_tensors_5_f = 'data/train_tensors_5_f'
        
        np.save(train_tensors_1_f,train_tensors_1,allow_pickle=True)
        np.save(train_tensors_2_f,train_tensors_2,allow_pickle=True)
        np.save(train_tensors_3_f,train_tensors_3,allow_pickle=True)
        np.save(train_tensors_4_f,train_tensors_4,allow_pickle=True)
        np.save(train_tensors_5_f,train_tensors_5,allow_pickle=True)
    
    return train_tensor
        

In [None]:

#it does take a fair amount of time to do the pre-processing, 
#so I will save the files on the disk to save time should something go wrong

train_tensor= train_preproc()


In [None]:
train_tensors_1 = np.load('data/train_tensors_1_f.npy')
print('train_1_loaded')
train_tensors_2 = np.load('data/train_tensors_2_f.npy')
print('train_2_loaded')
train_tensors_3 = np.load('data/train_tensors_3_f.npy')
print('train_3_loaded')
train_tensors_4 = np.load('data/train_tensors_4_f.npy')
print('train_4_loaded')
train_tensors_5 = np.load('data/train_tensors_5_f.npy')
print('train_5_loaded')
        
train_tensor = np.vstack([train_tensors_1,train_tensors_2,train_tensors_3,train_tensors_4,train_tensors_5])
        
print('concatinated')
        
del train_tensors_1
del train_tensors_2
del train_tensors_3
del train_tensors_4
del train_tensors_5

In [None]:
test_tensor_file = 'test_tensor_file'
np.save(test_tensor_file,test_tensors,allow_pickle=True)

---
<a id='step2'></a>
## Step 2: Initial data exploration

In the code cell below, we will take a look at the images we have, take a look at labels and at the metadata

In [None]:
# function to show a number of images

def plot_images(images, cls_true, cls_pred=None, smooth=True):

    assert len(images) == len(cls_true)

    # Create figure with sub-plots.
    fig, axes = plt.subplots(2, 2)

    # Adjust vertical spacing.
    if cls_pred is None:
        hspace = 0.3
    else:
        hspace = 0.6
    fig.subplots_adjust(hspace=hspace, wspace=0.3)

    # Interpolation type.
    if smooth:
        interpolation = 'spline16'
    else:
        interpolation = 'nearest'

    for i, ax in enumerate(axes.flat):
        # There may be less than 4 images, ensure it doesn't crash.
        if i < len(images):
            
            # Load image from file first
            img = cv2.imread(images[i])
            cv_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            # Plot image.
            ax.imshow(img, interpolation=interpolation)

            # Name of the true class.
            cls_true_name = cls_true[i]

            # Show true and predicted classes.
            if cls_pred is None:
                xlabel = "True: {0}".format(cls_true_name)
            else:
                # Name of the predicted class.
                cls_pred_name = class_names[cls_pred[i]]

                xlabel = "True: {0}\nPred: {1}".format(cls_true_name, cls_pred_name)

            # Show the classes as the label on the x-axis.
            ax.set_xlabel(xlabel)
        
        # Remove ticks from the plot.
        ax.set_xticks([])
        ax.set_yticks([])
    
    # Ensure the plot is shown correctly with multiple plots
    # in a single Notebook cell.
    plt.show()

In [None]:
#Let's see a couple of images

images = train_images[0:4]
labels = train_species[0:4]
plot_images(images=images, cls_true=labels, smooth=False)

In [None]:
# let's see how many unique labels we have
test_species_unique = np.unique(test_species)
train_species_unique = np.unique(train_species)

print('There are %d unique species in the train files.' % len(train_species_unique))
print('There are %d unique species in the test files.' % len(test_species_unique))

In [None]:
#let's see if there are any species in the test data not presents in the train dataset
np.setdiff1d(test_species_unique,train_species_unique)

Great, all test species are present in the train data

### Let's see how species are distributed

In [None]:
# let's see how species are distributed
import pandas as pd

unique_train, counts_train = np.unique(train_species, return_counts=True)
unique_test, counts_test = np.unique(test_species, return_counts=True)

train_data_species = pd.DataFrame()
train_data_species['names'] = unique_train
train_data_species['counts'] = counts_train

test_data_species = pd.DataFrame()
test_data_species['names'] = unique_test
test_data_species['counts'] = counts_test


In [None]:
print(test_data_species.describe())
print(train_data_species.describe())
#import seaborn as sns
#sns.set(style="darkgrid")
#ax = sns.countplot(y="names", data=train_data_species)

In [None]:
# from the summary table we can see some species are more prevalent compared to others in both datasets
#there is a large difference between the 25% and 75% in both datasets

---
<a id='step2'></a>
## Step 2: Detect plants

In this section, we will use transfer learning based on ResNet50 CNN architecture.

In [None]:
import keras
from keras import  metrics, models, regularizers, optimizers
from keras.applications import ResNet50 #, Xception, InceptionResNetV2
from keras.models import Sequential, Model 
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras.preprocessing.image import ImageDataGenerator


### Model Architecture

The model uses the the pre-trained ResNet50 model as a fixed feature extractor, where the last convolutional output of ResNet50 is fed as input to our model. We only add a global average pooling layer and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax.

In [None]:
# define the model
base_model = ResNet50(input_shape=DIMENSIONS, weights='imagenet', include_top=False)

# Freeze the layers which you don't want to train. Here I am freezing all of them
for layer in base_model.layers:
    layer.trainable = False

#Adding custom Layers 
x = base_model.output

x = GlobalAveragePooling2D(name='avg_pool_2')(x)
x = Dense(1000, activation='softmax', name='predictions')(x)

# creating the final model 
model_final = Model(input = base_model.input, output = x)

# compile the model 
model_final.compile(
    loss='categorical_crossentropy',
    optimizer=optimizers.Adam(1e-3),
    metrics=['acc'])

In [None]:
#Train and Test generators with Augmentation
#the data structure is not arranged in the way to use .flow_from_directory - I will need to use .flow


#(x_train, y_train), (x_test, y_test) = cifar10.load_data()
#y_train = np_utils.to_categorical(y_train, num_classes)
#y_test = np_utils.to_categorical(y_test, num_classes)


#We rescale the images by dividing every pixel in every image by 255.
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

train_generator = train_datagen.flow(x_train, y_train, batch_size=BATCH_SIZE)


test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = train_datagen.flow(x_test, y_test, batch_size=BATCH_SIZE)


model.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        epochs=50,
        validation_data=test_generator,
        validation_steps=800)


In [None]:
# the train dataset it too large to fit into the memory 
# so need to create a generator that will return batches of images instead

def generate_batches_from_train_folder(images_to_read,labels_to_read, batchsize = 100):
    """
    Generator that returns batches of images ('xs') and labels ('ys') from the train folder
    :param string filepath: Full filepath of files to read - this needs to be a list of image files and label_files
    :param int batchsize: Size of the batches that should be generated.
    :return: (ndarray, ndarray) (xs, ys): Yields a tuple which contains a full batch of images and labels.
    """
    dimensions = (BATCH_SIZE, IMG_HEIGHT, IMG_HEIGHT, 3) # pixels, three channels
 

    # needs to be on a infinite loop for the generator to work
    while 1:
        filesize = len(images_to_read)

        # count how many entries we have read
        n_entries = 0
        # as long as we haven't read all entries from the file: keep reading
        while n_entries < (filesize - batchsize):
            
            # start the next batch at index 0
            # create numpy arrays of input data (features) - this is already shaped as a tensor
            xs = paths_to_tensor(images_to_read[n_entries : n_entries + batchsize])

            # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
            y_values = f['y'][n_entries:n_entries+batchsize]
            ys = np.array(np.zeros((batchsize, 1000))) # data with 2 different classes (e.g. dog or cat)

            # Select the labels that we want to use, e.g. is dog/cat
            for c, y_val in enumerate(y_values):
                ys[c] = encode_targets(y_val, class_type='dog_vs_cat') # returns categorical labels [0,1], [1,0]

            # we have read one more batch from this file
            n_entries += batchsize
            yield (xs, ys)

In [None]:
# Save the model according to the conditions  
checkpoint = ModelCheckpoint("ResNet_1.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, verbose=1, mode='auto')


In [None]:

# Train the model 
model_out = model_final.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,
callbacks = [checkpoint, early])


In [None]:
model.load_weights(BEST_MODEL)
model.compile(
    optimizer=optimizers.Adam(lr=1e-4,),
    loss='categorical_crossentropy',
    metrics=['acc'])

model_out = model.fit_generator(
    train_generator,
    steps_per_epoch=n_of_train_samples//BATCH_SIZE,
    epochs=60,
    validation_data=val_generator,
    validation_steps=n_of_val_samples//BATCH_SIZE,
    verbose=0,
    callbacks=callbacks)

### Pre-process the Data

We rescale the images by dividing every pixel in every image by 255.