# Plant Village subset 


### Introduction

In this file, we separate a subset of the plant village data for use in the activation function investigation. The file structure containing the data is as follows:
```
   ..\Data\PlantVillage\Pepper__bell___Bacterial_spot 
                       \Pepper__bell___healthy
                       \Potato___Early_blight
                       \Potato___healthy
                       \Potato___Late_blight
                       \Tomato_Bacterial_spot
                       \Tomato_Early_blight
                       \Tomato_healthy
                       \Tomato_Late_blight
                       \Tomato_Leaf_Mold
                       \Tomato_Septoria_leaf_spot
                       \Tomato_Spider_mites_Two_spotted_spider_mite
                       \Tomato__Target_Spot
                       \Tomato__Tomato_mosaic_virus
                       \Tomato__Tomato_YellowLeaf__Curl_Virus
```
                  
In particular, there are a total of 15 classes of image spread across 3 plant species. The dataset contains 20.6 thousand images, each of size 256x256 pixels, which is far too large to quickly train neural networks in order to compare the impact of changing the activation function. Thus to create a balanced subset of the data, I will take the first 150 images from each of the above folders, and downsample the images to size the smallest size typically used for neural network investigations; 32x32.


In [1]:
#Required packages

import tensorflow as tf
import sys
import json
import math
import keras
import keras.backend as K
import numpy as np
import pickle
import cv2
import pandas as pd
import os

from os import listdir
from sklearn.preprocessing import LabelBinarizer
from numpy import asarray
from numpy import save


from tensorflow.keras.utils import img_to_array
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### Data import 

First I define a function to convert the images to arrays. We use depth = 3 to preserve the colour information of the images, as colour is likely important in disease classification.

In [2]:
# Set default image size
default_image_size = tuple((256, 256))

image_size = 0
directory_root = '../Data'
width=256
height=256
depth=3 

In [3]:
# function to convert images to array
def convert_image_to_array(image_dir):
    try:
        image = cv2.imread(image_dir)
        if image is not None :
            image = cv2.resize(image, default_image_size) 
          #  gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
            return img_to_array(image)
        else :
            return np.array([])
    except Exception as e:
        print(f"Error : {e}")
        return None

Next I import the data into two lists, containing the image arrays and the labels, which are derived from the folder names. Since the image loading takes time to run, the chunk below shows the progress by outputting which folders is being processed. The '.DS_store' condition is present so that the code operates correctly for group members using MAC.

In [None]:
image_list, label_list = [], []

try:
    print("[INFO] Loading images ...")
    root_dir = listdir(directory_root)
    for directory in root_dir :
        # remove .DS_Store from list
        if directory == ".DS_Store" :
            root_dir.remove(directory)


    for plant_folder in root_dir :
        plant_disease_folder_list = listdir(f"{directory_root}/{plant_folder}")
        copy=listdir(f"{directory_root}/{plant_folder}")
        plant_list = []
        print(plant_disease_folder_list)
        for disease_folder in copy :
            # remove .DS_Store from list
            print(disease_folder)
            if disease_folder == ".DS_Store" :
                plant_disease_folder_list.remove(disease_folder)
            
                
        for plant_disease_folder in plant_disease_folder_list:
            print(f"[INFO] Processing {plant_disease_folder} ...")
            plant_disease_image_list = listdir(f"{directory_root}/{plant_folder}/{plant_disease_folder}/")
                
            for single_plant_disease_image in plant_disease_image_list :
                if single_plant_disease_image == ".DS_Store" :
                    plant_disease_image_list.remove(single_plant_disease_image)

            for image in plant_disease_image_list[:150]:
                image_directory = f"{directory_root}/{plant_folder}/{plant_disease_folder}/{image}"
                if image_directory.endswith(".jpg") == True or image_directory.endswith(".JPG") == True:
                    image_list.append(convert_image_to_array(image_directory))
                    label_list.append(plant_disease_folder)
    print("[INFO] Image loading completed")  
except Exception as e:
    print(f"Error : {e}")

[INFO] Loading images ...
['Pepper__bell___Bacterial_spot', 'Pepper__bell___healthy', 'Potato___Early_blight', 'Potato___healthy', 'Potato___Late_blight', 'Tomato_Bacterial_spot', 'Tomato_Early_blight', 'Tomato_healthy', 'Tomato_Late_blight', 'Tomato_Leaf_Mold', 'Tomato_Septoria_leaf_spot', 'Tomato_Spider_mites_Two_spotted_spider_mite', 'Tomato__Target_Spot', 'Tomato__Tomato_mosaic_virus', 'Tomato__Tomato_YellowLeaf__Curl_Virus']
Pepper__bell___Bacterial_spot
Pepper__bell___healthy
Potato___Early_blight
Potato___healthy
Potato___Late_blight
Tomato_Bacterial_spot
Tomato_Early_blight
Tomato_healthy
Tomato_Late_blight
Tomato_Leaf_Mold
Tomato_Septoria_leaf_spot
Tomato_Spider_mites_Two_spotted_spider_mite
Tomato__Target_Spot
Tomato__Tomato_mosaic_virus
Tomato__Tomato_YellowLeaf__Curl_Virus
[INFO] Processing Pepper__bell___Bacterial_spot ...
[INFO] Processing Pepper__bell___healthy ...
[INFO] Processing Potato___Early_blight ...
[INFO] Processing Potato___healthy ...
[INFO] Processing Potato

In [None]:
# Checking we have the expected number of images and labels:
image_size = len(image_list)
print('Number of images: ' + str(image_size))
print('Number of labels: ' + str(len(label_list)))

In [None]:
#Checking that we have the intended distribution of labels
from collections import Counter
counts = Counter(label_list)
df = pd.DataFrame.from_dict(counts, orient='index')
df.plot(kind='bar')

### Downsampling

We now reduce the dimensions of the arrays from 256x256x3 to 32x32x3.

In [None]:
image_list[0].shape
#expected size

In [None]:
# Downsampling

input_size = 256
output_size = 32
bin_size = input_size // output_size

for i in range(len(image_list)):
    image_list[i] = image_list[i].reshape((output_size, bin_size, 
                                   output_size, bin_size, 3)).max(3).max(1)
image_list[0].shape



### Preprocessing

Next we convert the multi-class labels to binary labels (belong or does not belong to the class) for input into the neural networks.
Then we split the data into training and test data, and save the files.

In [None]:
#Binarize 
label_binarizer = LabelBinarizer()
image_labels = label_binarizer.fit_transform(label_list)
pickle.dump(label_binarizer,open('label_transform.pkl', 'wb'))
n_classes = len(label_binarizer.classes_)

# Check number of classes
print(n_classes)

In [None]:
print(label_binarizer.classes_)

In [None]:
image_list = np.array(image_list, dtype=np.float32)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(image_list, image_labels, test_size=0.2, random_state = 42) 

In [None]:
# Check length of training and test sets
print(len(x_train))
print(len(x_test))

In [None]:
x_train[0]

In [None]:
# save numpy array as npy file
save('../Data/x_train_sample.npy', x_train)
save('../Data/y_train_sample.npy', y_train)
save('../Data/x_test_sample.npy', x_test)
save('../Data/y_test_sample.npy', y_test)

### References
1. https://towardsdatascience.com/image-processing-with-python-5b35320a4f3c
2. https://www.kaggle.com/code/vijpatel/cnn-plantvillage
