<a href="https://colab.research.google.com/github/kskaran94/WasteClassification/blob/master/Waste_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Libraries Imported

In [0]:
from google.colab import files
import json
import numpy as np
import pandas as pd
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Activation, \
Dense, Dropout, Input, add
from tensorflow.keras import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, \
 recall_score, confusion_matrix
from tensorflow.keras.models import save_model, load_model
import seaborn as sns
import matplotlib.pyplot as plt
import shutil
import os
import random
import time

## Download Data from Kaggle

In [2]:
!pip install -q kaggle

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

token = {"username":"kskaran94","key":"e845a1f4ce47bb7f34dc6ec9f108f676"}
with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)
    
! chmod 600 /root/.kaggle/kaggle.json

! kaggle datasets download -d techsash/waste-classification-data

cp: cannot stat 'kaggle.json': No such file or directory
Downloading waste-classification-data.zip to /content
 93% 209M/225M [00:05<00:00, 35.1MB/s]
100% 225M/225M [00:05<00:00, 40.6MB/s]


In [0]:
!unzip waste-classification-data.zip

## Dataset File Structure

The first step after downloading the data would be to look at the data set file structure. 

In [4]:
!ls DATASET/

TEST  TRAIN


We, see that the dataset has pre-defined train and test splits. We are missing a validation and before any modling or configuration, the task would be to construct a validation set from the existing train test. At this point, test set is untouched.

In [5]:
!ls DATASET/TRAIN/

O  R


## Objective

Build a image classifier to correctly identify Recyclable and Organic waste.  This would help government authorities reduce toxic waste in landfills. Thereby reducing land pollution.

## Util Functions

All the custom functions used in the notebook can be found in this block.

### Copying files

Function copies given file_nams from source to destination using shutil

In [0]:
def copyfiles(file_names, dest, src_path):
    for file in file_names:
        full_file_name = os.path.join(src_path, file)
        if os.path.isfile(full_file_name):
            shutil.copy(full_file_name, dest)

### Train Validation Split

Function takes a path of a directory and percentage of train to split the data into train and validation. Sklearn's train_test_split works only with dataframes /arrays. This function is written for  directory-file structure.

In [0]:
def train_val_test_split(path, perc):
    train_string  = 'train/'
    val_string  = 'val/' 
    dest_path = '/content/'
    try:
        os.mkdir(dest_path + train_string)
        os.mkdir(dest_path + val_string)
    except:
        shutil.rmtree(dest_path + train_string)
        shutil.rmtree(dest_path + val_string)
        os.mkdir(dest_path + train_string)
        os.mkdir(dest_path + val_string)
    
    sub_direc = os.listdir(path=path)
    
    for sub in sub_direc:
      if sub in ['O','R']:
        try:
            shutil.rmtree(dest_path + train_string + sub)
            shutil.rmtree(dest_path + val_string + sub)
        except:
            os.makedirs(dest_path+train_string+sub)
            os.makedirs(dest_path+val_string+sub)
        src_path = path + sub
        filenames = os.listdir(src_path)
        filenames.sort()  
        # make sure that the filenames have a fixed order before shuffling
        random.shuffle(filenames) 
        # shuffles the ordering of filenames (deterministic given the chosen seed)

        split_1 = int(perc * len(filenames))
        train_filenames = filenames[:split_1]
        val_filenames = filenames[split_1:]

        copyfiles(train_filenames, dest_path+train_string+sub, src_path)
        ## train set path for all classes
        copyfiles(val_filenames, dest_path+val_string+sub, src_path)
        ## validation set path for all classes


In [0]:
train_val_test_split('DATASET/TRAIN/', 0.8)

### Predict from generator

In [0]:
def predict_from_generator(generator, model): 
    pred = model.predict_generator(generator)
    predicted_class_indices = np.argmax(pred, axis = -1)
    classes = generator.classes[generator.index_array]
    return predicted_class_indices, classes

## Data Preparation and Configuration

Preparing the data for the model is an important task. In case of images, standard preparation techniques include rescaling, resizing and data augmentation (if needed). Keras provides ImageDataGenerator class for data preparation. We will be defining three different ImageDataGenerators for train, validation and test sets respectively. 

Rescaling of images is defined within the ImageDataGenerator.

In [0]:
batch_size=64

train_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

In this case batch size is a paramter which can be tuned and the evaluation metric may also vary with different batch_size.

The Image data generator is used for returning configured images using the flow functions. We will be using the flow_from_directory to configure the images. There are other ways which can found here https://keras.io/preprocessing/image/

In this code block, the paramter that can be tuned is the target size of the image. 

In [0]:
train_generator = train_datagen.flow_from_directory(
            'train/',  # this is the target directory
            target_size=(150, 150),  # all images will be resized to 150x150
            batch_size=batch_size,
            class_mode='categorical',shuffle=False)

val_generator = val_datagen.flow_from_directory(
        'val/',  # this is the target directory
        target_size=(150, 150),
        batch_size=batch_size,
        class_mode='categorical',shuffle=False)


test_generator = test_datagen.flow_from_directory(
        'DATASET/TEST/',  # this is the target directory
        target_size=(150, 150),
        batch_size=batch_size,
        class_mode='categorical',shuffle=False)

## Model Defintion and Compilation

In [15]:
num_classes = 2
input_shape = (150, 150, 3)

cnn_small_bn = Sequential([
    Conv2D(8, kernel_size = (3,3), input_shape=input_shape, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(8, kernel_size = (3,3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(num_classes, activation='softmax'),
                 ])

cnn_small_bn.summary()

cnn_small_bn.compile("adam", "categorical_crossentropy",
                     metrics=['accuracy'])


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 148, 148, 8)       224       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 74, 74, 8)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 72, 72, 8)         584       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 36, 36, 8)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 10368)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 20738     
Total params: 21,546
Trainable params: 21,546
Non-trainable params: 0
__________________________________________________

## Model Training

In [0]:

history_cnn = cnn_small_bn.fit_generator(
    train_generator,
    steps_per_epoch=20,
    epochs=100,
    verbose=1,
    validation_data=val_generator,
    validation_steps=20)


In [0]:
pd.DataFrame(history_cnn.history).plot()

In [12]:
val_pred, val_classes = predict_from_generator(val_generator, cnn_small_bn)


confusion_matrix(val_classes, val_pred)

array([[1666,  848],
       [ 483, 1516]])

In [0]:
# avg_conf_matrix_val = np.zeros((num_classes, num_classes))

# for i in range(num_fold):
#     avg_conf_matrix_val += confusion_matrix(val_class_arr, val_pred_arr[i])
    
# avg_conf_matrix_val  /= num_fold

# avg_conf_matrix_val = avg_conf_matrix_val.astype('int')

# fig5 = sns.heatmap(avg_conf_matrix_val, annot=True, fmt="d")

# _ = fig5.set_xticklabels(class_names)

# fig1 = sns.barplot(class_names, calc_specificity(avg_conf_matrix_val))

# _ = fig1.set_title("Specificity values over all classes for validation set")

# _ = fig1.set(xlabel = "Class names", ylabel = "Specificity value")

# _ = fig1.set_ylim(0,1.2)

# for index, val in enumerate(calc_specificity(avg_conf_matrix_val)):
#     fig1.text(index,round(val,3) + 0.02 ,round(val,3), color='black', size = 12)