# Multioutput (Multilabel) Fruit Classification using CNN - Keras

## Background: Fruit Classification Use Cases
Beyond its educational aspects, fruit classification can have a significant practical value. With deep learning, we can create and explore multiple compelling use cases for this technology.

**Use Case 1: Sorting Ripe Fruits**

One practical application is sorting ripe fruits from unripe ones. For instance, an automated fruit sorting system powered by Deep Learning can efficiently categorize ripe bananas from green ones, streamlining the packaging process.

**Use Case 2: Detecting Spoiled Fruits**

Deep Learning can aid in identifying spoiled fruits, such as detecting fungus presence on their skin. By automating this process, we can minimize waste and enhance the quality control in fruit distribution centers.

**Use Case 3: Inventory Management**

Fruit classification can assist in inventory management for grocery stores and warehouses. With the ability to automatically classify and count different fruits in stock, businesses can optimize their supply chain and ensure adequate stock levels.

**Use Case 4: Fruit Disease Detection**

By analyzing fruit images, Deep Learning models can spot signs of diseases or pests affecting the fruit's health. Early detection enables timely intervention, preventing the spread of diseases and safeguarding crop yields.

With Deep Learning's capabilities, fruit classification can be well implemented as a replacement of traditional manual methods in the agricultural, horticultural and botany domains. By utilizing these models,we can introduce efficiency and accuracy into various fruit-related processes. 

Now we are familiar with some great ways to use fruit classification with Deep Learning, let's begin the fruit classification model! 😊

# 1. Introduction

Fruits are an essential part of our daily diets. In various production processes using fruits, sorting plays a crucial role, and implementation of AI systems is revolutionizing this task with highly accurate deep learning models. 

### Challenge: 
Identifying and grading fruits is a tough task due to their varying shapes, colors, and textures. The main challenges involve differentiating between different types of fruits and distinguishing among various varieties of the same fruit. Accurate fruit classification is vital for determining their prices in supermarkets.

### About this notebook
In this Kaggle notebook, we'll build a fruit classification model using the Fruits 360 dataset. The deep learning model will be Keras-based that aims to proficiently classify ten different types of fruit.

You'll find step-by-step instructions with clear and concise code to create this powerful deep learning model.

Apart from the model code, we'll also perform an comprehensive EDA to explore the dataset, discuss the approach to set up the deep learning and ConvNet-based classification model and also understand how we can use the same approach for a similar product/item classification.

# 1.1 Importing Required Libraries and Packages

In [None]:
# Libraries for file and directory operations
import os
import shutil
import glob
import random

# avoid warnings
import warnings 
warnings. filterwarnings('ignore')

# Library for data processing
import numpy as np
import math
import pandas as pd

# Libraries for data visualization
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
from PIL import Image

# Libraries for deep learning model
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop,Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from keras.layers.normalization.batch_normalization import BatchNormalization
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Getting to know the dataset: Fruits 360

The Kaggle dataset used in this notebook is the [Fruits 360 dataset](https://www.kaggle.com/datasets/moltean/fruits) (Version: 2020.05.18.0) which contains images of different fruits and vegetables.

**Dataset License**: CC-BY-SA 4.0 license.

# 1.2 Importing & Loading the Fruits360 Dataset

In [None]:
# dataset path
dataset_path = '/kaggle/input/fruits/fruits-360_dataset/fruits-360'

# Define training and test folders
training_folder_path = "/kaggle/input/fruits/fruits-360_dataset/fruits-360/Training"
test_folder_path = "/kaggle/input/fruits/fruits-360_dataset/fruits-360/Test"

In [None]:
# Counting total labels
def count_labels(folder_path):
    label_count = 0
    for _, dirs, _ in os.walk(folder_path):
        label_count += len(dirs)
        break  # Only count the top-level directories and exit the loop
    return label_count

num_labels = count_labels(training_folder_path)
print(f"Number of labels (folders) in the training dataset: {num_labels}")

This means that there are 131 classes of fruits and vegetable within the dataset (including variants).

### Exploring labels within training folder

In [None]:
# Get a list of all labels (subfolder names) within the training folder
labels = [label for label in os.listdir(training_folder_path) if os.path.isdir(os.path.join(training_folder_path, label))]

# Sort the labels alphabetically
sorted_labels = sorted(labels)

# Print the list of labels
print("Sorted Labels:")
for label in sorted_labels:
    print(label)

From the above printed labels, we can see that a few fruits are available in more than one variety and have been separately labeled as belonging to a different class.

# 1.3 Sorting and filtering the dataset

The dataset contains both fruits and vegetable images as seen from the class labels. Hence, it needs to be filtered for only fruits. To achieve this, relevant folders with fruit variants need to be copied to the working directory first. 

In [None]:
# creating a folder for filtered dataset in the working directory

def create_folders(destination_path):
    # Create "filtered_dataset" folder directly
    os.makedirs(destination_path, exist_ok=True)

    # Create "training" and "test" folders within "filtered_dataset"
    training_path = os.path.join(destination_path, "training")
    test_path = os.path.join(destination_path, "test")
    os.makedirs(training_path, exist_ok=True)
    os.makedirs(test_path, exist_ok=True)

if __name__ == "__main__":
    destination_path = "/kaggle/working/filtered_dataset"
    create_folders(destination_path)

    print(f"filtered_dataset folder created successfully in {destination_path}")
    print(f"Training folder created successfully in {destination_path}.")
    print(f"Test folder created successfully in {destination_path}.")

# 1.4 Classification model idea and requirements

Although a generalized model would be simpler to setup and might be able to detect different fruits, but it yet won't perform well for identifying the fruit variants.
Hence, for the model to be able to distinguish between multiple varieties of the same fruit, we will require a deeper model and a more complex model.

### Creating a subset of the main dataset with fruit labels

For this model, we will copy the following 11 fruits classes with all variant folders to both training and test datasets.
1. Apple
2. Banana
3. Cherry
4. Guava
5. Grape
6. Lychee
7. Pineapple
8. Rambutan
9. Raspberry
10. Redcurrant
11. Salak

In [None]:
def copy_selected_folders(source_path, destination_path, selected_fruits):
    if not os.path.exists(source_path):
        print("Source path does not exist.")
        return

    source_folders = os.listdir(source_path)
    for fruit_pattern in selected_fruits:
        fruit_pattern = fruit_pattern.lower()  # Make sure the fruit pattern is in lowercase
        fruit_folder_matches = [f for f in source_folders if f.lower().startswith(fruit_pattern)]

        if not fruit_folder_matches:
            print(f"No variants found for '{fruit_pattern}'.")
            continue

        for source_folder in fruit_folder_matches:
            fruit_name = source_folder
            source_folder = os.path.join(source_path, source_folder)
            destination_folder = os.path.join(destination_path, fruit_name)
            try:
                shutil.copytree(source_folder, destination_folder)
                print(f"Fruit '{fruit_name}' copied successfully in {destination_path}.")
            except FileExistsError:
                print(f"Fruit '{fruit_name}' already exists in the destination path.")
                
# copy fruit folders to training folder
if __name__ == "__main__":
    source_path = "/kaggle/input/fruits/fruits-360_dataset/fruits-360/Training"
    destination_path = "/kaggle/working/filtered_dataset/training"
    
    # Selecting the fruit names to copy all variants
    selected_fruits = ["Apple","Banana", "Cherry","Guava","Grape","Lychee","Pineapple","Rambutan","Raspberry","Redcurrant","Salak"] 
  
    copy_selected_folders(source_path, destination_path, selected_fruits)
    
# copy fruit folders to test folder
if __name__ == "__main__":
    source_path = "/kaggle/input/fruits/fruits-360_dataset/fruits-360/Test"
    destination_path = "/kaggle/working/filtered_dataset/test"
    
    # Selecting the fruit names to copy all variants
    selected_fruits = ["Apple","Banana", "Cherry","Guava","Grape","Lychee","Pineapple","Rambutan","Raspberry","Redcurrant","Salak"] 
  
    copy_selected_folders(source_path, destination_path, selected_fruits)

# 1.5 Exploring the Dataset

In [None]:
training_subset="/kaggle/working/filtered_dataset/training"
test_subset="/kaggle/working/filtered_dataset/test"

# function to count images in each folder
def count_images_per_label(folder_path):
    label_counts = {
        label: len(os.listdir(os.path.join(folder_path, label)))
        for label in os.listdir(folder_path)
        if os.path.isdir(os.path.join(folder_path, label))
    }

    return label_counts

if __name__ == "__main__":
    # Count images in training folders
    training_label_counts = count_images_per_label(training_subset)
    test_label_counts = count_images_per_label(test_subset)
    sorted_training_label_counts = sorted(training_label_counts.items(), key=lambda x: x[1], reverse=True)
    sorted_test_label_counts = sorted(test_label_counts.items(), key=lambda x: x[1], reverse=True)
print("Training Label Counts (sorted by count):")
for label, count in sorted_training_label_counts:
    print(f"{label}: {count}")
print("Test Label Counts (sorted by count):")
for label, count in sorted_test_label_counts:
    print(f"{label}: {count}")

In [None]:
#counting number of images
def count_total_images(folder_path):
    total_images = 0
    for _, _, files in os.walk(folder_path):
        total_images += len(files)
    return total_images

total_images_count = count_total_images(dataset_path)
total_train_images_count = count_total_images(training_subset)
total_test_images_count = count_total_images(test_subset)

#Display total number of images in each folder of the dataset
print(f"Total number of images in the main dataset: {total_images_count}")
print(f"Total number of images in the training dataset: {total_train_images_count}")
print(f"Total number of images in the test dataset: {total_test_images_count}")

In [None]:
# Combine the training and test label counts into a single dictionary
combined_label_counts = {
    label: training_label_counts.get(label, 0) + test_label_counts.get(label, 0)
    for label in set(list(training_label_counts.keys()) + list(test_label_counts.keys()))
}

# Create a DataFrame to hold the combined fruit counts
df_fruit_counts = pd.DataFrame({"Fruit Labels": list(combined_label_counts.keys()), "Count": list(combined_label_counts.values())})

# Sort the DataFrame by the counts in descending order
df_fruit_counts = df_fruit_counts.sort_values(by="Count", ascending=False)

# Select the top 15 fruit labels by count
top_15_fruits = df_fruit_counts.head(15)

# Plot the horizontal bar chart using Seaborn
plt.figure(figsize=(10, 8))
sns.barplot(x="Count", y="Fruit Labels", data=top_15_fruits, palette="YlOrRd")
plt.xlabel("Count")
plt.ylabel("Fruit Labels")
plt.title("Top 15 Fruit Labels by Count")
plt.show()

In [None]:
BATCH_SIZE = 32
IMAGE_SIZE = 100
CHANNELS = 3
EPOCHS = 10

In [None]:
# training dataset pipeline
train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    training_subset,
    seed=42,
    shuffle=True,
    image_size=(IMAGE_SIZE,IMAGE_SIZE),
    batch_size=BATCH_SIZE
)

In [None]:
#print training labels
tr_class_names = train_dataset.class_names
tr_class_names

**Data Exploration**

In [None]:
#visualizing sample images from the dataset
plt.figure(figsize=(10, 10))
for image_batch, labels_batch in train_dataset.take(9):
    for i in range(25):
        ax = plt.subplot(5,5, i + 1)
        plt.imshow(image_batch[i].numpy().astype("uint8"))
        plt.title(tr_class_names[labels_batch[i]], fontsize=10)
        plt.axis("off")

plt.tight_layout()
plt.show()


# 1.6 Preparing dataset

Next, let us split the data in the training folder into train and validation sets. The train set will be used to train the model, while the validation set will help evaluate the model performance and will also help to reduce overfitting, if any during training.

In [None]:
# define a function to split the dataset 
def get_dataset_partitions_tf(ds, train_split=0.8, val_split=0.2, shuffle=True, shuffle_size=10000):
    assert (train_split + val_split) == 1

    ds_size = len(ds)

    if shuffle:
        ds = ds.shuffle(shuffle_size, seed=1234)

    train_size = int(train_split * ds_size)
    val_size = int(val_split * ds_size)

    train_ds = ds.take(train_size)
    val_ds = ds.skip(train_size).take(val_size)

    return train_ds, val_ds

In [None]:
train_ds, val_ds = get_dataset_partitions_tf(train_dataset)

In [None]:
#print length of each set
print("Training dataset length",len(train_ds))
print("Validation dataset length",len(val_ds))

In [None]:
# Optimization for Training and Validation Datasets by caching and shuffling
train_ds = train_ds.cache().shuffle(100).prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.cache().shuffle(100).prefetch(buffer_size=tf.data.AUTOTUNE)

In [None]:
# resize and rescaling images to a specified size 
resize_and_rescale = tf.keras.Sequential([
  layers.experimental.preprocessing.Resizing(IMAGE_SIZE, IMAGE_SIZE),
  layers.experimental.preprocessing.Rescaling(1./255),
])

In [None]:
# prefetching the training data to optimize pipeline
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# 2. Building a multi-task or multi-output model in Keras

In [None]:
# Defining the shape of the input data batch for CNN
input_shape = (BATCH_SIZE, IMAGE_SIZE, IMAGE_SIZE, CHANNELS)

# Number of outputs
n_classes = len(tr_class_names)
n_classes

# 2.1 Defining the CNN

Let us now define our CNN model with the Sequential API in Keras using the input_shape specified in the previous step.
This model will consist of multiple Conv2D and MaxPooling2D layers, followed by a Flatten layer, two Dense layers with dropout regularization, and a finally a Dense layer with softmax activation for multi-class classification
Also, the n_classes will represent the number of output classes.

In [None]:
# CNN model
model = Sequential([
    resize_and_rescale,
    layers.Conv2D(32, kernel_size = (3,3), activation='relu', input_shape=input_shape),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(32, kernel_size = (3,3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    
    layers.Conv2D(64, kernel_size =(3,3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(64, kernel_size =(3,3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(128, kernel_size = (3,3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.25),
    layers.Dense(n_classes, activation='softmax'),
])

model.build(input_shape=input_shape)

In [None]:
# Review the model summary
model.summary()

### Choice of Optimizer
To build this fruit image classifier, we are using RMSprop instead of Adam due to its ability to handle sparse gradients. This can be an advantage as we have a multilabel task. This means we are dealing with sparse gradients due to multiple categories and sub categories.
Also our dataset has a varying complexities and different number of images in each class. 
Thus, RMSprop can be used instead of adam to leverage its adaptive learning rate mechanism for a better fine-tuning to learning rates. This can possibly provide a better convergence and performance on our specific image classification task. 

In [None]:
# specifying the optimizer and model metrics
model.compile(
    optimizer='rmsprop',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

# 2.2 Model Training

In [None]:
# saving the model training history
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=EPOCHS
)

# 2.3 Check for overfitting

If the deep learning CNN model gets too complex, it is likely to suffer from overfitting. 
Overfitting indicates that the model begins to memorize the training data instead of learning general patterns. 
A possible indicator of this is highly accurate model i.e. training and validation accuracies >90%. This also indicates that the model is biased to the images training data and a poor generalization on unseen images.

Let us plot the accuracy and loss curves to visualize the model training process.

In [None]:
#Plotting train & validation loss
plt.figure()
plt.plot(history.history["loss"],label = "Train Loss", color = "black")
plt.plot(history.history["val_loss"],label = "Validation Loss", color = "blue", linestyle="dashed")
plt.title("Model Losses", color = "darkred", size = 15)
plt.legend()
plt.show()

In [None]:
#Plotting train & validation accuracy
plt.figure()
plt.plot(history .history["accuracy"],label = "Train Accuracy", color = "black")
plt.plot(history .history["val_accuracy"],label = "Validation Accuracy", color = "blue", linestyle="dashed")
plt.title("Model Accuracy", color = "darkred", size = 15)
plt.legend()
plt.show()

### Remarks:
* Since the training accuracy and validation accuracy follow a similar trend and both increase over the epochs, the model shows no signs of overfitting.
* Similarly, the training loss and validation loss decrease consistently, indicating a good fit between the model and the data.
* The model can be retrained for higher epochs with a different batch size for experimentation and evaluating a possible improvement in the model accuracy.

# 3. Predicting unseen images from test dataset

# 3.1 Creating a test data pipeline

In [None]:
# test dataset pipeline
test_dataset = tf.keras.preprocessing.image_dataset_from_directory(
   test_subset,
    seed=42,
    shuffle=True,
    image_size=(IMAGE_SIZE,IMAGE_SIZE),
    batch_size=BATCH_SIZE
)

In [None]:
#print training labels
ts_class_names = test_dataset.class_names
ts_class_names

# 3.2 Predicting a sample image

In [None]:
# Fetching model predictions for sample image in test dataset
plt.figure(figsize=(3, 3))
for images_batch, labels_batch in test_dataset.take(1):

    first_image = images_batch[0].numpy().astype('uint8')
    first_label = labels_batch[0].numpy()
    print("first image to predict")
    plt.imshow(first_image)
    print("actual label:",ts_class_names[first_label])

    batch_prediction = model.predict(images_batch)
    print("predicted label:",tr_class_names[np.argmax(batch_prediction[0])])

# 3.3 Batch prediction on unseen images from test dataset

In [None]:
# Defining prediction function for testing images
def predict(model, img):
    img_array = tf.keras.preprocessing.image.img_to_array(images[i].numpy())
    img_array = tf.expand_dims(img_array, 0)

    predictions = model.predict(img_array)

    predicted_class = ts_class_names[np.argmax(predictions[0])]
    confidence = round(100 * (np.max(predictions[0])), 2)
    return predicted_class, confidence

We will do a few set of predictions and visualize the results to have a better idea about how our model is performing.

### Prediction set 1


In [None]:
plt.figure(figsize=(15, 15))

# Iterate over the batches and then the images to display their predictions
batch_size = 32
for images, labels in test_dataset.take(12):
    for i in range(batch_size):
        if i >= len(images):
            break

        ax = plt.subplot(6, 6, i + 1)
        image = tf.image.resize(images[i], (100, 100))
        plt.imshow(image.numpy().astype("uint8"))
        predicted_class, confidence = predict(model, images[i].numpy())
        actual_class = ts_class_names[labels[i]]
        plt.title(f"Actual: {actual_class},\n Predicted: {predicted_class}.\n Confidence: {confidence}%", fontsize=8)
        plt.axis("off")

    # If there are more than batch size images, break out of the loop
    if i >= batch_size - 1:
        break

# Hide any empty subplots
for i in range(i + 1, batch_size):
    plt.subplot(6,6, i + 1)
    plt.axis("off")

plt.tight_layout()
plt.show()

### Prediction set 2


In [None]:
plt.figure(figsize=(15, 15))

# Iterate over the batches and then the images to display their predictions
batch_size = 32
for images, labels in test_dataset.take(15):
    for i in range(batch_size):
        if i >= len(images):
            break

        ax = plt.subplot(6, 6, i + 1)
        image = tf.image.resize(images[i], (100, 100))
        plt.imshow(image.numpy().astype("uint8"))
        predicted_class, confidence = predict(model, images[i].numpy())
        actual_class = ts_class_names[labels[i]]
        plt.title(f"Actual: {actual_class},\n Predicted: {predicted_class}.\n Confidence: {confidence}%", fontsize=8)
        plt.axis("off")

    # If there are more than batch size images, break out of the loop
    if i >= batch_size - 1:
        break

# Hide any empty subplots
for i in range(i + 1, batch_size):
    plt.subplot(6,6, i + 1)
    plt.axis("off")

plt.tight_layout()
plt.show()

# 4. Concluding Notes

* We created a fairly accurate fruit classifier in Keras.
* The model accurately predicted most of the considered fruit categories. 
* The model seems to incorrectly label some of the fruit varieties with higher confidence. Since these varieties appear similar in colors and shapes, they can be challenging for the model to classify. This can be improved by -
    * Training the model with additional data using data augmentation in Keras
    * Experimenting with adding Batch normalization to the CNN layers to improve and stabilize the learning process
    * Adding more layers to the neural network
    * Adding L2 regularization
    * Experimenting with the dropout rate, learning rate

Apart from the above, the model does a really good job of classifying different fruits and their variants such as raspberry,pineapple, redcurrant, grapefruit, banana, etc.

* Fruit image classification has numerous practical applications, from sorting ripe fruits to detecting diseases. It offers efficiency, accuracy, and potential for optimizing inventory management in various industries. Similarly, Deep Learning can also be extended to broader plant species detection, benefiting agricultural industries.Exploring these use cases can create an awareness about potential AI projects in agricultural domain.

* Apart from fruit classification, Deep Learning can also be utilized for broader plant species detection. By analyzing various plant attributes, such as leaves, flowers, or fruits, the technology can identify different plant species in their natural habitat which can significantly impact the ongoing botany research, conservation efforts, and ecological studies. This demonstrates the versatility and potential applications of Deep Learning models in the agriculture and horticulture.

Also, if you're interested in getting hands-on experience on a similar exciting Deep Learning project, check out the **[Plant Species classification](https://bit.ly/43W6l3C)** on ProjectPro.

If you liked this notebook please remember to upvote. Thanks and Happy coding! 😊