# 🐕 End-to-End Multi-Class Dog Breed Classification.

This notebook builds an multi-class image classifier using TenserFlow 2.0 and TensorFlow Hub.

## 1. Problem

Indentifying the breed of a dog given an image of a dog.

"When I am sitting at a cafe and I see a cute dog, would love to know the breed of it."


## 2. Data

https://www.kaggle.com/c/dog-breed-identification/data

 The data we are using is from kaggle's dog breed indentification competitions.


## 3. Evaluation

The evalutions is a file with predictions probabilities for each dog breed of each test image.

https://www.kaggle.com/c/dog-breed-identification/overview

## 4. Features

- We are dealing with images (unstructured data) so it's probably it is best to use deep learning/ transfer learning.

- We have training set and a test set of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs.

- There are around 10,000+ images in the training set(these images have labels).
- There are around 10,000+ images in the test set(these images have no labels).

- File descriptions
  - train.zip - the training set, you are provided the breed for these dogs
  - test.zip - the test set, you must predict the probability of each breed for each image
  - sample_submission.csv - a sample submission file in the correct format
  - labels.csv - the breeds for the images in the train set

## Getting Workspace Ready

- Import TensorFlow 2.x
- Import TensorFlow Hub
- Make sure we are using GPU

In [None]:
import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

try:
    with tf.device('/device:GPU:0'):
        a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
        b = tf.constant([[5.0, 6.0], [7.0, 8.0]])
        c = tf.matmul(a, b)
        print("Matrix multiplication on GPU successful:\n", c.numpy())
except RuntimeError as e:
    print(f"Error running on GPU: {e}")

In [None]:
#Import neccesary tools
import tensorflow_hub as hub
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("Hub version:", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

## Getting our data ready (turning it into Tensors)

With all machine learning models, our data should be in numerical format. So, that's what we'll be doing first. Turning our images into Tensors (numerical representation more like numpy arrays with multiple dimensions)


In [None]:
# checkout labels of our data
import pandas as pd
labels_csv = pd.read_csv("/content/drive/MyDrive/Dog Vision/labels.csv")
print(labels_csv.describe())

In [None]:
labels_csv.head()

In [None]:
# how many images are there of each breed
labels_csv["breed"].value_counts()

In [None]:
import matplotlib.pyplot as plt

breed_counts = labels_csv["breed"].value_counts()

# Create the bar plot
plt.figure(figsize=(20, 10))  # Adjust figure size as needed
breed_counts.plot(kind='bar')
plt.title('Number of Images per Dog Breed')
plt.xlabel('Dog Breed')
plt.ylabel('Number of Images')
plt.xticks(rotation=90)  # Rotate x-axis labels for readability
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()


In [None]:
labels_csv["breed"].value_counts().median()

In [None]:
# Viewing the image
from IPython.display import Image
Image("/content/drive/MyDrive/Dog Vision/train/0021f9ceb3235effd7fcde7f7538ed62.jpg")

### Getting images and there labels

Lets get list of all of our image file pathnames

In [None]:
labels_csv.head()

In [None]:
# Create pathnames from image ID's
filenames = ["/content/drive/MyDrive/Dog Vision/train/"+fname for fname in labels_csv["id"]+".jpg"]
filenames[:10]

In [None]:
Image(filenames[11])

In [None]:
# check whether number of filename matches number of actual image files
import os
if len(os.listdir("/content/drive/MyDrive/Dog Vision/train")) == len(filenames):
  print("Filenames match actual amount of files")
else:
  print("Filenames do not match actual amount of files")

Since we'have now got our training image filepaths in a list. lets prepare our labels.

In [None]:
import numpy as np
labels = labels_csv["breed"]#.to_numpy() does same thing as below
labels = np.array(labels)
labels

In [None]:
len(labels)

In [None]:
# Check if number of labels matches the number of filenames to see if there are missing data
if len(labels) == len(filenames):
  print("Number of labels matches number of filenames")
else:
  print("Number of labels does not match number of filenames")

In [None]:
# Find the unique label values
unique_breeds = np.unique(labels)
len(unique_breeds)

In [None]:
# turn a single label into an array of booleans
print(labels[0])
labels[0] == unique_breeds

In [None]:
# Turn every single labels into boolean array
boolean_labels = [label == unique_breeds for label in labels]
boolean_labels[:2]

In [None]:
len(boolean_labels)

### Creating our own validation set

Our data set does not have a validation set so, we will need to create one.

In [None]:
# Set up X and y variables
X = filenames # this is our training and validation input the images
y = boolean_labels # this is our target labels, they are boolean arrays
# each array has a true value corresponding to the label.

We are going to start off experinmenting with ~1000 images and then increase as needed.

In [None]:
# Set number of images to use for experimenting
NUM_IMAGES = 1000 # @param {"type":"slider","min":1000,"max":10222,"step":100}

In [None]:
# Lets split our data into train and validation sets
from sklearn.model_selection import train_test_split
# split them into training and validation of total size NUM_IMAGES
X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES], y[:NUM_IMAGES],
                                                  test_size=0.2, random_state=42)


In [None]:
len(X_train), len(y_train), len(X_val), len(y_val)

## Preprocessing Images (turning images into tensors)

To preprocess our images into Tensors we are going to write a fucntion which does a few things:

1. Take image filepath as input.
2. Use TensorFlow to read the file and save it to a variable, `image`.
3. Turn our `image` (a jpg) into Tensors.
4. Normalize our image (convert color channel values from 0-255 to 0-1)
5. Resize the `image` to be a shape of (224,224).
The reason for specific size depends on which model you choose to train. Certain models have size requirements.
6. Return the modified `image`



In [None]:
# convert image to NumPy array
# .shape gives this info (height, width, color chanels)
# color chanels 3 means it is RGB
from matplotlib.pyplot import imread
image = imread(filenames[42])
image.shape

In [None]:
image.max(), image.min()

In [None]:
image[:2]

In [None]:
# turn image into tensor
tf.constant(image)[:2]

#### Now we have seen what image looks like as tensor, so lets build a function to preprocess them.

In [None]:
# Define image size which is (224, 224)

IMG_SIZE = 224

# Create a funtion for preproccessing images
def process_image(image_path):
  """
  1. Takes an image file path and turns it into a Tensor.
  """
  """
  2. Use TensorFlow to read the file and save it to a variable, image.
  """
  # Read in an Image file
  image = tf.io.read_file(image_path)

  """
  3. Turn the jpg image into numerical tensor with 3 colour chanels (Red, Green, Blue)
  """
  image = tf.image.decode_jpeg(image, channels=3)

  """
  4. covert the color channel values from 0-255 to 0-1 values, part of normalizatiom
  """
  image = tf.image.convert_image_dtype(image, tf.float32)

  """
  5. Resize the image to be a shape of (224,224).
  """
  image = tf.image.resize(image, size=[IMG_SIZE,IMG_SIZE])

  """
  6. Return the modified image
  """
  return image



## Turning our data into batches

<strong> Our approach will be to use mini batch training where our batches will be of 32 images.</strong>

Reason:
- Let's say we have 10000+ images to process in one go, it will not fit on memory (Even with GPUs).
- Slows down our training process.
- It is in-efficient.

So, thats why we will train in batches of 32 images at time. (you can manually adjust the batch size)

In order to use TensorFlow effectively, we need our data to be in form of Tensor tuples which looks like this:
`(image,label)`

In [None]:
# Create simple function to return a tuple -> (image, label)
def get_image_label(image_path, label):
  '''
  Takes an image file path name and the associated label,
  processes the image and returns a tuple of (image, label).
  '''
  image = process_image(image_path)
  return image, label

In [None]:
(process_image(X[42], y[42]))