<a href="https://colab.research.google.com/github/mayursrt/dog-breed-identification/blob/main/dog_breed_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dog Breed Identification
This notebook builds an end-to-end multi-class image classifier using TensorFlow 2.x and TensorFlow Hub.

1. Problem
Identifying the breed of a dog given an image of a dog.

When I'm sitting at the cafe and I take a photo of a dog, I want to know what breed of dog it is.

2. Data
The data we're using is from Kaggle's dog breed identification competition. You can get the data here:

https://www.kaggle.com/c/dog-breed-identification/data

3. Evaluation
The evaluation is a file with prediction probabilities for each dog breed of each test image.

https://www.kaggle.com/c/dog-breed-identification/overview/evaluation

4. Features
Some information about the data:

* We're dealing with images (unstructured data) so it's probably best we use deep learning/transfer learning.
* There are 120 breeds of dogs (this means there are 120 different classes).
* There are around 10,000+ images in the training set (these images have labels).
* There are around 10,000+ images in the test set (these images have no labels, because we'll want to predict them).


### Getting the workspace ready
First Import all the packages needed for the task:
* TensorFlow 2.x
* TensorFlow Hub

Also check if you're using a GPU.


In [None]:
# import packages and check their versions
import tensorflow as tf
import tensorflow_hub as hub
print('TensorFlow version:', tf.__version__)
print('TensorFlow Hub version:', hub.__version__)

In [None]:
# check GPU availability
print("GPU", "available" if tf.config.list_physical_devices("GPU") else "not available")

**NOTE:** This project will not be able to run if there is no GPU available. If using Google Colab, Goto Runtime > Change Runtime Type > Select GPU.

### Getting our data ready (turning into Tensors)
With all machine learning models, our data has to be in numerical format. So that's what we'll be doing first. Turning our images into Tensors (numerical representations).

Let's start by accessing our data and checking out the labels.

In [None]:
# Checkout the labels of the data
import pandas as pd
labels_csv = pd.read_csv("drive/MyDrive/Dog Breed Identification using Tensorflow/data/labels.csv")
print(labels_csv.describe())
print(labels_csv.head())

In [None]:
labels_csv['breed'].value_counts()

In [None]:
labels_csv['breed'].value_counts().plot.bar(figsize=(20, 10));

In [None]:
# median labels per breed to get distribution of data
labels_csv['breed'].value_counts().median()

In [None]:
 # View an Image
 from IPython.display import Image
 Image('drive/MyDrive/Dog Breed Identification using Tensorflow/data/train/0021f9ceb3235effd7fcde7f7538ed62.jpg')



### Getting images and their labels
Get the list of all image file pathnames.

In [None]:
# create pathnames for image ids
filenames = ['drive/MyDrive/Dog Breed Identification using Tensorflow/data/train/' + fname + '.jpg' for fname in labels_csv ['id']]
filenames[:10]

In [None]:
# check if the number of filenames match the number of actual image files(this can be caused by incomplete upload of the files)
import os
if len(os.listdir('drive/MyDrive/Dog Breed Identification using Tensorflow/data/train/')) == len(filenames):
  print('Filenames match actual amount of files..!!! you can proceed.')
else:
  print('Filenames do not match the actual amount of files..!! please try and reupload the data directory')

Preparing the labels

In [None]:
import numpy as np
# transforming labels so that they can be used.
labels = np.array(labels_csv['breed'])  ## can also use labels = labels_csv['breed'].to_numpy()
labels

In [None]:
#check length of labels
len(labels)

In [None]:
# see if the number of labels match the length of filenames
if len(labels) == len(filenames):
  print('Filenames match actual amount of files..!!! you can proceed.')
else:
  print('Filenames do not match the actual amount of files..!! please try and reupload the data directory')

In [None]:
#find unique label values
unique_breeds = np.unique(labels)
unique_breeds

In [None]:
#len of unique breeds
len(unique_breeds)

In [None]:
#turning a label into a boolean array
print(labels[0])
labels[0] == unique_breeds

In [None]:
# likewise turning all labels in boolean array
labels_bool = [labels == unique_breeds for labels in labels]
labels_bool[:2]

In [None]:
# turning boolean array into integers #maybe not needed

print(labels[0])
print(np.where(unique_breeds == labels[0]))
print(labels_bool[0].argmax())
print(labels_bool[0].astype(int))

### Creating validation set 
since we do not have validation set in our dataset, we need to create one so that we can run validation tests on the validation set.

we can use `train_test_split` for this job

In [None]:
# split into X and y
