<a href="https://colab.research.google.com/github/jerrvonewing/dog-breed-prediction/blob/main/Dog_Breed_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Setting up Kaggle using Kaggle API

In [None]:
# Run this cell and select your kaggle.json file
# from your Kaggle account settings page

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"jerrvonewing","key":"7eec8ef4aa056a754f29b50e5269cb9c"}'}

In [None]:
# Install Kaggle API client
!pip install -q kaggle

In [None]:
# The Kaggle API client looks for the file in ~/.kaggle, so create and move to directory
!mkdir -p ~/.kaggle/
!cp kaggle.json ~/.kaggle/

# Change permissions to avoid warning on Kaggle startup
!chmod 600 ~/.kaggle/kaggle.json

##To store the data, we will create a new directory and make it as current working directory

In [None]:
# Create directory and change cwd
!mkdir dog_dataset
%cd dog_dataset

/content/dog_dataset


##Searching Kaggle for required dataset using search option(-s) with title 'dogbreedidfromcomp'. We can also use different search options like searching competitions, notebooks, kernels, datasets, etc.

In [5]:
# Searching for the dataset by title
!kaggle datasets list -s dogbreedidfromcomp

ref                                title                    size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------  ----------------------  -----  -------------------  -------------  ---------  ---------------  
catherinehorng/dogbreedidfromcomp  dog-breed-id-from-comp  691MB  2020-06-26 03:09:05           1596          6  0.1764706        


##After searching the data, the next step would be downloading the data into collab notebook using references found in search option.

In [8]:
# Downloading the dataset and coming out of directory
!kaggle datasets download catherinehorng/dogbreedidfromcomp
%cd ..

dogbreedidfromcomp.zip: Skipping, found more recently modified local copy (use --force to force download)
/content


##Unzip the data file and remove any files we won't use

In [None]:
# Unzipping the downloaded file and deleting unusable files
!unzip dog_dataset/dogbreedidfromcomp.zip -d dog_dataset
!rm dog_dataset/dogbreedidfromcomp.zip
!rm dog_dataset/sample_submission.csv #provided by default

##Import required libraries

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm
from keras.preprocessing import image
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam


##Load labels into dataframe

In [14]:
# Read the labels.csv file 
labels_all = pd.read_csv("dog_dataset/labels.csv")
print(labels_all)
labels_all.head()

                                     id                     breed
0      000bec180eb18c7604dcecc8fe0dba07               boston_bull
1      001513dfcb2ffafc82cccf4d8bbaba97                     dingo
2      001cdf01b096e06d78e9e5112d419397                  pekinese
3      00214f311d5d2247d5dfe4fe24b2303d                  bluetick
4      0021f9ceb3235effd7fcde7f7538ed62          golden_retriever
...                                 ...                       ...
10217  ffd25009d635cfd16e793503ac5edef0                    borzoi
10218  ffd3f636f7f379c51ba3648a9ff8254f            dandie_dinmont
10219  ffe2ca6c940cddfee68fa3cc6c63213f                  airedale
10220  ffe5f6d8e2bff356e9482a80a6e29aac        miniature_pinscher
10221  fff43b07992508bc822f33d8ffd902ae  chesapeake_bay_retriever

[10222 rows x 2 columns]


Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever


##Here we get the count per class using the value_counts function

In [16]:
# Visualize the number of each breeds
breeds_all = labels_all["breed"]
breed_counts = breeds_all.value_counts()
breed_counts.head()

scottish_deerhound      126
maltese_dog             117
afghan_hound            116
entlebucher             115
bernese_mountain_dog    114
Name: breed, dtype: int64

In this example, we will work with only 3 of the breeds

In [17]:
# Selecting first 3 breeds (Limitation due to compute power)
CLASS_NAMES = ['scottish_deerhound','maltese_dog','afghan_hound']
labels = labels_all[(labels_all['breed'].isin(CLASS_NAMES))]
labels = labels.reset_index()
labels.head()

Unnamed: 0,index,id,breed
0,9,0042188c895a2f14ef64a918ed9c7b64,scottish_deerhound
1,12,00693b8bc2470375cc744a6391d397ec,maltese_dog
2,79,01e787576c003930f96c966f9c3e1d44,scottish_deerhound
3,90,022b34fd8734b39995a9f38a4f3e7b6b,maltese_dog
4,146,0379145880ad3978f9b80f0dc2c03fba,afghan_hound


##As we are working with the classification of the dataset, we need hot encode the target value, then read the images and convert them into an array. Then we normalize the array

In [19]:
# Creating numpy matrix with zeros
X_data = np.zeros((len(labels), 224,224, 3), dtype='float32')
#One hot encoding
Y_data = label_binarize(labels['breed'], classes = CLASS_NAMES)

# Reading and converting image to numpy array and normalizing dataset
for i in tqdm(range(len(labels))):
  img  = image.load_img('dog_dataset/train/%s.jpg' % labels['id'][i], target_size=(224,224))
  img = image.img_to_array(img)
  x = np.expand_dims(img.copy(), axis=0)
  X_data[i] = x / 255.0

# Printing train image and one hot encode shape & size
print('\nTrain Images shape: ',X_data.shape, ' size: {:,}'.format(X_data.size))
print('One-hot encoded output shape: ', Y_data.shape, ' size: {:,}'.format(Y_data.size))

100%|██████████| 359/359 [00:01<00:00, 220.56it/s]


Train Images shape:  (359, 224, 224, 3)  size: 54,039,552
One-hot encoded output shape:  (359, 3)  size: 1,077



