# Frame the problem and look at the big picture 

Chewy wants to develop a dog door that only opens for your pet. They would like an initial system that can recognize the breed of the dog. The door should not open if the dog is not the same breed as your dog. They have provided us with over 20,000 images of 120 different breeds of dogs. They want a high precision so that no other dogs (or mistaken creatures) are able to enter the door. Their current products require the pet to have chip or a special collar. They would like to have a new system that requires no external devices, this is a proof of concept problem because if they can't predict breed, they may not even be able to detect one dog properly.

This will be a supervised offline classification problem. We will use precision because we want to lower the amount of False Positives and keep True Positives up. (It would be acceptable for the door not to open every single time, but it is not acceptable to open for the incorrect breed.)

Since our system is sort of a starting point for their ultimate goal, we will shoot for 85% precision as well as a high accuracy (85%). Since this is more of a proof of concept though, we have some leeway with precision.

The MNIST problems are similar and we can reuse some code from notebooks.

There are no experts available to us.

The manual solution is the use of a chip/collar system.
Look for body shape, face shape, color

We assume the data is labelled correctly and not missing significant portions.
We will resize the images to be the same size, but this could introduce other issues potentially. 



In [None]:
!curl vision.stanford.edu/aditya86/ImageNetDogs/images.tar --output images

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  756M  100  756M    0     0  31.3M      0  0:00:24  0:00:24 --:--:-- 28.7M


In [None]:
!tar -xf images

In [None]:
!hostname

d098b0299a1e


In [1]:
import os
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import gc

from tensorflow import keras

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV, train_test_split
from sklearn.metrics import confusion_matrix, precision_score
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from sklearn import set_config
set_config(display='diagram')

from matplotlib import style
style.use('dark_background')

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# def load_images_and_labels(categories, fpath):
#     img_lst=[]
#     labels=[]
#     for index, category in enumerate(categories):
#         for image_name in os.listdir(fpath+"/"+category):
#             img = cv2.imread(fpath+"/"+category+"/"+image_name)
#             img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#             img_array = Image.fromarray(img, 'RGB')
#             resized_img = img_array.resize((227, 227))
#             img_lst.append(np.array(resized_img))
#             labels.append(index)

#     images = np.array(img_lst).astype(np.float32)/255
#     labels = np.array(labels).astype(np.int32)
    
#     return images, labels

def display_rand_images(images, labels, names):
    plt.figure(1 , figsize = (19 , 10))
    for i in range(9):
        r = np.random.randint(0 , images.shape[0] , 1)
        
        plt.subplot(3 , 3 , i+1)
        plt.subplots_adjust(hspace = 0.3 , wspace = 0.3)
        plt.imshow(images[r[0]])
        
        plt.title('Dog breed : {} ({})'.format(labels[r[0]], names[labels[r[0]]]))
        plt.xticks([])
        plt.yticks([])
        
    plt.show()

def display_images(data, gray=False):
    plt.figure(1 , figsize = (19 , 10))
    for i in range(9):
        r = np.random.randint(0 , data.shape[0] , 1)
        plt.subplot(3 , 3 , i+1)
        plt.subplots_adjust(hspace = 0.3 , wspace = 0.3)
        if gray:
            plt.imshow(data[r[0]], cmap='gray')
        else:
            plt.imshow(data[r[0]])
    plt.show()

In [None]:
def load_images_and_labels(categories, fpath):
    num_imgs = 0
    for _, _, files in os.walk(fpath):
        num_imgs += len(files)

    imgs = np.empty((num_imgs, 3, 227, 227), dtype=np.float32)
    labels = []
    for index, category in enumerate(categories):
        for image_name in os.listdir(fpath+"/"+category):
            labels.append(index)
            img = cv2.imread(fpath+"/"+category+"/"+image_name)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img_array = Image.fromarray(img, 'RGB').resize((227, 227))
            np.append(imgs, img_array)
            gc.collect()


    # images = np.array(img_lst).astype(np.float32)/255
    labels = np.array(labels).astype(np.int32)
    
    return imgs, labels

# Get the data 


In [None]:
# fpath = os.getcwd() + "/archive/images/Images"
# fpath = os.getcwd() + "/Images"
fpath = os.getcwd() + "/images"
categories = os.listdir(fpath)
names = [cat.split('-')[-1] for cat in categories]

# This will take 2-3 minutes to run
images, labels = load_images_and_labels(categories, fpath)

In [None]:
len(images), len(labels), len(names)

In [None]:
type(images), type(labels), type(names)

In [None]:
display_rand_images(images, labels, names)

In [None]:
# Shuffle the data
n = np.arange(images.shape[0])
np.random.seed(42)
np.random.shuffle(n)
images = images[n]
labels = labels[n]

In [None]:
X_train_, X_test_, y_train_, y_test_ = train_test_split(images, labels, test_size=0.2, random_state=42) # THIS IS THE DATA TO PROCESS

In [None]:
X_val, X_test_, y_val, y_test_ = train_test_split(X_test_, y_test_, test_size=0.2, random_state=42)

### Data for Explore

In [None]:
df = pd.DataFrame(X_train_.reshape(X_train_.shape[0], -1))
df["labels"] = y_train_
df


In [None]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
# val_set, test_set = train_test_split(test_set, test_size=0.5, random_state=42)

In [None]:
X_train, y_train = train_set.drop('labels', axis=1), train_set['labels']

In [None]:
train_subset = train_set.sample(n=1000, random_state=42)
X_train_subset, y_train_subset = train_subset.drop('labels', axis=1), train_subset['labels']

# Explore the data


There wasn't much exploration to do because the ranges for values are all 0-255.
There are no missing values because they are images and we have reshaped all of them to make processing easier. 

We did look into the correlations between the color channels and the target. We also looked at the grayscaled versions of the images.

We performed PCA on both grayscaled and colored images and unsurprisingly, many of the values are not necessary to replicate the images. 

However, many images cannot be as accurately recovered because some images:
  * are not of just one dog
  * may not have the dog as the focus of the picture
  * may have many humans in the image
  * have odd angles

These reasons will likely affect our model when we get there.

### Need to use the dataframe for some of these

In [None]:
# plot the first row of the train set as an image
plt.imshow(X_train.iloc[0, :].values.reshape(227, 227, 3));
plt.title('Dog breed : {} ({})'.format(y_train.iloc[0], names[y_train.iloc[0]]));

In [None]:
# This is about expected
# X_train_.describe()

In [None]:
X_train.info()

In [None]:
X_train.head()

In [None]:
# Look for the useless features
c = pd.DataFrame({'breed': X_train.corrwith(y_train)})
plt.figure(figsize=(12,50))
sns.heatmap(c);

### Look at each color channel

In [None]:
# get each color channel
r = X_train[X_train.columns[0::3]]
g = X_train[X_train.columns[1::3]]
b = X_train[X_train.columns[2::3]]

In [None]:
c = pd.DataFrame({'breed': r.corrwith(y_train)})
plt.figure(figsize=(12,50))
plt.title('Red channel')
sns.heatmap(c);

In [None]:
c = pd.DataFrame({'breed': g.corrwith(y_train)})
plt.figure(figsize=(12,50))
plt.title('Green channel')
sns.heatmap(c);

In [None]:
c = pd.DataFrame({'breed': b.corrwith(y_train)})
plt.figure(figsize=(12,50))
plt.title('Blue channel')
sns.heatmap(c);

## PCA

In [None]:
pca = PCA(n_components=0.99)
pca.fit(X_train, y_train)

plt.figure()
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('Number of features')
plt.ylabel('Cumulative explained variance')

In [None]:
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

In [None]:
# Let's see how good the recovery is
display_images(X_recovered.reshape(X_train.shape[0], 227, 227, 3));

In [None]:
# Find importances of each feature
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
sorted(zip(rf.feature_importances_, X_train.columns), reverse=True)

## Try Grayscale

In [None]:
# plot the first image
plt.imshow(cv2.cvtColor(X_train.iloc[0,:].values.reshape(227, 227, 3)*255, cv2.COLOR_RGB2GRAY), cmap='gray');
plt.title('Dog breed : {} ({})'.format(y_train.iloc[0], names[y_train.iloc[0]]));

In [None]:
ims = X_train.iloc[:,:].values.reshape(X_train.shape[0], 227, 227, 3)*255

In [None]:
X_subset_gray = np.array([cv2.cvtColor(im, cv2.COLOR_RGB2GRAY) for im in ims])

In [None]:
display_images(X_subset_gray, gray=True)

In [None]:
X_subset_gray = pd.DataFrame(X_subset_gray.reshape(X_subset_gray.shape[0], -1))

In [None]:
# Look at the useless features
c = pd.DataFrame({'breed': X_subset_gray.corrwith(y_train)})
plt.figure(figsize=(12,50))
sns.heatmap(c);

## Grayscale PCA

In [None]:
pca_gray = PCA(n_components=0.99)
pca_gray.fit(X_subset_gray, y_train)

plt.figure()
plt.plot(pca_gray.explained_variance_ratio_.cumsum())
plt.xlabel('Number of features')
plt.ylabel('Cumulative explained variance')

In [None]:
X_gray_reduced = pca_gray.fit_transform(X_subset_gray)
X_gray_recovered = pca_gray.inverse_transform(X_gray_reduced)

In [None]:
# Let's see how good the recovery is
display_images(X_gray_recovered.reshape(X_train.shape[0], 227, 227), gray=True);

In [None]:
# Find importances of each feature
rf_gray = RandomForestClassifier()
rf_gray.fit(X_subset_gray, y_train)
sorted(zip(rf_gray.feature_importances_, X_subset_gray.columns), reverse=True)

# Prepare the data 

In [None]:
def rgb_to_gray(X, reshape=True):
    """
    Converts to grayscale array
    TODO: write it better
    """
    if isinstance(X, pd.DataFrame):
      print("changing dataframe to grayscale")
      ims = X.values.reshape(X.shape[0], 227, 227, 3)
      gray_ims = np.array([cv2.cvtColor(im, cv2.COLOR_RGB2GRAY) for im in ims])
      if reshape:
        return pd.DataFrame(gray_ims.reshape(X.shape[0], -1).reshape(X.shape[0], 227, 227, 1))
      return pd.DataFrame(gray_ims.reshape(X.shape[0], -1))

    elif isinstance(X, np.ndarray):
      print("changing ndarray to grayscale")
      ims = X.reshape(X.shape[0], 227, 227, 3)
      gray_ims = np.array([cv2.cvtColor(im, cv2.COLOR_RGB2GRAY) for im in ims])
      if reshape:
          return gray_ims.reshape(X.shape[0], -1).reshape(X.shape[0], 227, 227, 1)
      return gray_ims.reshape(X.shape[0], -1)
    else:
      print("No changes")
      return X  



In [None]:
preprocessor_gray = Pipeline([
    ('grayscale', FunctionTransformer(rgb_to_gray)),
    ('pca', PCA(n_components=0.99)), # TODO: Look at IncrementalPCA, may need for full dataset
])

In [None]:
preprocessor_color = Pipeline([
    ('pca', PCA(n_components=0.99)), # TODO: Look at IncrementalPCA, may need for full dataset
])

# Short-list

In [None]:
def build_dog_network(input_shape=X_train_[0].shape, n_hidden_layers=5, n_neurons=100):
    """
    """
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape=input_shape, name='input'))
    for i in range(n_hidden_layers):
        model.add(keras.layers.Dense(n_neurons, activation="relu", name=f'hidden_{i}'))
    model.add(keras.layers.Dense(1, name='output'))
    model.summary()
    model.compile(optimizer="Nadam", loss="categorical_crossentropy", metrics=["accuracy"])
    return model

In [None]:
def build_alexnet(is_gray=False):
    """
    """
    from keras.models import Sequential
    from keras.layers import Conv2D,MaxPooling2D,Dense,Flatten,Dropout,BatchNormalization
    if is_gray:
      in_shape = (227,227,1)
    else:
      in_shape = (227,227,3)

    model=Sequential()

    #1 conv layer
    model.add(Conv2D(filters=96,kernel_size=(11,11),strides=(4,4),padding="valid",activation="relu",input_shape=in_shape))

    #1 max pool layer
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))

    model.add(BatchNormalization())

    #2 conv layer
    model.add(Conv2D(filters=256,kernel_size=(5,5),strides=(1,1),padding="valid",activation="relu"))

    #2 max pool layer
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))

    model.add(BatchNormalization())

    #3 conv layer
    model.add(Conv2D(filters=384,kernel_size=(3,3),strides=(1,1),padding="valid",activation="relu"))

    #4 conv layer
    model.add(Conv2D(filters=384,kernel_size=(3,3),strides=(1,1),padding="valid",activation="relu"))

    #5 conv layer
    model.add(Conv2D(filters=256,kernel_size=(3,3),strides=(1,1),padding="valid",activation="relu"))

    #3 max pool layer
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))

    model.add(BatchNormalization())


    model.add(Flatten())

    #1 dense layer
    model.add(Dense(4096,input_shape=in_shape,activation="relu"))

    model.add(Dropout(0.4))

    model.add(BatchNormalization())

    #2 dense layer
    model.add(Dense(4096,activation="relu"))

    model.add(Dropout(0.4))

    model.add(BatchNormalization())

    #3 dense layer
    model.add(Dense(1000,activation="relu"))

    model.add(Dropout(0.4))

    model.add(BatchNormalization())

    #output layer
    model.add(Dense(20,activation="softmax"))

    model.summary()
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

    return model

## Gray-scale

### Random Forest

In [None]:
# TODO: issue with preprocessor
# rf_gray_pipeline = Pipeline([
#     ('preprocessor', preprocessor_gray),
#     ('classifier', RandomForestClassifier())
# ]).fit(X_train, y_train)

In [None]:
# rf_gray_scores = cross_val_score(rf_gray_pipeline, X_train, y_train, scoring="accuracy")
# rf_gray_scores

### Extra Trees

In [None]:
# TODO: Issue with preprocessor
# et_gray_pipeline = Pipeline([
#         ('preprocessor', preprocessor_gray),
#         ('classifier', ExtraTreesClassifier())
# ]).fit(X_train, y_train)

In [None]:
# et_gray_scores = cross_val_score(et_gray_pipeline, X_train, y_train, scoring="accuracy")
# et_gray_scores

### Gray Neural Nets

#### Our Own Neural Net

In [None]:
# nn_grey_clf = keras.wrappers.scikit_learn.KerasClassifier(build_dog_network, input_shape=X_train_.shape[1]//3, n_hidden_layers=5, n_neurons=100)

In [None]:
# nn_grey = Pipeline([
#     ('grayscale', FunctionTransformer(rgb_to_gray)),
#     ('classifier', nn_grey_clf),
# ]).fit(X_train_, y_train_)

In [None]:
# nn_cust_grey_scores = cross_val_score(nn_grey, X_train_, y_train_, scoring="accuracy")
# nn_cust_grey_scores

#### AlexNet for Grey Scale

In [None]:
alexnet_grey = keras.wrappers.scikit_learn.KerasClassifier(build_alexnet, is_gray=True)

In [None]:
alexnet_grey.fit( rgb_to_gray(X_train_), y_train_, epochs=100)

In [None]:
val_pred_grey = alexnet_grey.predict(rgb_to_gray(X_val))
val_pred_grey.shape

In [None]:
plt.figure(1, figsize=(19, 10))

for i in range(9):
    r = np.random.randint(0, X_val.shape[0], 1)
    
    plt.subplot(3, 3, i+1)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(X_val[r[0]])
    plt.title('Actual = {} ({}), \nPredicted = {} ({})'.format(y_val[r[0]], names[y_val[r[0]]], 
                                                               val_pred_grey[r[0]], names[val_pred_grey[r[0]]]))
    plt.xticks([]) , plt.yticks([])

plt.show()

In [None]:
score_grey = sum(y_val == val_pred_grey) / X_val.shape[0]
score_grey

## Color

### Random Forest

In [None]:
rf_color_pipeline = Pipeline([
    ('classifier', RandomForestClassifier())
]).fit(X_train, y_train)

In [None]:
rf_color_scores = cross_val_score(rf_color_pipeline, X_train, y_train, scoring="accuracy")
rf_color_scores

### Extra Trees

In [None]:
et_color_pipeline = Pipeline([
        ('classifier', ExtraTreesClassifier())
]).fit(X_train, y_train)

In [None]:
et_color_scores = cross_val_score(et_color_pipeline, X_train, y_train, scoring="accuracy")
et_color_scores

### Color Neural Nets

#### Our Own Neural Net

In [None]:
# color_nn = keras.wrappers.scikit_learn.KerasClassifier(build_dog_network, input_shape=(227,227,3), n_hidden_layers=5, n_neurons=100)

In [None]:
# color_nn.fit(X_train_, y_train_, epochs=10)

In [None]:
# nn_cust_color_scores = cross_val_score(color_nn, X_train_, y_train_, scoring="accuracy")
# nn_cust_color_scores

#### AlexNet for colored

In [None]:
alexnet_color = keras.wrappers.scikit_learn.KerasClassifier(build_alexnet)

In [None]:
alexnet_color.fit(X_train_, y_train_, epochs=100)

In [None]:
val_pred_color = alexnet_color.predict(X_val)
val_pred_color.shape

In [None]:
plt.figure(1, figsize=(19, 10))

for i in range(9):
    r = np.random.randint(0, X_val.shape[0], 1)
    
    plt.subplot(3, 3, i+1)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(X_val[r[0]])
    plt.title('Actual = {} ({}), \nPredicted = {} ({})'.format(y_val[r[0]], names[y_val[r[0]]], 
                                                               val_pred_color[r[0]], names[val_pred_color[r[0]]]))
    plt.xticks([]) , plt.yticks([])

plt.show()

In [None]:
score_color = sum(y_val == val_pred_color) / X_val.shape[0]
score_color

### Good Models

In [None]:
et_color_pipeline.get_params().keys()

In [None]:
# the AlexNet color is likely the better choice

# Fine-tune your models

In [None]:
# Try fine tuning the extra trees
param_grid = [
    { 
      'classifier__n_estimators': [180, 200, 220],
      'classifier__max_depth': [10,12, 15], 
      'classifier__max_leaf_nodes': [10,12, 15], 
    }
]

search = GridSearchCV(et_color_pipeline, param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1);
search.fit(X_train, y_train);

In [None]:
models = zip(search.cv_results_['mean_test_score'], search.cv_results_['params'])
for mean_score, params in sorted(models, key=lambda x: x[0], reverse=True)[:10]:
    print(mean_score, params)

In [None]:
search.best_params_

Scores not improving much from grid search...

# Present your solution 

In [None]:
final_pred = alexnet_color.predict(X_test_)
final_score = sum(y_test_ == final_pred) / X_test_.shape[0]
final_score

In [None]:
plt.figure(1, figsize=(19, 10))

for i in range(9):
    r = np.random.randint(0, X_test_.shape[0], 1)
    
    plt.subplot(3, 3, i+1)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(X_val[r[0]])
    plt.title('Actual = {} ({}), \nPredicted = {} ({})'.format(y_test_[r[0]], names[y_test_[r[0]]], 
                                                               final_pred[r[0]], names[final_pred[r[0]]]))
    plt.xticks([]) , plt.yticks([])

plt.show()



A few attempts at modifications to gridsearch were unable to yield better results than the defaults. We were unable to run sklearn models using grayscaled images (had a bug we didn't have time to track down)

Therefore, our best model was the AlexNet using color which had a <mark>%</mark> precision on the test set and had <mark>% evaluate!!</mark> accuracy. 
This model slightly outperformed the one that used grayscale images.

Our results were not great... Part of this is because we did not have enough memory to do lots of transformations. 
Ideally, we could add more transformations on the images (such as flipping them upside down, cropping, etc). Another issue is that there are a good chunk of not great images to begin with. Some images barely have a dog in them and some have multiple dogs. We also had to cut down the number of breeds we were doing (mostly due to memory constraints). We didn't check but the first 20 categories could have poorly distributed amounts of images, or the most difficult dogs to distinguish, (or outright the worst images). This could be part of the cause for our low scores. 

[see more here](https://docs.google.com/presentation/d/19PBfOdaeBKNb2NyIfJWTD0xQLSJ59Cuu5T2Bj3hKmw0/edit?usp=sharing)