# Blueberry Muffin vs Chihuahua - Building an Image Classifier
## General Assembly Capstone Project



### Problem Statement: 

**Background - Meme**: In 2016, a meme went viral that asked people a question they'd likely never thought would be challenging before: Can you tell the difference between these images that you never before thought looked alike? 

![title](./images/Other-Memes.png)
(source: Elle Magazine, https://www.elle.com/culture/news/a34939/animals-or-food/)


As the owner of a Chihuahua, my interest zeroed in on this pairing: 
![title](./images/Chihuahua.png)



**Background - Image Classification**: A statement often made about image classification algorithms is that though they can quickly distinguish between thousands of images with *pretty good* accuracy, a child can distinguish between images with *much better accuracy*. 

The question of Chihuahua versus Blueberry Muffin fascinated me because--in the case of the particular close up angles selected for the meme--this is not a case when a human can easily distinguish between these images. 



**Problem Statement**: After building an image classification model that can predict whether an image is of a Chihuahua or a Blueberry Muffin trained on zoomed out distinctly different photos, can that image classification model accurately predict the classification for the challenging zoomed in photos from the meme? Additional questions I would like to explore include: 
- Based on the performance of the model, what can we determine about what the model is using to distinguish between the two classes? How does that differ from how a human distinguishes? Is it a better or worse system in the case of these memes?
- Looking at the training data set of images, which images were missclassified with the highest likelihood of the other class? Do these images fit well with the images that went viral in the way that they are difficult to distinguish? 

### Results
I am still making modifications to fine tune my model to improve accuracy and reduce variance. I can provide results from my best model. 

Compared to a baseline accuracy of 50%, the model has a training accuracy of 85% and a validation accuracy of 78%. Given the challenge of the 16 photos in the meme, it predicted 12 out of 16 images correctly. Here is what it predicted: 

![](./images/meme_acc.png)


### Process Flow
1. Scrape images from the internet
2. Modify images to fill an array
3. Build CNN to train and test on scraped images
4. Run images from meme through CNN model
5. Look for patterns in the data

#### 1. Scrape Images from the Internet
I explored a few free online libraries of image collections. It was not very hard to find an assortment of Chihuahua photos. However, there really was no great library of blueberry muffin photos. 

This led me to use a wrapper of Google Images' API. This was generally a great tool for polling images, but had one issue - Google Images' API only allows you to pull the 100 most recent images and the wrapper does not have a work around for this. 

In order to get around this issue, I started out by coming up with different descriptive words for blueberry muffin (jumbo, mini, Starbucks, Pete's, Vegan...). I quickly found that this led me to a lot of duplicates in the top 100 images that appear. 

The trick I found to find a unique new set of images was to translate "blueberry muffin" in to other languages and searching that term. There were issues with translations where there were accents on any letters (I needed to remove the accents) and I could not search words that were not in the Latin alphabet. I also found that in some cases, either the translation was wrong or blueberry muffins are made differently enough in that country that I didn't feel comfortable including those images. 

I have not yet been able to confirm from a native Vietnamese person or Finnish person if I have the correct translation for Blueberry Muffin or if these are the equivalent there, but the photos show the following: 
![title](./images/translations.png)

<details>
<summary>Image Scraping Code</summary>
<br>
    This code is also included in a separate Jupyter Notebook
    
``` python
# This code is modeled after stack overflow user Vicky Christina's code

# Importing Google API wrapper
from google_images_download import google_images_download 
import sys
orig_stdout = sys.stdout

# set up scraper
f = open('URLS.txt', 'w')
sys.stdout = f
    
# Image paths
muffin_path = './muffin images/'
chihuahua_path = './chihuahua/'

# specify crop size of images to be used
set_width = 200
set_height = 200
    
# list of words to pull images for
muffin_words = [
    'blueberry muffin close-up', 'blueberries muffin', 'blueberries scone', 
    'blueberry muffin', 'blueberry muffin recipe', 'blueberry muffins', 
    'blueberry mufin', 'blueberry scone', 'bluebery muffin', 'bluebery mufin', 
    'one blueberry muffin', 'single blueberry muffin', 'Starbucks blueberry muffin', 
    "Pete's blueberry muffin", 'mini blueberry muffin', 'blueberry muffin top', 
    'giant blueberry muffin', 'low fat blueberry muffin', 'blueberry cupcake', 
    'jumbo blueberry muffin',  'blueberry muffin side view', 'blueberry muffin zoom', 
    'blueberry muffin bottom', 'blueberry minimuffin', 'blue berry muffin', 
    'blueberrymuffin', 'blue bery muffin', 'blueberymufin', 'blues muffin', 
    'berries muffin', 'blueberyy muffin', 'muffin de arandanos', 'muffin aux myrtilles', 
    'Blaubeermuffin', 'muffin fraochan', 'Bolinho de mirtilo', 'blabarsmuffin', 
    'bosbessenmuffin', 'muffin od borovnice', 'bloubessie muffin', 'borovnica za muffine', 
    'boruvkovy muffin', , 'blua mufino', 'mustika muffin', 'mustikkamuffinssi', 
    'blauwe muffin', 'muffin de arandanos', 'mellenu smalkmaizite', 'melyniu keksas', 
    'te kaeka mira', 'Muffin jagodowy', 'briosa cu afine', 'sulu silika', 
    'muffin subh-craoibhe', 'cucoriedkovy muffin', 'borovnicev muffin', 'buluug buluug ah', 
    'muffin buah beri biru', 'yabanmersinli kek', "ko'k piyoz", 'banh nưong xop viet quat', 
    'myffin llus', 'biriki muffin'
    ]
chihuaua_words = [
    'cheagle', 'fat chihuahua', 'JackChis', 'ugly chihuahua', 'wet chihuahua', 'big chihuahua', 
    'chihuahua close-up', 'chihuahua ears', 'chihuahua face', 'chihuahua frown', 
    'chihuahua happy', 'chihuahua mouth', 'chihuahua nose', 'chihuahua puppy', 'chihuahua small', 
    'chihuahua smile', 'chihuahua tongue', 'chihuahua whiskers', 'chihuahua zoom', 
    'Chiwahwah', 'Chiwauwau', 'Chiwawa', 'Chiwawa puppy', 'chiweenie', 'chocolate brown chihuahua',
    'light brown chihuahua', 'old chihuahua'
    ]
    
for word in chihuaua_words: 
    response = google_images_download.googleimagesdownload()

    arguments = {"keywords"     : word,
                 "limit"        : 100,
                 "print_urls"   : False,
                 "size"         : ">2MP",
                 }
    # saves each word's photo in to a folder named word under a folder named downloads
    paths = response.download(arguments)

    sys.stdout = orig_stdout
    f.close()
    
    # collecting and  URLs of images (I did not wind up using URLs)
    with open('URLS.txt') as f:
        content = f.readlines()
    f.close()

    urls = []
    for j in range(len(content)):
        if content[j][:9] == 'Completed':
            urls.append(content[j-1][11:-1])  
            
for word in muffin_words: 
    response = google_images_download.googleimagesdownload()

    arguments = {"keywords"     : word,
                 "limit"        : 100,
                 "print_urls"   : False,
                 "size"         : ">2MP",
                 }
    # saves each word's photo in to a folder named word under a folder named downloads
    paths = response.download(arguments)

    sys.stdout = orig_stdout
    f.close()
    
    # collecting and  URLs of images (I did not wind up using URLs)
    with open('URLS.txt') as f:
        content = f.readlines()
    f.close()

    urls = []
    for j in range(len(content)):
        if content[j][:9] == 'Completed':
            urls.append(content[j-1][11:-1]) 
    
    ```
    
</details>

Prior to modifying the images, I manually moved the photos in to a folder of either Chihuahuas or Muffins, in the process deleting duplicates from searches. Next, I did a scan through the image set to catch any images that were not of the intended objects. I considered running an unsupervised model on the folder to catch the photos not of muffins or chihuahuas. However, I felt that I could more accurately sort the images manually and I was able to do so fairly quickly looking at the folder with large icons: 
![title](./images/window.png)

Some frequent mistaken images I encountered were: 
![title](./images/mistakes.png)

#### 2. Modify images to fill an array

In order to take my image files and covert them to numeric values that could be interpretted by a neaural network, I took the following steps: 
- Create an empty array of \[x, y, z\] where x and y are my intended pixel dimensions for each images (300 x 300) and z is the number of images in the file set
- Loop through every image to: 
    - Open the image with SkImage toolkit such that each opened image is an array of width of pixels x height of pixels x 3 colors (R, G, B). The value at each location represents the saturation of each color located each point.  
    - Convert image to greyscale. This reduces the file size by saying only the saturation of black at each point
    - Scale the image such that the smaller dimension is scaled up or down to 300 pixels
    - Crop the larger dimension to be only 300 pixels (I opted to crop an equal amount from the top and bottom. 
    - Add image to empty array
- Loop is run with try/except because some images that appeared viewable from Preview were unable to be processed by SkImage. These images were a small enough percent that I felt comfortable skipping past these to keep the loop running
- Save array to my computer to be loaded in my next notebook
![title](./images/processing1.png)
![title](./images/processing2.png)

Unfortunately, the dogs become much less cute by the time they're converted to a matrix of greyscale values. But it's the price we pay to effectively analyze our data. 

<details>
<summary> Image Manipulation Code </summary>
<br>
    This code is also included in a separate Jupyter Notebook
    
```python
# Importing libraries to access files; adjust images
import os
from skimage import io
from skimage.color import rgb2gray
from skimage.transform import rescale, resize, downscale_local_mean
import numpy as np
import pickle


# Image paths
muffin_path = './muffin images/'
chihuahua_path = './chihuahua/'

# specify crop size of images to be used
set_width = 200
set_height = 200

 greyscale, crop, resize and add to an array a file of images 

image_list = os.listdir(muffin_path)

# empty array to stack all images in
m_array = np.zeros([set_width, set_height, len(image_list)])

# loop through 100 images at a time
for i, image_name in enumerate(image_list):
    print(i)
    try:
        # reading image; making greyscale
        image = rgb2gray(io.imread(muffin_path+image_name))

        # loading height and width to determine: 
        # which dimension is smaller (to make 200)
        # where to crop larger dimension to center image
        image_height = image.shape[0]
        image_width = image.shape[1]

        if image_height > image_width:
            #resizing
            multiplier = set_width/image_width
            new_height = int(image_height*multiplier)
            image_resized = resize(image, (new_height, set_width), anti_aliasing=True)

             #cropping
            crop_cut = int((new_height-set_width)/2)
            cropped = image_resized[crop_cut:crop_cut+set_width, 0:set_width]


        else: 
            #resizing
            multiplier = set_height/image_height
            new_width = int(image_width*multiplier)
            image_resized = resize(image, (set_width, new_width), anti_aliasing=True)

            #cropping
            crop_cut = int((new_width-set_width)/2)
            cropped = image_resized[0:set_width, crop_cut:crop_cut+set_width]

            # add to image stack

        m_array[:,:,i] = cropped
        if (i+1)%50 == 0:
            np.save('./files/m_array.npy', m_array, True)
            print(f'Processed {i+1} out of {len(image_list)} images.')

    except:
        print(f'Error at {image_name}')

# greyscale, crop, resize and add to an array a file of images 

image_list = os.listdir(chihuahua_path)

# empty array to stack all images in
c_array = np.zeros([set_width, set_height, len(image_list)])

# loop through 100 images at a time
for i, image_name in enumerate(image_list):
    print(i)
    try:
        # reading image; making greyscale
        image = rgb2gray(io.imread(chihuahua_path+image_name))

        # loading height and width to determine: 
        # which dimension is smaller (to make 200)
        # where to crop larger dimension to center image
        image_height = image.shape[0]
        image_width = image.shape[1]

        if image_height > image_width:
            #resizing
            multiplier = set_width/image_width
            new_height = int(image_height*multiplier)
            image_resized = resize(image, (new_height, set_width), anti_aliasing=True)

             #cropping
            crop_cut = int((new_height-set_width)/2)
            cropped = image_resized[crop_cut:crop_cut+set_width, 0:set_width]


        else: 
            #resizing
            multiplier = set_height/image_height
            new_width = int(image_width*multiplier)
            image_resized = resize(image, (set_width, new_width), anti_aliasing=True)

            #cropping
            crop_cut = int((new_width-set_width)/2)
            cropped = image_resized[0:set_width, crop_cut:crop_cut+set_width]

            # add to image stack

        c_array[:,:,i] = cropped
        if (i+1)%50 == 0:
            np.save('./files/c_array.npy', c_array, True)
            print(f'Processed {i+1} out of {len(image_list)} images.')

    except:
        print(f'Error at {image_name}')

# Creating files for 16 images from meme
# creating file of 16 meme test images

image_list1 = os.listdir('./test images/chihuahua/')
image_list2 = os.listdir('./test images/muffin/')

# empty array to stack all images in
test_array = np.zeros([set_width, set_height, len(image_list1)+len(image_list2)])


for i, image_name in enumerate(image_list1):
    print(i)
    try:
        # reading image; making greyscale
        image = rgb2gray(io.imread('./test images/chihuahua/'+image_name))

        # loading height and width to determine: 
        # which dimension is smaller (to make 200)
        # where to crop larger dimension to center image
        image_height = image.shape[0]
        image_width = image.shape[1]

        if image_height > image_width:
            #resizing
            multiplier = set_width/image_width
            new_height = int(image_height*multiplier)
            image_resized = resize(image, (new_height, set_width), anti_aliasing=True)

             #cropping
            crop_cut = int((new_height-set_width)/2)
            cropped = image_resized[crop_cut:crop_cut+set_width, 0:set_width]


        else: 
            #resizing
            multiplier = set_height/image_height
            new_width = int(image_width*multiplier)
            image_resized = resize(image, (set_width, new_width), anti_aliasing=True)

            #cropping
            crop_cut = int((new_width-set_width)/2)
            cropped = image_resized[0:set_width, crop_cut:crop_cut+set_width]

            # add to image stack

        test_array[:,:,i] = cropped
        
    except:
        print(f'Error at {image_name}')
        
for i, image_name in enumerate(image_list2):
    j=i+8
    print(j)
    try:
        # reading image; making greyscale
        image = rgb2gray(io.imread('./test images/muffin/'+image_name))

        # loading height and width to determine: 
        # which dimension is smaller (to make 200)
        # where to crop larger dimension to center image
        image_height = image.shape[0]
        image_width = image.shape[1]

        if image_height > image_width:
            #resizing
            multiplier = set_width/image_width
            new_height = int(image_height*multiplier)
            image_resized = resize(image, (new_height, set_width), anti_aliasing=True)

             #cropping
            crop_cut = int((new_height-set_width)/2)
            cropped = image_resized[crop_cut:crop_cut+set_width, 0:set_width]


        else: 
            #resizing
            multiplier = set_height/image_height
            new_width = int(image_width*multiplier)
            image_resized = resize(image, (set_width, new_width), anti_aliasing=True)

            #cropping
            crop_cut = int((new_width-set_width)/2)
            cropped = image_resized[0:set_width, crop_cut:crop_cut+set_width]

            # add to image stack

        test_array[:,:,j] = cropped
    
    except:
        print(f'Error at {image_name}')

np.save('./files/test_array.npy', test_array, True)
    
    ```  
</details>

#### 3. Build CNN to train and test on scraped images

I created several versions of models using Keras Convolutional Neural Networks. My goal was to optimize the following metrics. Different metrics were prioritized at different stages of my model building: 
- **Run Time** - Early on, I wanted the models to run quickly to enable several iterations to get a rough idea of what will perform well at a very high level. 
- **Accuracy** - This is the most obvious thing to optimize. This is a case where there is no reason to optimize for sensitivity or specificity, so accuracy is a great metric to check how well my model runs.
- **Reducing Variance** - I found that the models with the highest train accuracy score improved train accuracy to the detriment of my validation accuracy. I am targeting models which have about the same accuracies for my training set and my validation set. 
- **Test Accuracy** - My capstone problem statement asks if an image classifier can distinguish between the 16 images from the *Blueberry Muffin or Chihuahua Meme*. After I have created a model with high accuracy, I want to check if it can accomplish this task. Whether the model can or can not, I want to extract information from the images the program either did or did not classify correctly. 

The parameters I modified as I iterated through models included: 
- **Images**
    - **Image Sizes** - Square images of edge lengths 300 pixels, 200 pixels, 100 pixels
    - **Color** - Greyscale images (each image is a numpy array of length, width, 1) or Color images (each image is a numpy array of length, width, 3)
- **Convolutional Layers**
    - **Number of Filters** 
    - **Kernel Size**
    - **Pooling Size**
- **Dense Layers**
    - **Units**
- **All Layers**
    - **Number of Layerss**
    - **Dropout Layers and Dropout Percents**
- **Model Fitting**
    - **Batch Size**
    - **Epochs**
    - **Callbacks/Learning Rate**

I was able to get a few models to a validation accuracy of about 80%. Please refer to the github to view all iterations of the model I ran through AWS. I have pasted the model I selected as my "best" below. 

<details>
<summary> Image Classification Code </summary>
<br>
   
```python
# Import libraries and modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# For reproducibility
np.random.seed(42)

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist
from keras.callbacks import ReduceLROnPlateau, Callback

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'


# Loading dataset
chihuahua_file = np.load('./files/c_array.npy')
muffin_file = np.load('./files/m_array.npy')

# Loading test set
test_array = np.load('./files/test_array.npy')
test_array = np.transpose(test_array, (3, 0, 1, 2))


files = np.append(chihuahua_file, muffin_file, axis = 2)

# Creating file of output values, 
# muffin is positive class chihuahua is negative class
y = np.append(np.ones(chihuahua_file.shape[2]),
              np.zeros(muffin_file.shape[2]))

# Reshaping y such that it can be input in to neural net 
# for greyscale analysis
X = np.transpose(files, (3, 0, 1, 2))

# Train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)


X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Setting up callback and learning
class myCallback(Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('acc')>.90):
            print("\nReached 90% accuracy so cancelling training!")
            self.model.stop_training = True            
callbacks=myCallback()

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.3, verbose=1,
                              patience=2, min_lr=0.00000001)

# Creating a convolutional neural network

model = Sequential()

model.add(Conv2D(
    filters = 45,
    kernel_size = (10,10),
    activation = 'relu',
    input_shape = (150, 150, 1)
))
model.add(MaxPooling2D(pool_size = (2)))

model.add(Conv2D(32,
                     kernel_size=6,
                     activation='relu'))
model.add(MaxPooling2D(pool_size=2))

model.add(Flatten())

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
history = model.fit(X_train, y_train,
                 batch_size=128,
                 epochs=40,
                 verbose=1,
                 validation_data=(X_test, y_test),
                 callbacks=[reduce_lr, callbacks]
                 )


# In[ ]:


# graphing training and testing loss scores
train_loss = history.history['loss']
test_loss = history.history['val_loss']

# Set figure size.
plt.figure(figsize=(12, 8))

# Generate line plot of training, testing loss over epochs.
plt.plot(train_loss, label='Training Loss', color='#185fad')
plt.plot(test_loss, label='Testing Loss', color='orange')

# Set title
plt.title('Training and Testing Loss by Epoch', fontsize = 25)
plt.xlabel('Epoch', fontsize = 18)
plt.ylabel('Categorical Crossentropy', fontsize = 18)

plt.legend(fontsize = 18);
plt.show()

model.save('./files/model_aws13.HDF5')

print(model.predict(test_array))


# graphing training and testing loss scores
train_loss = history.history['loss']
test_loss = history.history['val_loss']

# Set figure size.
plt.figure(figsize=(12, 8))

# Generate line plot of training, testing loss over epochs.
plt.plot(train_loss, label='Training Loss', color='#185fad')
plt.plot(test_loss, label='Validation Loss', color='orange')

# Set title
plt.title('Training and Validation Loss by Epoch', fontsize = 25)
plt.xlabel('Epoch', fontsize = 18)
plt.ylabel('Categorical Crossentropy', fontsize = 18)

plt.legend(fontsize = 18)
plt.savefig('./files/loss13.png')
plt.show();

train_acc = history.history['acc']
test_acc = history.history['val_acc']

# Set figure size.
plt.figure(figsize=(12, 8))

# Generate line plot of training, testing loss over epochs.
plt.plot(train_acc, label='Training Accuracy', color='#185fad')
plt.plot(test_acc, label='Validation Accuracy', color='orange')

# Set title
plt.title('Training and Validation Accuracy by Epoch', fontsize = 25)
plt.xlabel('Epoch', fontsize = 18)
plt.ylabel('Categorical Crossentropy', fontsize = 18)

plt.legend(fontsize = 18)
plt.savefig('./files/acc13.png')
plt.show();

np.save('./files/train_acc13.png', train_acc)
np.save('./files/test_acc13.png', test_acc)
np.save('./files/train_loss13.png', train_loss)
np.save('./files/test_loss13.png', test_loss)
    ```  
</details>