# Machine Learning Nanodegree

## Capstone

## Project: Write an Algorithm to Identify Whales by Flukes

---

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut.  Markdown cells can be edited by double-clicking the cell to enter edit mode.

---
### Why We're Here 

The goal of this project is to build an algorithm that will be used for identifying whale species in images by analyzing Happy Whale’s database of over 25,000 images, gathered from research institutions and public contributors.

### The Road Ahead

We break the notebook into separate steps.  Feel free to use the links below to navigate the notebook.

* [Step 0](#step0): Import Datasets and Explore Whale Categories
* [Step 1](#step1): Detect Whales
* [Step 2](#step2): Create a simple CNN to Classify Dog Breeds
* [Step 3](#step3): Use a CNN to Classify Dog Breeds (using Transfer Learning)
* [Step 4](#step4): Create a CNN to Classify Dog Breeds (using Transfer Learning)
* [Step 5](#step5): Write the Algorithm
* [Step 6](#step6): Test the Algorithm

---
<a id='step0'></a>
## Step 0: Import Datasets

### Import Dataset

In the code cell below, we import a dataset of dog images.  We populate a few variables through the use of the `load_files` function from the scikit-learn library:
- `train_files`, `valid_files`, `test_files` - numpy arrays containing file paths to images
- `train_targets`, `valid_targets`, `test_targets` - numpy arrays containing onehot-encoded classification labels 
- `whale_ID` - list of string-valued whale IDs for translating labels

In [111]:
from sklearn.datasets import load_files       
from keras.utils import np_utils
import numpy as np
import os

# define function to load train, test, and validation datasets
def load_dataset(path):
    data = load_files(path)
    files = np.array(data['filenames'])
    targets = np_utils.to_categorical(np.array(data["target"]), 4251)
    return files, targets

train_path = "/Users/priyankapatel/Desktop/Python/machine-learning-master/projects/capstone/Data/train"
validation_path = "/Users/priyankapatel/Desktop/Python/machine-learning-master/projects/capstone/Data/validation"
test_path = "/Users/priyankapatel/Desktop/Python/machine-learning-master/projects/capstone/Data/test"
    
# load train, test, and validation datasets (validation set created from training set)
train_files, train_targets = load_dataset(train_path)
validation_files, valid_targets = load_dataset(validation_path)
test_files, test_targets = load_dataset(test_path)

# print statistics about the dataset
print('There are %s total whale images.\n' % len(np.hstack([train_files, validation_files, test_files])))
print('There are %d training whale images.' % len(train_files))
print('There are %d validation whale images.' % len(valid_files))
print('There are %d test whale images.'% len(test_files))


There are 0 total whale images.

There are 0 training whale images.
There are 0 validation whale images.
There are 0 test whale images.


### Exploring Whale Categories

In the code cells below, we will explores whale IDs. 



In [69]:
import pandas as pd

whale_categories = pd.read_csv('Data/train.csv')
print(data_train.head())

          Image         Id
0  00022e1a.jpg  w_e15442c
1  000466c4.jpg  w_1287fbc
2  00087b01.jpg  w_da2efe0
3  001296d5.jpg  w_19e5482
4  0014cfdf.jpg  w_f22f3e3


In [65]:
rand_rows = data_train.sample(frac = 1.0)[:25]
imgs = list(rand_rows['Image'])
labels = list(rand_rows['Id'])
print(labels)

['w_243e33e', 'w_8b22583', 'w_0bc712b', 'w_fe5e78b', 'w_fbcb6e4', 'w_f792125', 'w_3694c7d', 'w_d47e2e3', 'w_6c803bf', 'w_e38b2c7', 'w_2e27b77', 'w_303518a', 'w_b9e00eb', 'w_6cef37c', 'w_86113c4', 'w_a14ccaa', 'w_b1c44fe', 'w_7d2d70c', 'w_b0e05b1', 'w_a18d0dc', 'w_3411b9f', 'w_18eee6e', 'w_bf38f05', 'w_1a9f141', 'w_37523b2']


In [62]:
num_categories = len(data_train['Id'].unique())
     
print('There are %d total whale IDs.' % num_categories)

There are 4251 total whale IDs.


In [71]:
topRankedCategories = whale_categories.groupby('Id').count().sort_values(by='Image',ascending=False).reset_index().head(10)
topRankedCategories

Unnamed: 0,Id,Image
0,new_whale,810
1,w_1287fbc,34
2,w_98baff9,27
3,w_7554f44,26
4,w_1eafe46,23
5,w_fd1cb9d,22
6,w_ab4cae2,22
7,w_693c9ee,22
8,w_987a36f,21
9,w_43be268,21


---
<a id='step2'></a>
## Step 1: Detect Whales

In this section, we use a pre-trained [ResNet-50](http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006) model to detect dogs in images.  Our first line of code downloads the ResNet-50 model, along with weights that have been trained on [ImageNet](http://www.image-net.org/), a very large, very popular dataset used for image classification and other vision tasks.  ImageNet contains over 10 million URLs, each linking to an image containing an object from one of [1000 categories](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a).  Given an image, this pre-trained ResNet-50 model returns a prediction (derived from the available categories in ImageNet) for the object that is contained in the image.

In [92]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

from keras.applications.resnet50 import ResNet50

# define ResNet50 model
ResNet50_model = ResNet50(weights='imagenet')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5


Exception: URL fetch failure on https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5: None -- [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)