<a href="https://colab.research.google.com/github/mfaits/Kaggle-Whales/blob/master/Kaggle_Whale_Identification_Implement_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import cv2
import os
from PIL import Image

from matplotlib.pyplot import imshow
from IPython.display import HTML

In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from keras import layers
from keras.preprocessing import image
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Input, Dense, Activation, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout
from keras.models import Model

import keras.backend as K
from keras.models import Sequential

Using TensorFlow backend.


**Data Import**

In this section of the notebook we will pull the data for the competition directly in from Kaggle. This will save us the hassle of downloading all the image files for the competition to our local machine then uploading them to Google Drive. Using the Kaggle API, we can pull the files directly into colab.

In [0]:
!pip install kaggle



In [0]:
#trying to hide API keys, so run this to upload a config file with your kaggle API key saved in it as "token"
from google.colab import files
uploaded = files.upload()

Saving config.py to config.py


In [0]:
import config
!mkdir .kaggle

In [0]:
import json
token = config.token
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)

In [0]:
!chmod 600 /content/.kaggle/kaggle.json
!mkdir ~/.kaggle
!cp /content/.kaggle/kaggle.json ~/.kaggle/

In [0]:
!kaggle competitions download -c humpback-whale-identification -p /content

#Extract files in separate train and test folders
!mkdir train
!mkdir test
!mv train.zip train
!mv test.zip test
!unzip -q ./train/*.zip  -d ./train
!unzip -q ./test/*.zip  -d ./test

Downloading sample_submission.csv to /content
  0% 0.00/498k [00:00<?, ?B/s]
100% 498k/498k [00:00<00:00, 34.2MB/s]
Downloading train.csv to /content
  0% 0.00/594k [00:00<?, ?B/s]
100% 594k/594k [00:00<00:00, 38.3MB/s]
Downloading test.zip to /content
100% 1.34G/1.35G [00:12<00:00, 116MB/s]
100% 1.35G/1.35G [00:12<00:00, 119MB/s]
Downloading train.zip to /content
100% 4.15G/4.16G [02:54<00:00, 25.4MB/s]
100% 4.16G/4.16G [02:54<00:00, 25.6MB/s]


**File Exploration**

Code in this section adapted from: https://www.kaggle.com/jhonatansilva31415/whales-a-simple-guide

Now that the test and train images are loaded in from kaggle, let's explore how to open them.


In [0]:
#Create variables for file paths
img_train_path = os.path.abspath('/content/train/')
img_test_path = os.path.abspath('/content/test/')
csv_train_path = os.path.abspath('/content/train.csv')

In [0]:
df = pd.read_csv(csv_train_path)
df.head()

Unnamed: 0,Image,Id
0,0000e88ab.jpg,w_f48451c
1,0001f9222.jpg,w_c3d896a
2,00029d126.jpg,w_20df2c5
3,00050a15a.jpg,new_whale
4,0005c1ef8.jpg,new_whale


In [0]:
df.shape #how many images are in the training set?

(25361, 2)

The train.csv file has one column for the image name and one column for the labels. We can get the full file path for each image by concatenating the img_train_path with the image name. We can do that all at once and output the results in a new column.

In [0]:
df['Image_path'] = [os.path.join(img_train_path,whale) for whale in df['Image']]
df.head()

Unnamed: 0,Image,Id,Image_path
0,0000e88ab.jpg,w_f48451c,/content/train/0000e88ab.jpg
1,0001f9222.jpg,w_c3d896a,/content/train/0001f9222.jpg
2,00029d126.jpg,w_20df2c5,/content/train/00029d126.jpg
3,00050a15a.jpg,new_whale,/content/train/00050a15a.jpg
4,0005c1ef8.jpg,new_whale,/content/train/0005c1ef8.jpg


In [0]:
#Looking at 5 random images
#full_path_random_whales = np.random.choice(df['Image_path'],5)

#%matplotlib inline
#for whale in full_path_random_whales:
   # img = Image.open(whale)
   # plt.imshow(img)
   # plt.show()

The purpose of this competition is to identify individual whales from their flukes. I want to get a sense for how similar one whale's fluke looks across images, so let's see if we can look at a set of photos from just one whale. First let's look at how many images each whale tends to have.

In [0]:
#agg = df.groupby(['Id']).size()
#agg = agg.sort_values(ascending=False)
#agg.head() #print the whales with the most pictures

New whale would skew any summary statistics, so drop it before we examine them.

In [0]:
#whale_pic_count = agg.drop('new_whale')
#whale_pic_count.describe()

In [0]:
#whale_pic_count.hist(bins=75)

There are 5004 distinct whale IDs in the training data set. Whales have an average of about 3 photos. Let's look at one that has 3.

In [0]:
#three_pics = whale_pic_count[whale_pic_count == 3]
#three_pics.head()

In [0]:
#w_5d426b6 = df[df.Id == 'w_5d426b6'] #might be better if this wasn't hard-coded

#%matplotlib inline
#for whale in w_5d426b6['Image_path']:
  #  img = Image.open(whale)
   # plt.imshow(img)
  #  plt.show()

Neat! I can't tell if that's a real whale with a pretty dot pattern on its fluke or if those are photographic aberrations, like some kind of water droplet artifact. What's going on with that whale that has 70+ pictures?

In [0]:
#lotta_pics = whale_pic_count[whale_pic_count >= 70]
#lotta_pics.head()

In [0]:
#w_23a388d = df[df.Id == 'w_23a388d'] 
#random_subset_lotta = np.random.choice(w_23a388d['Image_path'],5)

#I've used this code block three times so I should come back through and make it a function

#%matplotlib inline
#for whale in random_subset_lotta:
  #  img = Image.open(whale)
   # plt.imshow(img)
   # plt.show()

Just flipping through a few random selections, it seems like those really all could be the same whale, with white wingtips and black dots.

Okay, so here's what we know so far. In the training set, we have 25,361 pictures, representing 5,004 distinctly identified whales (each with an average of about 3 photos) and 9,664 photos categorized as "new whale." 

In [0]:
#def get_size(img_path):
 # img = Image.open(img_path)
 # width, height = img.size
 # return [width, height]

#w_23a388d = df[df.Id == 'w_23a388d'] 
#random_subset_lotta = np.random.choice(w_23a388d['Image_path'],5)

#for whale in random_subset_lotta:
 # print(get_size(whale))

In [0]:
#test = df.iloc[:100]

#test['width'] = test['Image_path'].apply(get_size)

In [0]:
#test.head()

In [0]:
#df['width'],df['height'] = df['Image_path'].apply(get_size)

In [0]:
#df.head()

In [0]:
#df['height'].describe()

In [0]:
def prepareImages(data, m):
    print("Preparing images")
    X_train = np.zeros((m, 100, 100, 3))
    count = 0
    
    for fig in data['Image_path']:
        #load images into images of size 100x100x3
        img = image.load_img(fig, target_size=(100, 100, 3))
        x = image.img_to_array(img)
        x = preprocess_input(x)

        X_train[count] = x
        if (count%500 == 0):
            print("Processing image: ", count+1, ", ", fig)
        count += 1
    
    return X_train

In [0]:
def prepare_labels(y):
    values = np.array(y)
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(values)
    # print(integer_encoded)

    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    # print(onehot_encoded)

    y = onehot_encoded
    # print(y.shape)
    return y, label_encoder


In [0]:
#X = prepareImages(df, df.shape[0])
#X /= 255

Preparing images
Processing image:  1 ,  /content/train/0000e88ab.jpg
Processing image:  501 ,  /content/train/04c72257b.jpg
Processing image:  1001 ,  /content/train/09cacb84d.jpg
Processing image:  1501 ,  /content/train/0ef961892.jpg
Processing image:  2001 ,  /content/train/141b56a1a.jpg
Processing image:  2501 ,  /content/train/199a417aa.jpg
Processing image:  3001 ,  /content/train/1ec170983.jpg
Processing image:  3501 ,  /content/train/23f084b93.jpg
Processing image:  4001 ,  /content/train/29163ad0b.jpg
Processing image:  4501 ,  /content/train/2e0fab120.jpg
Processing image:  5001 ,  /content/train/3347515d9.jpg
Processing image:  5501 ,  /content/train/3842d71dc.jpg
Processing image:  6001 ,  /content/train/3d7f4c7d5.jpg
Processing image:  6501 ,  /content/train/425f763ca.jpg
Processing image:  7001 ,  /content/train/4714400cd.jpg
Processing image:  7501 ,  /content/train/4c082fbdf.jpg
Processing image:  8001 ,  /content/train/50c683e23.jpg
Processing image:  8501 ,  /content

In [0]:
#y, label_encoder = prepare_labels(df['Id'])

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [0]:
#y.shape #I think we should come back thru later and get rid of new_whale

(25361, 5005)

In [0]:
#model = Sequential()

#model.add(Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0', input_shape = (100, 100, 3)))

#model.add(BatchNormalization(axis = 3, name = 'bn0'))
#model.add(Activation('relu'))

#model.add(MaxPooling2D((2, 2), name='max_pool'))
#model.add(Conv2D(64, (3, 3), strides = (1,1), name="conv1"))
#model.add(Activation('relu'))
#model.add(AveragePooling2D((3, 3), name='avg_pool'))

#model.add(Flatten())
#model.add(Dense(500, activation="relu", name='rl'))
#model.add(Dropout(0.8))
#model.add(Dense(y.shape[1], activation='softmax', name='sm'))

#model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
#model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv0 (Conv2D)               (None, 94, 94, 32)        4736      
_________________________________________________________________
bn0 (BatchNormalization)     (None, 94, 94, 32)        128       
_________________________________________________________________
activation_1 (Activation)    (None, 94, 94, 32)        0         
_________________________________________________________________
max_pool (MaxPooling2D)      (None, 47, 47, 32)        0         
_________________________________________________________________
conv1 (Conv2D)               (None, 45, 45, 64)        18496     
_________________________________________________________________
activation_2 (Activation)    (None, 45, 45, 64)        0         
_________________________________________________________________
avg_pool (AveragePooling2D)  (None, 15, 15, 64)        0         
__________

In [0]:
#history = model.fit(X, y, epochs=100, batch_size=100, verbose=1)
#gc.collect()

Epoch 1/100
Epoch 2/100
  600/25361 [..............................] - ETA: 19:37 - loss: 5.7562 - acc: 0.4000

In [0]:
from google.colab import files
uploaded = files.upload()

Saving model.h5 to model.h5


In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from keras import layers
from keras.preprocessing import image
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Input, Dense, Activation, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout
from keras.models import Model

import keras.backend as K
from keras.models import Sequential
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import cv2
import os
import json
from PIL import Image

from matplotlib.pyplot import imshow
from IPython.display import HTML
from keras.models import model_from_json

In [0]:
# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

OSError: ignored