## High-level feature extraction from images using Dato's Graphlab-Create

*Unfortunately, while this worked initially, an update of the software led to licencing problems. The short time frame for the capstone project led me to use Caffe as an alternate for the Deep Convolution Neural Network model.*

In [1]:
import random
import time
import numpy as np
from numpy.random import normal
import matplotlib.pyplot as plt
import graphlab as gl
import os
import glob

gl.canvas.set_target('ipynb')
#from code import iForest as isof

% matplotlib inline

A newer version of GraphLab Create (v1.6.1) is available! Your current version is v1.5.2.
New features in 1.6:
- Time Series data type
- Model tuning in Canvas
- Churn prediction toolkit
- Product sentiment analysis toolkit
- DBSCAN for clustering toolkit
- Record linker for data matching toolkit
- Frequent pattern mining toolkit
- Support adaptive Predictive Services model serving through endpoint policies
- Distributed Machine Learning in EC2
- Interface between DataFrames and SFrames in scala

Notable performance improvements:
- Improve service latency for all supervised learning models
- Improved performance of nearest neighbor toolkit by constructing a similarity graph directly
- Fast approximation of nearest neighbors through locality-sensitive hashing
- More efficient and faster access of data in S3
- Improved performance of distributed graph analytics

For detailed release notes please visit:
https://dato.com/download/release-notes.html

-
You can use pip to upgrade the graphla

### Now let's have a look at cat photos.

In [2]:
start = time.time()
WORKING_DIR = '/home/wilber/work/Galvanize/gcp-data/iForest/resize'
data = gl.image_analysis.load_images(WORKING_DIR, \
                                     random_order=True)
time1 = time.time()
print time1 - start
data['image'] = gl.image_analysis.resize(data['image'], 256, 256)
totsecs = time.time() - start
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

[INFO] This non-commercial license of GraphLab Create is assigned to wilber@ssl.berkeley.eduand will expire on September 15, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-3860 - Server binary: /usr/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443230189.log
[INFO] GraphLab Server Version: 1.5.2


2.62452983856
Couldn't import dot_parser, loading of dot files will not be possible.
Elapsed time = 0 hours, 0 minutes, 5.573969841 seconds


In [3]:
names = data['path']
names = map(lambda x: x.split('/')[-1], names)
names[:10]

['House_Cat_hd_Wallpaper.png',
 '6974484-cute-cat-smile.png',
 'cat-pictures-of-house-cats-122008-spooky.png',
 'maxresdefault.png',
 'dbe5f0727b69487016ffd67a6689e75a.png',
 'Cat3.png',
 '99059361-choose-cat-litter-632x475.png',
 '5705f938c04ca27c38f24fc9e7a2f68b.png',
 'kitten_250.png',
 'willoween1.png']

#### Fetch Dato's ImageNet-trained deep neural net.

In [3]:
start = time.time()
pretrained_model = gl.load_model('http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45')
time1 = time.time()
print time1 - start

pretrained_model.save('ImageNet45')
test = model_load('ImageNet45')
time2 = time.time()
print time2 - time1, time2 - start

PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45/dir_archive.ini to /var/tmp/graphlab-wilber/3860/000000.ini
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45/objects.bin to /var/tmp/graphlab-wilber/3860/000001.bin


IOError: http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45 is not a valid file name.

#### Hope to save the model for later restore.

In [None]:
import cPickle as pickle
pretrained_model.save('ImageNet45')

test = model_load('ImageNet45')

In [None]:
data['extracted_features'] = pretrained_model.extracted_features(data, 21)

In [13]:
data['extracted_features'] = pretrained_model.extract_features(data, 21)
totsecs = time.time() - start
print totsecs
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

NameError: name 'pretrained_model' is not defined

In [None]:
X21 = np.array(data['extracted_features'])
print np.shape(X21)
X21[:5]

m21 = isof.iForest(n_estimators=50, max_depth=100)
start = time.time()
m21.fit(X21)
end = time.time()
print "elapsed time: {0}s.".format(end - start)

In [None]:
anom_scores = m22.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

[INFO] This non-commercial license of GraphLab Create is assigned to wilber@ssl.berkeley.eduand will expire on September 15, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-11450 - Server binary: /usr/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1442961464.log
[INFO] GraphLab Server Version: 1.6


1.59461402893
Couldn't import dot_parser, loading of dot files will not be possible.

### try iForest using real image data (raw pixels):

In [None]:
%pwd

### High-level features from Dato's ImageNet classifier:

In [None]:
start = time.time()
WORKING_DIR = '/home/wilber/work/Galvanize/gcp-data/iForest/loonie'
data = gl.image_analysis.load_images(WORKING_DIR, \
                                     random_order=True)
time1 = time.time()
print time1 - start
data['image'] = gl.image_analysis.resize(data['image'], 256, 256)
totsecs = time.time() - start
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

In [None]:
names = data['path']
names = map(lambda x: x.split('/')[-1], names)
names[:10]

In [None]:
# Import data from MNIST
#data = gl.SFrame('http://s3.amazonaws.com/dato-datasets/mnist/sframe/train6k')

# Create a DeepFeatureExtractorObject

start = time.time()
extractor = gl.feature_engineering.DeepFeatureExtractor(feature = 'image')
print help(extractor)
# Fit the encoder for a given dataset.
time1 = time.time()
print "\n\n", time1 - start
extractor = extractor.fit(data)
time2 = time.time()
print "\n\n", time2 - time1

# Return the model used for the deep feature extraction.
extracted_model = extractor['model']
time3 = time.time()
print "\n\n", time3 - time2

# Extract features.
features_sf = extractor.transform(data)
features_sf.head()
time4 = time.time()
print "\n\n", time4 - time3, ", total time: ", time4 - start

In [None]:
#feature = features_sf.head['deep_features_image']
features_sf.num_rows(), features_sf.num_cols()

In [None]:
extractor['model']

In [None]:
opts_dict = extractor.get_current_options()
for key, value in opts_dict.iteritems():
    print key, value

In [None]:
repr(extractor)

In [None]:
features_sf.column_names()

In [None]:
np.shape(features_sf['deep_features_image'])

In [None]:
X = np.array(features_sf['deep_features_image'])

In [None]:
X[:5]

In [None]:
model = isof.iForest(n_estimators=200)
start = time.time()
model.fit(X)
end = time.time()
print end - start

In [None]:
anom_scores = model.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

### Different layers from pre-trained model:

#### Layer 22

In [3]:
import graphlab as gl
start = time.time()
pretrained_model = gl.load_model('http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45')
time1 = time.time()
#pretrained_model = gl.load_model('/home/wilber/work/Galvanize/gcp/DatoImageNetIter45.bin')
print "dt1: ", time1 - start
data['extracted_features'] = pretrained_model.extract_features(data, 22)
print time.time() - time1
totsecs = time.time() - start
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45/dir_archive.ini to /var/tmp/graphlab-wilber/29977/862a9888-7e64-422c-b70d-660e1c324f60.ini
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45/objects.bin to /var/tmp/graphlab-wilber/29977/573cd2a3-9ce1-4f25-a5bb-5a270497b1d7.bin


IOError: http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45 is not a valid file name.

In [None]:
#pretrained_model.save('/home/wilber/work/Galvanize/gcp/DatoImageNetIter45.bin')

In [None]:
X22 = np.array(data['extracted_features'])
print np.shape(X22)
X22[:5]

In [None]:
m22 = isof.iForest(n_estimators=200)
start = time.time()
m22.fit(X22)
end = time.time()
print "elapsed time: {0}s.".format(end - start)

In [None]:
anom_scores = m22.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

### layer 21

In [None]:
start = time.time()
#pretrained_model = gl.load_model('http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45')
#time1 = time.time()
#print time1 - start
data['extracted_features'] = pretrained_model.extract_features(data, 21)
totsecs = time.time() - start
print totsecs
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

In [None]:
X21 = np.array(data['extracted_features'])
print np.shape(X21)
X21[:5]

In [None]:
m21 = isof.iForest(n_estimators=200)
start = time.time()
m21.fit(X21)
end = time.time()
print "elapsed time: {0}s.".format(end - start)

In [None]:
anom_scores = m21.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

### Layer 20

In [None]:
start = time.time()
# pretrained_model = gl.load_model('http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45')
# time1 = time.time()
# print time1 - start
data['extracted_features'] = pretrained_model.extract_features(data, 20)
print time.time() - time1
totsecs = time.time() - start
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

In [None]:
X20 = np.array(data['extracted_features'])
print np.shape(X20)
X20[:5]

In [None]:
m20 = isof.iForest(n_estimators=200)
start = time.time()
m20.fit(X20)
end = time.time()
print "elapsed time: {0}s.".format(end - start)

In [None]:
anom_scores = m20.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

### Layer 19

In [None]:
start = time.time()
# pretrained_model = gl.load_model('http://s3.amazonaws.com/dato-datasets/deeplearning/imagenet_model_iter45')
# time1 = time.time()
# print time1 - start
data['extracted_features'] = pretrained_model.extract_features(data, 19)
print time.time() - time1
totsecs = time.time() - start
hours = int(totsecs/3600)
mins = int((totsecs - 3600.*hours)/60)
secs = totsecs - 3600.*hours - 60.*mins
print "Elapsed time = {0} hours, {1} minutes, {2} seconds".format(hours, mins, secs)

In [None]:
X19 = np.array(data['extracted_features'])
print np.shape(X19)
X19[:5]

In [None]:
m19 = isof.iForest(n_estimators=200)
start = time.time()
m19.fit(X19)
end = time.time()
print "elapsed time: {0}s.".format(end - start)

In [None]:
anom_scores = m19.anomaly_score_
sort_indices = np.argsort(anom_scores)
for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

### With layer 19, clearly have gone too far!

### Now try with raw pixels:

In [None]:
model = isof.iForest(n_estimators=200)
start = time.time()
model.fit(Xraw)
end = time.time()
print end - start

In [None]:
anom_scores = model.anomaly_score_
sort_indices = np.argsort(anom_scores)

for i in range(1,16):
    ind = sort_indices[-i]
    print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])

print "\n"
for i in range(1, 2000):
    ind = sort_indices[-i]
    if names[ind].startswith('fig'):
        print "{0}\t{1}\t{2}".format(i, names[ind], anom_scores[ind])