# Deep Learning for Computer Vision:  Final Project

## Computer Science: COMS W 4995 004

### Proposal: March 21, 2017
### Presentations: April 20, 25, and 27, 2017
### Report: April 27, 2017

### Project Overview

The final project is one of the most important and, hopefully, exciting components of the course. You will have the opportunity to develop a deep learning system of your own choosing. 
You are free to select whatever framework (Tensorflow, Theano, Caffe) or wrapper (Keras, TFLearn) you like, but you need create a report on your project in a Jupyter notebook. You are also free build on publically available models and code, but your report must clearly give attribution for the work of others and must clearly delineate your contributions. 

### Project Proposal

The project description should include the title of the project, participants, a description of the objectives of the project, and a plan for how the project will be completed. The description of the objectives should include modest predictions of the success of the project. The plan for completion should include a description of the training data and how it will be obtained, a discussion of what deep learning framework will be used and why, and a rough description of the planned network architecture.

You are permitted to work together on a project in groups of two or three, but group size must not exceed three participants.  For group projects there must be a clearly delineated division of labor: you should state in the project description and project report who was responsible for which portion of the project. Each student must hand in a separate report. (Students will not necessarily get the same grade for the same project.)

You should mention whether you are simply re-implementing what others have done before but applying to new data or whether you are attempting to do something new to the best of your knowledge. Creative and original projects will be judged more kindly than those that are rehashing something in the existing literature. And projects that include a component in which data is acquired/curated into training and validation sets will be veiwed more favorably than those that simply download an existing data set such as CIFAR-100.

As this is a computer vision course it is expected that your data will be visual, but excpetions might be made if the student is enthusiatic and persuasive enough. The most straightforward project would be to build a system that classifies images into catogories. A more difficult project might be to build a system that detects and localizes a type of object within an image. A still more complicated project might involve joining a ConvNet with an LSTM for a problem (like image captioning) that requires vision and language. But again, creative and originnal projects will be judged more kindly.  

It is important to scope your project so that you get some working results. Project reports that say "I tried this and this but nothing seemed to work..." are discouraged. 
Above all, you should demonstrate end-to-end fluency in the basics of deep learning. 

I cannot wait to see the results. Good luck!

### Project Presentations


To allow students to present their work in three class periods, each student will have only 3 minutes, not a second more. We will be strict about the timing, so you should practice your presentation. The key here is to get across three things: what you did, how you did it, and how well it worked. Students working in groups of two will get 6 minutes and groups of three will get 9 minutes. 


### Project Reports

The report should be done as a Jupyter Notebook. The report should be a complete description of the objectives of the work, the methods used to solve the problem, experimental evidence of a working system, the code, and clear delineation of what you have done vs. what you are leveraging that others have done. If you have used the work of others YOU MUST INCLUDE ATTRIBUTION by citing this work inline and as part of a "bibliography" at the end. You should describe what worked, what did not, and why. If you are working in a group you need to submit your own report and this report should be clear about what your individual contribution was.  

# Proposal

- Title of the project: **Frictionless product categorization**
- Participant: Jose Vicente Ruiz Cepeda (jr3660)
- Context: I am the cofounder of Relendo, a peer to peer rental platform where users upload products that other users can rent (think Airbnb but with goods like high-end cameras, tools, sports equipment...). One of the things users have to do when they upload a product is to take a picture of it, and then select the category (and, sometimes, subcategory) it belongs to. A Neural Network could completely avoid this step, if it were able to select the category just based on the image(s) uploaded by the user. Although this would be enough to create value, the Neural Network could also recognize the object in the images and give this information to another system that would return information of similar products that has been uploaded and that are successful, so that the user can have a reference in terms of wording for the title/description, price per day, etc.
- Objectives:
  - 1: Build a Neural Network that is able to classify a product image in its correct category 80% of the times.
  - 2: Build a Neural Network that is able to detect the product in the image with a 50% acurracy.
  - 3 (probably for the future): Build a Neural Network that is able to learn the features of the different products from their images.
- Plan:
  - 1: First, extract, clean and curate the dataset of the images already uploaded by users. As of today (March 21st) it is composed of almost 12,000 products with more than 18,000 images. Then, design and implement a Neural Network that is able to categorize an image in one of the eight available categories in Relendo: photography and video, sports, electronics, tools, events, musical instruments, caravans, and others (the default one), with an accuracy of, at least, 80% in the test set.
  - 2: Create another dataset using the titles of the products splitted by spaces as labels for the images. Then, design and implement a Neural Network that is able to infer this labels, which will represent the specific products from the images. This information could later be used to help the user with the uploading process.
  - 3 (probably for the future): design and implement a new network that will perform Unsupervised Feature Learning over the images, so that the features can later be used to find clusters of products. This would make easy to detect when a certain type of product is frequent enough to be its own category/subcategory.
- Dataset: the set of images uploaded by the users along with their products information to www.relendo.com.
- Deep Learning framework: I will start using Keras because it is easier to use, with TensorFlow as backend, since it has the advantage of being easy to deploy in production environments.
- Deep Learning architecture: I will have to try different architectures, but I am pretty sure that all of them will be sequential and will contain several convolutional layers that will connect to dense layers by the end.

# Code

In [1]:
import keras
from keras import optimizers

from keras.models import Model

from keras.applications import ResNet50
from keras.applications import InceptionV3
from keras.applications import Xception # TensorFlow ONLY
from keras.applications import VGG16
from keras.applications import VGG19

from keras.applications.imagenet_utils import preprocess_input
from keras.applications.imagenet_utils import decode_predictions
from keras.applications.inception_v3 import preprocess_input

from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
from keras.preprocessing.image import ImageDataGenerator

from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import ZeroPadding2D
from keras.layers import Activation
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import Flatten

from keras.callbacks import TensorBoard
from keras.callbacks import ModelCheckpoint

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

%matplotlib inline


import os
import h5py
from keras.models import Sequential
from keras.models import load_model
from keras import backend

Using TensorFlow backend.


In [2]:
RELENDO_DATASET = 0
CATS_DOGS_DATASET = 1

DATASET = RELENDO_DATASET
MANUAL_VGG16 = False

In [3]:
if DATASET == RELENDO_DATASET:
    nb_classes = 8
    class_name = {
        0: 'photo',
        1: 'electronics',
        2: 'events',
        3: 'instruments',
        4: 'tools',
        5: 'sports',
        6: 'caravans',
        7: 'others'
    }
else:
    nb_classes = 2
    class_name = {
        0: 'cats',
        1: 'dogs'
    }

In [7]:
# dimensions of our images.
img_width, img_height = 224, 224

if DATASET == RELENDO_DATASET:
    train_data_dir = './data/train'
    validation_data_dir = './data/validation'
    nb_train_samples = 15478
    nb_validation_samples = 3876
elif DATASET == CATS_DOGS_DATASET:
    train_data_dir = './cats-dogs-data/train'
    validation_data_dir = './cats-dogs-data/validation'
    nb_train_samples = 20000
    nb_validation_samples = 2000

In [8]:
# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1./255)

if DATASET == RELENDO_DATASET:
    class_mode = 'categorical'
elif DATASET == CATS_DOGS_DATASET:
    class_mode = 'binary'

train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=16,
        classes=class_name.values(),
        class_mode=class_mode)

validation_generator = test_datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_width, img_height),
        batch_size=16,
        classes=class_name.values(),
        class_mode=class_mode)

Found 15478 images belonging to 8 classes.
Found 3876 images belonging to 8 classes.


In [9]:
model = ResNet50(
    weights="imagenet",
    include_top=False,
    input_shape=(img_width, img_height, 3)
)

In [10]:
def build_vgg16(framework='tf'):

    if framework == 'th':
        # build the VGG16 network in Theano weight ordering mode
        backend.set_image_dim_ordering('th')
    else:
        # build the VGG16 network in Tensorflow weight ordering mode
        backend.set_image_dim_ordering('tf')
        
    model = Sequential()
    if framework == 'th':
        model.add(ZeroPadding2D((1, 1), input_shape=(3, img_width, img_height)))
    else:
        model.add(ZeroPadding2D((1, 1), input_shape=(img_width, img_height, 3)))
        
    model.add(Convolution2D(64, 3, 3, activation='relu', name='conv1_1'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(64, 3, 3, activation='relu', name='conv1_2'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(128, 3, 3, activation='relu', name='conv2_1'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(128, 3, 3, activation='relu', name='conv2_2'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu', name='conv3_1'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu', name='conv3_2'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(256, 3, 3, activation='relu', name='conv3_3'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv4_1'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv4_2'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv4_3'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))

    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv5_1'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv5_2'))
    model.add(ZeroPadding2D((1, 1)))
    model.add(Convolution2D(512, 3, 3, activation='relu', name='conv5_3'))
    model.add(MaxPooling2D((2, 2), strides=(2, 2)))
    
    return model

if MANUAL_VGG16:
    # path to the model weights files.
    weights_path = '../HW5/vgg16_weights.h5'
    th_model = build_vgg16('th')

    # load the weights of the VGG16 networks
    # (trained on ImageNet, won the ILSVRC competition in 2014)
    # note: when there is a complete match between your model definition
    # and your weight savefile, you can simply call model.load_weights(filename)
    assert os.path.exists(weights_path), 'Model weights not found (see "weights_path" variable in script).'
    f = h5py.File(weights_path)
    for k in range(f.attrs['nb_layers']):
        if k >= len(th_model.layers):
            # we don't look at the last (fully-connected) layers in the savefile
            break
        g = f['layer_{}'.format(k)]
        weights = [g['param_{}'.format(p)] for p in range(g.attrs['nb_params'])]
        th_model.layers[k].set_weights(weights)
    f.close()
    print('Model loaded.')

    tf_model = build_vgg16('tf')

    # transfer weights from th_model to tf_model
    for th_layer, tf_layer in zip(th_model.layers, tf_model.layers):
        if th_layer.__class__.__name__ == 'Convolution2D':
            kernel, bias = th_layer.get_weights()
            kernel = np.transpose(kernel, (2, 3, 1, 0))
            tf_layer.set_weights([kernel, bias])
        else:
            tf_layer.set_weights(tf_layer.get_weights())
    
    model = tf_model

In [11]:
# build a classifier model to put on top of the convolutional mode
x = model.output
x = Flatten(input_shape=model.output_shape[1:])(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)

if DATASET == RELENDO_DATASET:
    preds = Dense(8, activation='softmax')(x)
elif DATASET == CATS_DOGS_DATASET:
    preds = Dense(1, activation='sigmoid')(x)

final_model = Model(model.input, preds)
# print (final_model.summary())

In [12]:
for layer in final_model.layers[:-4]:
    layer.trainable = False

In [None]:
print (final_model.summary())

In [13]:
# compile the model with a SGD/momentum optimizer
# and a very slow learning rate.

if DATASET == RELENDO_DATASET:
    loss = 'categorical_crossentropy'
elif DATASET == CATS_DOGS_DATASET:
    loss = 'binary_crossentropy'

final_model.compile(
    loss=loss,
    optimizer=optimizers.Adagrad(lr=0.001),
    metrics=['accuracy']
)

In [14]:
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

epochs = 20
batch_size = 32

# Keras 2
#final_model.fit_generator(
#    train_generator,
#    steps_per_epoch=1000,
#    epochs=epochs,
#    validation_data=validation_generator,
#    validation_steps=100
#)


# Keras 1
final_model.fit_generator(
    train_generator,
    samples_per_epoch=(nb_train_samples // batch_size),
    nb_epoch=epochs,
    validation_data=validation_generator,
    nb_val_samples=(nb_validation_samples // batch_size))

Epoch 1/2



Epoch 2/2


<keras.callbacks.History at 0x146bc6d90>

In [15]:
final_model.fit_generator(
    train_generator,
    samples_per_epoch=(nb_train_samples // batch_size),
    nb_epoch=20,
    validation_data=validation_generator,
    nb_val_samples=(nb_validation_samples // batch_size))

Epoch 1/20
Epoch 2/20
Epoch 3/20

  'to RGBA images')


Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x13adf9cd0>

# Sandbox

In [None]:
img_path = '/Users/Josevi/product_images/photo/9760-1.jpg'
img = load_img(img_path, target_size=(224, 224))
x = img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

In [None]:
imgplot = plt.imshow(mpimg.imread(img_path))
plt.show()

In [None]:
%time
preds = model.predict(x)
print preds
#print('Predicted:', decode_predictions(preds))

In [None]:
def show_sample(X, y, prediction=-1):
    im = X
    print y
    #y = np.flip(y, axis=0)
    y_label = class_name[np.nonzero(y)[0][0]]
    plt.imshow(im)
    if prediction >= 0:
        plt.title("Class = %s, Predict = %s" % (y_label, class_name[prediction]))
    else:
        plt.title("Class = %s" % (y_label))

    plt.axis('on')
    plt.show()

In [None]:
for X_batch, Y_batch in train_generator:
    for i in range(len(Y_batch)):
        show_sample(X_batch[i, :, :, :], Y_batch[i])
    break