<img src="images/title.png" alt="Drawing" style="width: 1100px;"/>

# Overview
This competition aims to correctly classify millions of products for e-commerce company Cdiscount.com. Perform image classification for each of the 9 million products into 1 of 5000 categories, with each product having 1-4 images (180x180 resolution) in the dataset.

# Data
* category_names.7z
 * Shows hierarchy of product classification
 * Each category_id has a level1, level2, level3 name in French
 * each Product's category_id corresponds to a specific level 1, 2, and 3 level (specific spot in the category tree)
* train_example.bson
 * First 100 dicts from train.bson
* train.bson
 * List of 7,069,896 dictionaries (one per product) with keys:
 * product id ( **\_id: 42** )
 * category id ( **category_id: 1000021794** )
 * list of 1-4 images in a dictionary ( **imgs: [{'picture':b'...binarystring...'}, {'picture':b'...binarystring...'}]** )
* test.bson
 * List of 1,768,182 products in same format as train.bson, except there is no 'category_id' with each image
* sample_submission.7z 


 | \_id   | category_id   |  
 |:---    |:---           |
 | 10     |	1000010653    |
 | 14     |	1000010653    |
 | 21     |	1000010653    |
 | 24     |	1000010653    |
 | 27     |	1000010653    |



In [1]:
import numpy as np
import pandas as pd
import io
import bson
import matplotlib.pyplot as plt
import seaborn as sns
from skimage.data import imread

import os
import math

import json
from matplotlib import pyplot as plt

import cv2
from PIL import Image

import numpy as np
from numpy.random import random, permutation
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom

import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.models import Sequential, Model
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.layers import Input
from keras.layers.convolutional import Conv2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD, RMSprop
from keras.preprocessing import image

from keras.layers.advanced_activations import ELU

Using TensorFlow backend.


In [2]:
import requests, json

def slack(message):
    webhook_url = 'https://hooks.slack.com/services/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'
    slack_data = {'text': message, "link_names":1}

    response = requests.post(
        webhook_url, data=json.dumps(slack_data),
        headers={'Content-Type': 'application/json'})
    
    return response

# Evaluate the Submission Test Set

In [15]:
#This will be the directory we pull images from, all images must be in subdirs of this path (even if only 1 folder)
testrepo = "C:\\Kaggle\\04_Cdiscount\\"
datarepo = "D:\\Kaggle\\04_Cdiscount\\"

#The batch size to use for NN
batch_size = 32

## Build VGG16 Model
Construct a VGG16 model in Keras which will accept the images from this competition as input

In [4]:
vgg_mean = np.array([123.68, 116.779, 103.939]).reshape((3,1,1))

def vgg_preprocess(x):
    x = x - vgg_mean   #subtract mean
    return x[:, ::-1]  #RGB -> BGR

In [5]:
def ConvBlock(layers, model, filters):
    for i in range(layers):
        model.add(ZeroPadding2D((1,1)))
        model.add(Conv2D(filters, (3,3), activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2), data_format="channels_first"))

In [6]:
def FullyConnectedBlock(model):
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))

In [7]:
def VGG16():
    model = Sequential()
    model.add(Lambda(vgg_preprocess, input_shape=(3,224,224)))
    
    ConvBlock(2, model, 64)
    ConvBlock(2, model, 128)
    ConvBlock(3, model, 256)
    ConvBlock(3, model, 512)
    ConvBlock(3, model, 512)
    
    model.add(Flatten())
    FullyConnectedBlock(model)
    FullyConnectedBlock(model)
    model.add(Dense(1000, activation='softmax'))
    
    return model

## Instantiate Model and Load Weights

In [8]:
model = VGG16()
model.pop()
model.add(Dense(5270, activation='softmax'))
model.load_weights(datarepo + "weights\\finetune_best_weights2.hdf5")
model.compile(optimizer=RMSprop(lr=0.000005), loss="categorical_crossentropy", metrics=['accuracy'])
#model.summary()

## Create a Master List of Images
Create_Image_List allows us to feed our custom generator with a customized image list. Each image can be grabbed once, or we can grab the same number of images from each training class regardless of it's actual size. This will loop back to the beginning for smaller classes and help the imbalanced dataset problem. The list can be shuffled or given sequentially.

In [16]:
import random

def Create_Image_List(directory, perclass=0, seed=42, shuffle=False):
    """
    Return a list of images 
    Directory must contain subdirs which are the class names
    Shuffle will randomize how images are selected from the subdir
    perclass amount of images will be pulled from each subdir (looping if needed)
    1st output image is from 1st subdir class, 2nd from 2nd class, etc ... looping back to 1st class
    """
    Lfiles = []
    Lclasses = []
    Lmaster = []

    for i,(dirpath, dirname, fname) in enumerate(os.walk(directory)):
        if i == 0:
            Lclasses = dirname
        else:
            Lfiles.append([Lclasses[i-1], fname])
    
    #count total images
    totalimgs = 0
    for item in Lfiles:
        totalimgs += len(item[1])    

    print("Found", str(len(Lfiles)), "classes with a total of", totalimgs, "images" )

    #shuffle each classes' image list
    if shuffle:
        random.seed(seed)
        for i,tmp in enumerate(Lfiles):
            random.shuffle(Lfiles[i][1])

    #create an output list with each image appearing once
    if perclass == 0:
        for cls in Lfiles:
            for img in cls[1]:
                Lmaster.append(cls[0] + "\\" + img)
    
    #create the output list of images
    #if perclass is greater than num of images in a class, loop back to it's first img
    #every class will have same num of images
    if perclass > 0:
        for idx in range(perclass):
            for cls in Lfiles:
                looper = idx % len(cls[1])
                Lmaster.append(cls[0] + "\\" + cls[1][looper])
    
    if perclass == 0:
        print("Returning a list with all images in each class, totaling", str(len(Lmaster)), "images")
    else:
        print("Returning a list with", str(perclass), "images per class, totaling", str(len(Lmaster)), "images")
    
    return Lmaster

In [17]:
Master_Images_Test = Create_Image_List(directory=testrepo, perclass=0, seed=42, shuffle=False)
Master_Filenames = [i.split('\\')[1] for i in Master_Images_Test]

Found 1 classes with a total of 3095080 images
Returning a list with all images in each class, totaling 3095080 images


## Create Master List of Categories

In [18]:
categories = pd.read_csv(r'D:\Kaggle\04_Cdiscount\category_names.csv', index_col='category_id')
Master_Classes = categories.index.tolist()
Master_Classes.sort()

## Create Custom Generator
This will endlessly feed images to the predict stage. This is more configurable than the normal Keras image data generator and works better on this system for some reason. Keras' IDG was skipping batches and giving erroneous results. The helper function *Open_Image* is useful so the generator will yield correctly formatted images. They must be numpy arrays of size 224x224 with "channels first" aka (3,224,224)

In [19]:
def Open_Image(directory, path):
    im = Image.open(directory + path)
    imarray = np.array(im)
    imresize = misc.imresize(imarray, (224,224)) 
    imT = np.transpose(imresize, (2,0,1)) 
    #img = Image.fromarray(imarray, 'RGB')
    #img.show()
    return imT

In [20]:
def Batch_Generator(dataset, batch_size, repo):
    for i in range(0,len(dataset), batch_size):
        batch = dataset[i : i+batch_size]
        yield np.asarray([Open_Image(repo, i) for i in batch]),np.asarray([i.split('\\')[0] for i in batch])

In [24]:
test_batches = Batch_Generator(dataset=Master_Images_Test, batch_size=batch_size, repo=testrepo)

## Predict Output Classes for Submission Test Set
It may be worth looking at predictions for each image of a product (up to 4) and combining results or voting in order to determine best classification. Possibly run the extra images through a different NN then ensemble?

The prediction output contains 5,270 columns per sample, so we must predict in batches, saving predicted output classes in an array along the way. We run out of memory if we try to predict all the submission test images at once (millions of images x 5,270 values/image x 4 bytes/value = WAY TOO BIG FOR MEMORY).

In [None]:
# Master_Classifications = []

# for i,(imgs,labels) in enumerate(test_batches):
#     if i%100 == 0: print("Finished batch:", str(i), "/96721")
#     preds = model.predict_on_batch(imgs)
#     highest_prob = np.argmax(preds, axis=1)
#     for highest in range(len(highest_prob)):
#         idx = highest_prob[highest]
#         Master_Classifications.append(Master_Classes[idx])

In [25]:
Master_Classifications = []

preds = model.predict_generator(generator=test_batches, steps=(len(Master_Images_Test)//batch_size),
                               max_queue_size=10, workers=1, use_multiprocessing=False, verbose=1)

highest_prob = np.argmax(preds, axis=1)
for highest in range(len(highest_prob)):
    idx = highest_prob[highest]
    Master_Classifications.append(Master_Classes[idx])

  293/96721 [..............................] - ETA: 23431s

KeyboardInterrupt: 

In [None]:
slack("FINISHED CLASSIFICATION")

## Format Predictions into Submission Format
- Create a numpy array with a header of 2 columns named **_id** and **category_id**
- Each row should be in the format of **_id,category_id** such as **5,1000016018**
 - Strip off the "-#.png" portion of each filename
 - use class_list to find the category_id
- Only parse out preds and filenames for images ending in "-1.png"
- **MAKE SURE FINAL SUBMISSION HAS 1,768,182 ROWS**

In [None]:
#remove the ".jpg" extension
parsed_filenames = []
for imgname in Master_Filenames:
    parsed_filenames.append(imgname.split('.')[0])

#combine filenames and classifications into 1 numpy array
a = np.array(parsed_filenames)
b = np.array(Master_Classifications)
submission_array = np.column_stack((a,b))

#turn the numpy array into a Pandas Dataframe
df = pd.DataFrame(data=submission_array)
df.columns = ['_id', 'category_id']
df = df[df._id.str.contains('-1')]
df['_id'] = df['_id'].str[:-2]
df.shape

In [None]:
if df.shape != (1768182, 2):
    print("Error: final submission dataframe shape should be (1768182, 2) but got", df.shape,"instead")
else:
    print("Ready for submission!")

## Create a Zip file for Submission

In [None]:
from zipfile import ZipFile, ZIP_DEFLATED

output_file = "final_submission6"

df.to_csv(datarepo + "submissions" + "\\" + output_file + ".csv", index=False)
os.chdir(datarepo + "submissions")
ZipFile(output_file + ".zip", "w", ZIP_DEFLATED).write(output_file + ".csv")

print(datarepo + "submissions" + "\\" + output_file + ".csv ready for submission")

## Submit Results