Output is in "output.xlsx" file

This Kernel(Jupyter Notebook) is building and training a convolutional neural network for text classification. We are classifying products into categories based on their Brand and description(title).

Reading training data and creating two list of training_categories(CategoryName) and training_texts(BrandName + Title)

In [1]:
import pandas as pd

TRAINING_FILE = "Training_Data_Assessment.xlsx"
training_sheets = pd.ExcelFile(TRAINING_FILE)
training_data_df = training_sheets.parse("training_data")
training_categories = []
training_texts = []
for index, row in training_data_df.iterrows():
    text = row["BrandName"].lower() + " " + row["Title"].lower()
    category = row["CategoryName"].lower()
    training_texts.append(text)
    training_categories.append(category)

A Tokenizer class to tokenize texts of training data using Keras text processing functions

In [2]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

class WordTokenizer:
    """Class which tokenizes words
    Attributes:
        max_sequence_length (int): Maximum sequence length for embedding
        tokenizer (Tokenizer): Keras Tokenizer
    """

    def __init__(self, max_sequence_length=200):
        """Create tokenizer
        Args:
            max_sequence_length (int): Maximum sequence length for texts
        """
        self.max_sequence_length = max_sequence_length
        self.tokenizer = None
        
    def train(self, texts, max_nb_words=15000):
        """Takes a list of texts, fits a tokenizer to them, and create the embedding matrix.
        Args:
            texts (list(str)): List of texts
            max_nb_words: Maximum number of words indexed
        """
        # Tokenize
        print('Training tokenizer...')
        self.tokenizer = Tokenizer(num_words=max_nb_words,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n\'')
        self.tokenizer.fit_on_texts(texts)
        print('Found %s unique tokens.' % len(self.tokenizer.word_index))

    def tokenize(self, texts):
        """Takes a list of texts and tokenizes them.
        Args:
            texts (list(str)): List of texts
        Returns:
            np.array: 2D numpy array (len(texts), self.max_sequence_length)
        """
        sequences = self.tokenizer.texts_to_sequences(texts)
        data = pad_sequences(sequences, maxlen=self.max_sequence_length)
        return data

Using TensorFlow backend.


Train Tokenizer

In [4]:
tokenizer = WordTokenizer()
tokenizer.train(training_texts)
tokenized_data = tokenizer.tokenize(training_texts)

Training tokenizer...
Found 12325 unique tokens.


I tried different models and different hyper parameters and finally choose a 2D CNN. 

I tried using word embeddings with pretrained data(weights) as well and I found that performance of model increases when used with correct word embedding pretrained model. I tried Glove and Word2Vec and found that Glove had no impact on performance because mostly it did not had any ecommerce specific words like accessories, etc but Word2Vec had many of them thus I finally decided to use Word2Vec model and use it to pass as weight matrix while doing embedding. I really feel and suggest that we should have ecommerce specific word embeddings like we have Word2Vec and Glove for News relted data, it can really enhance performance of model

I tried many models but finally had to choose between two of them, one was 1D CNN with LSTM layer at top of it and other is 2d CNN and I found that performance of 2D CNN is better than 1D CNN with LSTM layer. And I think reason for this better performance is 2D CNN and figure out more hidden details and can form more hidden connection which 1D CNN with LSTM can not. I am able to develop a model with 2d CNN having 94% traing accuracy and 82% validation accuracy, I am pretty sure this model will work will and give good accuracy for test data. I think having almost 63K records helped but if thier was more data then performance of model will also increase.

I also found that using 0.5 dropout fraction and batch size of 32 and number of epochs as 35 seems a good set of hyper parameters. I fugured out all hyper parameters after several runs of model.

lets load the pretrain Word2Vec model from Google https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

It might take time since it contains contains 300-dimensional vectors for 3 million words and phrases

do,

cd ..

unzip GoogleNews-vectors-negative300.bin

A class for Product Classifier

commented code are another embedding method and 1D CNN model with LSTM on top, I have left them deliberately so it can be checked if needed. 

This class helps in compiling and training model.

In [5]:
"""Product classifier class"""

import os, json
import numpy as np
import gensim
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Input, Flatten, Dropout, Conv1D, MaxPooling1D, Embedding, LSTM, Conv2D, MaxPooling2D, concatenate
from keras.models import load_model, Model
from keras.layers.core import Reshape
from keras.utils import to_categorical
from sklearn.metrics import classification_report
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors
from keras import regularizers

class ProductClassifier(object):
    """Class which classifies products based on various inputs
       Attributes:
           prefix (str): Model files prefix
           model (keras.model): Keras model
           category_map (dict(str, int)): Map between category names and indices
    """

    def __init__(self):
        """Load in model and category map
        """
        self.model = None
        self.category_map = {}

    def index_categories(self, categories):
        """Take a list of possibly duplicate categories and create an index list
        Args:
            categories (list(str)): List of categories
        Returns:
            list(int): List of indices
        """
        print('Indexing categories...')
        indices = []
        for category in categories:
            if not (category in self.category_map):
                self.category_map[category] = len(self.category_map)
            indices.append(self.category_map[category])
        print('Found %s unique categories.' % len(self.category_map))
        return indices

    def classify(self, data):
        """Classify by products by text
        Args:
            data (np.array): 2D array representing descriptions of the product and/or product title
        Returns:
            list(dict(str, float)): List of dictionaries of product categories with associated confidence
        """
        prediction = self.model.predict(data)
        all_category_probs = []
        for i in range(prediction.shape[0]):
            category_probs = {}
            for category in self.category_map:
                category_probs[category] = prediction[i,self.category_map[category]]
            all_category_probs.append(category_probs)
        return all_category_probs

    def get_labels(self, categories):
        """Create labels from a list of categories
        Args:
            categories (list(str)): A list of product categories
        Returns:
            (list(int)): List of indices
        """
        indexed_categories = self.index_categories(categories)
        labels = to_categorical(np.asarray(indexed_categories))
        return labels

    def compile(self, tokenizer, embedding_dim=300, dropout_fraction=0.5, kernal_size=5, n_filters=128):
        """Compile network model for classifier
        Args:
            embedding_dim (int): Size of embedding vector
            tokenizer (WordTokenizer): Object used to tokenize orginal texts
            dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer
            kernal_size (int): Size of sliding window for convolution
            n_filters (int): Number of filters to produce from convolution
        """
        # Load embedding layer
        print('Creating embedding layer....')
        word_vectors = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
        
        embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim))
        for word, i in tokenizer.tokenizer.word_index.items():
            try:
                embedding_vector = word_vectors[word]
                embedding_matrix[i] = embedding_vector
            except KeyError:
                embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),embedding_dim)
                
        embedding_layer = Embedding(len(tokenizer.tokenizer.word_index) + 1,
                                    embedding_dim,
                                    weights=[embedding_matrix],
                                    input_length=tokenizer.max_sequence_length,
                                    trainable=False)
        
#         embedding_layer = Embedding(len(tokenizer.tokenizer.word_index) + 1,
#                                     embedding_dim,
#                                     input_length=tokenizer.max_sequence_length,
#                                     trainable=False)

        # Create network
        print('Creating network...')
        sequence_input = Input(shape=(tokenizer.max_sequence_length,), dtype='int32')
        embedded_sequences = embedding_layer(sequence_input)
        
        reshape = Reshape((tokenizer.max_sequence_length,embedding_dim,1))(embedded_sequences)

        conv_0 = Conv2D(n_filters, (3, embedding_dim),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)
        conv_1 = Conv2D(n_filters, (4, embedding_dim),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)
        conv_2 = Conv2D(n_filters, (5, embedding_dim),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)

        maxpool_0 = MaxPooling2D((tokenizer.max_sequence_length - 3 + 1, 1), strides=(1,1))(conv_0)
        maxpool_1 = MaxPooling2D((tokenizer.max_sequence_length - 4 + 1, 1), strides=(1,1))(conv_1)
        maxpool_2 = MaxPooling2D((tokenizer.max_sequence_length - 5 + 1, 1), strides=(1,1))(conv_2)

        merged_tensor = concatenate([maxpool_0, maxpool_1, maxpool_2], axis=1)
        flatten = Flatten()(merged_tensor)
        dropout = Dropout(dropout_fraction)(flatten)
        preds = Dense(len(self.category_map), activation='softmax')(dropout)
        
#         x = Dropout(dropout_fraction)(embedded_sequences)
#         x = Conv1D(n_filters, kernal_size, activation='relu')(x)
#         x = MaxPooling1D(kernal_size)(x)
#         x = LSTM(embedding_dim, dropout=dropout_fraction, recurrent_dropout=dropout_fraction)(x)
#         preds = Dense(len(self.category_map), activation='softmax')(x)

        # Compile model
        print('Compiling network...')
        self.model = Model(sequence_input, preds)
        self.model.compile(loss='categorical_crossentropy',
                           optimizer='rmsprop',
                           metrics=['acc'])

    def train(self, data, labels, validation_split=0.25, batch_size=32, epochs=35):
        """Train classifier
        Args:
            data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
            labels (np.array): 2D numpy array (n_samples, len(self.category_map))
            validation_split (float): Fraction of samples to be used for validation
            batch_size (int): Training batch size
            epochs (int): Number of training epochs
        """
        print('Training...')
        # Split the data into a training set and a validation set
        indices = np.arange(data.shape[0])
        np.random.shuffle(indices)
        data = data[indices]
        labels = labels[indices]
        nb_validation_samples = int(validation_split * data.shape[0])

        x_train = data[:-nb_validation_samples]
        y_train = labels[:-nb_validation_samples]
        x_val = data[-nb_validation_samples:]
        y_val = labels[-nb_validation_samples:]

        # Train!
        self.model.fit(x_train, y_train, validation_data=(x_val, y_val),
                       epochs=epochs, batch_size=batch_size)
        self.evaluate(x_val, y_val, batch_size)

    def evaluate(self, x_test, y_test, batch_size=256):
        """Evaluate classifier
        Args:
            x_test (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
            y_test (np.array): 2D numpy array (n_samples, len(self.category_map))
            batch_size (int): Training batch size
        """
        print('Evaluating...')
        predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1)
        predicted_classes = np.argmax(predictions_last_epoch, axis=1)
        target_names = ['']*len(self.category_map)
        for category in self.category_map:
            target_names[self.category_map[category]] = category
        y_val = np.argmax(y_test, axis=1)
        print(classification_report(y_val, predicted_classes, target_names=target_names, digits = 6))



Get labels from classifier

In [6]:
classifier = ProductClassifier()
training_labels = classifier.get_labels(training_categories)

Indexing categories...
Found 63 unique categories.


Compile classifier network and train

In [7]:
classifier.compile(tokenizer)
classifier.train(tokenized_data, training_labels)

Creating embedding layer....
Creating network...
Compiling network...
Training...
Train on 4526 samples, validate on 1508 samples
Epoch 1/35
Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35
Evaluating...
                             precision    recall  f1-score   support

                 headphones   0.730769  1.000000  0.844444        19
                     cables   0.562500  0.818182  0.666667        22
    security & surveillance   0.720000  0.947368  0.818182        19
            streaming media   0.793103  1.000000  0.884615        23
     television accessories   0.708333  0.629630  0.666667        27
             monitor risers   

Reading test data

In [8]:
import pandas as pd

TESTING_FILE = "Data To Classify_Assessment.xlsx"
testing_sheets = pd.ExcelFile(TESTING_FILE)
testing_data_df = testing_sheets.parse("sample_data")
testing_texts = []
for index, row in testing_data_df.iterrows():
    if(row["BrandName"]=="" or row["BrandName"]=="NULL"):
        text = row["Title"]
    else:
        try:
            text = row["BrandName"] + " " + row["Title"]
        except TypeError:
            text = str(row["BrandName"]) + " " + row["Title"]
    testing_texts.append(text)
    
print(len(testing_texts))

57030


Predicting category index of test data

In [9]:
predictions_test_epoch = classifier.model.predict(tokenizer.tokenize(testing_texts), batch_size=32, verbose=1)
predicted_test_classes = np.argmax(predictions_test_epoch, axis=1)
print(predicted_test_classes.shape)

(57030,)


Get list of category names based on index

In [10]:
target_names = ['']*len(classifier.category_map)
for category in classifier.category_map:
    target_names[classifier.category_map[category]] = category

Store predicted category in output file

In [11]:
import xlsxwriter
  
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('output.xlsx')
worksheet = workbook.add_worksheet()

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': 1})

# Write some data headers.
worksheet.write('A1', 'ASIN', bold)
worksheet.write('B1', 'BrandName', bold)
worksheet.write('C1', 'Title', bold)
worksheet.write('D1', 'Category', bold)

row_index = 1
col_index = 0

for index, row in testing_data_df.iterrows():
    worksheet.write(row_index,col_index,row["ASIN"])
    try:
        if(row["BrandName"]=="" or row["BrandName"]=="NULL"):
            worksheet.write(row_index,col_index+1,"")
        else:
            worksheet.write(row_index,col_index+1,row["BrandName"])
    except TypeError:
        worksheet.write(row_index,col_index+1,str(row["BrandName"]))
    worksheet.write(row_index,col_index+2,row["Title"])
    worksheet.write(row_index,col_index+3,target_names[predicted_test_classes[row_index-1]])
    row_index += 1
    
workbook.close()