# Convolutional Neural Networks for Text
(adapted from Debajyoti Datta)

**CNNs have become widely popular in text classification systems. CNNs help in reduction in computation by exploiting local correlation of the input data.**



Let's start with inputs:

![png](./assets/images/conv.001.png)

So the green boxes represent the words or the characters depending on your approach. If you are using character based convolutional neural network then it is characters whereas if you are using words as a unit then it is the word based convolution. And the corresponding blue rows represent the representation of the words or the characters. In the case of character based convolutions, since the number of characters was around 70, including punctuations, numbers, and alphabets, this is generally the one-hot representation. In the case of words, the blue boxes generally represent dense vectors. These dense vectors can be pre-trained word embeddings or word vectors trained during training. 

<!-- <img src="conv/conv.002.png",width=550,height=550> -->

![png](./assets/images/conv.002.png)

Now as you can see, filters (also known as kernels) can be of any length. Here the length refers to the number of rows of the filter. In the case of images, the width and length of the kernel can be any size. But the width of the kernel in case of character and word representations is the dimension of the entire word embedding or the entire character representation. Thus the only dimension that matters in the case of convolutions in NLP tasks, is the length of the filter or the size of the filter.

Now the filters need to convolve with the input and produce the output. Convolve is a fancy term for multiplication with corresponding cells and adding up the sum. This part is slightly tricky to understand and varies based on things like stride (How much the filter moves every stage?) and the length of the filter. The output of the convolution operation is directly dependent on these two aspects. This will become clearer in the following image.

<!-- <img src="conv/conv.003.png",width=300,height=300>
<img src="conv/conv.004.png",width=300,height=300>
<img src="conv/conv.005.png",width=300,height=300>
<img src="conv/conv.006.png",width=300,height=300> -->

![png](./assets/images/conv.003.png)
![png](./assets/images/conv.004.png)
![png](./assets/images/conv.005.png)
![png](./assets/images/conv.006.png)

Now, this is where all the interesting bit happens! The convolution is just the multiplication of the weights in the filters and the corresponding representation of the words or characters. Each of the output of the multiplication is then just summed up and it produces one output, shown with the arrow. Thus if the filter would have moved with a stride of 2, then the number of filter outputs would have been different. If the filter length was different, the convolved output would be different. Convince yourself that the filter of the length 4, when convolved, will just produce 2 outputs.

Like multiple filter lengths, there can be multiple filters of the same length. So there can be a 100 filters of length 2, a hundred filters of length 4 and so on.

Each of these will then produce multiple outputs!

<!-- <img src="conv/conv2.006.png",width=400,height=400> -->

![png](./assets/images/conv2.006.png)

The final stage comprises of max pooling then concatenation and softmax regularization.

<!-- <img src="conv/conv2.007.png",width=300,height=300> -->

![png](./assets/images/conv2.007.png)

The entire process was very nicely illustrated by Zhang et al, in the paper "A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional
Neural Networks for Sentence Classification", for words. The above is just replicating the same for characters and how it would appear at every step!

<!-- <img src="conv/Zhang.png",width=400,height=400> -->

![png](./assets/images/Zhang.png)

The task we are trying to accomplish here is to classify text. Specifically, the input to the convolutional network can be words or characters like we discussed before. Here, from the sequence of words, in a sentence or from the sequence of characters in a sentence we would want to classify the category of the sentence, like positive or negative and so on.

In [1]:
import pandas as pd
import re
import tensorflow as tf
import numpy as np
from keras.utils.np_utils import to_categorical

data = pd.read_csv("./yelp_labelled.txt", delimiter='\t', header=None, names=['review','label'], encoding='utf-8')

data

Using TensorFlow backend.


Unnamed: 0,review,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


## Character based CNN

In [None]:
docs = []
sentences = []
sentiments = []

for sentences, sentiment in zip(data.review, data.label):
    sentences_cleaned = [sent.lower() for sent in sentences]
    docs.append(sentences_cleaned)
    sentiments.append(sentiment)

len(docs), len(sentiments)

In [None]:
maxlen = 1024 
nb_filter = 256
dense_outputs = 1024
filter_kernels = [7, 7, 3, 3, 3, 3]
n_out = 2
batch_size = 80
nb_epoch = 10

In [None]:
txt = ''
for doc in docs:
    for s in doc:
        txt += s
chars = set(txt)
vocab_size = len(chars)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
from keras.preprocessing.sequence import pad_sequences

def vectorize_sentences(data, char_indices):
    X = []
    for sentences in data:
        x = [char_indices[w] for w in sentences]
        x2 = np.eye(len(char_indices))[x]
        X.append(x2)
    return (pad_sequences(X, maxlen=maxlen))

train_data = vectorize_sentences(docs,char_indices)
train_data.shape
y_train = to_categorical(sentiments)


from keras.models import Model
from keras.layers import Input, Dense, Dropout, Flatten
from keras.layers.convolutional import Convolution1D, MaxPooling1D

inputs = Input(shape=(maxlen, vocab_size), name='input', dtype='float32')

conv = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[0],
                     border_mode='valid', activation='relu',
                     input_shape=(maxlen, vocab_size))(inputs)
conv = MaxPooling1D(pool_length=3)(conv)

conv1 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[1],
                      border_mode='valid', activation='relu')(conv)
conv1 = MaxPooling1D(pool_length=3)(conv1)

conv2 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[2],
                      border_mode='valid', activation='relu')(conv1)

conv3 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[3],
                      border_mode='valid', activation='relu')(conv2)

conv4 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[4],
                      border_mode='valid', activation='relu')(conv3)

conv5 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[5],
                      border_mode='valid', activation='relu')(conv4)
conv5 = MaxPooling1D(pool_length=3)(conv5)
conv5 = Flatten()(conv5)

z = Dropout(0.5)(Dense(dense_outputs, activation='relu')(conv5))
z = Dropout(0.5)(Dense(dense_outputs, activation='relu')(z))

pred = Dense(n_out, activation='softmax', name='output')(z)

model = Model(input=inputs, output=pred)

model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
              metrics=['accuracy'])

# model.fit(train_data, y_train, batch_size=32,
#            nb_epoch=120, validation_split=0.2, verbose=False)

In [None]:
from keras.utils.visualize_util import model_to_dot
from IPython.display import Image

Image(model_to_dot(model, show_shapes=True).create(prog='dot', format='png'))

## Word based CNN

The next part, we will use convolutions on words.

Now the input data needs to be in the form:

Number_of_reviews X maxlen

Now the yelp data is really small, for these to be very effective. But hopefully, this will be sufficient for you to do experiments with your own data sets.

In [None]:
data = pd.read_csv("./yelp_labelled.txt",
                   delimiter='\t', header=None, names=['review','label'],encoding='utf-8')
# z = list(nlp.pipe(data['review'], n_threads=20, batch_size=20000))

from __future__ import unicode_literals
from spacy.en import English
from collections import Counter

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()

def tokenizeSentences(sent):
    doc = nlp(sent)
    sentences = [sent.string.strip() for sent in doc]
    return sentences

Xs = []    
for texts in data['review']:
    Xs.append(tokenizeSentences(texts))

Xs

In [None]:
vocab = sorted(reduce(lambda x, y: x | y, (set(words) for words in Xs)))
len(vocab)

At this stage, you can go ahead with this vocab, but ideally, we would want to get rid of the very infrequent words. So cases where the tokenizer failed, or the emoticons and so on. This is because if someone used the word 'amaaazzziiinggg' or 'Wooooooow' to describe the movie we do not want to create two different tokens for the word 'amazing'. Also, you can define custom rules in Spacy to take these into account. Secondly, another option that is often followed in information retrieval approaches is to get rid of the most frequent words. This is because words like 'the', 'an' and so on occur very frequently and do not add to the meaning. We will ignore this part of the yelp dataset since it is already really small but the following code snippet will help you get rid of the most frequent and the least frequent words in case you desire so.

So let's build a function to get words that have at least appeared more than once in our vocab. The reason for doing this, instead of selecting the most frequent 500 words, is that there will be a lot of words that get eliminated arbitrarily as soon as we reach the 500-word limit. (A lot of words have very similar counts!)


In [None]:
import operator

def word_freq(Xs, num):
    all_words = [words.lower() for sentences in Xs for words in sentences]
    sorted_vocab = sorted(dict(Counter(all_words)).items(), key=operator.itemgetter(1))
    final_vocab = [k for k,v in sorted_vocab if v>num]
    word_idx = dict((c, i + 1) for i, c in enumerate(final_vocab))
    return final_vocab, word_idx

final_vocab, word_idx = word_freq(Xs,2)
vocab_len = len(final_vocab) # Finally we have 598 words!

In [None]:
The vectorize function will vectorize the words we have. Now there are three possible scenarios that can occur. Because we didn't specifically lower case all the words:
    the word is in the dictionary and we found it
    the lower case word is in the dictionary and we found it
    the word is not in the dictionary

In [None]:
def vectorize_sentences(data, word_idx, final_vocab, maxlen=40):
    X = []
    paddingIdx = len(final_vocab)+2
    for sentences in data:
        x=[]
        for word in sentences:
            if word in final_vocab:
                x.append(word_idx[word])
            elif word.lower() in final_vocab:
                x.append(word_idx[word.lower()])
            else:
                x.append(paddingIdx)
        X.append(x)
    return (pad_sequences(X, maxlen=maxlen))

train_data = vectorize_sentences(Xs, word_idx, final_vocab)
train_data.shape

And here's the model:

In [None]:
from keras.layers.core import Activation, Flatten, Dropout, Dense, Merge
from keras.layers import Convolution1D
from keras.layers.pooling import MaxPooling1D
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam, rmsprop


