# 1DCovnet
Using a 1DCovnet combined with word embeddings for text classification. Download the [Glove word embeddings](https://nlp.stanford.edu/projects/glove/)(glove.6B.zip). Also, download the raw [IMDB](http://mng.bz/0tIo) dataset and uncompress it.

In [40]:
import os

import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras import layers, Input

In [29]:
# Using the 100 dimension word embeddings
glove_dir = "/Users/marshall.carter/Documents/glove"

imdb_dir = "/Users/marshall.carter/Documents/my_repos/keras/aclImdb"

## Load the data

In [7]:
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [8]:
# Sample the data
for i in range(2):
    print(labels[i], texts[i], '\n')

0 Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form. 

0 Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think 

In [10]:
# Number of records
len(labels)

25000

## Text feature parameters

In [26]:
# Cut off reviews as this many words
maxlen = 200

# Set number of observation for validation holdout
training_samples = 15000
validation_samples = 10000

# Consider on the top n (max_words) number of words in the model
max_words = 20000

## Hash the words to an integer index

In [15]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index

In [16]:
print(sequences[0])

[777, 16, 28, 4, 1, 115, 2278, 6887, 11, 19, 1025, 5, 27, 19499, 5, 42, 2425, 1861, 128, 2270, 5, 3, 6985, 308, 7, 7, 3383, 2373, 1, 19, 36, 463, 16115, 3169, 2, 222, 3, 1016, 174, 20, 49, 808]


In [17]:
# Word to integer mapping
for word, index in list(word_index.items())[:5]:
    print(word, index)

the 1
and 2
a 3
of 4
to 5


In [18]:
# Rever to index such that words are keys
reverse_index = {value: key for key, value in word_index.items()}

In [19]:
print([reverse_index[index_val] for index_val in sequences[0]])

['working', 'with', 'one', 'of', 'the', 'best', 'shakespeare', 'sources', 'this', 'film', 'manages', 'to', 'be', 'creditable', 'to', "it's", 'source', 'whilst', 'still', 'appealing', 'to', 'a', 'wider', 'audience', 'br', 'br', 'branagh', 'steals', 'the', 'film', 'from', 'under', "fishburne's", 'nose', 'and', "there's", 'a', 'talented', 'cast', 'on', 'good', 'form']


## Pad the integer sequences so they are the same length

In [23]:
data = pad_sequences(sequences, maxlen=maxlen)

# Create labels array
labels = np.asarray(labels)

In [24]:
data[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

## Split and shuffle the data; creating training and validation datasets

In [27]:
# Number of observations
indices = np.arange(data.shape[0])

# Randomly shuffle observations
np.random.shuffle(indices)

data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]

x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

## Create the GLOVE word embedding mapping

In [30]:
embeddings_index = {}

# Read in as a list of vectors
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [31]:
len(embeddings_index)

400000

In [34]:
# Create an empty embedding matrix
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))

# Populate the embedding matrix
for word, i in word_index.items():
    if i < max_words:
        # Get the word's GLOVE vector
        embedding_vector = embeddings_index.get(word)
        
        # Return a 0 if the word has no GLOVE vector
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [37]:
# This is expected; index[0] "is a placeholder"
embedding_matrix[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [38]:
embedding_matrix[1]

array([-0.038194  , -0.24487001,  0.72812003, -0.39961001,  0.083172  ,
        0.043953  , -0.39140999,  0.3344    , -0.57545   ,  0.087459  ,
        0.28786999, -0.06731   ,  0.30906001, -0.26383999, -0.13231   ,
       -0.20757   ,  0.33395001, -0.33848   , -0.31742999, -0.48335999,
        0.1464    , -0.37303999,  0.34577   ,  0.052041  ,  0.44946   ,
       -0.46970999,  0.02628   , -0.54154998, -0.15518001, -0.14106999,
       -0.039722  ,  0.28277001,  0.14393   ,  0.23464   , -0.31020999,
        0.086173  ,  0.20397   ,  0.52623999,  0.17163999, -0.082378  ,
       -0.71787   , -0.41531   ,  0.20334999, -0.12763   ,  0.41367   ,
        0.55186999,  0.57907999, -0.33476999, -0.36559001, -0.54856998,
       -0.062892  ,  0.26583999,  0.30204999,  0.99774998, -0.80480999,
       -3.0243001 ,  0.01254   , -0.36941999,  2.21670008,  0.72201002,
       -0.24978   ,  0.92136002,  0.034514  ,  0.46744999,  1.10790002,
       -0.19358   , -0.074575  ,  0.23353   , -0.052062  , -0.22

## Specify the 1DCovnet architecture
Using the function API of Keras to specify convolutional layers with multiple kernel sizes. We will then specify an achitecture similiar to the below, which is described in this [paper](https://arxiv.org/abs/1510.03820)

<img src="https://miro.medium.com/max/1400/0*0efgxnFIaLTZ2qkY">

In [53]:
n_filters = 100

text_input = Input(shape=(maxlen,), dtype='int32', name='review')

embedded_text = layers.Embedding(max_words,
                                100,
                                input_length = maxlen,
                                weights = [embedding_matrix],
                                trainable = False)(text_input)

conv1d_2 = layers.Conv1D(n_filters, 2, activation='relu')(embedded_text)
maxpooling_2 = layers.GlobalMaxPooling1D()(conv1d_2)

conv1d_3 = layers.Conv1D(n_filters, 3, activation='relu')(embedded_text)
maxpooling_3 = layers.GlobalMaxPooling1D()(conv1d_3)

conv1d_4 = layers.Conv1D(n_filters, 4, activation='relu')(embedded_text)
maxpooling_4 = layers.GlobalMaxPooling1D()(conv1d_4)

concat_maxpoolings = layers.concatenate([maxpooling_2, maxpooling_3, maxpooling_4], axis=-1)

prediction = layers.Dense(1, activation='sigmoid')(concat_maxpoolings)

model = Model(text_input, prediction)

model.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
review (InputLayer)             (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 200, 100)     2000000     review[0][0]                     
__________________________________________________________________________________________________
conv1d_16 (Conv1D)              (None, 199, 100)     20100       embedding_6[0][0]                
__________________________________________________________________________________________________
conv1d_17 (Conv1D)              (None, 198, 100)     30100       embedding_6[0][0]                
____________________________________________________________________________________________

## Fit the model

In [55]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=10, 
                    batch_size=100, 
                    validation_data=(x_val, y_val), 
                    verbose=2)

Train on 15000 samples, validate on 10000 samples
Epoch 1/10
 - 31s - loss: 0.1704 - accuracy: 0.9473 - val_loss: 0.3031 - val_accuracy: 0.8705
Epoch 2/10
 - 32s - loss: 0.1400 - accuracy: 0.9629 - val_loss: 0.3030 - val_accuracy: 0.8696
Epoch 3/10
 - 32s - loss: 0.1087 - accuracy: 0.9791 - val_loss: 0.3113 - val_accuracy: 0.8657
Epoch 4/10
 - 32s - loss: 0.0872 - accuracy: 0.9881 - val_loss: 0.3074 - val_accuracy: 0.8716
Epoch 5/10
 - 33s - loss: 0.0687 - accuracy: 0.9934 - val_loss: 0.3364 - val_accuracy: 0.8614
Epoch 6/10
 - 33s - loss: 0.0543 - accuracy: 0.9973 - val_loss: 0.3146 - val_accuracy: 0.8709
Epoch 7/10
 - 32s - loss: 0.0424 - accuracy: 0.9985 - val_loss: 0.3227 - val_accuracy: 0.8701
Epoch 8/10
 - 33s - loss: 0.0322 - accuracy: 0.9997 - val_loss: 0.3312 - val_accuracy: 0.8707
Epoch 9/10
 - 32s - loss: 0.0259 - accuracy: 0.9999 - val_loss: 0.3337 - val_accuracy: 0.8701
Epoch 10/10
 - 32s - loss: 0.0204 - accuracy: 0.9999 - val_loss: 0.3515 - val_accuracy: 0.8668


## Adding dense layers to the model

In [56]:
n_filters = 100

text_input = Input(shape=(maxlen,), dtype='int32', name='review')

embedded_text = layers.Embedding(max_words,
                                100,
                                input_length = maxlen,
                                weights = [embedding_matrix],
                                trainable = False)(text_input)

conv1d_2 = layers.Conv1D(n_filters, 2, activation='relu')(embedded_text)
maxpooling_2 = layers.GlobalMaxPooling1D()(conv1d_2)

conv1d_3 = layers.Conv1D(n_filters, 3, activation='relu')(embedded_text)
maxpooling_3 = layers.GlobalMaxPooling1D()(conv1d_3)

conv1d_4 = layers.Conv1D(n_filters, 4, activation='relu')(embedded_text)
maxpooling_4 = layers.GlobalMaxPooling1D()(conv1d_4)

concat_maxpoolings = layers.concatenate([maxpooling_2, maxpooling_3, maxpooling_4], axis=-1)

dense_1 = layers.Dense(100, activation='relu')(concat_maxpoolings)
dense_1_dropout = layers.Dropout(0.5)(dense_1)

dense_2 = layers.Dense(50, activation='relu')(dense_1_dropout)
dense_2_dropout = layers.Dropout(0.5)(dense_2)

prediction = layers.Dense(1, activation='sigmoid')(dense_2_dropout)

model = Model(text_input, prediction)

model.summary()

Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
review (InputLayer)             (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 200, 100)     2000000     review[0][0]                     
__________________________________________________________________________________________________
conv1d_19 (Conv1D)              (None, 199, 100)     20100       embedding_7[0][0]                
__________________________________________________________________________________________________
conv1d_20 (Conv1D)              (None, 198, 100)     30100       embedding_7[0][0]                
____________________________________________________________________________________________

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=10, 
                    batch_size=100, 
                    validation_data=(x_val, y_val), 
                    verbose=2)

Train on 15000 samples, validate on 10000 samples
Epoch 1/10
 - 32s - loss: 0.6907 - accuracy: 0.5416 - val_loss: 0.6142 - val_accuracy: 0.7400
Epoch 2/10
 - 32s - loss: 0.5363 - accuracy: 0.7133 - val_loss: 0.4012 - val_accuracy: 0.8323
Epoch 3/10
 - 32s - loss: 0.4290 - accuracy: 0.7931 - val_loss: 0.3968 - val_accuracy: 0.8453
Epoch 4/10
 - 32s - loss: 0.3766 - accuracy: 0.8219 - val_loss: 0.3672 - val_accuracy: 0.8417
Epoch 5/10
 - 32s - loss: 0.3462 - accuracy: 0.8369 - val_loss: 0.3377 - val_accuracy: 0.8521
Epoch 6/10
 - 34s - loss: 0.3088 - accuracy: 0.8593 - val_loss: 0.3479 - val_accuracy: 0.8446
Epoch 7/10
 - 33s - loss: 0.2932 - accuracy: 0.8491 - val_loss: 0.3689 - val_accuracy: 0.8520
Epoch 8/10
 - 32s - loss: 0.2335 - accuracy: 0.8865 - val_loss: 0.3388 - val_accuracy: 0.8583
Epoch 9/10
 - 32s - loss: 0.1984 - accuracy: 0.8971 - val_loss: 0.3758 - val_accuracy: 0.8582
Epoch 10/10
