### Learn embeddings from scratch

In this example, we will train a **one-dimensional Convolutional Neural Network (CNN) to classify
sentences as either positive or negative.** Words in sentences exhibit linear structure in the same way as images exhibit spatial structure.

Traditional (non-deep learning) NLP approaches to language modeling involve creating word ngrams
(https://en.wikipedia.org/wiki/N-gram) to exploit this linear structure inherent among words. **One-dimensional
CNNs do something similar**, learning convolution filters that operate on sentences a few
words at a time, and max pooling the results to create a vector that represents the most important
ideas in the sentence. There is another class of neural network, called **Recurrent Neural Network (RNN)**, which is
specially designed **to handle sequence data**, including text, which is a sequence of words. 

**Install NLTK (Natural Language Toolkit)** to parse the text into sentences and words. he statistical models supplied by NLTK are more powerful at parsing than regular expressions 
            
    conda install nltk

The **sequence of word indices is fed into an array of embedding layers** of a set size (in our case, the
number of words in the longest sentence). The embedding layer is initialized by default to random
values. **The output of the embedding layer is connected to a 1D convolutional layer** that convolves (in
our example) word trigrams in 256 different ways (essentially, it applies different learned linear
combinations of weights on the word embeddings). These features are then pooled into a single
pooled word by a global max pooling layer. **This vector (256) is then input to a dense layer**, which
outputs a vector (2). **A softmax activation will return a pair of probabilities**, one corresponding to
positive sentiment and another corresponding to negative sentiment. The network is shown in the
following figure:
<img src="CNN_Text.JPG">

In [1]:
from keras.layers.core import Dense, SpatialDropout1D, Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np     
import codecs

from tensorflow.keras.callbacks import TensorBoard
from time import gmtime, strftime
import datetime, os
import tensorflow as tf
import time

Using TensorFlow backend.


For run TensorBoard, which display the graphs to evaluate de Neural Network Model:

- 1: create a folder named as *logs* inside of your main folder
- 2: run the next command in the terminal of your main folder:  *tensorboard --logdir=logs/*
- 3: write the http address in new aba on your browser
- 4: press enter in terminal: the TensorBoard window pop up in your browser

Note: Wait until the number of epochs is 20 as defined in : NUM_EPOCHS = 20 

In [2]:
NAME = "wordtext{}".format(int(time.time()))
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))

In [3]:
""" 
we want consistent results between runs Since the initializations of 
the weight matrices are random, differences in initialization can lead
to differences in output, so this is a way to control that
"""
np.random.seed(42)  

We will classify sentences from the **UMICH SI650 sentiment classification competition on Kaggle**. The dataset has around 7000 sentences, and is **labeled 1 for positive and 0 for negative**. The format of the file is a sentiment label (0 or 1) followed by a tab, followed by a sentence.

**Download data from:** https://www.kaggle.com/c/si650winter11/data

In [4]:
INPUT_FILE = "data/umich-sentiment-train.txt"
VOCAB_SIZE = 5000
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
BATCH_SIZE = 64
NUM_EPOCHS = 20

In [5]:
counter = collections.Counter()
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')

In the next block, we first read our input sentences and construct our vocabulary out of the most
frequent words in the corpus. We then use this vocabulary to convert our input sentences into a list of
word indices.

In [7]:
maxlen = 0
for line in fin:
    _, sent = line.strip().split("\t")
    words = [x.lower() for x in nltk.word_tokenize(sent)]   # lower case of words
    if len(words) > maxlen:
        maxlen = len(words)                                 # We pad each of our sentences to predetermined 
                                                            # length maxlen (in this case the number of words in the
                                                            # longest sentence in the training set)
    for word in words:
        counter[word] += 1
fin.close()

word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
# Adding one because UNK.
# It means representing words that are not seen in the vocubulary
vocab_sz = len(word2index) + 1
index2word = {v: k for k, v in word2index.items()}

In [8]:
xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
    label, sent = line.strip().split("\t")
    ys.append(int(label))
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    wids = [word2index[word] for word in words]
    xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

We split up our data into a 70/30 training and test set. The data is now in a form ready to be
fed into the network

In [9]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.3, random_state=42)

We define the network that we described earlier in this notebook

In [10]:
model = Sequential()

model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen))
model.add(SpatialDropout1D(0.2))

model.add(Conv1D(filters=NUM_FILTERS,
                 kernel_size=NUM_WORDS,
                 activation="relu"))

model.add(GlobalMaxPooling1D())

model.add(Dense(2, activation="softmax"))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


As you can see, the network gives us 99.98% accuracy on the test set.

In [12]:
model.compile(loss="categorical_crossentropy", 
              optimizer="adam",
              metrics=["accuracy"])

history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    callbacks=[tensorboard],
                    validation_data=(Xtest, Ytest))

# evaluate model
score = model.evaluate(Xtest, Ytest, verbose=1)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score[0], score[1]))

Train on 4960 samples, validate on 2126 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test score: 0.018, accuracy: 0.993


**Validation plots for Loss and Accuracy** 
<img src="tensorboard2.JPG">

**Structure of the Neural Network Model**
<img src="tensorboard3.JPG">