# Sentiment Analysis: 
The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

In this tutorial, we will use various ML techniques to do sentiment analysis. This tutorial uses code examples from various sources. 

References:
https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
https://developers.google.com/machine-learning/crash-course/embeddings/programming-exercise
https://github.com/adeshpande3/Tensorflow-Programs-and-Tutorials/blob/master/Sentiment%20Analysis%20with%20LSTMs.ipynb

# Download the IMDb movie review data

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset (aclImdb_v1.tar.gz), decompress the files.

# Pre-processing IMDB dataset

Preprocessing step is to clean up the data. This includes removing the HTML tags (using the Python package BeautifulSoup), unnecessary punctuation (using the Python package RegularExpression), convert to lower case, and removing stop words2 if needed (Python Natural Language Toolkit). To convert a cleaned sequence of words to numerical feature vectors try following methods: 

• Bag of Words (BOW) Bag of words is probably the simplest way to numerically represent texts. Given a text T, we assign a vector vT ∈Nd to it, such that vT i is the number of times the i’th word of the vocabulary has appeared in the text T. d is the size of our vocabulary, which consists of all words in the set of reviews except for very rare words (we use the 5000 most frequent words). After learning the BOW vectors for every review in the labeled training set, we ﬁt a classiﬁer to the data. 

• Word2Vec Another way to numerically represent texts is to transform each word of the text to a vector. This transformation should preserve the semantics of words, that is if the meanings of two words are close, their vectors should be close as well (in an L2-distance sense). One important aspect of the word2vec task is that it is independent of the main objective (here sentiment analysis), and does not require a labeled dataset. Note that for the sentiment analysis we need a feature vector for each review. 

• Words to reviews: Averaging Perhaps the simplest way toassign a feature vector to a set of words(a re view)is to average the word vectors of all words. 

Reference: https://cs224d.stanford.edu/reports/PouransariHadi.pdf

First we will use Bag of words approach to convert text to feature vectors.

Reference:
https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch08/ch08.ipynb

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [17]:
import pyprind
import pandas as pd
import os
import time
import numpy as np

In [None]:
# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir\(path):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

We can saving the assembled data as CSV file for further use.

In [None]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [2]:
#have a look at the file

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

print(count.vocabulary_)

print(bag.toarray())
np.set_printoptions(precision=2)


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

In [None]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

In [None]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

In [None]:
df.loc[0, 'review'][-50:]

In [4]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
preprocessor(df.loc[0, 'review'][-50:])

In [5]:
df['review'] = df['review'].apply(preprocessor)

In [37]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [38]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\javai\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']

In [39]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

# Training a classifier

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5,
                             max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

#count_vect = CountVectorizer()
#X_train_counts = count_vect.fit_transform(X_train)

#tf_transformer = TfidfTransformer()
#X_train_tfidf = tf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(train_vectors, y_train)

print(train_vectors.shape, test_vectors.shape)

(2501, 7207) (47500, 7207)


  self.class_log_prior_ = (np.log(self.class_count_) -


In [34]:
from sklearn.metrics import classification_report

predicted = clf.predict(test_vectors)

np.mean(predicted == y_test)

#print(classification_report(y_test, predicted))

0.47368421052631576

# Building a Pipeline

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:
> from sklearn.pipeline import Pipeline
> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

The names vect, tfidf and clf (classifier) are arbitrary. We shall see their use in the section on grid search, below. We can now train the model with a single command.

In [42]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)
#TfidfVectorizer Equivalent to CountVectorizer followed by TfidfTransformer.

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},              
              ]

lr_tfidf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=2,
                           verbose=1,
                           n_jobs=-1)

In [None]:
gs_lr_tfidf.fit(X_train[:400], y_train[:400])

Fitting 2 folds for each of 24 candidates, totalling 48 fits


In [None]:
predicted = gs_lr_tfidf.predict(y_test[:10])
np.mean(predicted == y_test.value) 

# Another  way is to use Keras.
Keras provides access to the IMDB dataset built-in.
The keras.datasets.imdb.load_data() allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data.

Usage:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

In [1]:
import numpy
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
X = numpy.concatenate((X_train, X_test), axis=0)
y = numpy.concatenate((y_train, y_test), axis=0)

Using TensorFlow backend.


In [2]:
print("Training data: ")
print(X.shape)
print(y.shape)

# Summarize number of classes
print("Classes: ")
print(numpy.unique(y))

# Summarize number of words
print("Number of words: ")
print(len(numpy.unique(numpy.hstack(X))))

Training data: 
(50000,)
(50000,)
Classes: 
[0 1]
Number of words: 
88585


In [1]:
# LSTM for the IMDB problem
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

Using TensorFlow backend.


Why do we need embedding layer and other stuff refer: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [12]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

print('Build model...')

model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 168,353
Trainable params: 168,353
Non-trainable params: 0
_________________________________________________________________
None


In [13]:
print('Train...')

model.fit(X_train, y_train, batch_size=128, epochs=2, validation_data=(X_test, y_test), verbose=2)


Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
 - 339s - loss: 0.5616 - acc: 0.7226 - val_loss: 0.4220 - val_acc: 0.8128
Epoch 2/2
 - 350s - loss: 0.3961 - acc: 0.8279 - val_loss: 0.4394 - val_acc: 0.7989


<keras.callbacks.History at 0x26cab0ddc88>

In [15]:
score, acc = model.evaluate(X_test, y_test, batch_size=128)

print('Test score:', score)
print('Test accuracy:', acc)
print("Accuracy: %.2f%%" % (acc*100))

Test score: 0.439354675655365
Test accuracy: 0.7989199999809266
Accuracy: 79.89%


In [16]:
#let us check how the model predicts
classes = model.predict(X_test[:10], batch_size=128)
for i in range (0,10):
    if(classes[i] > 0.5 and y_test[i] == 1 or (classes[i] <= 0.5 and y_test[i] == 0)):
        print( classes[i], y_test[i], " Right prdiction")
    else :
        print( classes[i], y_test[i], " Wrong prdiction")
        

[0.7350859] 0  Wrong prdiction
[0.8579811] 1  Right prdiction
[0.67867035] 1  Right prdiction
[0.32355246] 0  Right prdiction
[0.9454058] 1  Right prdiction
[0.39817715] 1  Wrong prdiction
[0.7007876] 1  Right prdiction
[0.19897594] 0  Right prdiction
[0.5610889] 0  Wrong prdiction
[0.8130962] 1  Right prdiction


# Let us use another model in keras and see how it peforms

Gated Recurrent Unit (GRU). Although many people use LSTM, but it has been seen that the GRU is faster to train than LSTM and gives similar performance.

Let us use GRU and check the performace

Ref: Deep Learning By Example; By Ahmed Menshawy

https://arxiv.org/pdf/1412.3555v1.pdf

https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm

Keras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer. 
This data preparation step can be performed using the Tokenizer API also provided with Keras.
Here we use Tokenizer API to convert the unseen text data to the integr encoding format to test the model.
Keras provides a Tokenizer class that can  convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, 
and provides access to the dictionary mapping of words to integers in a word_index attribute.

In [3]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
import pandas as pd

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
max_words = 500

#load the csv file saved

df = pd.read_csv('movie_data.csv', encoding='utf-8')

X_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

tokenizer_obj = Tokenizer(num_words=top_words)
total_reviews = X_train + X_test
tokenizer_obj.fit_on_texts(total_reviews) 

X_train_tokens =  tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)


X_train_pad = pad_sequences(X_train_tokens, maxlen=max_words)
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_words)


In [4]:
print('Build model...')

model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(GRU(units=32,  dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('Summary of the built model...')
print(model.summary())

Build model...
Summary of the built model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
gru_1 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 166,273
Trainable params: 166,273
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
print('Train...')

model.fit(X_train_pad, y_train, batch_size=128, epochs=25, validation_data=(X_test_pad, y_test), verbose=2)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/25
 - 269s - loss: 0.7271 - acc: 0.5000 - val_loss: 0.7031 - val_acc: 0.4998
Epoch 2/25
 - 255s - loss: 0.6931 - acc: 0.5218 - val_loss: 0.6988 - val_acc: 0.4986
Epoch 3/25


We also try out model save and load API from Keras. Just trying out these, though not needed for this task.

In [None]:
# serialize model to JSON
model_json = model.to_json()
with open("movies_sa_model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("movies_sa_model.h5")
print("Saved model to disk")
 
 
# load json and create model
json_file = open('movies_sa_model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("movies_sa_model.h5")
print("Loaded model from disk")

In [7]:
print('Testing...')
score, acc = loaded_model.evaluate(X_test_pad, y_test, batch_size=128)

print('Test score:', score)
print('Test accuracy:', acc)

print("Accuracy: {0:.2%}".format(acc))

Testing...
Test score: 0.3937822948884964
Test accuracy: 0.82836
Accuracy: 82.84%


In [8]:
predicted = loaded_model.predict(x=X_test_pad[0:1000])
predicted = predicted.T[0]

In [10]:
import numpy as np
class_predicted = np.array([1.0 if prob>0.5 else 0.0 for prob in predicted])
class_actual = np.array(y_test[0:1000])
incorrect_samples = np.where(class_predicted != class_actual)
incorrect_samples = incorrect_samples[0]
len(incorrect_samples)


218

Let us try some sample to test our model. We will use Tokenizer to convert the text.

In [11]:
#Let us test some  samples
test_sample_1 = "This movie is fantastic! I really like it because it is so good!"
test_sample_2 = "Good movie!"
test_sample_3 = "Maybe I like this movie."
test_sample_4 = "Meh ..."
test_sample_5 = "If I were a drunk teenager then this movie might be good."
test_sample_6 = "Bad movie!"
test_sample_7 = "Not a good movie!"
test_sample_8 = "This movie really sucks! Can I get my money back please?"
test_samples = [test_sample_1, test_sample_2, test_sample_3, test_sample_4, test_sample_5, test_sample_6, test_sample_7, test_sample_8]

test_samples_tokens = tokenizer_obj.texts_to_sequences(test_samples)
test_samples_tokens_pad = pad_sequences(test_samples_tokens, maxlen=max_words)

#predict
loaded_model.predict(x=test_samples_tokens_pad)


array([[0.97209734],
       [0.9428219 ],
       [0.8753096 ],
       [0.9427748 ],
       [0.23053733],
       [0.58713585],
       [0.9108747 ],
       [0.1033233 ]], dtype=float32)

We see that training GRU based model is faster than LSTM based. For each epoch  GRU  takes roughly 240s and LSTM around 340s on NVIDIA GeForce GTX 970

# Bag of Tricks for Efﬁcient Text Classiﬁcation
https://arxiv.org/pdf/1607.01759.pdf

Abstract: This paper explores a simple and efﬁcient baseline for text classiﬁcation. Our experiments show that our fast text classiﬁer fastText is often on par with deep learning classiﬁers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

This paper uses below two key techniques to make faster
> 1) A bag of n-grams as additional features to capture some partial information about the local word order.

> 2) Maintains a fast and memory efﬁcient mapping of the n-grams by using the hashing trick.
    https://en.wikipedia.org/wiki/Feature_hashing                                                  
    http://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

Refer to code here: https://github.com/keras/blob/master/examples/imdb_fasttext.py

In [None]:

'''This example demonstrates the use of fasttext for text classification
Based on Joulin et al's paper:
Bags of Tricks for Efficient Text Classification
https://arxiv.org/abs/1607.01759

Results on IMDB datasets with uni and bi-gram embeddings:
    Uni-gram: 0.8813 test accuracy after 5 epochs. 8s/epoch on i7 cpu.
    Bi-gram : 0.9056 test accuracy after 5 epochs. 2s/epoch on GTx 980M gpu.
'''


from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.datasets import imdb

def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3)
    [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)]
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))

def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    Example: adding bi-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
    >>> add_ngram(sequences, token_indice, ngram_range=2)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]

    Example: adding tri-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018}
    >>> add_ngram(sequences, token_indice, ngram_range=3)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42, 2018]]
    """

    new_sequences = []

    for input_list in sequences:
        new_list = input_list[:]

        for ngram_value in range(2, ngram_range + 1):
            for i in range(len(new_list) - ngram_value + 1):
                ngram = tuple(new_list[i:i + ngram_value])

                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences


# Set parameters:
# ngram_range = 2 will add bi-grams features
ngram_range = 1
max_features = 20000
maxlen = 400
batch_size = 32
embedding_dims = 50
epochs = 5

print('Loading data...')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))

if ngram_range > 1:
    print('Adding {}-gram features'.format(ngram_range))

    # Create set of unique n-gram from the training set.
    ngram_set = set()

    for input_list in x_train:
        for i in range(2, ngram_range + 1):

            set_of_ngram = create_ngram_set(input_list, ngram_value=i)
            ngram_set.update(set_of_ngram)

    # Dictionary mapping n-gram token to a unique integer.
    # Integer values are greater than max_features in order
    # to avoid collision with existing features.
    start_index = max_features + 1
    token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
    indice_token = {token_indice[k]: k for k in token_indice}

    # max_features is the highest integer that could be found in the dataset.
    max_features = np.max(list(indice_token.keys())) + 1
    # Augmenting x_train and x_test with n-grams features
    x_train = add_ngram(x_train, token_indice, ngram_range)
    x_test = add_ngram(x_test, token_indice, ngram_range)
    print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
    print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))

# we add a GlobalAveragePooling1D, which will average the embeddings
# of all words in the document
model.add(GlobalAveragePooling1D())

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

When you run the above cell, you will notice that the model trains very fast much faster than above LSTM based.

In [None]:
score, acc = model.evaluate(x_test, y_test,batch_size=128)

print('Test score:', score)
print('Test accuracy:', acc)
print("Accuracy: %.2f%%" % (acc*100))

Check accuracy well around 89% similar to above LSTM with the bonus of faster training.

In [None]:
#let us check how the model predicts
classes = model.predict(x_test[:10], batch_size=128)
for i in range (0,10):
    if(classes[i] > 0.5 and y_test[i] == 1 or (classes[i] <= 0.5 and y_test[i] == 0)):
        print( classes[i], y_test[i], " Right prdiction")
    else :
        print( classes[i], y_test[i], " Wrong prdiction")
        

# Using Tensorflow

#From Google Machine Learning Crash Course
https://developers.google.com/machine-learning/crash-course/embeddings/programming-exercise

Convert movie-review string data to a sparse feature vector

Implement a sentiment-analysis linear model using a sparse feature vector

Implement a sentiment-analysis DNNClassifier model using an embedding that projects data into two dimensions
Visualize the embedding to see what the model has learned about the relationships between words
In this exercise, we'll explore sparse data and work with embeddings using text data from movie reviews (from the ACL 2011 IMDB dataset). This data has already been processed into tf.Example format.

In [17]:
import collections
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
#from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)
train_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/train.tfrecord'
train_path = tf.keras.utils.get_file(train_url.split('/')[-1], train_url)
test_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/test.tfrecord'
test_path = tf.keras.utils.get_file(test_url.split('/')[-1], test_url)

In [18]:
def _parse_function(record):
  """Extracts features and labels.
  
  Args:
    record: File path to a TFRecord file    
  Returns:
    A `tuple` `(labels, features)`:
      features: A dict of tensors representing the features
      labels: A tensor with the corresponding labels.
  """
  features = {
    "terms": tf.VarLenFeature(dtype=tf.string), # terms are strings of varying lengths
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32) # labels are 0 or 1
  }
  
  parsed_features = tf.parse_single_example(record, features)
  
  terms = parsed_features['terms'].values
  labels = parsed_features['labels']

  return  {'terms':terms}, labels

In [19]:
# Create the Dataset object
ds = tf.data.TFRecordDataset(train_path)
# Map features and labels with the parse function
ds = ds.map(_parse_function)

ds

<MapDataset shapes: ({terms: (?,)}, (1,)), types: ({terms: tf.string}, tf.float32)>

In [20]:
# Create an input_fn that parses the tf.Examples from the given files,
# and split them into features and targets.
def _input_fn(input_filenames, num_epochs=None, shuffle=True):
  
  # Same code as above; create a dataset and map features and labels
  ds = tf.data.TFRecordDataset(input_filenames)
  ds = ds.map(_parse_function)

  if shuffle:
    ds = ds.shuffle(10000)

  # Our feature data is variable-length, so we pad and batch
  # each field of the dataset structure to whatever size is necessary     
  ds = ds.padded_batch(25, ds.output_shapes)
  
  ds = ds.repeat(num_epochs)

  
  # Return the next batch of data
  features, labels = ds.make_one_shot_iterator().get_next()
  return features, labels

In [21]:
# 54 informative terms that compose our model vocabulary 
informative_terms = ("bad", "great", "best", "worst", "fun", "beautiful",
                     "excellent", "poor", "boring", "awful", "terrible",
                     "definitely", "perfect", "liked", "worse", "waste",
                     "entertaining", "loved", "unfortunately", "amazing",
                     "enjoyed", "favorite", "horrible", "brilliant", "highly",
                     "simple", "annoying", "today", "hilarious", "enjoyable",
                     "dull", "fantastic", "poorly", "fails", "disappointing",
                     "disappointment", "not", "him", "her", "good", "time",
                     "?", ".", "!", "movie", "film", "action", "comedy",
                     "drama", "family", "man", "woman", "boy", "girl")

terms_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(key="terms", vocabulary_list=informative_terms)

In [22]:

terms_embedding_column = tf.feature_column.embedding_column(terms_feature_column, dimension=2)
feature_columns = [ terms_embedding_column ]

my_optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

classifier = tf.estimator.DNNClassifier(
  feature_columns=feature_columns,
  hidden_units=[10,10],
  optimizer=my_optimizer
)


classifier.train(
  input_fn=lambda: _input_fn([train_path]),
  steps=1000)

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn([train_path]),
  steps=1000)
print ("Training set metrics:")
for m in evaluation_metrics:
  print (m, evaluation_metrics[m])
print ("---")

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn([test_path]),
  steps=1000)

print ("Test set metrics:")
for m in evaluation_metrics:
  print (m, evaluation_metrics[m])
print ("---")

Training set metrics:
accuracy 0.78408
accuracy_baseline 0.5
auc 0.86556464
auc_precision_recall 0.8542738
average_loss 0.45815992
label/mean 0.5
loss 11.453998
prediction/mean 0.49827233
global_step 1000
---
Test set metrics:
accuracy 0.77908
accuracy_baseline 0.5
auc 0.8636594
auc_precision_recall 0.85153013
average_loss 0.46100447
label/mean 0.5
loss 11.525111
prediction/mean 0.49753433
global_step 1000
---


The above DNNClassifier based model does well to give around 78% accuracy and trains fast.
LSTM/RNN based Deep learning models are very slow to train and we have seen that for simple text classification or sentiment analysis problems other ML approaches as well give similar results with quicker training time.

# Hope You liked it. Thanks for reading till the end. :)