In [None]:
#!pip3 install tqdm gensim keras nltk numpy

## Sentiment Analysis on Twitter Data using Word2Vec (gensim) in Keras

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.[[Source: Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis)]

I attempt here to perform sentiment analysis using **Word2Vec** Text Embedding from [**gensim**](https://github.com/RaRe-Technologies/gensim).

The analysis and training is performed on 400,000 Tweets which are either **Positive** or **Negative**

With training on 400,000 Tweets, using word2vec, I was able to achieve an accuracy of approximately **69%**

### Preprocessing Tweets

Dataset is read from .txt file and then shuffled for mainting random distribution.

Labels are then generated from each tweet.

Finally all of the tweets are tokenized (`RegexpTokenizer()`) and then Lemmatized (`WordNetLemmatizer()`) for only storing the root words. 

All the variables or lists are deleted to save memory!

In [1]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import random

random.seed(1000)

lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer('[a-zA-Z0-9]\w+')

pos_tweets = []
neg_tweets = []

with open('pos_1.2M.txt', 'r', buffering=1000) as f:
    pos_tweets = f.readlines()

with open('neg_1.2M.txt', 'r', buffering=1000) as f:
    neg_tweets = f.readlines()

pos_tweets = pos_tweets[:200000]
neg_tweets = neg_tweets[:200000]
  
print('Shuffling ..')
tweets_unclean = list(pos_tweets) + list(neg_tweets)
random.shuffle(tweets_unclean)

print('Generating Labels ..')
labels = []

with tqdm(total=len(tweets_unclean)) as pbar:
    for tweet in tweets_unclean:
        if tweet in pos_tweets:
              labels.append(1)
        else:
              labels.append(0)

        pbar.update(1)
    
del pos_tweets
del neg_tweets

print('Tokenizing ..')
tweets = [tokenizer.tokenize(tweet.lower()) for tweet in tweets_unclean]

print('Done.')

tweets = []

print('Lemmatizing ..')

with tqdm(total=len(tweets_unclean)) as pbar:
    for tweet in tweets_unclean:
        lemmatized = [lemmatizer.lemmatize(word) for word in tweet]
        tweets.append(lemmatized)
        pbar.update(1)

del tweets_unclean

Shuffling ..


  0%|          | 30/400000 [00:00<22:15, 299.55it/s]

Generating Labels ..


100%|██████████| 400000/400000 [22:22<00:00, 297.87it/s]


Tokenizing ..


  0%|          | 0/400000 [00:00<?, ?it/s]

Done.
Lemmatizing ..


100%|██████████| 400000/400000 [02:19<00:00, 2871.33it/s]


### Generating Word2Vec and storing the Model

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. [[Source: Wikipedia](https://en.wikipedia.org/wiki/Word2Vec)]

Docs in Gensim: [models.word2vec](https://radimrehurek.com/gensim/models/word2vec.html)

Word2Vec has 2 important models inside: Skip-Grams and Continous Bag-of-Words(CBOW)

### Skip-Grams:
In Skip-Gram model, we take a centre word and a window of context words  or neighbors within the context window and we try to predict context words for each centre word. The model generates a probability distribution i.e., probability of a word appearing in context given centre word and the task here is to choose the vector representation to maximize the probability.

![Skip-Gram Model](skip-gram-model.png)


![Example](skip-gram-example.png)


### Continous Bag-of-Words (CBOW): 
CBOW is opposite of Skip-Grams. We attempt to predict the centre word from the given context i.e., we try to predict the centre word by summing vectors of surrounding words.

![Continous Bag-of-Words](CBOW-model.png)


In [2]:
vector_size = 256
window = 5

In [3]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

import time

word2vec_model = 'word2vec.model'

print('Generating Word2Vec Vectors ..')

start = time.time()

model = Word2Vec(sentences=tweets, size=vector_size, window=window, negative=20, iter=50, workers=4)

print('Word2Vec Created in {} seconds.'.format(time.time() - start))

model.save(word2vec_model)
print('Word2Vec Model saved at {}'.format(word2vec_model))

# Got to clear the memory!
del model

Generating Word2Vec Vectors ..
Word2Vec Created in 1120.455406665802 seconds.
Word2Vec Model saved at word2vec.model


In [4]:
# Load the saved model!
model = Word2Vec.load(word2vec_model)

In [5]:
x_vectors = model.wv
del model

In [6]:
len(labels), len(tweets)

(400000, 400000)

### Dataset Partition

Spliting the tweets and labels in `(x_train, y_train)` and `(x_test, y_test)` with 90% for training and 10% for testing from all the tweets.

Maximum number of tokens allowed for each tweet is set to be 15.

In [7]:
import numpy as np
import keras.backend as K

train_size = int(0.9*(len(tweets)))
test_size = int(0.1*(len(tweets)))

max_no_tokens = 15

indexes = set(np.random.choice(len(tweets), train_size + test_size, replace=False))

x_train = np.zeros((train_size, max_no_tokens, vector_size), dtype=K.floatx())
y_train = np.zeros((train_size, 2), dtype=np.int32)

x_test = np.zeros((test_size, max_no_tokens, vector_size), dtype=K.floatx())
y_test = np.zeros((test_size, 2), dtype=np.int32)

Using TensorFlow backend.


In [8]:
for i, index in enumerate(indexes):
    for t, token in enumerate(tweets[index]):
        if t >= max_no_tokens:
            break
      
        if token not in x_vectors:
            continue
    
        if i < train_size:
            x_train[i, t, :] = x_vectors[token]
        else:
            x_test[i - train_size, t, :] = x_vectors[token]

  
    if i < train_size:
        y_train[i, :] = [1.0, 0.0] if labels[index] == 0 else [0.0, 1.0]
    else:
        y_test[i - train_size, :] = [1.0, 0.0] if labels[index] == 0 else [0.0, 1.0]
    
del tweets
del labels

In [9]:
x_train.shape, y_test.shape

((360000, 15, 256), (40000, 2))

### Building the Neural Model

For training a combination of Convolution Neural Network and Bidirectional Long Short Term Memory Network is used (CNN-LSTM).

Batch Size is 100.


To prevent overfitting or over training of the network, `EarlyStopping()` is used in `callbacks` thus if the network does not improve or starts overfitting, the training comes to an end.

**Acrhitecture of Network:**

===============================================================================

Conv1D -> Conv1D -> Conv1D -> Max Pooling1D -> Bidirectional LSTM -> Dense -> Dropout -> Dense -> Dropout -> Dense -> Dropout -> Output

===============================================================================

Total params: 3,314,274

Trainable params: 3,314,274

Non-trainable params: 0

In [10]:
batch_size = 500
no_epochs = 100

In [11]:
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, Dense, Flatten, LSTM, MaxPooling1D, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard


model = Sequential()

model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same',
                 input_shape=(max_no_tokens, vector_size)))
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same'))
model.add(Conv1D(32, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=3))

model.add(Bidirectional(LSTM(512, dropout=0.2, recurrent_dropout=0.3)))

model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])

tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 15, 32)            24608     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 15, 32)            3104      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 15, 32)            3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1024)              2232320   
_________________________________________________________________
dense_1 (Dense)              (None, 512)               524800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
__________

### Training

In [12]:
model.fit(x_train, y_train, batch_size=batch_size, shuffle=True, epochs=no_epochs,
         validation_data=(x_test, y_test), callbacks=[tensorboard, EarlyStopping(min_delta=0.0001, patience=3)])

Train on 360000 samples, validate on 40000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100




<keras.callbacks.History at 0x7fe0a4ca9860>

### Evaluating the Model

In [13]:
model.metrics_names

['loss', 'acc']

In [14]:
model.evaluate(x=x_test, y=y_test, batch_size=32, verbose=1)



[0.5668408149003983, 0.694675]

### Saving the Model

In [15]:
model.save('twitter-sentiment-word2vec-400k.model')