In [None]:
#!pip3 install tqdm gensim keras nltk numpy

## Sentiment Analysis on Twitter Data using FastText (gensim) in Keras

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.[[Source: Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis)]

I attempt here to perform sentiment analysis using **FastText** Text Embedding from [**gensim**](https://github.com/RaRe-Technologies/gensim).

The analysis and training is performed on 400,000 Tweets which are either **Positive** or **Negative**

With training on 400,000 Tweets, using fastText, I was able to achieve an accuracy of approximately **69%**

### Preprocessing Tweets

Dataset is read from .txt file and then shuffled for mainting random distribution.

Labels are then generated from each tweet.

Finally all of the tweets are tokenized (`RegexpTokenizer()`) and then Lemmatized (`WordNetLemmatizer()`) for only storing the root words. 

All the variables or lists are deleted to save memory!

In [0]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import random

random.seed(1000)

lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer('[a-zA-Z0-9]\w+')

pos_tweets = []
neg_tweets = []

with open('pos_1.2M.txt', 'r', buffering=1000) as f:
    pos_tweets = f.readlines()

with open('neg_1.2M.txt', 'r', buffering=1000) as f:
    neg_tweets = f.readlines()

pos_tweets = pos_tweets[:200000]
neg_tweets = neg_tweets[:200000]
  
print('Shuffling ..')
tweets_unclean = list(pos_tweets) + list(neg_tweets)
random.shuffle(tweets_unclean)

print('Generating Labels ..')
labels = []

with tqdm(total=len(tweets_unclean)) as pbar:
    for tweet in tweets_unclean:
        if tweet in pos_tweets:
            labels.append(1)
        else:
              labels.append(0)

        pbar.update(1)
    
del pos_tweets
del neg_tweets

print('Tokenizing ..')
tweets = [tokenizer.tokenize(tweet.lower()) for tweet in tweets_unclean]

print('Done.')

tweets = []

print('Lemmatizing ..')

with tqdm(total=len(tweets_unclean)) as pbar:
    for tweet in tweets_unclean:
        lemmatized = [lemmatizer.lemmatize(word) for word in tweet]
        tweets.append(lemmatized)
        pbar.update(1)

del tweets_unclean

Shuffling ..


  0%|          | 41/400000 [00:00<16:15, 409.88it/s]

Generating Labels ..


100%|██████████| 400000/400000 [13:56<00:00, 478.22it/s]


Tokenizing ..


  0%|          | 0/400000 [00:00<?, ?it/s]

Done.
Lemmatizing ..


100%|██████████| 400000/400000 [02:19<00:00, 2876.50it/s]


### Generating FastText and storing the Model
fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model is an unsupervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. fastText uses Neural network for word embedding
 [[Source: Wikipedia](https://en.wikipedia.org/wiki/FastText)]

Docs on Gensim: [models.fasttext](https://radimrehurek.com/gensim/models/fasttext.html)

FastText is an extension to Word2Vec proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n-grams. After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset. Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words. I will show you how to use FastText with Gensim in the following section.

![FastText Example](fasttext-example.png)


In [1]:
vector_size = 256
window = 5

In [2]:
from gensim.models import FastText

import time

fasttext_model = 'fasttext.model'

print('Generating FastText Vectors ..')

start = time.time()

model = FastText(size=vector_size)
model.build_vocab(tweets)
model.train(tweets, window=window, min_count=1, workers=4, total_examples=model.corpus_count,
           epochs=model.epochs)

print('FastText Created in {} seconds.'.format(time.time() - start))

model.save(fasttext_model)
print('FastText Model saved at {}'.format(fasttext_model))

del model

Generating FastText Vectors ..
FastText Created in 138.93589448928833 seconds.
FastText Model saved at fasttext.model


In [3]:
model = FastText.load(fasttext_model)

In [4]:
x_vectors = model.wv
del model

In [5]:
len(labels), len(tweets)

(400000, 400000)

### Dataset Partition

Spliting the tweets and labels in `(x_train, y_train)` and `(x_test, y_test)` with 90% for training and 10% for testing from all the tweets.

Maximum number of tokens allowed for each tweet is set to be 15.

In [6]:
import numpy as np
import keras.backend as K

train_size = int(0.9*(len(tweets)))
test_size = int(0.1*(len(tweets)))

max_no_tokens = 15

indexes = set(np.random.choice(len(tweets), train_size + test_size, replace=False))

x_train = np.zeros((train_size, max_no_tokens, vector_size), dtype=K.floatx())
y_train = np.zeros((train_size, 2), dtype=np.int32)

x_test = np.zeros((test_size, max_no_tokens, vector_size), dtype=K.floatx())
y_test = np.zeros((test_size, 2), dtype=np.int32)

Using TensorFlow backend.


In [7]:
for i, index in enumerate(indexes):
    for t, token in enumerate(tweets[index]):
        if t >= max_no_tokens:
            break
      
        if token not in x_vectors:
            continue
    
        if i < train_size:
            x_train[i, t, :] = x_vectors[token]
        else:
            x_test[i - train_size, t, :] = x_vectors[token]

  
    if i < train_size:
        y_train[i, :] = [1.0, 0.0] if labels[index] == 0 else [0.0, 1.0]
    else:
        y_test[i - train_size, :] = [1.0, 0.0] if labels[index] == 0 else [0.0, 1.0]
    
del tweets
del labels

In [8]:
x_train.shape, y_test.shape

((360000, 15, 256), (40000, 2))

### Building the Neural Model

For training a combination of Convolution Neural Network and Bidirectional Long Short Term Memory Network is used (CNN-LSTM).

Batch Size is 100.


To prevent overfitting or over training of the network, `EarlyStopping()` is used in `callbacks` thus if the network does not improve or starts overfitting, the training comes to an end.

**Acrhitecture of Network:**

===============================================================================

Conv1D -> Conv1D -> Conv1D -> Max Pooling1D -> Bidirectional LSTM -> Dense -> Dropout -> Dense -> Dropout -> Dense -> Dropout -> Output

===============================================================================

Total params: 3,314,274

Trainable params: 3,314,274

Non-trainable params: 0

In [9]:
batch_size = 500
no_epochs = 100

In [10]:
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, Dense, Flatten, LSTM, MaxPooling1D, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard


model = Sequential()

model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same',
                 input_shape=(max_no_tokens, vector_size)))
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same'))
model.add(Conv1D(32, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=3))

model.add(Bidirectional(LSTM(512, dropout=0.2, recurrent_dropout=0.3)))

model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])

tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 15, 32)            24608     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 15, 32)            3104      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 15, 32)            3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1024)              2232320   
_________________________________________________________________
dense_1 (Dense)              (None, 512)               524800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
__________

### Training

In [11]:
model.fit(x_train, y_train, batch_size=batch_size, shuffle=True, epochs=no_epochs,
         validation_data=(x_test, y_test), callbacks=[tensorboard, EarlyStopping(min_delta=0.0001, patience=3)])

Train on 360000 samples, validate on 40000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100


<keras.callbacks.History at 0x7f2529da1710>

### Evaluating the Model

In [12]:
model.metrics_names

['loss', 'acc']

In [13]:
model.evaluate(x=x_test, y=y_test, batch_size=32, verbose=1)



[0.5668948031425476, 0.69585]

### Saving the Model

In [14]:
model.save('twitter-sentiment-fasttext.model')