In [0]:
#!pip3 install tensorflow-hub tensorflow numpy pickle tqdm keras

## Sentiment Analysis on Twitter Data using Universal Sentence Encoder (TensorFlow) in Keras

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.[[Source: Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis)]

I attempt here to perform sentiment analysis using **Universal Sentence Encoder** Text Embedding from [**TensorFlow**](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1).

The analysis and training is performed on 400,000 Tweets which are either **Positive** or **Negative**

With training on 400,000 Tweets, using Universal Sentence Encoder, I was able to achieve an accuracy of approximately **77%**

### Preprocessing Tweets

Dataset is read from .txt file and then shuffled for mainting random distribution.

Labels are then generated from each tweet.


All the variables or lists are deleted to save memory!

In [1]:
from tqdm import tqdm
import random

pos_tweets = []
neg_tweets = []

with open('pos_1.2M.txt', 'r', buffering=1000) as f:
  ptweets = f.readlines()

with open('neg_1.2M.txt', 'r', buffering=1000) as f:
  ntweets = f.readlines()

pos_tweets = ptweets[:200000]
neg_tweets = ntweets[:200000]

print('\nShuffling ..')
tweets_unclean = list(pos_tweets) + list(neg_tweets)
random.shuffle(tweets_unclean)

print('\nGenerating Labels ..')
with tqdm(total=len(tweets_unclean)) as pbar:
  labels = []
  for tweet in tweets_unclean:
    if tweet in pos_tweets:
      labels.append(1)
    else:
      labels.append(0)
    
    pbar.update(1)
  
del pos_tweets
del neg_tweets


Shuffling ..


  0%|          | 25/400000 [00:00<28:45, 231.80it/s]


Generating Labels ..


100%|██████████| 400000/400000 [25:19<00:00, 263.30it/s]


In [2]:
tweets = tweets_unclean

### Generating Embedding

Released in 2018, The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a **512 dimensional vector**. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

Source: [TensorFlow/Hub/Universal-Sentence-Encoder](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2)

Embedding are created from the collected 400,000 Tweets using the pretrained model and reshaped to make them ready to train.

In [3]:
embed = hub.Module(module_name)

tf.logging.set_verbosity(tf.logging.ERROR)

with tf.Session() as sess:
  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
  tweet_embeddings = sess.run(embed(tweets))

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/2'.
INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/2'.
INFO:tensorflow:Initialize variable module/Embeddings_en/sharded_0:0 from checkpoint b'/tmp/tfhub_modules/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47/variables/variables' with Embeddings_en/sharded_0
INFO:tensorflow:Initialize variable module/Embeddings_en/sharded_1:0 from checkpoint b'/tmp/tfhub_modules/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47/variables/variables' with Embeddings_en/sharded_1
INFO:tensorflow:Initialize variable module/Embeddings_en/sharded_10:0 from checkpoint b'/tmp/tfhub_modules/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47/variables/variables' with Embeddings_en/sharded_10
INFO:tensorflow:Initialize variable module/Embeddings_en/sharded_11:0 from checkpoint b'/tmp/tfhub_modules/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47/var

In [4]:
tweet_embeddings = np.array(tweet_embeddings)
tweet_embeddings.shape

(400000, 512)

In [5]:
tweet_embeddings = np.array([np.reshape(embed, (len(embed), 1)) for embed in tweet_embeddings])

In [6]:
tweet_embeddings.shape

(400000, 512, 1)

### One hot Labels

One-hot labels for the tweet dataset are generated.

In [7]:
from tqdm import tqdm

labels_one_hot = []

with tqdm(total=len(labels)) as pbar:
  for label in labels:
    if label == 0:
      labels_one_hot.append([1., 0.])
    else:
      labels_one_hot.append([0., 1.])
      
    pbar.update(1)

100%|██████████| 400000/400000 [00:01<00:00, 345964.21it/s]


In [8]:
labels_one_hot = np.array(labels_one_hot)

### Pickling all the data

The Training data and Labels are pickled for reusability.

In [9]:
import pickle

embeddings_file = "embeddings-{}.pickle".format(len(tweet_embeddings))
labels_file = "labels-{}.pickle".format(len(labels))

pickle.dump(tweet_embeddings, open(embeddings_file, 'wb'))
pickle.dump(labels_one_hot, open(labels_file, 'wb'))

In [10]:
labels_one_hot.shape

(400000, 2)

### Loading the Data

In [11]:
import pickle

tweet_embeddings = pickle.load(open('tweets_embeddings.pickle', 'rb'))
labels = pickle.load(open('labels-one_hot.pickle', 'rb'))

### Dataset Partition

Spliting the tweets and labels in `(x_train, y_train)` and `(x_test, y_test)` with 90% for training and 10% for testing from all the tweets.

Maximum number of tokens allowed for each tweet is set to be 15.

In [12]:
import numpy as np

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tweet_embeddings, labels, test_size=.1)

In [14]:
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

x_train.shape, y_train.shape, x_test.shape, y_test.shape

((360000, 512, 1), (360000, 2), (40000, 512, 1), (40000, 2))

In [15]:
vector_size = 512
batch_size = 500
no_epochs = 10

### Retraining the model if already trained

In [16]:
'''from keras.models import load_model

model = load_model('universal-sentence-encoder-400k.model')'''

Using TensorFlow backend.




### Building the Neural Model

For training a combination of Convolution Neural Network and Bidirectional Long Short Term Memory Network is used (CNN-LSTM).

Batch Size is 100.


To prevent overfitting or over training of the network, `EarlyStopping()` is used in `callbacks` thus if the network does not improve or starts overfitting, the training comes to an end.

**Acrhitecture of Network:**

===============================================================================

Conv1D -> Conv1D -> Conv1D -> Max Pooling1D -> Bidirectional LSTM -> Dense -> Dropout -> Dense -> Dropout -> Dense -> Dropout -> Output

===============================================================================

Total params: 3,289,794

Trainable params: 3,289,794

Non-trainable params: 0

In [17]:
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, Dense, Flatten, LSTM, MaxPooling1D, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard

model = Sequential()

model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same',
                 input_shape=(vector_size, 1)))
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same'))
model.add(Conv1D(32, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=3))

model.add(Bidirectional(LSTM(512, dropout=0.2, recurrent_dropout=0.3)))

model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])

tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_7 (Conv1D)            (None, 512, 32)           128       
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 512, 32)           3104      
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 512, 32)           3104      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 170, 32)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 1024)              2232320   
_________________________________________________________________
dense_9 (Dense)              (None, 512)               524800    
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
__________

### Training

In [18]:
model.fit(np.array(x_train), np.array(y_train), batch_size=batch_size, epochs=no_epochs,
         validation_data=(np.array(x_test), np.array(y_test)), callbacks=[tensorboard, EarlyStopping(min_delta=0.0001, patience=3)])

Train on 360000 samples, validate on 40000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f193166e9b0>

### Evaluating the Accuracy

In [19]:
model.metrics_names

['loss', 'acc']

In [20]:
model.evaluate(x=x_test, y=y_test, batch_size=500, verbose=1)



[0.48201919458806514, 0.7677500016987324]

### Saving the trained Model

In [21]:
model.save('universal-sentence-encoder-400k.model')