# Sentiment Analysis on COVID19 Tweets

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("./covid19_tweets.csv")

## Exploring Tweet Data

* Sentiment Analysis on Covid19 Tweets
    * Exploring tweet data
    * Encoding tweets
    * Encoding sentiments
    * Detecting outlier reviews
    * Training, testing and validating
    * Dataloaders and batching
    * Sentiment network with PyTorch
    * Instantiate the netork
    * Calculating model's accuracy
    * Testing model on a random covid19 tweet.

In [35]:
sentiment_df = pd.read_csv('/kaggle/input/twitterdata/finalSentimentdata2.csv')

In [36]:
sentiment_df.head()

Unnamed: 0.1,Unnamed: 0,sentiment,text
0,3204,sad,agree the poor in india are treated badly thei...
1,1431,joy,if only i could have spent the with this cutie...
2,654,joy,will nature conservation remain a priority in ...
3,2530,sad,coronavirus disappearing in italy show this to...
4,2296,sad,uk records lowest daily virus death toll since...


In [37]:
sentiment_df.columns

Index(['Unnamed: 0', 'sentiment', 'text'], dtype='object')

In [38]:
sentiment_df['sentiment'].nunique

<bound method IndexOpsMixin.nunique of 0         sad
1         joy
2         joy
3         sad
4         sad
        ...  
3085      sad
3086    anger
3087      joy
3088      sad
3089      sad
Name: sentiment, Length: 3090, dtype: object>

In [39]:
sentiment_df.loc[:, 'text'] = sentiment_df['text'].apply(punctuation_stopwords_removal)

In [40]:
reviews_split = []
for i, j in sentiment_df.iterrows():
    reviews_split.append(j['text'])


In [41]:
words = []
for review in reviews_split:
    for word in review:
        words.append(word)


In [42]:
print(words[:20])

['agree', 'poor', 'india', 'treated', 'badly', 'poors', 'seek', 'living', 'singapore', 'treated', 'like', 'citizens', 'given', 'free', 'medical', 'treatment', 'given', 'food', 'daily', 'sim']


## Encoding Tweets
Create an array that contains integer encoded version of words in reviews. The word appearing the most should have least integer value. Example if the appeared the most in reviews, then assign 'the' : 1

In [43]:
from collections import Counter

counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word:ii for ii, word in enumerate(vocab, 1)}

In [44]:
encoded_reviews = []
for review in reviews_split:
    encoded_reviews.append([vocab_to_int[word] for word in review])


In [45]:
print(len(vocab_to_int))
print(encoded_reviews[:10])

10662
[[853, 186, 20, 1079, 1457, 4429, 2201, 407, 1240, 1079, 15, 218, 337, 167, 253, 462, 337, 122, 168, 4430, 4431, 140, 23, 264, 58, 765, 3, 5, 195, 1079, 2966, 274], [80, 1755, 4432, 2967, 4433, 86, 854, 1080, 2968, 4434, 4435, 7], [543, 4436, 946, 1458, 265, 2, 1241, 168, 2202], [1, 4437, 169, 266, 1459, 129, 1242, 47, 7], [304, 1756, 4438, 168, 8, 39, 219, 93, 355, 4, 21], [2203, 2204, 1, 1757, 1243, 2969, 1460, 1081, 1461, 4439, 98, 4440], [947, 37, 1758, 285, 948, 4441, 1462, 2205, 13, 3, 5, 4442, 285, 1244, 4443, 1463, 4444, 4445, 286, 4446, 15, 4447, 47, 228, 2970, 338, 40, 312, 1463, 179, 1759], [41, 377, 149, 1245, 4448, 34], [2206, 649, 180, 1760, 2207, 91, 650, 378, 463, 1246, 595, 1464, 2208, 2, 8, 2209, 651, 40, 379, 2210, 21, 228, 703, 1246, 1761, 408], [2206, 649, 180, 1760, 2207, 91, 650, 378, 463, 1246, 595, 1464, 2208, 2, 8, 2209, 651, 40, 379, 2210, 21, 228, 703, 1246, 2971]]


## Encoding Sentiments

For simplicity purposes, I am encoding positive sentiment such as joy as 1 and rest (anger, sad) as 0

In [46]:
labels_to_int = []
for i, j in sentiment_df.iterrows():
    if j['sentiment']=='joy':
        labels_to_int.append(1)
    else:
        labels_to_int.append(0)
    

## Detecting any outlier reviews

This step involves -<br>
1. Getting rid of extremely long/short reviews
2. Padding/truncating reaining data to maintain constant review length.

In [47]:
reviews_len = Counter([len(x) for x in encoded_reviews])
print(max(reviews_len))

48


In [48]:
print(len(encoded_reviews))

3090


In [49]:
non_zero_idx = [ii for ii, review in enumerate(encoded_reviews) if len(encoded_reviews)!=0]
encoded_reviews = [encoded_reviews[ii] for ii in non_zero_idx]
encoded_labels = np.array([labels_to_int[ii] for ii in non_zero_idx])

In [50]:
print(len(encoded_reviews))
print(len(encoded_labels))

3090
3090


In [51]:
def pad_features(reviews_int, seq_length):
    features = np.zeros((len(reviews_int), seq_length), dtype=int)
    for i, row in enumerate(reviews_int):
        if len(row)!=0:
            features[i, -len(row):] = np.array(row)[:seq_length]
    return features

In [52]:
seq_length = 50
padded_features= pad_features(encoded_reviews, seq_length)
print(padded_features[:2])


[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0  853  186   20 1079 1457 4429 2201  407 1240 1079
    15  218  337  167  253  462  337  122  168 4430 4431  140   23  264
    58  765    3    5  195 1079 2966  274]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0   80 1755 4432 2967
  4433   86  854 1080 2968 4434 4435    7]]


## Training, Testing and Validating

In [53]:
split_frac = 0.8
split_idx = int(len(padded_features)*split_frac)

training_x, remaining_x = padded_features[:split_idx], padded_features[split_idx:]
training_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]


## Dataloaders and Batching

A neat way to create data-loaders and batch our training, validation and test Tensor datasets is as follows -<br>
```python
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```
This is an alternative to creating a generator function for batching our data into full batches.

In [54]:
import torch
from torch.utils.data import TensorDataset, DataLoader

In [55]:
# torch.from_numpy creates a tensor data from n-d array
train_data = TensorDataset(torch.from_numpy(training_x), torch.from_numpy(training_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))

batch_size = 1

train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)
valid_loader = DataLoader(valid_data, batch_size=batch_size)

In [56]:
gpu_available = torch.cuda.is_available

if gpu_available:
    print('Training on GPU')
else:
    print('GPU not available')

Training on GPU


## Sentiment Network with PyTorch
Below are the various layers of our RNN that would perform sentiment analysis -<br>
1. An *embedding layer* that converts our word tokens (integers) into embeddings of a specific size.
2. A *LSTM layer* defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network."

In [57]:
import torch.nn as nn

class CovidTweetSentimentAnalysis(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.2):
        super(CovidTweetSentimentAnalysis, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.embedding_layer = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
    
    def forward(self, x, hidden):
        # x : batch_size * seq_length * features
        batch_size = x.size(0)
        x = x.long()
        embeds = self.embedding_layer(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(lstm_out)
        out = self.fc(out)
        sig_out = self.sig(out)
        
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        
        return sig_out, hidden
    
    def init_hidden(self, batch_size):
        # initialize weights for lstm layer
        weights = next(self.parameters()).data
        
        if gpu_available:
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero())
        return hidden

## Instantiate the network
Here, I will define the model hyper-parameters -<br>

1. `vocab_size` : Size of our vocabulary or the range of values for our input, word tokens.
2. `output_size` : Size of our desired output; the number of class scores we want to output (pos/neg).
3. `embedding_dim` : Number of columns in the embedding lookup table; size of our embeddings.
4. `hidden_dim` : Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
5. `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [58]:
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1 # either happy or sad
embedding_dim = 400
hidden_dim = 256
n_layers = 2

In [59]:
net = CovidTweetSentimentAnalysis(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

CovidTweetSentimentAnalysis(
  (embedding_layer): Embedding(10663, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.2)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [60]:
lr = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [61]:
epochs = 4
count = 0
print_every = 100
clip = 5 
if gpu_available:
    net.cuda()

net.train()
for e in range(epochs):
    # initialize lstm's hidden layer 
    h = net.init_hidden(batch_size)
    for inputs, labels in train_loader:
        count += 1
        if gpu_available:
            inputs, labels = inputs.cuda(), labels.cuda()
        h = tuple([each.data for each in h])
        
        # training process
        net.zero_grad()
        outputs, h = net(inputs, h)
        loss = criterion(outputs.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm(net.parameters(), clip)
        optimizer.step()
        
        # print average training losses
        if count % print_every == 0:
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:
                val_h = tuple([each.data for each in val_h])
                if gpu_available:
                    inputs, labels = inputs.cuda(), labels.cuda()
            outputs, val_h = net(inputs, val_h)
            val_loss = criterion(outputs.squeeze(), labels.float())
            val_losses.append(val_loss.item())
        
            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(count),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

  return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)


Epoch: 1/4... Step: 100... Loss: 0.136036... Val Loss: 0.139115
Epoch: 1/4... Step: 200... Loss: 0.057192... Val Loss: 0.103990
Epoch: 1/4... Step: 300... Loss: 0.192796... Val Loss: 0.082102
Epoch: 1/4... Step: 400... Loss: 0.068804... Val Loss: 0.805758
Epoch: 1/4... Step: 500... Loss: 0.662398... Val Loss: 0.666439
Epoch: 1/4... Step: 600... Loss: 0.039760... Val Loss: 0.278505
Epoch: 1/4... Step: 700... Loss: 1.115070... Val Loss: 0.351145
Epoch: 1/4... Step: 800... Loss: 0.792689... Val Loss: 0.476311
Epoch: 1/4... Step: 900... Loss: 0.744873... Val Loss: 0.747945
Epoch: 1/4... Step: 1000... Loss: 0.023593... Val Loss: 0.638791
Epoch: 1/4... Step: 1100... Loss: 0.093999... Val Loss: 0.229394
Epoch: 1/4... Step: 1200... Loss: 1.009879... Val Loss: 0.599059
Epoch: 1/4... Step: 1300... Loss: 0.033126... Val Loss: 0.420905
Epoch: 1/4... Step: 1400... Loss: 0.052237... Val Loss: 0.542419
Epoch: 1/4... Step: 1500... Loss: 1.461284... Val Loss: 0.670432
Epoch: 1/4... Step: 1600... Loss: 

## Calculating model's accuracy

The `CovidTweetSentimentAnalysis` model achieved accuracy of 85.4 %

In [62]:
test_losses = []
num_correct = 0

h = net.init_hidden(batch_size)
net.eval()

for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    if gpu_available:
        inputs, labels = inputs.cuda(), labels.cuda()
    
    outputs, h = net(inputs, h)
    test_loss = criterion(outputs.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(outputs.squeeze())
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not gpu_available else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# printing average statistics
print("Test loss: {:.3f}".format(np.mean(test_losses)))
    
# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.594
Test accuracy: 0.890


## Testing model on random tweet

Since for performing sentiment analysis on covid 19 tweets, I on-boarded a completely different dataset in this notebook. Now that the our model is trained,we can use this model to perform sentiment analysis on tweets related to covid19 on this notebook.

In [63]:
from string import punctuation

def tokenize_covid_tweet(tweet):
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in tweet])
    return test_ints

In [64]:
def predict_covid_sentiment(net, test_tweet, seq_length=50):
    print('Original Sentence :')
    print(test_tweet)
    
    print('\nAfter removing punctuations and stop-words :')
    test_tweet = punctuation_stopwords_removal(test_tweet)
    print(test_tweet)
    
    print('\nAfter converting pre-processed tweet to tokens :')
    tokenized_tweet = tokenize_covid_tweet(test_tweet)
    print(tokenized_tweet)
    
    print('\nAfter padding the tokens into fixed sequence lengths :')
    padded_tweet = pad_features(tokenized_tweet, 50)
    print(padded_tweet)
    
    feature_tensor = torch.from_numpy(padded_tweet)
    batch_size = feature_tensor.size(0)
    
    if gpu_available:
        feature_tensor = feature_tensor.cuda()
    
    h = net.init_hidden(batch_size)
    output, h = net(feature_tensor, h)
    
    predicted_sentiment = torch.round(output.squeeze())
    print('\n==========Predicted Sentiment==========\n')
    if predicted_sentiment == 1:
        print('Happy')
    else:
        print('Sad')
    print('\n==========Predicted Sentiment==========\n')


In [65]:
test_sad_tweet = 'It is very sad to see the corona pandemic increasing at such an alarming rate'
predict_covid_sentiment(net, test_sad_tweet)

Original Sentence :
It is very sad to see the corona pandemic increasing at such an alarming rate

After removing punctuations and stop-words :
['sad', 'see', 'corona', 'pandemic', 'increasing', 'alarming', 'rate']

After converting pre-processed tweet to tokens :
[[328, 63, 2, 28, 1964, 6137, 267]]

After padding the tokens into fixed sequence lengths :
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0  328   63    2   28 1964 6137  267]]


Sad




In [66]:
test_happy_tweet = 'It is amazing to see that New Zealand reaches 100 days without Covid transmission!'
predict_covid_sentiment(net, test_happy_tweet)

Original Sentence :
It is amazing to see that New Zealand reaches 100 days without Covid transmission!

After removing punctuations and stop-words :
['amazing', 'see', 'new', 'zealand', 'reaches', '100', 'days', 'without', 'covid', 'transmission']

After converting pre-processed tweet to tokens :
[[642, 63, 30, 9453, 8081, 225, 35, 125, 3, 1326]]

After padding the tokens into fixed sequence lengths :
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0  642   63
    30 9453 8081  225   35  125    3 1326]]


Happy


