## Embeddings

We intend to experiment with sentiment analysis (sentence classification) using different word
embeddings and the Sentiment140 dataset containing labeled Twitter tweets. More specifically,
we intend to train our own word embeddings using word2vec and then use those with a deep
learning sentiment analysis model (TBD: LSTM/GRU/Transformer) that we will create using
Pytorch. The main experiments we will perform are
- Use Pytorch’s trainable embeddings with random initialization
- Use Pytorch’s trainable embeddings with our trained word2vec initialization
- Only use our trained word2vec embeddings as inputs

and then compare the accuracy results for the sentiment analysis task.

I would suggest training several sets of your own embeddings (experiment with the parameters to see how they influence the final vectors). Then, compare the sets of embeddings outside of your system (analogies,odd-one-out...), so you can set some expectations about what embeddings might yield the best result for your task. Finally, look at how the vectors perform in your system and analyze if you expected such result and why.

- https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb

- https://github.com/hietalajulius/deep-learning-aalto/blob/master/Classifier.ipynb

- https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis

In [1]:
from collections import Counter
import csv
import math
import numpy as np
import pandas as pd
import spacy

import torch
import torchtext
import torchtext.vocab
from torchtext import datasets

In [2]:
df_train = pd.read_csv('data/processed_train.csv')
print(df_train.shape)
df_train.head()

(1024000, 2)


Unnamed: 0,target,text
0,1,oop wrong url thing work mind brain foggi life
1,1,yes fantast
2,1,pretttyyi pleaseee tweet mileycsupport reallll...
3,0,yep heard everi sad song twitter safe say far ...
4,1,oh got blush like littl girl


## Pre-trained word embeddings
- word2Vec
- Glove

In [3]:
def get_vector(embeddings, word):
    
    return embeddings.vectors(embeddings.stoi[word])


def closest(embeddings, vector, n=6):
    disntances = []
    for neighbor in embeddings.itos:
        distances.append(neighbor, torch.dist(vector, get_vactor(embeddings, neighbor)))
    
    return sorted(distances, key=lambda x: x[1])[:n]


def analogy(embeddings, w1, w2, w3, n=6):
    
    closest_words = closest(embeddings,
                           get_vector(embeddings, w2) \
                            - get_vector(embeddings, w1) \
                            + get_vector(embeddings, w3),
                           n + 3)
    closest_words = [x for x in closest_words if x[0] not in [w1, w2, w3]][:n]
    
    return closest_words

"""
glove = torchtext.vocab.Glove(name='6B', dim=100)

closest(glove, get_vector(glove, 'paper'))

analogy(glove, 'moon', 'night', 'sun')
"""

"\nglove = torchtext.vocab.Glove(name='6B', dim=100)\n\nclosest(glove, get_vector(glove, 'paper'))\n\nanalogy(glove, 'moon', 'night', 'sun')\n"

## Build vocab

In [4]:
"""
TEXT = torchtext.data.Field(tokenize= 'spacy',
                            init_token='< sos >',
                            eos_token='< eos >',
                            unk_token='< unk >',
                            tokenizer_language='en_core_web_sm',
                            lower=True)
"""
TEXT = torchtext.data.Field(tokenize= 'spacy',
                            tokenizer_language='en_core_web_sm',
                            lower=True)
LABEL = torchtext.data.LabelField(dtype=torch.float)

datafields = [('Sentiment', LABEL), ('SentimentText', TEXT)]

train, val, test = torchtext.data.TabularDataset.splits(path='data/',
                                                  train='processed_train.csv',
                                                  validation='processed_val.csv',
                                                  test='processed_test.csv',
                                                  format='csv',
                                                  skip_header=True,
                                                  fields=datafields)



In [5]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train)

print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

print(TEXT.vocab.freqs.most_common(20))

print(TEXT.vocab.itos[:10])

print(LABEL.vocab.stoi)

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2
[('go', 88779), ('get', 70792), ('day', 67839), ('good', 59384), ('work', 56051), ('like', 53493), ('love', 52734), ('quot', 47034), ('got', 45559), ('today', 43640), ('time', 42450), ('nt', 39689), ('lol', 38152), ('thank', 38138), ('back', 36807), ('want', 36771), ('one', 36690), ('i', 36474), ('miss', 36311), ('u', 35543)]
['<unk>', '<pad>', 'go', 'get', 'day', 'good', 'work', 'like', 'love', 'quot']
defaultdict(None, {'0': 0, '1': 1})


In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
# minimise badding for each sentence
train_iterator, val_iterator, test_iterator = torchtext.data.BucketIterator.splits(
                                (train, val, test),
                                batch_size=64,
                                sort_key=lambda x: len(x.SentimentText),
                                sort_within_batch=False,
                                device=device)

cuda


## RNN model

In [7]:
import torch.nn as nn

class GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim,
                output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, 
                          hidden_dim,
                         num_layers=n_layers,
                         bidirectional=bidirectional,
                         dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text):
        embedded_text = self.dropout(self.embedding(text))
        output, hidden = self.gru(embedded_text)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        x = self.fc(hidden.squeeze(0))
        return x

In [8]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = GRU(vocab_size=INPUT_DIM, 
            embedding_dim=EMBEDDING_DIM, 
            hidden_dim=HIDDEN_DIM, 
            output_dim=OUTPUT_DIM, 
            n_layers=2,
            bidirectional=True,
            dropout=0.1)
print(model)

GRU(
  (embedding): Embedding(25002, 100)
  (gru): GRU(100, 256, num_layers=2, dropout=0.1, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)


In [9]:
# Use pretrained embeddings
"""
#pretrained_embeddings = TEXT.vocab.vectors
#model.embedding.weight.data.copy_(pretrained_embeddings)

unk_idx = TEXT.vocab.stoi[TEXT.unk_token]
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)
"""

'\n#pretrained_embeddings = TEXT.vocab.vectors\n#model.embedding.weight.data.copy_(pretrained_embeddings)\n\nunk_idx = TEXT.vocab.stoi[TEXT.unk_token]\npad_idx = TEXT.vocab.stoi[TEXT.pad_token]\n\nmodel.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)\nmodel.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)\n'

In [10]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)


In [11]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [16]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.SentimentText).squeeze(1)
        
        loss = criterion(predictions, batch.Sentiment)
        
        acc = binary_accuracy(predictions, batch.Sentiment)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return model, epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            # print(batch.SentimentText)
            if batch.SentimentText.nelement() > 0:
                predictions = model(batch.SentimentText).squeeze(1)

                loss = criterion(predictions, batch.Sentiment)

                acc = binary_accuracy(predictions, batch.Sentiment)

                epoch_loss += loss.item()
                epoch_acc += acc.item()
            # else:
              #  print(f"Found a non-empty Tensorlist {batch.SentimentText}")
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [13]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [17]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    model, train_loss, train_acc  = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, val_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'gru_simple_.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 18m 9s
	Train Loss: 0.667 | Train Acc: 59.14%
	 Val. Loss: 0.656 |  Val. Acc: 59.88%
Epoch: 02 | Epoch Time: 18m 16s
	Train Loss: 0.657 | Train Acc: 60.61%
	 Val. Loss: 0.649 |  Val. Acc: 61.08%
Epoch: 03 | Epoch Time: 18m 16s
	Train Loss: 0.649 | Train Acc: 61.75%
	 Val. Loss: 0.641 |  Val. Acc: 62.29%
Epoch: 04 | Epoch Time: 18m 16s
	Train Loss: 0.638 | Train Acc: 63.20%
	 Val. Loss: 0.627 |  Val. Acc: 64.07%
Epoch: 05 | Epoch Time: 18m 22s
	Train Loss: 0.620 | Train Acc: 65.24%
	 Val. Loss: 0.606 |  Val. Acc: 66.33%
Epoch: 06 | Epoch Time: 18m 17s
	Train Loss: 0.596 | Train Acc: 67.56%
	 Val. Loss: 0.580 |  Val. Acc: 69.06%
Epoch: 07 | Epoch Time: 18m 35s
	Train Loss: 0.577 | Train Acc: 69.24%
	 Val. Loss: 0.566 |  Val. Acc: 70.27%
Epoch: 08 | Epoch Time: 18m 16s
	Train Loss: 0.565 | Train Acc: 70.34%
	 Val. Loss: 0.557 |  Val. Acc: 71.17%
Epoch: 09 | Epoch Time: 18m 18s
	Train Loss: 0.556 | Train Acc: 71.11%
	 Val. Loss: 0.549 |  Val. Acc: 71.92%
Epoch: 10 |

In [18]:
model.load_state_dict(torch.load('gru_simple_.pt'))
print(model)

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

GRU(
  (embedding): Embedding(25002, 100)
  (gru): GRU(100, 256, num_layers=2, dropout=0.1, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
Test Loss: 0.547 | Test Acc: 72.14%
