## Embeddings

We intend to experiment with sentiment analysis (sentence classification) using different word
embeddings and the Sentiment140 dataset containing labeled Twitter tweets. More specifically,
we intend to train our own word embeddings using word2vec and then use those with a deep
learning sentiment analysis model (TBD: LSTM/GRU/Transformer) that we will create using
Pytorch. The main experiments we will perform are
- Use Pytorch’s trainable embeddings with random initialization
- Use Pytorch’s trainable embeddings with our trained word2vec initialization
- Only use our trained word2vec embeddings as inputs

and then compare the accuracy results for the sentiment analysis task.

I would suggest training several sets of your own embeddings (experiment with the parameters to see how they influence the final vectors). Then, compare the sets of embeddings outside of your system (analogies,odd-one-out...), so you can set some expectations about what embeddings might yield the best result for your task. Finally, look at how the vectors perform in your system and analyze if you expected such result and why.

- https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb

- https://github.com/hietalajulius/deep-learning-aalto/blob/master/Classifier.ipynb

- https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis

In [13]:
from collections import Counter
import math
import numpy as np
import pandas as pd
import spacy

ModuleNotFoundError: No module named 'spacy'

In [2]:
df_train = pd.read_csv('data/processed_train.csv')
df_train.head()

Unnamed: 0,target,text
0,0,yeah hmmmm lay low guess u ever wait till last...
1,1,excit new everyday sunday cd relea today serio...
2,1,good morn tweep tworld malibu string bikini fa...
3,0,feel sooo sick stress exam tomorrow
4,0,know feel get cowork sick oh well think got co...


## Do your own vocab

In [None]:
train_vocab = Counter()
df_train = df_train[df_train['text'].notnull()]
for i, tweet in enumerate(df_train.text.values):
    # print(tweet)
    for word in tweet.split(' '):
        # print(word)
        train_vocab.update(word)

print(f"Sorting vocab")
#train_vocab = sorted(train_vocab, key=train_vocab.get, reverse=True)

In [None]:
print(train_vocab.most_common(10))

In [None]:
vec = torchtext.vocab.Vectors(train_vocab, 
                              max_size=50000,
                             min_freq=1)
text_field.build_vocab(trn_ds, vectors=vec)

## Pre-trained word embeddings
- word2Vec
- Glove

In [4]:
!pip install torch



You are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [5]:
!pip install torchtext

Collecting torchtext
  Using cached https://files.pythonhosted.org/packages/79/ef/54b8da26f37787f5c670ae2199329e7dccf195c060b25628d99e587dac51/torchtext-0.5.0-py3-none-any.whl
Collecting sentencepiece (from torchtext)
  Using cached https://files.pythonhosted.org/packages/19/df/055557e0b5e05c13bbfcc648c10181627949a9313c55a6390558eac10cf1/sentencepiece-0.1.85-cp36-cp36m-win_amd64.whl
Collecting tqdm (from torchtext)
  Using cached https://files.pythonhosted.org/packages/47/55/fd9170ba08a1a64a18a7f8a18f088037316f2a41be04d2fe6ece5a653e8f/tqdm-4.43.0-py2.py3-none-any.whl
Installing collected packages: sentencepiece, tqdm, torchtext
Successfully installed sentencepiece-0.1.85 torchtext-0.5.0 tqdm-4.43.0


You are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [6]:
import torch
import torchtext
import torchtext.vocab
from torchtext import datasets

In [7]:
glove = torchtext.vocab.Glove(name='6B', dim=100)

AttributeError: module 'torchtext.vocab' has no attribute 'Glove'

In [None]:
def get_vector(embeddings, word):
    
    return embeddings.vectors(embeddings.stoi[word])


def closest(embeddings, vector, n=6):
    disntances = []
    for neighbor in embeddings.itos:
        distances.append(neighbor, torch.dist(vector, get_vactor(embeddings, neighbor)))
    
    return sorted(distances, key=lambda x: x[1])[:n]


def analogy(embeddings, w1, w2, w3, n=6):
    
    closest_words = closest(embeddings,
                           get_vector(embeddings, w2) \
                            - get_vector(embeddings, w1) \
                            + get_vector(embeddings, w3),
                           n + 3)
    closest_words = [x for x in closest_words if x[0] not in [w1, w2, w3]][:n]
    
    return closest_words

In [None]:
closest(glove, get_vector(glove, 'paper'))

In [None]:
analogy(glove, 'moon', 'night', 'sun')

## LSTM model

In [12]:
!pip install spacy

Collecting spacy
  Using cached https://files.pythonhosted.org/packages/0a/19/2b2c0e1340131a8e23ce4a9804cdccdd62d4d23d3d86c1754857b3de7a14/spacy-2.2.4-cp36-cp36m-win_amd64.whl
Collecting thinc==7.4.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/dd/f5/4c76b84a9ae0ea6c659285cf8fed8cce76d5db5b8353e1e08b8d3f56058e/thinc-7.4.0-cp36-cp36m-win_amd64.whl
Collecting srsly<1.1.0,>=1.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/3a/d6/939d46c05289b185226a576421079468123b1719ffe16e181e0005d45ef9/srsly-1.0.2-cp36-cp36m-win_amd64.whl
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/b0/71/a58322c3489bf0f5a71aa69a66b42164cbc4f0d5ac5e1042c11233766b3f/preshed-3.0.2-cp36-cp36m-win_amd64.whl
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/dd/ec/904b4741879a2a280a40d5bf0b61449a20d1f75281e14ebee06566f7765b/cymem-2.0.3-cp36-cp36m-win_amd64.whl
Collecting cata

Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\numpy\\add_newdocs.py'
Consider using the `--user` option or check the permissions.

You are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [None]:
import spacy

In [10]:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
def tokenizer(s):
    return[w.text.lower() for w in nlp(s)]

TEXT = torchtext.data.Field(tokenize=tokenizer,
                           init_token='< sos >',
                             eos_token='< eos >',
                             unk_token='< unk >',
                             tokenizer_language='en',
                             lower=True)
LABEL = torchtext.data.LabelField(dtype=torch.float)

datafields = [('Sentiment', LABEL), ('SentimentText', TEXT)]

train, test = torchtext.data.TabluarDataset.splits(path='Data/',
                                                  train='processed_train.csv',
                                                  test='processed_test.csv',
                                                  format='csv',
                                                  skip_header=True,
                                                  fields=datafields)

NameError: name 'spacy' is not defined

In [None]:
TEXT.build_vocab(train, max_size=20000,
                vectors='glove.6B.100d',  # index position of our vocab
                unk_init=torch.Tensor.normal_)

LABEL.build.vocab(train)

In [None]:
TEXT.vocab.freqs.most_common(10)
TEXT.vocab.itos(:10)

In [None]:
# minimise badding for each sentence
train_iterator, test_iterator = torchtext.data.BucketIterator.splits(
                                (train, test),
                                batch_size=64,
                                sort_key=lambda x: len(x.SentimentText),
                                sort_within_batch=False)

In [None]:
import torch.nn as nn

class GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim,
                output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, 
                          hidden_dim,
                         num_layers=n_layers,
                         bidirectional=bidirectional,
                         dropout=dropout)
        self.fc = nn.linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text):
        embedded_text = self.dropout(self.embedding(text))
        output, hidden = self.gru(embedded_text)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        x = self.fc(hidden.squeeze(0))
        return x

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# Use pretrained embeddings
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

unk_idx = TEXT.vocab.stoi[TEXT.unk_token]
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)

In [None]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)


In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.SentimentText).squeeze(1)
        
        loss = criterion(predictions, batch.Sentiment)
        
        acc = binary_accuracy(predictions, batch.Sentiment)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.SentimentText).squeeze(1)
            
            loss = criterion(predictions, batch.Sentiment)
            
            acc = binary_accuracy(predictions, batch.Sentiment)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')