[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensioai/dl/blob/master/nlp/sentiment_analysis_bidirectional.ipynb)

## Sentiment Analysis

In [2]:
import torch
import torchtext

TEXT = torchtext.data.Field(tokenize = 'spacy')
LABEL = torchtext.data.LabelField(dtype = torch.float)

train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:09<00:00, 9.22MB/s]


In [3]:
len(train_data), len(test_data)

(25000, 25000)

In [4]:
print(vars(train_data.examples[0]))

{'text': ['Cinderella', 'In', 'my', 'opinion', 'greatest', 'love', 'story', 'ever', 'told', 'i', 'loved', 'it', 'as', 'a', 'kid', 'and', 'i', 'love', 'it', 'now', 'a', 'wonderful', 'Disney', 'masterpiece', 'this', 'is', '1', 'of', 'my', 'favorite', 'movies', 'i', 'love', 'Disney', '.', 'i', 'could', 'rave', 'on', 'and', 'on', 'about', 'Cinderella', 'and', 'Disney', 'all', 'day', 'but', 'i', 'wo', 'nt', 'i', 'll', 'give', 'you', 'a', 'brief', 'outline', 'of', 'the', 'story', '.', 'When', 'a', 'young', 'girl', "'s", 'father', 'dies', 'she', 'has', 'to', 'live', 'with', 'her', 'evil', 'step', 'mother', 'and', 'her', 'equally', 'ugly', 'and', 'nasty', 'step', 'sisters', 'Drusilla', 'and', 'Anastasia', '.', 'Made', 'to', 'do', 'remedial', 'house', 'chores', 'all', 'day', 'poor', 'Cinderella', 'has', 'only', 'the', 'little', 'mice', 'who', 'scurry', 'around', 'the', 'house', 'and', 'her', 'dog', 'Bruno', 'as', 'friends', '.', 'When', 'one', 'day', 'a', 'letter', 'is', 'sent', 'to', 'her', 'h

In [5]:
train_data, valid_data = train_data.split()
len(train_data), len(valid_data)

(17500, 7500)

In [6]:
MAX_VOCAB_SIZE = 10000
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [7]:
len(TEXT.vocab), len(LABEL.vocab)

(10002, 2)

We have two extra tokens: *unk* and *pad*. The *pad* token is used to ensure that all the sentences have the same length.

In [8]:
TEXT.vocab.freqs.most_common(10)

[('the', 205359),
 (',', 194246),
 ('.', 166557),
 ('and', 110362),
 ('a', 110266),
 ('of', 102210),
 ('to', 94693),
 ('is', 76797),
 ('in', 61951),
 ('I', 54422)]

In [9]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

In [10]:
LABEL.vocab.stoi

defaultdict(<function torchtext.vocab._default_unk_index>,
            {'neg': 1, 'pos': 0})

In [11]:
device = "cuda" if torch.cuda.is_available() else "cpu"

dataloader = {
    'train': torchtext.data.BucketIterator(train_data, batch_size=64, sort_within_batch=True, device=device),
    'val': torchtext.data.BucketIterator(valid_data, batch_size=64, device=device),
    'test': torchtext.data.BucketIterator(test_data, batch_size=64, device=device)
}

### Training

In [12]:
class Metric():
  def __init__(self):
    self.name = "acc"
  
  def call(self, outputs, targets):
    rounded_preds = torch.round(torch.sigmoid(outputs))
    correct = (rounded_preds == targets).float() 
    acc = correct.sum().item() / len(correct)
    return acc

In [13]:
class RNN(torch.nn.Module):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=1, dropout=0.2):
        super().__init__()
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        self.rnn = torch.nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=2, dropout=dropout)
        self.fc = torch.nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        #text = [sent len, batch size]        
        embedded = self.embedding(text)        
        #embedded = [sent len, batch size, emb dim]        
        output, hidden = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim]
        y = self.fc(output[-1,:,:].squeeze(0)).squeeze(1)     
        return y

We can tell the network NOT to learn the pad token, since it is irrelevant. This is called *masking*.

In [14]:
class MaskedRNN(RNN):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=1, dropout=0.2, pad_idx=0):
        super().__init__(input_dim, embedding_dim, hidden_dim, output_dim, dropout)
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)

### Bidireccional RNNs

At each time step, a regular recurrent layer only looks at past and present inputs before generating an output. This makes sense for time series forecasting, but for some NLP tasks it is preferable to look ahead at the next words before encoding a given word. We can achieve this using bidirectional recurrent layers.


![](https://miro.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png)

In [17]:
from src import WordModel

class MaskedBidirectionalRNN(MaskedRNN):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=1, dropout=0.2, pad_idx=0, bidirectional=True):
        super().__init__(input_dim, embedding_dim, hidden_dim, output_dim, dropout)
        self.rnn = torch.nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=2, dropout=dropout, bidirectional=bidirectional)
        if bidirectional:
          self.fc = torch.nn.Linear(2*hidden_dim, output_dim)

In [18]:
net = MaskedBidirectionalRNN(len(TEXT.vocab), embedding_dim=100, pad_idx=TEXT.vocab.stoi[TEXT.pad_token])

In [19]:
model = WordModel(net)

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.BCEWithLogitsLoss(),
              metrics=[Metric()])

hist = model.fit(dataloader['train'], dataloader['val'], epochs=5)

In [20]:
model.evaluate(dataloader['test'])

We can now use our model to predict if a movie review is good or bad.

In [21]:
import spacy
nlp = spacy.load('en')

In [22]:
sentences = ["this film is terrible", "this film is great", "this film is good", "a waste of time"]
tokenized = [[tok.text for tok in nlp.tokenizer(sentence)] for sentence in sentences]
indexed = [[TEXT.vocab.stoi[_t] for _t in t] for t in tokenized]
tensor = torch.tensor(indexed).to(device).permute(1,0)
model.net.eval()
prediction = torch.sigmoid(model.net(tensor))
prediction

tensor([0.9891, 0.0520, 0.0571, 0.9869], device='cuda:0',
       grad_fn=<SigmoidBackward>)