<a href="https://colab.research.google.com/github/jonas-jun/nlp_imdb_sentiment/blob/master/sentimental_analysis_IMDb_2_200801.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install torchtext
!pip install spacy
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [2]:
import torch
from torchtext import data
from torchtext import datasets
import random
import torch.nn as nn
import torch.optim as optim
import time # calculate time for an epoch
import spacy

## Prepare Datasets

In [3]:
SEED = 1234
torch.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy', include_lengths=True) # include_lengths
LABEL = data.LabelField(dtype=torch.float)

In [4]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

print('number of training examples: {:,}'.format(len(train_data)))
print('number of test examples: {:,}'.format(len(test_data)))

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 10.8MB/s]


number of training examples: 25,000
number of test examples: 25,000


In [5]:
print(vars(train_data.examples[0]))

{'text': ['Otto', 'Preminger', "'s", 'Dana', 'Andrews', 'cycle', 'of', 'films', 'noirs', 'are', 'among', 'the', '(', 'largely', ')', 'unsung', 'jewels', 'of', 'the', 'genre', '.', 'Because', 'they', 'lack', 'paranoia', ',', 'misogyny', 'or', 'hysteria', ',', 'they', 'may', 'have', 'seemed', 'out', 'of', 'place', 'at', 'the', 'time', ',', 'but', 'the', 'clear', '-', 'eyed', 'imagery', ',', 'the', 'complex', 'play', 'with', 'identity', ',', 'masculinity', 'and', 'representation', ',', 'the', 'subversion', 'of', 'traditional', 'psychological', 'tenets', ',', 'the', 'austere', ',', 'geometrical', 'style', 'all', 'seem', 'startlingly', 'modern', 'today', ',', 'and', 'very', 'similar', 'to', 'Melville', '.', 'The', 'lucid', 'ironies', 'of', 'this', 'film', 'are', 'so', 'loaded', ',', 'brutal', 'and', 'ironic', 'that', 'the', "'", 'happy', "'", 'ending', 'is', 'one', 'of', 'the', 'cruellest', 'in', 'Hollywood', 'history', '.', 'Brilliant', 'on', 'the', 'level', 'of', 'entertaining', 'thriller

In [6]:
train_data, valid_data = train_data.split(random_state=random.seed(SEED)) # default ratio is 0.7

print('number of training examples: {:,}'.format(len(train_data)))
print('number of validation examples: {:,}'.format(len(valid_data)))
print('number of test examples: {:,}'.format(len(test_data)))

number of training examples: 17,500
number of validation examples: 7,500
number of test examples: 25,000


In [19]:
max_vocab_size = 25000

TEXT.build_vocab(train_data, max_size=max_vocab_size, vectors='glove.6B.100d', unk_init=torch.Tensor.normal_) # glove 100 dimension word embedding
LABEL.build_vocab(train_data)

In [20]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

In [23]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    datasets=(train_data, valid_data, test_data), batch_size=BATCH_SIZE,
    sort_within_batch=True, device=device)

In [22]:
print('number of train iter: {:,}'.format(len(train_iter)))
print('number of valid iter: {:,}'.format(len(valid_iter)))
print('number of test iter: {:,}'.format(len(test_iter)))

number of train iter: 274
number of valid iter: 118
number of test iter: 391


In [12]:
int(len(train_data) / BATCH_SIZE) # approximately same with length of train iter

273

## Build the Model

- input dimension: length of one-hot vectors
- embedding dimension: size of dense word vectors, usually around 50-250 dimensions
- hidden dimension: size of the hidden states
- output dimension: number of classes, in this case 1, because only 2 cases, 0 or 1

In [16]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim) # bidirectional model이기 때문에?
        self.dropout = nn.Dropout(dropout)

    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text)) # word를 embedding 후 1차 dropout을 거침

        # pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) # hidden_dim 길이의 2가지 output을 concatenate

        return self.fc(hidden) # 2*hidden_dim, output_dim(0 or 1)

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors can be loaded into the model, the EMBEDDING_DIM must be equal to that of the pre-trained GloVe vectors loaded earlier.

We get our pad token index from the vocabulary, getting the actual string representing the pad token from the field's pad_token attribute, which is 'pad' by default.

In [24]:
input_dim = len(TEXT.vocab)
embedding_dim = 100 # glove.100d
hidden_dim = 256
output_dim = 1
n_layers = 2
bidirectional = True # bidirectional을 True로 주면 num_layers는 자연스레 2개가 되는 거 아닌가?
dropout = 0.5
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(vocab_size=input_dim,
            embedding_dim=embedding_dim,
            hidden_dim=hidden_dim,
            output_dim=output_dim,
            n_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout,
            pad_idx=pad_idx)

In [25]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print('The model has {:,} trainable parameters'.format(count_parameters(model)))

The model has 4,810,857 trainable parameters


The final addition is copying the pre-trained word embeddings we loaded earlier into the embedding layer of out model.

We retrieve the embeddings from the field's vocab, and check they're the correct size, [vocab size, embedding dim]

In [26]:
pretrained_embeddings = TEXT.vocab.vectors
print(pretrained_embeddings.shape)

torch.Size([25002, 100])


Then, we replace the initial weights of the embedding layer with the pre-trained embeddings.

NOTE: this should always be done on the weight.data and not the weight!

In [29]:
len(pretrained_embeddings[1]) # 1개 단어 당 length 100의 vector로 이뤄짐

100

In [30]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.9597,  0.8905, -0.7076,  ...,  0.3940, -1.2075, -0.9683],
        [-0.3404,  0.2269,  0.0731,  ..., -0.4427,  0.6267,  0.2811],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4765,  0.2254,  0.3035,  ..., -0.2082,  0.1948,  0.8972],
        [-0.1983, -0.2634,  0.4227,  ...,  0.1574, -0.3458,  0.4537],
        [ 0.3203, -0.8293,  1.2617,  ..., -0.8267,  1.5076, -0.0893]])

As our 'unk' and 'pad' token aren't in the pre-trained vocabulary they have been initialized using unk_init (N(0,1) distribution) when building our vocab. it is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment.

We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens, which we have already done for the padding index.

NOTE: like initializing the mebeddings, this should be done one the weight.data and not the weight!

In [35]:
unk_idx = TEXT.vocab.stoi[TEXT.unk_token] # pad idx는 위에서 할당

model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)

print(unk_idx, pad_idx)
print(model.embedding.weight.data.shape)
print(model.embedding.weight.data)

0 1
torch.Size([25002, 100])
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4765,  0.2254,  0.3035,  ..., -0.2082,  0.1948,  0.8972],
        [-0.1983, -0.2634,  0.4227,  ...,  0.1574, -0.3458,  0.4537],
        [ 0.3203, -0.8293,  1.2617,  ..., -0.8267,  1.5076, -0.0893]])


## Train the Model

    argument 1: parameters we'll update
    argument 2: learning rate

    > change optimization method SGD to Adam

In [36]:
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device) # which things are should read to cuda?

#### Accuracy Function

In [38]:
def binary_accuracy(preds, y):
    '''
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8 NOT 8
    '''

    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() # convert to float for division
    acc = correct.sum() / len(correct)
    return acc

#### Definitions for training and evaluating

As we have set include_lengths = True, our batch.text is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We seperate these into their own variables, text and text_lengths, before passing them to the model.

NOTE: as we are now using dropout, we must remember to use model.train() to ensure the dropout is 'turrned on' while training.

In [39]:
def train(model, iterator, optimizer, criterion):

    epoch_loss = 0
    epoch_acc = 0
    
    model.train()

    for batch in iterator:
        
        optimizer.zero_grad()
        text, text_lengths = batch.text

        predictions = model(text, text_lengths).squeeze(1)

        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item() # .item() ?

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

As we are now using dropout, we must remember to use model.eval() to ensure the dropout is 'turned off' while evaluating

In [40]:
# do not need to optimize, do back propagation
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text

            predictions = model(text, text_lengths).squeeze(1)

            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Create function to tell us how long an apoch takes to compare training times between models

In [41]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins*60))
    return elapsed_mins, elapsed_secs

#### Train the model multiple epochs

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set

In [42]:
N_EPOCHS = 5
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iter, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')

    print('Epoch: {:02} | Epoch Time: {}m {}s'.format(epoch+1, epoch_mins, epoch_secs))
    print(f'\tTrain Loss: {train_loss:0.03f}, | Train_acc: {(train_acc*100):0.02f}')
    print(f'\tValid Loss: {valid_loss:0.03f}, | Valid_acc: {(valid_acc*100):0.02f}') 

Epoch: 01 | Epoch Time: 1m 40s
	Train Loss: 0.645, | Train_acc: 61.49
	Valid Loss: 0.540, | Valid_acc: 74.33
Epoch: 02 | Epoch Time: 1m 39s
	Train Loss: 0.541, | Train_acc: 72.68
	Valid Loss: 0.464, | Valid_acc: 78.41
Epoch: 03 | Epoch Time: 1m 39s
	Train Loss: 0.442, | Train_acc: 79.96
	Valid Loss: 0.389, | Valid_acc: 82.91
Epoch: 04 | Epoch Time: 1m 39s
	Train Loss: 0.362, | Train_acc: 84.57
	Valid Loss: 0.348, | Valid_acc: 85.27
Epoch: 05 | Epoch Time: 1m 39s
	Train Loss: 0.321, | Train_acc: 86.64
	Valid Loss: 0.329, | Valid_acc: 86.02


## Apply Testset

In [43]:
model.load_state_dict(torch.load('tut2-model.pt'))
test_loss, test_acc = evaluate(model, test_iter, criterion)
print('Test Loss: {:0.03f} | Test Acc: {:0.02f}%'.format(test_loss, test_acc*100))

Test Loss: 0.340 | Test Acc: 85.40%


## Test with User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has benn trained on movie reviews, the sentences provided should also be movie reviews.

When using a model for inference it should always be in evaluation mode. If this tutorial is followed step-by-step then it should already be in evaluation mode (from doing evaluate on the test set), however we explicitly set it to avoid any risk

Our predict_sentiment function does a few things:

1. sets the model to evaluation mode
2. tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
3. indexes the tokens by converting them into their integer representation from our vocabulary
4. gets the length of our sequence
5. converts the indexes, which are a Python list into a PyTorch tensor
6. add a batch dimension by unsqueezeing
7. converts the length into a tensor
8. squashes the output prediction from a real number between 0 and 1 with the sigmoid function
9. converts the tensor holding a single value into an integer with the item() method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [44]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))

    return prediction.item()

In [45]:
predict_sentiment(model, "It was really boring, and I felt that I would like to stop this travel and quit the theater.")

0.03796844184398651