# Welsh Tweets Sentiment Analysis
### Anything Goes Implementation

For this implementation I used PyTorch's NLP library Torchtext. Torchtext isn't very straightforward, but felt pretty easy to use once I gained a better understanding of its main classes. To reach this understanding, I based much of the following code on notebooks from [this](https://github.com/bentrevett/pytorch-sentiment-analysis) repository.

Imports and filepaths. **To use an actual test set, place the .tsv file in the `data` directory and rename the 
`test_file` variable to the filename.**

In [1]:
import io
import os
import torch
from torch import nn
from torchtext import data
from torchtext.data import Dataset, Example
from torchtext.utils import unicode_csv_reader

# make the experiment reproducible
SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

train_file = 'train.tsv'
test_file = 'train.tsv'

Since the supplied training data had invalid rows, and torchtext's `TabularDataset` class could not filter these, I declare an adapted version of that class here. The class checks each row's length and ensures it matches the number of fields provided before adding it to the dataset. 

In [2]:
class FilteredTabularDataset(Dataset):
    def __init__(self, path, format, fields, skip_header=False,
                 csv_reader_params={}, **kwargs):
        format = format.lower()
        make_example = {
            'json': Example.fromJSON, 'dict': Example.fromdict,
            'tsv': Example.fromCSV, 'csv': Example.fromCSV}[format]

        with io.open(os.path.expanduser(path), encoding="utf8") as f:
            if format == 'csv':
                reader = unicode_csv_reader(f, **csv_reader_params)
            elif format == 'tsv':
                reader = unicode_csv_reader(f, delimiter='\t', **csv_reader_params)
            else:
                reader = f

            if format in ['csv', 'tsv'] and isinstance(fields, dict):
                if skip_header:
                    raise ValueError('When using a dict to specify fields with a {} file,'
                                     'skip_header must be False and'
                                     'the file must have a header.'.format(format))
                header = next(reader)
                field_to_index = {f: header.index(f) for f in fields.keys()}
                make_example = partial(make_example, field_to_index=field_to_index)

            if skip_header:
                next(reader)
            
            # only include valid rows
            examples = [make_example(line, fields) for line in reader if len(line) == len(fields)]

        if isinstance(fields, dict):
            fields, field_dict = [], fields
            for field in field_dict.values():
                if isinstance(field, list):
                    fields.extend(field)
                else:
                    fields.append(field)

        super(FilteredTabularDataset, self).__init__(examples, fields, **kwargs)
        
        

Here I declare the `TEXT` and `LABEL` Fields. These handle a lot of the tokenization work within torchtext. I set the language to spacy's "multi-language" because it doesn't support welsh. I'm including lengths so I can use packed padding later.

In [3]:
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)



The dataset has two values in each row, the first for the label and the second is the text document

In [4]:
fields = [('label', LABEL), ('text', TEXT)]

Here I load the supplied files using my modified `TabularDataset`. I pass the fields list and the dataset will assign the values appropriately.

In [5]:
train_data, test_data = FilteredTabularDataset.splits(
                                        path = 'data',
                                        train = train_file,
                                        test = test_file,
                                        format = 'tsv',
                                        fields = fields,
)



Split the training and validation data using the random seed.

In [6]:
import random

train_data, valid_data = train_data.split(split_ratio=0.8, random_state = random.seed(SEED))

Here I'm attempting to load pretrained welsh word embeddings I got from [this site](https://fasttext.cc/docs/en/crawl-vectors.html).

In [7]:
import torchtext.vocab as vocab

custom_embeddings = vocab.FastText(language='cy', unk_init = torch.Tensor.normal_)

print(custom_embeddings.vectors)

tensor([[-0.1013,  0.0229,  0.0199,  ...,  0.0170,  0.0220, -0.0518],
        [-0.0333, -0.1484, -0.1136,  ..., -0.4232,  0.0796, -0.0488],
        [-0.1061,  0.0483,  0.1382,  ..., -0.0362,  0.2384, -0.0137],
        ...,
        [ 0.3927, -0.0973,  0.1411,  ..., -0.1604,  0.2125, -0.4878],
        [-0.0190, -0.1905,  0.1613,  ...,  0.2480,  0.2578, -0.1305],
        [ 0.1491, -0.1887, -0.0913,  ..., -0.1866, -0.0556, -0.0979]])


Here I declare the number of words to include in my vocabulary and load the welsh word embeddings. I also set `unk_init` to `torch.Tensor.normal_` so words not found in the pretrained embeddings are initialized with a Gaussian distribution.

In [8]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data,
                 vectors = custom_embeddings,
                 max_size = MAX_VOCAB_SIZE)
print(TEXT.vocab.vectors)

LABEL.build_vocab(train_data)

tensor([[ 1.9269,  1.4873,  0.9007,  ..., -2.1268, -0.1341, -1.0408],
        [ 0.7694,  2.5574,  0.5716,  ..., -0.9120,  0.3682,  0.7050],
        [-1.2299,  2.0929,  0.8758,  ..., -0.9745,  0.9817,  0.5837],
        ...,
        [ 0.9878,  1.3393, -0.2436,  ...,  0.5269,  0.1220, -1.0843],
        [ 1.4220, -0.0306, -0.0236,  ..., -1.5861, -0.2060, -2.1624],
        [-0.1431, -0.0774,  0.1979,  ...,  0.3322,  0.3148,  0.2115]])


Set the batch size and assign the GPU if one is available.

I then create an iterator for each dataset to act as the dataloader. Since I'm using packed padded sequences, I have to sort each document by length, hence the `sort_key` and `sort_within_batch` inputs.

In [9]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch = True,
    device = device)



After the embedding layer, I pack the sequences and feed them to a multi-layer LSTM with bidirectional support. This part took me a while to wrap my head around but is responsible for a significant accuracy boost. I also included a dropout layer to prevent overfitting, but have to play with the dropout % for a bit for it to be effective.

In [10]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        self.rnn = nn.LSTM(embedding_dim,
                          hidden_dim,
                          num_layers=n_layers,
                          bidirectional=bidirectional,
                          dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
    
    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text))
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))

        return self.fc(hidden)


Declare hyperparameters and dimensions. **`EMBEDDING_DIM` must reflect the pretrained embeddings' dimension.**

In [11]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 4
BIDIRECTIONAL = True
DROPOUT = 0.7
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM,
           EMBEDDING_DIM,
           HIDDEN_DIM,
           OUTPUT_DIM,
           N_LAYERS,
           BIDIRECTIONAL,
           DROPOUT,
           PAD_IDX)

Assign the model's embedding layer weights to those of the pretrained embeddings.

In [12]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ..., -2.1268, -0.1341, -1.0408],
        [ 0.7694,  2.5574,  0.5716,  ..., -0.9120,  0.3682,  0.7050],
        [-1.2299,  2.0929,  0.8758,  ..., -0.9745,  0.9817,  0.5837],
        ...,
        [ 0.9878,  1.3393, -0.2436,  ...,  0.5269,  0.1220, -1.0843],
        [ 1.4220, -0.0306, -0.0236,  ..., -1.5861, -0.2060, -2.1624],
        [-0.1431, -0.0774,  0.1979,  ...,  0.3322,  0.3148,  0.2115]])

Zero the `unk_idx` and `pad_idx` weights so they don't affect outputs. Print these weights to confirm that the first two lines are all zeros.

In [13]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-1.2299,  2.0929,  0.8758,  ..., -0.9745,  0.9817,  0.5837],
        ...,
        [ 0.9878,  1.3393, -0.2436,  ...,  0.5269,  0.1220, -1.0843],
        [ 1.4220, -0.0306, -0.0236,  ..., -1.5861, -0.2060, -2.1624],
        [-0.1431, -0.0774,  0.1979,  ...,  0.3322,  0.3148,  0.2115]])


Using an Adam optimizer with an auto-generated learning rate and binary cross-entropy loss with logits.

In [14]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

Send the model and the loss to the GPU if one is being used.

In [15]:
model = model.to(device)
criterion = criterion.to(device)

Accuracy function.

In [16]:
def accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

Standard `train` and `evaluate` functions.

In [17]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        acc = accuracy(predictions, batch.label)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [18]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            acc = accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Run for 10 epochs, saving the best model's weights for use with the test data.

In [19]:
N_EPOCHS = 10

# initialize best loss as infinity
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'anything_goes.pt')
    
    print(f'Epoch: {epoch+1}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    
    



Epoch: 1
	Train Loss: 0.615 | Train Acc: 65.52%
	 Val. Loss: 0.540 |  Val. Acc: 72.09%
Epoch: 2
	Train Loss: 0.543 | Train Acc: 72.21%
	 Val. Loss: 0.523 |  Val. Acc: 72.36%
Epoch: 3
	Train Loss: 0.505 | Train Acc: 75.00%
	 Val. Loss: 0.498 |  Val. Acc: 75.70%
Epoch: 4
	Train Loss: 0.480 | Train Acc: 76.75%
	 Val. Loss: 0.495 |  Val. Acc: 75.91%
Epoch: 5
	Train Loss: 0.460 | Train Acc: 78.06%
	 Val. Loss: 0.499 |  Val. Acc: 76.18%
Epoch: 6
	Train Loss: 0.445 | Train Acc: 79.09%
	 Val. Loss: 0.506 |  Val. Acc: 76.26%
Epoch: 7
	Train Loss: 0.431 | Train Acc: 80.01%
	 Val. Loss: 0.504 |  Val. Acc: 75.85%
Epoch: 8
	Train Loss: 0.417 | Train Acc: 80.85%
	 Val. Loss: 0.523 |  Val. Acc: 76.17%
Epoch: 9
	Train Loss: 0.405 | Train Acc: 81.34%
	 Val. Loss: 0.562 |  Val. Acc: 75.92%
Epoch: 10
	Train Loss: 0.393 | Train Acc: 82.15%
	 Val. Loss: 0.523 |  Val. Acc: 76.04%


Test on the best model.

In [20]:
model.load_state_dict(torch.load('anything_goes.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.424 | Test Acc: 80.19%


With a 4-layer bidirectional LSTM and a 0.7 dropout rate on a my peak validation accuracy was 76.58% (80/20 train/val split). I think this could definitely be higher if I can get the welsh pretrained embeddings working.