# Problem 1


As you can see, `gensim` provides an object that can be keyed with a given word to return the word vector. We loaded the `glove-wiki-gigaword-100` word vectors, which were trained on the combination of the Gigaword and Wikipedia datasets using the GloVe algorithm, and are 100-dimensional word vectors.

For this problem, you should
1. Implement a function that computes the average of the word vectors for a given sentence.
2. Get the average word vectors for every sentence in the training and test sets.
3. Train a logistic regression model to predict the sentiment label (0 or 1) using the average word vectors as input.
4. Evaluate its performance on the test set.

You are welcome to use whatever approach/framework you want to build and train the logistic regression model. The textbook has an example implementation that you can use [here](http://d2l.ai/chapter_linear-networks/softmax-regression-concise.html).

Note that for the first step, you will have to handle out-of-vocabulary words in some way, since the word vector collection does not include every word in the SST-2 dataset. My simple recommendation is jsut to ignore out-of-vocabulary words completely when taking the average across word vectors for a given sentence.

After training a reasonable model on `glove-wiki-gigaword-100`, try a different set of word vectors than the `glove-wiki-gigaword-100` model. You can find a list of available pre-trained word vectors [here](https://github.com/RaRe-Technologies/gensim-data#models). Report the difference in accuracy between the different pre-trained word vectors and make a guess as to why one works better or worse than the other.

In [None]:
import numpy as np
import random
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import unicodedata as uni
from gensim.utils import simple_preprocess
import gensim.downloader as api

import warnings
warnings.filterwarnings('ignore')

seedy = 666
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
print("Device available for running: %s" % device)

In [None]:
def load_sst2_data(filename):
    with open(filename) as f:
        data = [(l[2:].strip(), int(l[0])) for l in f.readlines()]
    return tuple(zip(*data))

sentences_train, labels_train = load_sst2_data("stsa.binary.train.txt")
sentences_test, labels_test = load_sst2_data("stsa.binary.test.txt")
# 6290 training samples
# 1821 test samples

In [None]:
word_vectors = api.load("glove-wiki-gigaword-100")

In [None]:
def get_avg_vec(sentence):
    clean_sentence = uni.normalize("NFKD", sentence).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    words = "".join([x.lower() for x in clean_sentence.split(" ")])
    try:
        vecs = [word_vectors.get_vector(x) for x in words]
    except Exception as e:
        print(e, sentence)
    avg = np.mean(vecs)
    return avg

In [None]:
s = time.time()
avg_vecs_train = [get_avg_vec(x) for x in sentences_train]
print("%5.3f mins" % ((time.time()-s)/60))

# Problem 2

In this problem, instead of representing each sentence as an average of the word vectors for the words in the sentence, you will get sentence vectors from a pre-trained BERT model. `gluonnlp` provides a handy interface for loading a pre-trained BERT model, check it out [here](https://nlp.gluon.ai/model_zoo/bert/index.html). A popular choice for a `pytorch` implementation is [Hugging Face Transformers](https://huggingface.co/transformers/). I would recommend using the DistilBERT model (called `distil_book_corpus_wiki_en_uncased` in `gluonnlp` and `distilbert-base-uncased` in Hugging Face `transformers`). DistilBERT is a smaller (and more computationally efficienty) version of BERT that gets reasonable performance. In this problem, you will use BERT in two ways: Either to get fixed sentence representations for each sentence, or via fine-tuning the full model (as is most common in transfer learning).

1. Replace the average-word-vector representation you used from the first problem with the CLS token representation for each sentence from DistilBERT. Then, train a small logistic regressor on top of these new vector representations and retport the performance.
2. Fine-tune all of BERT's parameters on the SST-2 dataset. [Here](https://nlp.gluon.ai/examples/sentence_embedding/bert.html) is a tutorial for `gluonnlp`, [here](https://huggingface.co/transformers/training.html) is one for `transformers`. Note that you may need to modify the tutorial code somewhat (for example, the `gluonnlp` example focuses on sentence-pair classification rather than sentence classification).

Which worked better? Note that [state-of-the-art performance](https://gluebenchmark.com/leaderboard) on SST-2 is about 98%, and BERT's reported performance is about 95%. How close are you?

In [None]:
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

import torch.utils.data as data_utils


In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
LIL_BERT = 'distilbert-base-uncased'

model = DistilBertForSequenceClassification.from_pretrained(LIL_BERT).to(device)

tokenizer = DistilBertTokenizerFast.from_pretrained(LIL_BERT, cls_token='[CLS]')

encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

dataset_train = MyDataset(encodings_train, labels_train)
dataset_test = MyDataset(encodings_test, labels_test)

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
optimizer = AdamW(model.parameters(), lr=1e-5)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_test             # evaluation dataset
)

trainer.train()

In [None]:
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
# encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

In [None]:
type(encodings_train)

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

In [None]:
LIL_BERT = 'distilbert-base-uncased'

model = DistilBertForSequenceClassification.from_pretrained(LIL_BERT).to(device)
# model.train()

optimizer = AdamW(model.parameters(), lr=1e-5)

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained(LIL_BERT, cls_token='[CLS]')

encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

# attention_mask = encodings_train['attention_mask'].to(device)

In [None]:
X_train = encodings_train['input_ids'].to(device)
y_train = torch.tensor(labels_train)
dataset_train = data_utils.TensorDataset(X_train, y_train)

X_test = encodings_test['input_ids'].to(device)
y_test = torch.tensor(labels_test)
dataset_test = data_utils.TensorDataset(X_test, y_test)

In [None]:
# from transformers import BertTokenizer, glue_convert_examples_to_features
# data = tfds.load('glue/mrpc')

# train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')

In [None]:
train_dataset

In [None]:
encodings_train

In [None]:
len(encodings_train)

In [None]:
encodings_train

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
DataCollatorWithPadding

In [None]:
type(dataset_test)

In [None]:
training_args = TrainingArguments(
    output_dir='./hw10_results/training_results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,            # strength of weight decay
    logging_dir='./hw10_results/logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_test            # evaluation dataset
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
outputs = model(input_ids, attention_mask=attention_mask, labels=y_train)
loss = outputs.loss
loss.backward()
optimizer.step()

In [None]:
LIL_BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(LIL_BERT, do_lower_case=True, cls_token='[CLS]')

encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

In [None]:
encodings_train[1]

In [None]:
len(encodings_train)

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text=text_preprocessing(sent),  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            return_attention_mask=True      # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

In [None]:
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights);