# Problem 1


As you can see, `gensim` provides an object that can be keyed with a given word to return the word vector. We loaded the `glove-wiki-gigaword-100` word vectors, which were trained on the combination of the Gigaword and Wikipedia datasets using the GloVe algorithm, and are 100-dimensional word vectors.

For this problem, you should
1. Implement a function that computes the average of the word vectors for a given sentence.
2. Get the average word vectors for every sentence in the training and test sets.
3. Train a logistic regression model to predict the sentiment label (0 or 1) using the average word vectors as input.
4. Evaluate its performance on the test set.

You are welcome to use whatever approach/framework you want to build and train the logistic regression model. The textbook has an example implementation that you can use [here](http://d2l.ai/chapter_linear-networks/softmax-regression-concise.html).

Note that for the first step, you will have to handle out-of-vocabulary words in some way, since the word vector collection does not include every word in the SST-2 dataset. My simple recommendation is jsut to ignore out-of-vocabulary words completely when taking the average across word vectors for a given sentence.

After training a reasonable model on `glove-wiki-gigaword-100`, try a different set of word vectors than the `glove-wiki-gigaword-100` model. You can find a list of available pre-trained word vectors [here](https://github.com/RaRe-Technologies/gensim-data#models). Report the difference in accuracy between the different pre-trained word vectors and make a guess as to why one works better or worse than the other.

In [20]:
import numpy as np
import random
import time
from d2l import torch as d2l

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score


import torch.utils.data as data_utils

import unicodedata as uni
import gensim.downloader as api

import warnings
warnings.filterwarnings('ignore')

seedy = 666
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Device available for running: %s" % device)

Device available for running: cuda:0


In [21]:
def load_sst2_data(filename):
    with open(filename) as f:
        data = [(l[2:].strip(), int(l[0])) for l in f.readlines()]
    return tuple(zip(*data))

sentences_train, labels_train = load_sst2_data("stsa.binary.train.txt")
sentences_test, labels_test = load_sst2_data("stsa.binary.test.txt")

# 6290 training samples
# 1821 test samples

word_vectors = api.load("glove-wiki-gigaword-100")
print("Vocab size:",len(word_vectors.key_to_index.keys()))

Vocab size: 400000


## <font color=magenta> 1. Implement a function that compures the average of the word vectors for a given sentence.

In [28]:
def get_avg_vec(sentence):
    
    clean_sentence = uni.normalize("NFKD", sentence).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    words = [x.lower() for x in clean_sentence.split(" ") if x.lower() in word_vectors.key_to_index.keys()]

    try:
        vecs = [word_vectors.get_vector(x) for x in words]
    except Exception as e:
        print(e, sentence)
        
    vec_array = np.vstack(vecs)
    avg_vec = np.mean(vec_array,axis=0)
    return avg_vec

## <font color=magenta> 2. Get the average word vectors for every sentence in the training and test sets.


In [29]:
X_train = np.array([get_avg_vec(x) for x in sentences_train])
X_test = np.array([get_avg_vec(x) for x in sentences_test])

y_train = np.array(list(labels_train))
y_test = np.array(list(labels_test))

print("Shapes:",X_train.shape, y_train.shape)

Shapes: (6920, 100) (6920,)


## <font color=magenta> 3. Train a logistic regression model to predict the sentiment label (0 or 1) using the average word vectors as input.


In [None]:
logistic_reg = LogisticRegression(solver="liblinear",random_state = seedy, max_iter=10000, n_jobs=-1)
logistic_reg.fit(X_train,y_train)

## <font color=magenta> 4. Evaluate its performance on the test set. 

In [None]:
y_pred = logistic_reg.predict(X_test)

print("Accuracy: ",round(accuracy_score(y_test,y_pred),3))
print("F1: ",round(f1_score(y_test, y_pred),3))

# Problem 2

In this problem, instead of representing each sentence as an average of the word vectors for the words in the sentence, you will get sentence vectors from a pre-trained BERT model. `gluonnlp` provides a handy interface for loading a pre-trained BERT model, check it out [here](https://nlp.gluon.ai/model_zoo/bert/index.html). A popular choice for a `pytorch` implementation is [Hugging Face Transformers](https://huggingface.co/transformers/). I would recommend using the DistilBERT model (called `distil_book_corpus_wiki_en_uncased` in `gluonnlp` and `distilbert-base-uncased` in Hugging Face `transformers`). DistilBERT is a smaller (and more computationally efficienty) version of BERT that gets reasonable performance. In this problem, you will use BERT in two ways: Either to get fixed sentence representations for each sentence, or via fine-tuning the full model (as is most common in transfer learning).

1. Replace the average-word-vector representation you used from the first problem with the CLS token representation for each sentence from DistilBERT. Then, train a small logistic regressor on top of these new vector representations and retport the performance.
2. Fine-tune all of BERT's parameters on the SST-2 dataset. [Here](https://nlp.gluon.ai/examples/sentence_embedding/bert.html) is a tutorial for `gluonnlp`, [here](https://huggingface.co/transformers/training.html) is one for `transformers`. Note that you may need to modify the tutorial code somewhat (for example, the `gluonnlp` example focuses on sentence-pair classification rather than sentence classification).

Which worked better? Note that [state-of-the-art performance](https://gluebenchmark.com/leaderboard) on SST-2 is about 98%, and BERT's reported performance is about 95%. How close are you?

In [22]:
from transformers import DistilBertTokenizerFast
from transformers import AdamW
from transformers import DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

import torch.utils.data as data_utils

## <font color=magenta> 1. Replace the average-word-vector representation you used from the first problem with the CLS token representation for each sentence from DistilBERT. Then, train a small logistic regressor on top of these new vector representations and retport the performance.

In [23]:
LIL_BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(LIL_BERT, cls_token='[CLS]')

encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

In [24]:
model = DistilBertForSequenceClassification.from_pretrained(LIL_BERT).to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [41]:
from tqdm.notebook import tqdm

input_train = encodings_train["input_ids"]
print(len(input_train))

cls_encodings_train = []
with torch.no_grad():
    for x in tqdm(input_train, total=len(input_train)):
        inputs = x.unsqueeze(0).to(device)
        out = model(inputs)
        cls_embeddings = out[0][0]
        cls_encodings_train.append(cls_embeddings.cpu().detach().numpy())

6920


HBox(children=(FloatProgress(value=0.0, max=6920.0), HTML(value='')))




In [42]:
input_test = encodings_test["input_ids"]
print(len(input_train))

cls_encodings_test = []
with torch.no_grad():
    for x in tqdm(input_test, total=len(input_test)):
        inputs = x.unsqueeze(0).to(device)
        out = model(inputs)
        cls_embeddings = out[0][0]
        cls_encodings_test.append(cls_embeddings.cpu().detach().numpy())

6920


HBox(children=(FloatProgress(value=0.0, max=1821.0), HTML(value='')))




In [47]:
X_train = np.vstack(cls_encodings_train)
X_test = np.vstack(cls_encodings_test)

array([[-0.15392572, -0.02882835],
       [-0.13375531, -0.04288049],
       [-0.15778652, -0.03006012],
       [-0.14594616, -0.03129875],
       [-0.1469923 , -0.04183166],
       [-0.11734855, -0.04717905],
       [-0.14698863, -0.03894245],
       [-0.13506207, -0.03972663],
       [-0.14023195, -0.02553444],
       [-0.14346951, -0.04786361]], dtype=float32)

In [53]:
logistic_reg = LogisticRegression(random_state = seedy, max_iter=10000, n_jobs=-1)
logistic_reg.fit(X_train,y_train)

LogisticRegression(max_iter=10000, n_jobs=-1, random_state=666)

In [54]:
y_pred = logistic_reg.predict(X_test)

print("Accuracy: ",round(accuracy_score(y_test,y_pred),3))
print("F1: ",round(f1_score(y_test, y_pred),3))

Accuracy:  0.526
F1:  0.653


In [None]:
encodings_train["input_ids"].to(device)

In [None]:
token_ids = encodings_train['input_ids']
attn_mask = encodings_train['attention_mask']
outputs = model(encodings_train)
# hidden_reps, cls_head = model(token_ids, attention_mask = attn_mask)

In [None]:
encodings_train.keys()

In [None]:
hidden_reps, cls_head = model(token_ids, attention_mask = attn_mask, token_type_ids = seg_ids)
print(type(hidden_reps))
print(hidden_reps.shape ) #hidden states of each token in inout sequence 
print(cls_head.shape ) #hidden states of each [cls]

output:
hidden_reps size 
torch.Size([1, 15, 768])

cls_head size
torch.Size([1, 768])

In [None]:
len(encodings_train)
encodings_train.keys()

In [None]:
encodings_train['input_ids'][:10]

In [None]:
output[0][0]

In [None]:
tokenizer.cls_token_id

In [None]:
X_train = encodings_train["input_ids"]
_X_test = encodings_test["input_ids"]

## Need to add more padding to X_test
X_test = []
for x in _X_test:
    pad = torch.zeros(10)
    new_x = torch.cat((x, pad))
    X_test.append(new_x)
X_test = torch.stack(X_test, axis=0)
    
y_train = np.array(list(labels_train))
y_test = np.array(list(labels_test))

print("Shapes:",X_train.shape, y_train.shape)
print("Shapes:",X_test.shape, y_test.shape)

In [None]:
logistic_reg = LogisticRegression(solver="liblinear",random_state = seedy, max_iter=10000, n_jobs=-1)
logistic_reg.fit(X_train,y_train)

In [None]:
y_pred = logistic_reg.predict(X_test)

print("Accuracy: ",round(accuracy_score(y_test,y_pred),3))
print("F1: ",round(f1_score(y_test, y_pred),3))

In [None]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)

In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained(LIL_BERT, cls_token='[CLS]')

encodings_train = tokenizer(list(sentences_train), truncation=True, padding=True, return_tensors='pt')
encodings_test = tokenizer(list(sentences_test), truncation=True, padding=True, return_tensors='pt')

dataset_train = MyDataset(encodings_train, labels_train)
dataset_test = MyDataset(encodings_test, labels_test)

In [None]:
train_dataset

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained(LIL_BERT).to(device)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_test             # evaluation dataset
)

trainer.train()

In [None]:
res = trainer.evaluate()

In [None]:
res

In [None]:
pred = trainer.predict(dataset_test)

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print(compute_metrics(pred))

In [None]:
X_train = encodings_train['input_ids'].to(device)
y_train = torch.tensor(labels_train)
dataset_train = data_utils.TensorDataset(X_train, y_train)

X_test = encodings_test['input_ids'].to(device)
y_test = torch.tensor(labels_test)
dataset_test = data_utils.TensorDataset(X_test, y_test)

In [None]:
# from transformers import BertTokenizer, glue_convert_examples_to_features
# data = tfds.load('glue/mrpc')

# train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')

In [None]:
train_dataset