In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import torch
from transformers import AutoTokenizer, AutoModel

### CONFIG

In [None]:
# warnings.filterwarnings('ignore')
MODEL_NAME = 'distilbert-base-uncased'
DATA_URL = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
N_ROWS = 1000 # number of rows to read from input file.


### Question 1: Load data
- Use `pd.read_csv` to load data from the `DATA_URL` specified above<br>
(**NOTE**: Specify parameters `delimiter='\t'` and `header=None`, since this data file is a `.tsv` file without header columns)

- Rename the columns (`df.rename`). The text column should renamed to 'text' and label column should be called 'label'
- Print the top three rows
- Print the value counts of the label column

### Question 2: Load model weights
The HuggingFace (transformers) ecosystem allows us to build down model weights for pre-trained transformer neural networks. By passing in the name (MODEL_NAME) of the model we want to use, we can load the weights into a model object automatically. 
The same thing goes for tokenizers. Most models have different tokenization schemas, which means that we want to load the tokenizer schema that works for the particular model we specified.

- Run the cell below to load the model weights and tokenization schema into the Model and Tokenizer objects.

- Print the `model` object. Notice how it is made up of layers just like the neural networks we trained for image classification in a prior lab, albeit a bit more complex.

In [None]:
model = AutoModel.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

### Question 3: Tokenize the data
With our tokenizer loaded, we can now preprocess our data into the format that the BERT model expects.

- Tokenize the data by using the `__call__` method of the tokenizer object<br><br>
    - e.g., `tokenized=tokenizer(df["text"].to_list())`
    - Pass the following arguments, along with the texts
        - `add_special_tokens=True` (Adds the [CLS] and [SEP] tokens that the BERT model expects")
        - `padding='longest'` (The texts have uneven length. Padding means to insert dummy tokens to make the equal)
        - `return_attention_mask=True` (Return the attention mask expected by the BERT model)
        - `return_tensors='pt'` (Return the input ID tensors expected by the BERT model as PyTorch tensors)
        - `verbose=True` (Tell us what's going on)
<br><br>
- Print the `tokenized` object (It will be a dictionary of `ìnput_ids` (as tensors) and `attention_mask` (as tensors))




### Question 4: Calculate embedding features
With our data tokenized such that the BERT model can understand it, we are now ready to calculate the embeddings.

- Use the function below to extract embeddings

- Print the shape of the embeddings. Should be (N_ROWS, 758)



In [None]:
def get_bert_embeddings(model, tokenized):

    """ Calculate BERT embeddings for a batch of sentences.
    NOTE: Calculating BERT embeddings is a very expensive operation.
    Particularly on CPU, it can take a long time to calculate embeddings for
    a large batch of sentences (Max 10-20 minutes for 6K sentences).

    Args:
        model (transformers BERT model): BERT model.
        tokenized (dict): Dictionary of tokenized sentences (input_ids and attention_mask)

    Returns:
        n-d NumPy array: BERT embeddings for the sentences in the batch.
    """

    print("Getting model encodings...")
    # The following is a context-manager that disables gradient calculation.
    # Disabling gradient calculation is useful for inference, when you are 
    # sure that you will not call Tensor.backward(). It will reduce memory 
    # consumption for computations that would otherwise have requires_grad=True.
    # TLDR: calculating gradients is expensive. We don't need them for inference.
    with torch.no_grad():
        last_hidden_states = model(**tokenized)

    # last_hidden_states[0] is the last hidden state of the first token of the
    # sequence (classification token) further processed by a Linear layer and 
    # a Tanh activation function. The Linear layer weights are trained from the
    #  next sentence prediction (classification) objective during pretraining.
    # last_hidden_states[0].shape = (batch_size, hidden_size)
    print("Returning embeddings...")
    return last_hidden_states[0][:,0,:].numpy()

embeddings = get_bert_embeddings(model, tokenized)
embeddings.shape

### Question 5: Train a model
Now that we have our embeddings, it is time to use them in a machine learning model. (You can use Keras too, if you feel adventurous.)

- Split the embeddings and the labels into (X_train, y_train) and (X_test, y_test) using sklearn's `train_test_split` function

- Fit a Logistic Regression model on (X_train, y_train)

- Evaluate the results on both the training data and the test data

****

# Bonus exercises
As discussed, the remaining tasks are bonus tasks. You are not expected to complete these before you hand in. It is just for you own understanding.

### Question 6 - Compare with CountVectorizer

- Split the original data using train_test_split
- Fit a CountVectorizer to (X_train)
- Transform X_train and X_test
- Fit a logistic regression model to the countvectorized X_train and y_train
- Evaluate the results on both (X_train, y_train) and (X_test, y_test)
- Compare with the results of using BERT