# Exercise 03 Notebook - Text Classification with BERT
In this notebook we will create a Neural Network for Text Classification. As a basis for our Network we will use a pretrained BERT Model and just add a classification layer. We will train and test our dataset on the IMDB dataset we have already seen in the previous exercises.

## 1. Download the data, model , and tokenizer
In a first step we will install transformers, an awesome library by huggingface that implements a lot of transformer models and also provides pretrained models in many languages that we can download and use for free.

In [None]:
# install transformer library via pip
!pip install transformers

Before we download a pretrained model, we download the data we will train our classifier on. Read the data from `https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv` into a pandas DataFrame.

In [None]:
# import pandas
import pandas as pd

In [None]:
# Read the csv file and save it in a variable called data_df


In [None]:
# After reading in the data, shuffle the rows of the DataFrame


If you forget how the dataset looks like to a little bit of of data exploration

In [None]:
# Data Exploration


A lot of errors can happen when you build your dataset and model for the first time. It neat little trick is to not use all of your data from the beginning. The IMDB dataset we use has 50,000 training examples. To see if everything works we dont need to use all of our data. A small subset is enough, so it might be useful to just take 1000 examples at first and only use the full data at the very end when we no that everything works as intended.

In [None]:
# get the first 1000 rows of the dataframe


In [None]:
# Make an 80:20 split for training and validation data and save
# them as train_df and val_df, respectively.


Great! Now that we have our data at hand we will now download the pretrained BERT Model. To be more precise we will use a smaller version of BERT named DistilBERT which is a lot smaller than the normnal BERT model but has 95% of it's performance.



> The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

A smaller model allows much faster training and it makes much sense to try out ideas with a smaller model, because we can iterate much faster with a smaller model.

In [None]:
# import model and tokenizer classes
from transformers import DistilBertModel, DistilBertTokenizer

The transformer library makes it very easy for us to download pretrained versions of a model and the corresponding tokenizer. You can find a pretrained model [here](https://huggingface.co/distilbert-base-uncased). You could try out more models.

As already mentioned we want to use DistilBert for our classifier. Try to find a suitable model that we can use for our dataset and save it's name as a string in a variable:

In [None]:
# insert model name
model_name = None

In [None]:
bert_model = DistilBertModel.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

## 2. Tokenizer
After successfully downloading the pretrained model and tokenizer will use the  tokenizer to tokenize our text data.**bold text**

### 2.1 Getting familiar with the tokenizer

As a little warm-up lets do a few exercises on the example text given below. If you don't know what to do have a look at the [tokenizer documentation](https://huggingface.co/transformers/main_classes/tokenizer.html).

In [None]:
example_text = """Star wars made epic fantasy real. For a generation of people it has defined what the cinema experience is meant to be."""

In [None]:
# tokenize the example_text
# you should receive a list ['Star', 'wars', 'made', ..., '.']


In [None]:
# encode the example text.
# you should receive a list of integers [101, 2537, 8755, ..., 102]
# save the list into a variable encoded_text


In [None]:
# use the tokenizer to decode the encoded_text
# do you notice something?


In [None]:
# use the tokenizer object as a function and use the
# example_text as the input. What type of object do you receive?
# What does each value mean?


### 2.2 Tokenizing the data
We are now ready to tokenize our test and validation dataset. Again use the tokenizer to create input_ids and attention_masks for the test and validation set. The tokenized text should have the following properties:
* [CLS] and [SEP] token added
* max length should be 128
* texts with more than 128 tokens should be truncated
* the tokenizer should return torch tensors.


In [None]:
# Apply the tokenizer to the training text data and save the resulting dict
# in a variable called tokenized_val_data



In [None]:
# Apply the tokenizer to the validation text data and save the resulting dict
# in a variable called tokenized_train_data


In [None]:
# have a look at the input_ids of the tokenized_train_data


In [None]:
# have a look at the attention_mask of the tokenized_train_data


## 3. Creating the test and validation Dataset and DataLoader
Our train and evaluation data in vector format and we can now use the tokenized text to create our Dataset and DataLoader class

In [None]:
import torch
import torch.nn as nn
# import Dataset and DataLoader class
from torch.utils.data import Dataset, DataLoader

Now create a Dataset class called TextDataset. As always we will need to implement three functions:
* `__init__`
* `__len__`
* `__getitem__`

The `__init__` function should take tokenized data and the labels as arguments
and store them into the class variables `X` and `Y`

The `__len__` function should return the length of the dataset

The `__getitem__` should take index as input and return a tuple of data that looks like this `(input_ids, attention_mask, labels)`


In [None]:
class TextDataset(Dataset):

  def __init__(self, X, Y):
    pass


  def __len__(self,):
    pass

  def __getitem__(self, index):
    pass

Create a `train_dataset` and `val_dataset`:

In [None]:
train_dataset = pass
val_dataset = pass

Create the training DataLoader `train_dl` and the validation DataLoader `val_dl` with a batch size of 32:

In [None]:
train_dl = pass
val_dl = pass

# 4. Creating the Model
In this part we will create our model using the pretrained DistilBert model that we downloaded at the beginning. Before we add our classifier to the network we first need to understand what exactly the DistilBert models output looks like.

### 4.1 Understanding BERT's output

In [None]:
# get the first batch from our train_dl
first_batch = next(iter(train_dl))

Have a look at the `first_batch`.

In [None]:
# first batch
first_batch

Save the input ids in a variable called `input_ids` and the attention mask into an variable called `attention_mask`. You can ignore the labels for now.

In the first chapter we downloaded the pretrained DistilBert Model and saved it as `bert_model`. For the forward propagation the `bert_model`expects input ids and attention masks as an input. [This blog](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) post is a very nice visualization and is very helpful for understanding coming out of the bert model.

![Last hidden state](https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png)

In [None]:
last_hidden_state = None

Check the dimension of the models output and make sure you understand what each dimension represents. Slice the model so that you get all values for each `[CLS]` token in the batch

In [None]:
# shape of last hidden state


In [None]:
# Select [CLS] Token


In [None]:
# check the shape of the [CLS] tokens


### 4.2 Defining the Model
Now create a neural network called `BertClassifier`. The constructor should receive the pretrained bert model and the number of classes.
In the constructor save the bert model into a variable `bert`. Create a linear layer and think which input and output dimensions are needed.

The `forward()` should receive `input_ids` and `attention_mask` as input and should propate them through the layers.

In [None]:
# implement BertClassifier
class BertClassifier(nn.Module):

  def __init__(self, bert_model, n_classes):
    pass

  def forward(self, input_ids, attention_mask):
    pass

In [None]:
# instiantiate the model
model = pass

## 5. Model Training
After defining the model we now have everything we need to train our model.

### 5.1 Moving to the GPU

In previous exercises our models were quite small with only a couple of thousand parameters. Our BERT classifier is several magnitudes larger (about 66M parameters). With that many parameters it becomes necessary to train on the gpu.

We can easily move our model to the gpu with the following command:
  `model.to('gpu')`. But there is one problem. If there is no 'gpu' available the code will crash. There is an easy way to check if a gpu is available:
  

In [None]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

We can now move our model to the gpu safely. If there is no gpu available the model just stays on the cpu.

In [None]:
# pass the model to the gpu


### 5.2 Setting up the training

In [None]:
epochs = pass
lr = pass
optimizer = pass
loss_func = pass

In [None]:
# To get a better idea of how well your model performs
# you should implement an accuracy function that is
# called after each epoch of your training loop
def accuracy(out, yb):
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()

In [None]:
# Freezing Parameters of bert


### 5.3 Train the model
The training loop is almost the same as in the first exercise of the course. Spot and understand the differences

In [None]:
# Execute the train function and train the model.
train(model, train_dl, val_dl, epochs, optimizer, loss_func)