# Exercise: Create a BERT sentiment classifier

In this exercise, you will create a BERT sentiment classifier (actually DistilBERT) using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. 

You will use the [IMDB movie review dataset](https://huggingface.co/datasets/imdb) to train and evaluate your model. The IMDB dataset contains movie reviews that are labeled as either positive or negative. 

In [1]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
! pip install -q "datasets==2.15.0"

In [2]:
# Import the datasets and transformers libraries
from datasets import load_dataset 

In [3]:
# Load the train and test splicts of the IMDB dataset
splits = ['train', 'test']
ds ={split: ds for split, ds in zip(splits, load_dataset('imdb', split=['train', 'test']))}

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [5]:
ds['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
# Tin out the dataset to make it run for this example
for split in splits:
    ds[split] = ds[split].select(range(1000))

In [7]:
splits = ['train', 'test']
datasets = load_dataset('imdb', split=splits)

In [9]:
ds = {}
for split, datasets in zip(splits, datasets):
    ds[split] = datasets

## Pre-process datasets

Now we are going to process our datasets by converting all the text into tokens for our models. You may ask, why isn't the text converted already? Well, different models may use different tokenizers, so by converting at train time we retain more flexibility.

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

  _torch_pytree._register_pytree_node(


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples"""
    return tokenizer(examples['text'], padding='max_length', truncation=True)


    