# Lesson 4 Notebook: BERT Endeavors

**Description:** After some setup for our standard IMDB movie classification task we will explore BERT (obtained from the [Huggingface Transformer library](https://huggingface.co/docs/transformers/index)) and apply it to text classification.

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Data Acquisition](#dataAcquisition)  
  * 3. [BERT Basics](#bertBasics)
    * 3.1 [Tokenization](#tokenization)
    * 3.2 [Model Structure & Output](#modelOutput)
    * 3.3 [Context Based Embeddings with BERT](#contextualEmbeddings)
  * 4. [Text Classification with BERT (using the Pooler Output)](#BERTClassification)
    * 4.1 [Classification Model Setup](#modelSetup)
    * 4.2 [Model Training](#modelTraining)
    * 4.3 [Class Exercise](#classExercise)

  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_4_BERT.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

This notebook requires the Huggingface transformers package and a dataset that you must download and then store locally.

In [1]:
!pip install -q transformers
!pip install -q torchinfo
!pip install -U -q datasets fsspec huggingface_hub # Hugging Face's dataset library
!pip install -q datasets
!pip install -q evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Ready to do the imports.

In [2]:
import numpy as np

import transformers
import evaluate

from datasets import load_dataset
from torchinfo import summary

For the Transformer library we need to import the **tokenizer**, the pre-trained **model**, and a **trainer** class plus **trainer arguments** to do the fine-tuning:

In [3]:
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import TrainingArguments, Trainer

A small function calculating the cosine similarity may also come in handy:

In [4]:
def cosine_similarities(vecs):
    for v_1 in vecs:
        similarities = ''
        for v_2 in vecs:
            similarities += ('\t' + str(np.dot(v_1, v_2)/np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(similarities)

[Return to Top](#returnToTop)  
<a id = 'dataAcquisition'></a>

## 2. Data Acquisition

We will use the IMDB dataset delivered as part of the Huggingface datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [5]:
imdb_dataset = load_dataset("imdb")

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
# Look at what's in the dataset
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
# Look at the first few examples
for i in range(4):
  print(imdb_dataset['train']['text'][i])
  print(imdb_dataset['train']['label'][i])
  print()

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [8]:
# Look at the label names
imdb_dataset['train'].features['label'].names

['neg', 'pos']

[Return to Top](#returnToTop)  
<a id = 'bertBasics'></a>
## 3. BERT Basics

We now need to settle on the pre-trained BERT model we want to use. We will leverage **'bert-base-cased'**.

We need to load the model and corresponding tokenizer. Let's start with the tokenizer.

In [9]:
checkpoint = 'bert-base-cased'
bert_tokenizer = BertTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

[Return to Top](#returnToTop)  
<a id = 'tokenization'></a>

### 3.1 Tokenization

Tokenization with BERT is interesting. To minimize the number of unknown words, BERT (like most pre-trained transformer models) uses a **subword** model for tokenization. We will see what that means in a second.

Let's start with something simple:

In [10]:
bert_tokenizer.tokenize('This is great!')

['This', 'is', 'great', '!']

Ok, that is as expected. What about:

In [11]:
bert_tokenizer.tokenize('This tree is 1253 years old.')

['This', 'tree', 'is', '125', '##3', 'years', 'old', '.']

or

In [12]:
bert_tokenizer.tokenize('Pneumonia can be very serious.')

['P', '##ne', '##um', '##onia', 'can', 'be', 'very', 'serious', '.']

Ouch! Many more complex terms are not in BERT's vocabulary and are split up.

**Question:** in what type of NLP problems can this lead to complications?

Next, how do we generate the BERT input with its tokenizer? Fortunately, by now Huggingface's tokenizer implementation makes this rather straightforward:

In [13]:
bert_tokenizer(['This is great!'])

{'input_ids': [[101, 1188, 1110, 1632, 106, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}

To make sure we do this correctly though we may want to specify that we want to have the inputs for TensorFlow (vs. PyTorch), and we may want to do some padding:

In [14]:
bert_input = bert_tokenizer.batch_encode_plus(
    ['This is great!', 'This is terrible!'],
    max_length=10,
    truncation=True,
    padding='max_length',
    return_tensors='pt'
)

bert_input

{'input_ids': tensor([[ 101, 1188, 1110, 1632,  106,  102,    0,    0,    0,    0],
        [ 101, 1188, 1110, 6434,  106,  102,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

What do we notice? Look at shapes and values. Does everything make sense?

**Question**: What is the input_id of the CLS token? What is the input_id of the SEP token?

We'll use the tokenizer to preprocess our input text. We can define a function that takes each movie review and turns it into the inputs needed for the BERT model. Then we'll map that function onto our training and validation datasets.

We'll restrict the number of examples we use during the live session, so that the training goes more quickly, but you can try using the full dataset on your own.

In [15]:
max_length = 128

def preprocess_imdb(data):
    review_text = data['text']

    encoded = bert_tokenizer.batch_encode_plus(
            review_text,
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="pt"
        )

    return encoded

In [16]:
num_train = 5000  # There are 25,000 available; we'll use less in the live session
num_dev = 500  # There are another 25,000 in the "test" split, which we'll use for validation, since the remaining split doesn't have labels

imdb_train_dataset = imdb_dataset['train'].shuffle().select(range(num_train)).map(preprocess_imdb, batched=True)
imdb_dev_dataset = imdb_dataset['test'].shuffle().select(range(num_dev)).map(preprocess_imdb, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [17]:
imdb_train_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5000
})

[Return to Top](#returnToTop)  
<a id = 'modelOutput'></a>

### 3.2 Model Structure & Output

Where we have familiarized ourselves with the tokenization, we can now turn to the model and its output.

Let's start by using the basic BertModel class. This class loads only the pre-trained part of the model that we'll use, up until the last hidden layer. We don't want the output layers used in pre-training, since we won't be doing those tasks (i.e. masked token prediction or next sentence prediction).

We can look at the model's internal weight matrices, to see what's in the pre-trained layers we've loaded. These are named, and they should be familiar, based on our knowledge of the BERT model architecture.

In [18]:
bert_model = BertModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [19]:
for name, param in bert_model.named_parameters():
    print(name)

embeddings.word_embeddings.weight
embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight
embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.attention.output.dense.bias
encoder.layer.0.attention.output.LayerNorm.weight
encoder.layer.0.attention.output.LayerNorm.bias
encoder.layer.0.intermediate.dense.weight
encoder.layer.0.intermediate.dense.bias
encoder.layer.0.output.dense.weight
encoder.layer.0.output.dense.bias
encoder.layer.0.output.LayerNorm.weight
encoder.layer.0.output.LayerNorm.bias
encoder.layer.1.attention.self.query.weight
encoder.layer.1.attention.self.query.bias
encoder.layer.1.attention.self.key.weight
encoder.layer.1.attention.self.key

Now let's turn to BERT's output.

In [20]:
bert_output = bert_model(**bert_input)
bert_output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3179,  0.3525,  0.1506,  ..., -0.1706,  0.3517,  0.0125],
         [ 0.2948, -0.0997,  0.6395,  ..., -0.0371,  0.2188,  0.3518],
         [ 0.1575,  0.6034,  0.7143,  ...,  0.0823,  0.3011,  0.6374],
         ...,
         [-0.0841,  0.2192,  0.3046,  ..., -0.0895,  0.3235,  0.2655],
         [ 0.2485, -0.0316,  0.3258,  ..., -0.1088,  0.4849,  0.0566],
         [ 0.2249, -0.0716,  0.2036,  ...,  0.0480,  0.4485,  0.1516]],

        [[ 0.3317,  0.3834,  0.0829,  ..., -0.2209,  0.2429, -0.1369],
         [ 0.3239, -0.0473,  0.6687,  ..., -0.0043,  0.3423,  0.3567],
         [ 0.2500,  0.7461,  0.3597,  ..., -0.1030,  0.2468,  0.7532],
         ...,
         [ 0.0742,  0.1487,  0.1221,  ..., -0.1052,  0.1900,  0.2008],
         [ 0.2734, -0.1222,  0.1121,  ..., -0.0486,  0.4010,  0.1512],
         [ 0.2248, -0.0299, -0.0100,  ...,  0.1512,  0.4156,  0.1596]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

Let's analyze this a bit:

In [21]:
print('Shape of first BERT output: ', bert_output[0].shape)
print('Shape of second BERT output: ', bert_output[1].shape)

Shape of first BERT output:  torch.Size([2, 10, 768])
Shape of second BERT output:  torch.Size([2, 768])


What does that mean? Are the dimensions correct? Why are there 2 outputs? Let's discuss in class. You can (and should!) also go to https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertModel and read the documentation. **REALLY(!)**
 critical.

[Return to Top](#returnToTop)  
<a id = 'contextualEmbeddings'></a>

### 3.3 Context-based Embeddings with BERT

We can use this version of the model if we want to access the contextualized embeddings that come from that last hidden layer of the model. The first output of the model provides the full sequence of token embeddings. The second output is basically the CLS token embedding (after one more dense layer transformation).

Let's look at the BERT contextualized embeddings for the word "bank" when it appears in a few contexts:

In [22]:
bert_bank_inputs = bert_tokenizer(["I need to bring my money to the bank today",
                                  "I will need to bring my money to the bank tomorrow",
                                  "I had to bank into a turn",
                                  "The bank teller was very nice" ],
                                padding=True,
                                return_tensors='pt')

bert_bank_inputs

{'input_ids': tensor([[ 101,  146, 1444, 1106, 2498, 1139, 1948, 1106, 1103, 3085, 2052,  102,
            0],
        [ 101,  146, 1209, 1444, 1106, 2498, 1139, 1948, 1106, 1103, 3085, 4911,
          102],
        [ 101,  146, 1125, 1106, 3085, 1154,  170, 1885,  102,    0,    0,    0,
            0],
        [ 101, 1109, 3085, 1587, 1200, 1108, 1304, 3505,  102,    0,    0,    0,
            0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

Next, we will get the outputs and extract the word vectors for bank in each of these sentences:

In [23]:
bert_bank_outputs = bert_model(**bert_bank_inputs)

bank_1 = bert_bank_outputs[0][0, 9]
bank_2 = bert_bank_outputs[0][1, 10]
bank_3 = bert_bank_outputs[0][2, 4]
bank_4 = bert_bank_outputs[0][3, 2]

banks = [
    bank_1.detach().numpy(),
    bank_2.detach().numpy(),
    bank_3.detach().numpy(),
    bank_4.detach().numpy()
]

Where are those numbers coming from?

Finally, we obtain the cosine similarities between the 4 vectors (from left to right and top to bottom we iterate through our vectors and report the cosine similarity):

In [24]:
cosine_similarities(banks)

	1.0	0.99	0.59	0.86
	0.99	1.0	0.59	0.87
	0.59	0.59	1.0	0.62
	0.86	0.87	0.62	1.0


Does this look right?

[Return to Top](#returnToTop)  
<a id = 'BERTClassification'></a>

# 4. Text Classification with BERT

Now we're ready to load and train our classification model.

[Return to Top](#returnToTop)  
<a id = 'modelSetup'></a>

### 4.1 Classification Model Setup

In order to use the BERT model on a new classification task, we'll need a new classification layer on top of the pre-trained transformer layers. Huggingface provides a model class for that purpose, called BertForSequenceClassification.

This class will load a model with the full architecture for a new text classification task. It starts with the pre-trained BERT model up until the last hidden layer, then it takes the "pooler" output from the BERT model and passes that into a new classification layer of the size we need for our task. (The "pooler" output is just the CLS token output, passed through another dense layer, before going to the output layer.)

The new classification layer has not been pre-trained, so we'll need to train it for our task. By default, Huggingface will give us an output layer with two classes (which fits our binary classification task), though we can specify a different number of classes if we have a multiclass classification task. We will most likely also want to continue updating the weights of at least some of the pre-trained layers, which we can explore later.

In [25]:
bert_classification_model = BertForSequenceClassification.from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This first time through, we'll freeze all of the pre-trained BERT layers to make the fine tuning go much faster. Then later we'll try unfreezing some or all layers and see what works better.

We need to keep the final classification layer unfrozen no matter what, because that's a new layer that hasn't been trained at all yet, and needs to be trained for our task.

Let's use the named_parameters method to access the names of the weight matrices in the model, and only freeze the pre-trained BERT ones.

In [26]:
for name, param in bert_classification_model.named_parameters():
    print(name)

bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.enc

In [27]:
for name, param in bert_classification_model.named_parameters():
    if name.split(".")[0] == "bert":
        param.requires_grad = False

In [28]:
# confirm all pre-trained layers are frozen
summary(bert_classification_model)

Layer (type:depth-idx)                                       Param #
BertForSequenceClassification                                --
├─BertModel: 1-1                                             --
│    └─BertEmbeddings: 2-1                                   --
│    │    └─Embedding: 3-1                                   (22,268,928)
│    │    └─Embedding: 3-2                                   (393,216)
│    │    └─Embedding: 3-3                                   (1,536)
│    │    └─LayerNorm: 3-4                                   (1,536)
│    │    └─Dropout: 3-5                                     --
│    └─BertEncoder: 2-2                                      --
│    │    └─ModuleList: 3-6                                  (85,054,464)
│    └─BertPooler: 2-3                                       --
│    │    └─Linear: 3-7                                      (590,592)
│    │    └─Tanh: 3-8                                        --
├─Dropout: 1-2                                         

[Return to Top](#returnToTop)  
<a id = 'modelTraining'></a>

### 4.2 Model Training

To train a Huggingface model, we'll use a Trainer class, and a TrainingArguments class that goes with it.

Let's start with the TrainingArguments. This is just a simple config where we specify things like the batch size and number of epochs.

We also choose a filepath where we want to save model checkpoints after training. For now, we'll just define a local directory name, which will save the trained model in the Colab notebook's temporary storage.

For your assignments and project, you'll probably want to mount your Google Drive and specify a filepath to a directory there, so that the saved model checkpoints persist after the notebook is shut down.

In [29]:
batch_size = 16
num_epochs = 1

training_args = TrainingArguments(
    output_dir="bert_fine_tuned_imdb",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to='none'
)

In addition to model loss, we'll also want to keep track of a simple but more interpretable metric like validation accuracy, so that we can see how well the model is generalizing.

The trainer takes a "compute_metrics" argument, which needs to be a function that takes a set of predictions and labels and returns a metric. We can use the accuracy metric from the Huggingface evaluate package, and wrap it in the necessary function like this:

In [30]:
metric = evaluate.load('accuracy')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

Now we make our Trainer, passing it the model to use, the training arguments, the training and validation data, and our compute_metrics function.

In [31]:
trainer = Trainer(
    model=bert_classification_model,
    args=training_args,
    train_dataset=imdb_train_dataset,
    eval_dataset=imdb_dev_dataset,
    compute_metrics=compute_metrics
)

... and train it!  (This takes a few minutes; we might only be able to train for one epoch in the live session.)

In [32]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.689605,0.518


TrainOutput(global_step=313, training_loss=0.7032082164630341, metrics={'train_runtime': 40.2722, 'train_samples_per_second': 124.155, 'train_steps_per_second': 7.772, 'total_flos': 328888819200000.0, 'train_loss': 0.7032082164630341, 'epoch': 1.0})

How well does it work? Can we do better? In the code above, we trained the extra hidden and classification layers that we added on top of BERT for our task. But we froze the BERT model (set trainable=False) so we're leaving the pre-trained BERT layers as-is and only our dense layer is learning.

[Return to Top](#returnToTop)  
<a id = 'classExercise'></a>

### 4.3 Class Exercise

Why didn't that work very well? Most likely it's because we froze all of the pre-trained layers. The way we use BERT for classification tasks, we're relying on the CLS token (and the pooler dense layer after it). Those were pre-trained using the next sentence prediction task (they don't really get trained in the masked language model task, because there's no real token there to mask).

People have generally found the next sentence prediction task to not be very useful pretraining for downstream tasks. The architecture is good, but we almost always need to fine-tune at least some of the transformer layers, to teach the CLS token how to capture useful context from the rest of the text (the real tokens) for a classification task.

Let's try unfreezing only the topmost transformer block and pooler layer (as well as the classification layer, always), or leaving all of the layers in the entire model unfrozen.

In [33]:
#let's get a fresh instance of the bert_model -- good practice
bert_classification_model = BertForSequenceClassification.from_pretrained(checkpoint)

#freeze all layers except the last transformer block ("layer.11", the pooler layer and the classification layer)
for name, param in bert_classification_model.named_parameters():
    if not any(x in name for x in ["layer.11", "bert.pooler", "classifier"]):
        param.requires_grad = False

trainer = Trainer(
    model=bert_classification_model,
    args=training_args,
    train_dataset=imdb_train_dataset,
    eval_dataset=imdb_dev_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.386369,0.808


TrainOutput(global_step=313, training_loss=0.5156440612987969, metrics={'train_runtime': 50.324, 'train_samples_per_second': 99.356, 'train_steps_per_second': 6.22, 'total_flos': 328888819200000.0, 'train_loss': 0.5156440612987969, 'epoch': 1.0})

Now let's allow all of the layers to be modified as part of fitting the model.

In [34]:
#let's get a fresh instance of the bert_model -- good practice
bert_classification_model = BertForSequenceClassification.from_pretrained(checkpoint)

trainer = Trainer(
    model=bert_classification_model,
    args=training_args,
    train_dataset=imdb_train_dataset,
    eval_dataset=imdb_dev_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.346721,0.858


TrainOutput(global_step=313, training_loss=0.44326504350851137, metrics={'train_runtime': 129.385, 'train_samples_per_second': 38.644, 'train_steps_per_second': 2.419, 'total_flos': 328888819200000.0, 'train_loss': 0.44326504350851137, 'epoch': 1.0})

What do you think? You'll explore these options more on your own in Assignment 2.