# Setup

In [1]:
! pip install -q datasets==2.20.0 \
                 accelerate==0.33.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradient 2.0.6 requires attrs<=19, but you have attrs 23.1.0 which is incompatible.[0m[31m
[0m

In [3]:
import numpy as np
import pandas as pd

import tensorflow as tf

from transformers import create_optimizer

from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification

from sklearn.model_selection import train_test_split
from datasets import Dataset, load_dataset

# Method

Just as with transfer learning with images, pretrained BERT models can be fine-tuned by:

- Importing a pretrained model from HuggingFace and attaching a classifier head.
- Then, we freeze the base BERT model and finetune the dense layer.
- Finally, we unfreeze the base BERT model and finetune the entire model.

The finetuned model is now ready for inference.

# Data

In [5]:
ds = load_dataset("ccdv/patent-classification", "abstract")

Downloading data:   0%|          | 0.00/8.61M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [6]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [9]:
tokenized_dataset = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [10]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

In [11]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [25]:
id2label = {
    0: "Human Necessities", 
    1: "Performing Operations; Transporting",
    2: "Chemistry; Metallurgy",
    3: "Textiles; Paper",
    4: "Fixed Constructions",
    5: "Mechanical Engineering; Lightning; Heating; Weapons; Blasting",
    6: "Physics",
    7: "Electricity",
    8: "General tagging of new or cross-sectional technology"
}

label2id = { v: k for k, v in id2label.items()}

In [26]:
batch_size = 24
num_epochs = 50

In [28]:
batches_per_epoch = len(tokenized_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    min_lr_ratio=0.001,
    num_warmup_steps=0,
    num_train_steps=total_train_steps
)

2024-08-27 04:52:30.097522: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-08-27 04:52:30.136796: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-08-27 04:52:30.137049: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

# Build Model

In [30]:
model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=9,
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [31]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  6921      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66960393 (255.43 MB)
Trainable params: 66960393 (255.43 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# Fine-Tune

Freeze the base BERT model.

In [32]:
model.layers[0].trainable = False

In [33]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  6921      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66960393 (255.43 MB)
Trainable params: 597513 (2.28 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________


In [34]:
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

We use two callbacks - model checkpointing when best accuracy is observed and early stopping if validation accuracy does not improve for 4 epochs.

In [35]:
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    "best_model",
    monitor="val_accuracy",
    mode="max",
    save_best_only=True,
    save_weights_only=True
)

earlystopping = tf.keras.callbacks.EarlyStopping(
    patience=4,
    monitor="val_accuracy",
    restore_best_weights=True
)

In [36]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_dataset['train'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    tokenizer=tokenizer
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [37]:
tf_val_set = model.prepare_tf_dataset(
    tokenized_dataset['validation'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    tokenizer=tokenizer
)

In [38]:
model.fit(
    tf_train_set,
    validation_data=tf_val_set,
    epochs=50,
    callbacks=[checkpoint, earlystopping]
)

Epoch 1/50


2024-08-27 04:53:55.478404: I external/local_xla/xla/service/service.cc:168] XLA service 0x33496900 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-08-27 04:53:55.478468: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Quadro P5000, Compute Capability 6.1
2024-08-27 04:53:55.485104: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-08-27 04:53:56.325699: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8902
I0000 00:00:1724734436.442683     228 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50


<keras.src.callbacks.History at 0x7f2cd1ae2810>

Unfreeze the base BERT model and continue training.

In [39]:
model.layers[0].trainable = True

In [40]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  6921      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66960393 (255.43 MB)
Trainable params: 66960393 (255.43 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
model.fit(
    tf_train_set,
    validation_data=tf_val_set,
    epochs=50,
    callbacks=[checkpoint, earlystopping]
)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50


<keras.src.callbacks.History at 0x7f2c443b6c10>

# Inference

In [44]:
tf_test_set = model.prepare_tf_dataset(
    tokenized_dataset['test'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    tokenizer=tokenizer
)

In [46]:
model.evaluate(tf_test_set)



[1.1365915536880493, 0.5917468070983887]