# Pretrained transformers

Transformers can be pretrained on large amounts of unlabeled text by trying to predict the next word (as in GPT) or trying to predict missing words within a passage (as in BERT). This creates embeddings that are similar to word2vec, but incorporate the surrounding context, making them extremely powerful for all kinds of downstream Natural Language Processing tasks. In this tutorial, we will use a pretrained BERT model, with a classification head, to classify sentences as coming from different sections of structured abstracts.

## Abstract section prediction with pretrained transformers
Scientific abstracts are often structured with labeled sections, like "background" and "methods" ([example](https://pubmed.ncbi.nlm.nih.gov/1429477/)). In this lab, we will try to predict which section a sentence came from. By leveraging embeddings from a pretrained model that already understands a lot about language, we will be able to solve this task with very few training examples and time.

For more information see https://huggingface.co/docs/transformers/tasks/sequence_classification.


In [18]:
!pip install datasets evaluate



In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, AutoConfig # added AutoConfig -JU 4/18/2025
from transformers import create_optimizer
from transformers import DataCollatorWithPadding
from transformers.keras_callbacks import KerasMetricCallback
import evaluate
import warnings
warnings.filterwarnings('ignore')

This dataset has four possible labels:

- 0: BACKGROUND
- 1: METHODS
- 2: RESULTS
- 3: CONCLUSIONS

Load the data, split into training and validation, and create HuggingFace `Dataset` objects to help with tokenization and batching:

In [20]:
!gdown 1kwdvVkYmsAFYdnppZPlFBDvVLQQtYgYC

Downloading...
From: https://drive.google.com/uc?id=1kwdvVkYmsAFYdnppZPlFBDvVLQQtYgYC
To: /content/data.tsv
100% 16.9M/16.9M [00:00<00:00, 69.5MB/s]


In [21]:
subset = 10000

id2label = {0: "BACKGROUND", 1: "METHODS", 2: "RESULTS", 3: "CONCLUSIONS"}
label2id = {"BACKGROUND": 0, "METHODS": 1, "RESULTS": 2, "CONCLUSIONS": 3}
num_labels = 4
df = pd.read_csv('data.tsv', sep='\t')
df = df[:subset]
df['label'] = [label2id[x] for x in df['label']]
df_train, df_val = train_test_split(df, test_size=0.1, random_state=0)
train_ds = Dataset.from_pandas(df_train, split="train")
val_ds = Dataset.from_pandas(df_val, split="val")

Datasets have columns like DataFrames. Inspect the first few rows:

In [22]:
train_ds[0:5]

{'label': [2, 2, 3, 3, 1],
 'text': ['Drug sequence had no effect on toxicities. ',
  'Four adult patients had transient hypotension. ',
  'Rimexolone has a low IOP-elevating potential, comparable to that of fluorometholone and less than that of dexamethasone sodium phosphate and prednisolone acetate.',
  'We believe that this is only the second reported case of acute cholestatic jaundice resulting from ciprofloxacin therapy. ',
  'In a Phase II trial, the authors evaluated the influence of paclitaxel, carboplatin, and an antimotility factor (acellular pertussis vaccine [APV]) in 18 patients with cisplatin- and methotrexate-resistant metastatic bladder carcinoma. '],
 '__index_level_0__': [1554, 2087, 5470, 2363, 7570]}

Load the pretrained transformer model from [HuggingFace](https://huggingface.co), an extremely popular library and repository for loading and manipulating language models. In this case we will be using [DistilBERT](https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E), a smaller (and faster) verison of the seminal [BERT](https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ) pretrained transformer model. Also note that this is a "...ForSequenceClassification" model, which means it already has a randomly initialized feedforward head on top of the transformer embedding model, ready for us to train.

**1. Initialize an [`AutoTokenizer`](https://huggingface.co/docs/transformers/v4.35.2/en/model_doc/auto) named `tokenizer` using `distilbert-base-uncased` as the pretrained model path and using truncation.**

In [23]:
# YOUR CODE HERE (1)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

**2. Initialize a [`TFAutoModelForSequenceClassification`](https://huggingface.co/docs/transformers/v4.35.2/en/model_doc/auto#transformers.TFAutoModelForSequenceClassification) named `model`, again using `distilbert-base-uncased` as the pretrained model path.**
- Hint: You will need to specify `num_labels` and the dictionaries `id2label` and `label2id`. See the definitions of these above.

In [24]:
# YOUR CODE HERE (2)
config = AutoConfig.from_pretrained("distilbert-base-uncased", num_labels=num_labels, id2label=id2label, label2id=label2id)
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", config=config)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

It has warned us that the weights of the pretrained model don't all match the classification model. That makes sense, because we are taking of the old top (which predicted missing words) and putting on our new one (which predicts a class). The warning thus advises us to train this new model, and that's exactly what we will do!



### Tokenization

Text data needs to be "tokenized," which is the process of coverting raw strings into sequences of numbers that identify 'tokens' in our vocabulary. Tokens can be whole words, but some are punctuation or pieces of words. Having pieces of words in the vocabulary lets us handle words that were not in the pretraining data.

In [25]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = train_ds.map(preprocess_function, batched=True)
tokenized_val = val_ds.map(preprocess_function, batched=True)

tf_train_set = model.prepare_tf_dataset(
    tokenized_train,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_val,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

After tokenizing, our dataset has the additional fields `input_ids` (the list of tokens in each sentence) and `attention_mask` (which allows padding sentences to the same length):

In [26]:
tokenized_train[0]

{'label': 2,
 'text': 'Drug sequence had no effect on toxicities. ',
 '__index_level_0__': 1554,
 'input_ids': [101,
  4319,
  5537,
  2018,
  2053,
  3466,
  2006,
  11704,
  6447,
  1012,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Set up the training process, including compiling the model, computing steps, and setting up a callback to evaluate accuracy on the validation set after each epoch.

In [27]:
batch_size = 16
num_epochs=1
batches_per_epoch = len(tokenized_train) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

**3. Compile the model, using the optimizer defined by `create_optimizer` above.**
- Hint: HuggingFace will take care of loss, so it should not be specified.

In [28]:
# YOUR CODE HERE (3)
model.compile(optimizer=optimizer)

Now we just need to run fit (this will take a few minutes).

**4. Fit the model on the training set for 1 epoch, validating on the validation set and using `metric_callback` as the only callback.**

In [29]:
# YOUR CODE HERE (4)
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=1, callbacks=[metric_callback])



<tf_keras.src.callbacks.History at 0x7b56759bc150>

This accuracy is not bad for a single epoch of a few thousand training examples, and for a 4-class classification problem! See how high you can get the accuracy with more epochs and more of the available data (see `subset` above).

To use our model, we need to define an inference function:

In [30]:
def classify(text):
    inputs = tokenizer(text, return_tensors="tf")
    logits = model(**inputs).logits
    predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
    return model.config.id2label[predicted_class_id]

Let's see what the model says about a few sentences (try your own!):

In [31]:
classify("""Little work has been done in humans to evaluate
the potential benefit of potassium supplementation.""")

'BACKGROUND'

In [32]:
classify("""The mean MDS-UPDRS total score at baseline was 34.3 in
the deferiprone group and 33.2 in the placebo group and increased
(worsened) by 15.6 points and 6.3 points, respectively (difference,
9.3 points; 95% confidence interval, 6.3 to 12.2; P<0.001).""")

'RESULTS'

In [33]:
classify("""We conducted a multicenter, phase 2, randomized, double-blind
trial involving participants with newly diagnosed Parkinson's disease who
had never received levodopa.""")

'METHODS'