# Finetune a BERT model for politeness classification
BERT models are often used for text classification tasks. BERT produces an output embedding for every input token; the special `[CLS]` token is the output embedding usually used for text classification. Due to self-attention, this embedding contains information from all the tokens in the sentence so can be used as a representation for the whole sentence.

In this notebook, you will finetune a DistilBERT model, a small BERT model, for the task of politeness classification, determining whether a sentence is polite or not.

This notebook is best run on a GPU

Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification

In [1]:
! pip install --user datasets evaluate # transformers[torch] is assumed to be already installed

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.

Now restart your kernel with **Kernel > Restart Kernel**. Test the installation by running:

In [1]:
import transformers
import datasets
import evaluate

2025-03-22 10:20:25.533573: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742653226.133483   17504 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742653226.298369   17504 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742653227.535001   17504 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742653227.535044   17504 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742653227.535047   17504 computation_placer.cc:177] computation placer alr

# Load politeness data

In [2]:
import pandas as pd

data = pd.read_csv('data/politeness.csv')
# Rename `polite` column to `label`
data = data.rename(columns={'polite': 'label'})
data['text'] = data['text'].str.lower() # lowercase
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4914 entries, 0 to 4913
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   utterance_id     4914 non-null   int64 
 1   conversation_id  4914 non-null   int64 
 2   text             4914 non-null   object
 3   label            4914 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 153.7+ KB


Unnamed: 0,utterance_id,conversation_id,text,label
0,3208,2099,is it for a particular cpu (e.g. a z80 or the ...,1
1,4458,4622,what is an eastern style pot? is this a cup-l...,1
2,4888,5599,i didn't know that you had to use an email add...,0
3,3316,2295,i don't understand. that call stack is all ab...,1
4,4005,3674,the last sentence makes me think that there is...,0


# Finetune DistilBERT for politeness classification
Using Hugging Face 🤗 tools

## Get dataset into the Hugging Face input format

In [4]:
from datasets import Dataset

dataset = Dataset.from_pandas(data).train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['utterance_id', 'conversation_id', 'text', 'label'],
        num_rows: 4422
    })
    test: Dataset({
        features: ['utterance_id', 'conversation_id', 'text', 'label'],
        num_rows: 492
    })
})

## Set up tokenization and initialize the model
This will load subword tokenization models that have been pretrained on data to recognize lots of subwords and apply them to our data.

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True) # truncates to DistilBERT's maximum input length

tokenized_dataset = dataset.map(preprocess, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # pads to the correct length

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4422 [00:00<?, ? examples/s]

Map:   0%|          | 0/492 [00:00<?, ? examples/s]

Now we'll initialize a pretrained DistilBERT Hugging Face model for 'sequence classification', which is text classification. We will set up necessary hyperparameters for training (finetuning) the model with the `Trainer` class.

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "NOT POLITE", 1: "NOT POLITE"}
label2id = {"NOT POLITE": 0, "POLITE": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Set up evaluation
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Set up training hyperparameters and initialize model
training_args = TrainingArguments(
    output_dir="politeness_classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Finetune (train) the model

In [9]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.495912,0.764228
2,0.484600,0.494869,0.770325
3,0.484600,0.543407,0.764228


TrainOutput(global_step=831, training_loss=0.4212143550280629, metrics={'train_runtime': 134.7046, 'train_samples_per_second': 98.482, 'train_steps_per_second': 6.169, 'total_flos': 252139791973848.0, 'train_loss': 0.4212143550280629, 'epoch': 3.0})

## Evaluate performance on the test set

In [11]:
results = trainer.evaluate(tokenized_dataset['test'])
pd.DataFrame(results, index=['Fine-tuned DistilBERT'])

Unnamed: 0,eval_loss,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
Fine-tuned DistilBERT,0.494869,0.770325,1.1401,431.547,27.191,3.0


This is a hard task and our DistilBERT model has room for improvement. 

Feel free to play around with the training hyperparameters, retrain, and see if you can get better accuracy. You can also try other pretrained models such as `distilroberta-base`, `bert-base-uncased`, or `roberta-base`. Just substitute the names of the pretrained tokenizers and models.