# Finetune a BERT model for politeness classification
BERT models are often used for text classification tasks. BERT produces an output embedding for every input token; the special `[CLS]` token is the output embedding usually used for text classification. Due to self-attention, this embedding contains information from all the tokens in the sentence so can be used as a representation for the whole sentence.

In this notebook, you will finetune a DistilBERT model, a small BERT model, for the task of politeness classification, determining whether a sentence is polite or not.

This notebook is best run on a GPU

Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification

In [None]:
import transformers
import datasets
import evaluate

# Load politeness data

In [None]:
import pandas as pd

data = pd.read_csv('data/politeness.csv')
# Rename `polite` column to `label`
data = data.rename(columns={'polite': 'label'})
data['text'] = data['text'].str.lower() # lowercase
data.info()
data.head()

# Finetune DistilBERT for politeness classification
Using Hugging Face 🤗 tools

## Get dataset into the Hugging Face input format

In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(data).train_test_split(test_size=0.1)
dataset

## Set up tokenization and initialize the model
This will load subword tokenization models that have been pretrained on data to recognize lots of subwords and apply them to our data.

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True) # truncates to DistilBERT's maximum input length

tokenized_dataset = dataset.map(preprocess, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # pads to the correct length

Now we'll initialize a pretrained DistilBERT Hugging Face model for 'sequence classification', which is text classification. We will set up necessary hyperparameters for training (finetuning) the model with the `Trainer` class.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "NOT POLITE", 1: "NOT POLITE"}
label2id = {"NOT POLITE": 0, "POLITE": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
# Set up evaluation
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Set up training hyperparameters and initialize model
training_args = TrainingArguments(
    output_dir="politeness_classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Finetune (train) the model

In [None]:
trainer.train()

## Evaluate performance on the test set

In [None]:
results = trainer.evaluate(tokenized_dataset['test'])
pd.DataFrame(results, index=['Fine-tuned DistilBERT'])

This is a hard task and our DistilBERT model has room for improvement. 

Feel free to play around with the training hyperparameters, retrain, and see if you can get better accuracy. You can also try other pretrained models such as `distilroberta-base`, `bert-base-uncased`, or `roberta-base`. Just substitute the names of the pretrained tokenizers and models.