This script will walk through training and evaluation of a transformers classification from beginning to end. In particular, we are going to use data where we labelled text as being either relevant or irrelevant for this particular client project. The data has already been cleaned and labelled, so data wrangling and cleaning techniques are not covered in this tutorial. 

We are going to use a pretrained **encoder model** and then use **domain-adaption** to update the initial weights in the model body and then **fine-tune** the model head on a binary classification problem. Instead of retraining the language model from scratch (which we don't have enough data for nor the resources), we can continue training a pretrained model on data from our domain. In this step we use the classic language model objective of predicting masked words, which means we don’t need any labeled data. After that we can load the adapted model as a classifier and fine-tune it, thus leveraging the unlabeled data.

Your first important decision is what model architecture do we want to adapt and fine-tune. Here is a [webpage](https://huggingface.co/docs/transformers/model_doc/albert) with all available pretrained models available on Hugginface, different models are trained with different tasks in mind select a model based on your use case. We are going to use [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta).

In this notebook we are going use the huggingface transformers library to perform our model training and then weights and biases to track our training progress and track experiments. 

In [None]:
# run in terminal so we can use quantization
# pip install git+https://github.com/huggingface/transformers.git
# pip install git+https://github.com/huggingface/accelerate.git

In [2]:
# first let's load in our pacakges
import pandas as pd
import numpy as np
import os
import datasets
import evaluate

# import wandb
import accelerate
import transformers
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EvalPrediction,
    DataCollatorForLanguageModeling,
    AutoModelForMaskedLM,
    set_seed,
    QuantoConfig,
)

## What Model Should We Use?

Here are a [webpage](https://huggingface.co/docs/transformers/model_doc/albert) with all available pretrained models available on Hugginface. We are going to use [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta).

## Prepare Our Data

We have two datasets, one with unlabelled data and one with labelled data. We will load both from S3 and then convert them into a hugginface dataset. 

In [4]:
# first let's load in our unlabelled training data
unlabelled_data = pd.read_parquet(
    "s3a://fti-adhoc-prod/MLGuide/Example_DomainAdaption_Data"
)
unlabelled_data = unlabelled_data.rename(columns={"sentence": "text"})

# add in a blank labels column to our unlabelled data
unlabelled_data["label"] = ""

# take a random sample of unlabelled data since the dataset is so large
unlabelled_data = unlabelled_data.sample(5000).reset_index(drop=True)

# next, let's load our labelled data
labelled_data = pd.read_parquet(
    "s3://fti-adhoc-prod/MLGuide/Example_Classification_Training_Data"
)
labelled_data = labelled_data[["text", "relevance_label"]]
labelled_data = labelled_data.rename(columns={"relevance_label": "label"})

# let's random sample our discard so we have equal number of samples
num_relevant = len(labelled_data[labelled_data["label"] == "Relevant"])
keep = (
    labelled_data[labelled_data["label"] == "Discard"]
    .sample(num_relevant)
    .index.tolist()
)
keep.extend(labelled_data[labelled_data["label"] == "Relevant"].index.tolist())

labelled_data = labelled_data.iloc[keep]

Next we use scikit to create balenced training, validation, ad test sets. If you are training a multi-label model than refer to the [scikit-multilearn library](http://scikit.ml/)

Here we are going to split the labelled data into 75% training and 25% testing and then split the testing data by 40% testing and 60% for validation. So final split is 75% training, 15% validation, and 10% testing.

In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    labelled_data,
    test_size=0.25,
    random_state=42,
    shuffle=True,
    stratify=labelled_data["label"],
)

test, val = train_test_split(
    test, test_size=0.6, random_state=42, shuffle=True, stratify=test["label"]
)

Let's look at class distribution between our training, test, and validation sets.

In [34]:
class_dist = pd.DataFrame()

total_examples = len(labelled_data)

temp = pd.DataFrame(train["label"].value_counts()).transpose()
temp["data set"] = "train"
class_dist = pd.concat([class_dist, temp], axis=0)

temp = pd.DataFrame(test["label"].value_counts()).transpose()
temp["data set"] = "test"
class_dist = pd.concat([class_dist, temp], axis=0)

temp = pd.DataFrame(val["label"].value_counts()).transpose()
temp["data set"] = "val"
class_dist = pd.concat([class_dist, temp], axis=0)

class_dist["proportion"] = (
    class_dist["Discard"] + class_dist["Relevant"]
) / total_examples

print(class_dist)

label  Relevant  Discard data set  proportion
count      1927     1926    train    0.749903
count       257      257     test    0.100039
count       385      386      val    0.150058


Now let's add convert our training data into a datasets object

In [7]:
ds = datasets.DatasetDict(
    {
        "train": datasets.Dataset.from_pandas(train.reset_index(drop=True)),
        "test": datasets.Dataset.from_pandas(test.reset_index(drop=True)),
        "val": datasets.Dataset.from_pandas(val.reset_index(drop=True)),
        "unsup": datasets.Dataset.from_pandas(unlabelled_data.reset_index(drop=True)),
    }
)

print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3853
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 514
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 771
    })
    unsup: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


In [8]:
del (unlabelled_data, labelled_data, test, train, val)

## Domain Adaption

In this step we are going to leverage are large volume of unlabelled data to fine-tune a pretrained model with masked langauge model so the model "learns" the type of text within our conversation. 

First, we want to adjust our pre-trained tokenizer to reutrn our masked tokens `[MASK]` by setting `return_special_tokens_mask = True`. Make sure to remove all columns (include your classification labels) so the only columns are `['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask']` --- we don't want the masked data to include the full text (i.e. the correct answers) during training.

In [21]:
# set model checkpoint
model_ckpt = "microsoft/deberta-v3-small"

# download pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# see max_position_embeddings in model's config.json file to set max_length (will usually be 128 or 512)
def tokenize(batch):
    return tokenizer(
        batch["text"], truncation=True, max_length=512, return_special_tokens_mask=True
    )


ds_mlm = ds.map(tokenize, batched=True)

# remove all columns (include your classification labels) so the only columns are ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask']
ds_mlm = ds_mlm.remove_columns(["text", "label"])

tokenizer_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 88.6kB/s]
config.json: 100%|██████████| 578/578 [00:00<00:00, 4.59MB/s]
spm.model: 100%|██████████| 2.46M/2.46M [00:00<00:00, 8.54MB/s]
Map: 100%|██████████| 3853/3853 [00:00<00:00, 10066.89 examples/s]
Map: 100%|██████████| 514/514 [00:00<00:00, 12935.98 examples/s]
Map: 100%|██████████| 771/771 [00:00<00:00, 11101.34 examples/s]
Map: 100%|██████████| 5000/5000 [00:00<00:00, 28263.92 examples/s]


Next, we use a data collector to mask random words within our text. We set `mlm_probability = .15` to mask 15% of tokens, which follows the original procedure in the BERT paper.

In [22]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

Let's take a look at how it works -- to do this we set our data collector to return tensors as a numpy array, but be sure to switch this back to PyTorch.

In [23]:
set_seed(3)
data_collator.return_tensors = "np"
inputs = tokenizer("Transformers are awesome!", return_tensors="np")
outputs = data_collator([{"input_ids": inputs["input_ids"][0]}])

data_collator.return_tensors = "pt"

pd.DataFrame(
    {
        "Original tokens": tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
        "Masked tokens": tokenizer.convert_ids_to_tokens(outputs["input_ids"][0]),
        "Original input_ids": inputs["input_ids"][0],
        "Masked input_ids": outputs["input_ids"][0],
        "Labels": outputs["labels"][0],
    }
).T

Unnamed: 0,0,1,2,3,4,5
Original tokens,[CLS],▁Transformers,▁are,▁awesome,!,[SEP]
Masked tokens,[CLS],▁Transformers,▁are,▁awesome,[MASK],[SEP]
Original input_ids,1,32629,281,2253,300,2
Masked input_ids,1,32629,281,2253,128000,2
Labels,-100,-100,-100,-100,300,-100


Load in pre-trained model

In [24]:
# from transformers import QuantoConfig
# import torch
# from torch.quantization import quantize_dynamic
model_ckpt = "microsoft/deberta-v3-small"
model = AutoModelForMaskedLM.from_pretrained(model_ckpt)
# quantization_config = QuantoConfig(weights="int8")
# model = AutoModelForMaskedLM.from_pretrained(model_ckpt, quantization_config=quantization_config)

# model = (AutoModelForMaskedLM.from_pretrained(model_ckpt).to("cpu"))
# model_quantized = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

pytorch_model.bin: 100%|██████████| 286M/286M [00:05<00:00, 52.0MB/s]
Some weights of DebertaV2ForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we are ready to fine-tune the masked language model and log training run to wandb

In [None]:
# wandb.login(host="https://ftisc.wandb.io/", key = "local-f89c2a822ee1ac33511ca756aaae3763cb7e45c0")

In [25]:
# project_name = "llm-finetuning-example"
# entity = "marycatherine-sullivan"
# os.environ["WANDB_LOG_MODEL"] = "checkpoint"

# If you don't want your script to sync to the cloud
os.environ["WANDB_MODE"] = "offline"

# wandb.init(project=project_name,
# entity=entity,
# job_type="mlm training")

training_args = TrainingArguments(
    output_dir="outputs",
    # report_to="wandb",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    logging_steps=500,
    warmup_steps=5,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ds_mlm["unsup"],
    eval_dataset=ds_mlm["train"],
    data_collator=data_collator,
)

trainer.train()
# wandb.finish()

  lambda data: self._console_raw_callback("stderr", data),
  3%|▎         | 8/234 [20:13:10<571:12:02, 9098.77s/it]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  3%|▎         | 8/234 [00:29<13:12,  3.51s/it]

RuntimeError: MPS backend out of memory (MPS allocated: 13.10 GB, other allocations: 22.71 GB, max allowed: 36.27 GB). Tried to allocate 1.95 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [12]:
wandb.finish()

