# Lightweight Fine-Tuning Project

* PEFT technique: LoRA
* Model: gpt2
* Evaluation approach: accuracy
* Fine-tuning dataset: ccdv/patent-classification

In this project, we fine-tune the GPT2 pre-trained model using the LoRA method to classify patent texts into broad technologies.

## Loading and Evaluating a Foundation Model

We first install versions of packages that allows use to use the combinations of techniques and datasets within the workspace provided.

In [1]:
%%capture
# within the workspace, this needs to be run at the start of every session
# then the kernel needs to be restarted before moving on
!pip install pandas==2.2.1
!pip install scikit-learn==1.4.0
!pip install transformers==4.38.1
!pip install datasets==2.17.1
!pip install peft==0.8.2
!pip install accelerate==0.27.2
!pip install fsspec==2023.9.2
!pip install torch==1.13.1

In [1]:
import pandas as pd
from datasets import load_dataset, DatasetDict

# https://huggingface.co/datasets/ccdv/patent-classification
# According to the documentations, these are what the labels represent.

num_labels = 9

label_dict = {
    0: "Human Necessities",
    1: "Performing Operations; Transporting",
    2: "Chemistry; Metallurgy",
    3: "Textiles; Paper",
    4: "Fixed Constructions",
    5: "Mechanical Engineering; Lightning; Heating; Weapons; Blasting",
    6: "Physics",
    7: "Electricity",
    8: "New or cross-sectional technology",
}

# Load the subsets of the data (pre-split into train, validation, and test)
# NOTE: only loaded subset due to training time constraints.
# Using more data would likely result in better results.

dataset = DatasetDict()

dataset["train"] = load_dataset(
    "ccdv/patent-classification",
    split="train[:1800]",
    trust_remote_code=True,
)

dataset["validation"] = load_dataset(
    "ccdv/patent-classification",
    split="validation[:900]",
    trust_remote_code=True,
)

dataset

Downloading builder script:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/487M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/97.9M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1800
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 900
    })
})

In [2]:
# take a look at a sample
pd.DataFrame(dataset["train"])

Unnamed: 0,text,label
0,"turning now to the drawings , there is shown i...",6
1,deployment mechanisms that are configured for ...,0
2,"now , first and second embodiments of the pres...",7
3,"as used herein , “ administration ” of a compo...",0
4,"in accordance with the figures , the mixing de...",8
...,...,...
1795,and now the invention will be described in det...,1
1796,reference will now be made in detail to the pr...,3
1797,preferred features of exemplary embodiments of...,7
1798,referring to fig1 one embodiment of the invent...,7


In [3]:
# preprocess the text

from transformers import AutoTokenizer

model_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def preprocess(raw_data):
    preprocessed_data = tokenizer(
        raw_data["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
    preprocessed_data["label"] = raw_data["label"]
    return preprocessed_data

data_train = dataset["train"].map(preprocess, batched=True)
data_validation = dataset["validation"].map(preprocess, batched=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

In [4]:
# load the pre-trained model as a Classifier with the same number of labels as the dataset.
# We do not expect this to perform well without fine tuning.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model.config.pad_token_id = model.config.eos_token_id

model

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid

In [5]:
# load a test data and take a look

data_test = pd.DataFrame(
    load_dataset(
        "ccdv/patent-classification",
        split="test[:100]",
        trust_remote_code=True,
    )
)

data_test

Unnamed: 0,text,label
0,"as used herein , the term &# 34 ; sensitizer m...",8
1,fig1 describes the five step / stage process o...,1
2,"an alarm network for a building 8 , or the lik...",6
3,the disposable razor and emollient dispensing ...,1
4,in accordance with the present invention it ha...,2
...,...,...
95,one or more specific embodiments of the presen...,0
96,"with reference to the above fig1 - 3 , the ret...",0
97,the following description is of the best - con...,5
98,the following agents which destroy extracellul...,0


In [6]:
# prepare the prediction method

import torch

def predict(model: AutoModelForSequenceClassification, text: str) -> int:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    input_tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
    
    with torch.no_grad():
        outputs = model(**input_tokens)
        logits = outputs.logits
        
    probs = torch.nn.functional.softmax(logits, dim=1)
    return probs.argmax().item()

In [7]:
test_result = data_test.assign(pred = lambda df: df["text"].apply(lambda text: predict(model, text)))
print("Accuracy: ", (test_result["pred"] == test_result["label"]).mean().round(3))
test_result

Accuracy:  0.17


Unnamed: 0,text,label,pred
0,"as used herein , the term &# 34 ; sensitizer m...",8,7
1,fig1 describes the five step / stage process o...,1,7
2,"an alarm network for a building 8 , or the lik...",6,7
3,the disposable razor and emollient dispensing ...,1,7
4,in accordance with the present invention it ha...,2,7
...,...,...,...
95,one or more specific embodiments of the presen...,0,7
96,"with reference to the above fig1 - 3 , the ret...",0,7
97,the following description is of the best - con...,5,8
98,the following agents which destroy extracellul...,0,7


## Performing Parameter-Efficient Fine-Tuning

Create a PEFT model from the loaded model, run a training loop, and save the PEFT model weights.

In [8]:
# prepare the PEFT model with the LORA config

from peft import LoraConfig, TaskType, PeftModelForSequenceClassification

config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1,
    fan_in_fan_out=True,
)

peft_model = PeftModelForSequenceClassification(model, config)
peft_model.print_trainable_parameters()

trainable params: 154,368 || all params: 124,601,088 || trainable%: 0.12388976892400812


In [27]:
# fine tune the model

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments, EvalPrediction
from transformers import DataCollatorWithPadding


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def compute_metrics(eval_prediction: EvalPrediction):
    predictions = np.argmax(eval_prediction.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        eval_prediction.label_ids,
        predictions,
        average='weighted',
    )
    return {
        "accuracy": accuracy_score(eval_prediction.label_ids, predictions),
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

training_args = TrainingArguments(
    output_dir="./results/peft_model",
    evaluation_strategy="epoch",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_dir='./logs/peft_model',
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100,
    warmup_ratio=0.1,
    dataloader_num_workers=1, #
    dataloader_prefetch_factor=1, #
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=data_train,
    eval_dataset=data_validation,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.4824,1.42232,0.473333,0.455897,0.448151,0.473333


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Checkpoint destination directory ./results/peft_model/checkpoint-225 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=225, training_loss=1.7929861111111112, metrics={'train_runtime': 189.8299, 'train_samples_per_second': 9.482, 'train_steps_per_second': 1.185, 'total_flos': 471217471488000.0, 'train_loss': 1.7929861111111112, 'epoch': 1.0})

In [28]:
# save model

peft_model.save_pretrained(f"model/{model_name}-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [8]:
# load model

from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
lora_model = AutoPeftModelForSequenceClassification.from_pretrained(f"model/{model_name}-lora", num_labels=num_labels)
lora_model.config.pad_token_id = lora_model.config.eos_token_id


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
test_result = (
    data_test
    .assign(pred = lambda df: df["text"].apply(lambda text: predict(lora_model, text)))
)
print("Accuracy: ", (test_result["pred"] == test_result["label"]).mean().round(3))
test_result

Accuracy:  0.48


Unnamed: 0,text,label,pred
0,"as used herein , the term &# 34 ; sensitizer m...",8,2
1,fig1 describes the five step / stage process o...,1,8
2,"an alarm network for a building 8 , or the lik...",6,6
3,the disposable razor and emollient dispensing ...,1,1
4,in accordance with the present invention it ha...,2,2
...,...,...,...
95,one or more specific embodiments of the presen...,0,0
96,"with reference to the above fig1 - 3 , the ret...",0,0
97,the following description is of the best - con...,5,8
98,the following agents which destroy extracellul...,0,2


We see that the accuracy has improved from 17% to 48%.

Using more training data would probably improve this result. This project merely looked at the method of fine-tuning a pre-trained model.