# Packages

We use tools provided by Hugging Face to fine tune a pretrained large language model (distilbert uncased) using Low Rank Adaptation. We tune the model on a financial sentiment dataset.

We use the following libraries; datasets, tranformeres and peft

'transformers' contains a wide range of pretrained models for NLP tasks

'peft' stands for Parameter Efficient Fine Tuning and profides the necessary tuning tools

'datasets' allows us to easily load, preprocess and manipulate the data

In [7]:
from datasets import load_dataset, DatasetDict, Dataset
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)
# peft; Parameter Efficient Fine Tuning
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#Dataset

Tuning data contains statements and sentiment labels (Positive, Neutral and Negative)

Contains ~5000 datapoints

We use a 80:20 train test split

In [9]:
df = pd.read_csv('data.csv')

# Mapping the sentiments to numerical values
sentiment_mapping = {
    'positive': 2,
    'neutral': 1,
    'negative': 0
}

df['Sentiment'] = df['Sentiment'].map(sentiment_mapping)
train_df, test_df = train_test_split(df, test_size=0.2)

x_train = train_df['Sentence'].tolist()
y_train = train_df['Sentiment'].tolist()

x_test = test_df['Sentence'].tolist()
y_test = test_df['Sentiment'].tolist()

In [None]:
# create new dataset
dataset = DatasetDict({'train':Dataset.from_dict({'label':y_train,'text':x_train}),
                             'validation':Dataset.from_dict({'label':y_test,'text':x_test})})

# Convert the labels to a numpy array
labels = np.array(dataset['train']['label'])

# Calculate the counts of each unique label
unique_labels, counts = np.unique(labels, return_counts=True)

# Calculate the total number of samples in the training set
total_samples = len(labels)

# Calculate and print the proportions
for label, count in zip(unique_labels, counts):
    proportion = count / total_samples
    print(f"Proportion of label {label}: {proportion:.2f}")

#Base Model

We tune distilbert-uncased. This model contains only 67 Million parameters and is available on Hugging Face.

We take these labels and the model checkpoint and plug them into AutoModelForSequenceClassification class from the transformers package

This base model is specifically ready to do binary classification

In [9]:
model_checkpoint = 'distilbert-base-uncased'
# model_checkpoint = 'roberta-base' # you can alternatively use roberta-base but this model is bigger thus training will take longer

# define label maps
id2label = {0: "Negative", 1: "Neutral", 2:"Positive"}
label2id = {"Negative":0, "Neutral":1, "Positive":2}

# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=3, id2label=id2label, label2id=label2id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Tokenisation

Converting text into a numeric form so the model can understnad the text

Note we pass the specific model into the tokeniser

We also add a pad token


In [10]:
# create tokenizer FOR particular base model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer)) # updating the model to handle the additional token

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



This tells the model how we will tokenise the text and applies the tokenisation to the text in the dataset

In [11]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["text"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/4673 [00:00<?, ? examples/s]

Map:   0%|          | 0/1169 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 4673
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1169
    })
})

In [12]:
# create data collator (dynamic batch padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [13]:
# import accuracy evaluation metric from evaluate package
accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [14]:
# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1) # Finding largest argument amongst the logits
    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

#Performance of the Untrained Model

We pass some dummy inputs to see how the untrained model performs, clearly it doesnt do that well

In [15]:
text_list = [
    "The company's earnings exceeded expectations this quarter.",
    "Sales figures are disappointing; the stock price is likely to fall.",
    "Investors are optimistic about the new product launch.",
    "The recent acquisition will strengthen our market position.",
    "Revenue has declined year-over-year, which raises concerns.",
    "Analysts recommend buying shares based on the positive outlook.",
    "The management's decision to cut costs is seen as a negative sign.",
    "Despite recent challenges, the outlook for next year is promising."
]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt") # pt stands for pytorch here
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)

    print(text + " - " + id2label[predictions.tolist()])

Untrained model predictions:
----------------------------
The company's earnings exceeded expectations this quarter. - Positive
Sales figures are disappointing; the stock price is likely to fall. - Positive
Investors are optimistic about the new product launch. - Negative
The recent acquisition will strengthen our market position. - Negative
Revenue has declined year-over-year, which raises concerns. - Negative
Analysts recommend buying shares based on the positive outlook. - Negative
The management's decision to cut costs is seen as a negative sign. - Positive
Despite recent challenges, the outlook for next year is promising. - Negative


#Tuning the model

In [16]:
# LoRA configuration parameters
peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4, # rank of the trainable weight matrix
                        lora_alpha=32, # this is like a learning rate
                        lora_dropout=0.01, # randonmly zero out weights in training, prevents coadaptation
                        target_modules = ['q_lin']) # applying lora to the query layers

Note below that we tune a small portion of the total trainable parameters

In [17]:
# training less than a 1% of the total model parameters
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 629,763 || all params: 67,585,542 || trainable%: 0.9318


In [18]:
# hyperparameters
lr = 1e-3
batch_size = 4
num_epochs = 10

# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01, # L-2 Penalisation
    evaluation_strategy="epoch", # tells model to evaluate after each epoch
    save_strategy="epoch", # tells model to save after each epoch
    load_best_model_at_end=True,
)



#Training

Note validation loss is increasing here, indicating overfitting. This example is illustrative of how to implement LoRA

What we could try before this is Transfer Learning to get the model closer to something that does sentiment analysis well. We could then use LoRA to tune further.

In [19]:

# creater trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# train model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6174,0.602385,{'accuracy': 0.7639007698887939}
2,0.515,0.696433,{'accuracy': 0.7433704020530368}
3,0.5228,0.653898,{'accuracy': 0.76732249786142}
4,0.4571,0.849367,{'accuracy': 0.7553464499572284}
5,0.3871,0.821161,{'accuracy': 0.7844311377245509}
6,0.3658,0.893301,{'accuracy': 0.7707442258340462}
7,0.3228,0.963472,{'accuracy': 0.7681779298545766}
8,0.3059,1.038007,{'accuracy': 0.7510692899914457}
9,0.2735,1.09539,{'accuracy': 0.7485029940119761}
10,0.2272,1.109352,{'accuracy': 0.7493584260051326}


Trainer is attempting to log a value of "{'accuracy': 0.7639007698887939}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.7433704020530368}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.76732249786142}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.7553464499572284}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.7844311377245509}" of 

TrainOutput(global_step=11690, training_loss=0.3977033597164627, metrics={'train_runtime': 286.9179, 'train_samples_per_second': 162.869, 'train_steps_per_second': 40.743, 'total_flos': 541124686468800.0, 'train_loss': 0.3977033597164627, 'epoch': 10.0})

Much better performance when we test the dummy examples

In [23]:
model.to('cuda') # moving to mps for Mac (can alternatively do 'cpu')

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to("cuda") # moving to mps for Mac (can alternatively do 'cpu')

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])

Trained model predictions:
--------------------------
The company's earnings exceeded expectations this quarter. - Positive
Sales figures are disappointing; the stock price is likely to fall. - Negative
Investors are optimistic about the new product launch. - Positive
The recent acquisition will strengthen our market position. - Positive
Revenue has declined year-over-year, which raises concerns. - Negative
Analysts recommend buying shares based on the positive outlook. - Neutral
The management's decision to cut costs is seen as a negative sign. - Neutral
Despite recent challenges, the outlook for next year is promising. - Positive
