## RegNLP - Fine tuned sentence transformer model

This notebook details the training process of the text embedding model [raul-delarosa99/bge-small-en-v1.5-RIRAG_ObliQA](https://huggingface.co/raul-delarosa99/bge-small-en-v1.5-RIRAG_ObliQA). The model is derived by fine-tuning the  model [ BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) using a dataset that contains regulatory passages that require specialized knowledge to accurately interpret compliance requirements. This model aims to support academic research in the field of Regulatory Natural Language Processing (RegNLP). 

This notebook depends on the `data_processing.ipynb` notebook. Please ensure that notebook is run before training the model so the raw data is preprocessed as expected.

## Environment

First, let us validate the hardware specs that will be used to train this model by using the NVIDIA command tool `nvidia-smi`

In [2]:
!nvidia-smi

Thu Oct 31 13:02:26 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A40-24Q                 On  |   00000000:00:10.0 Off |                  N/A |
| N/A   N/A    P0             N/A /  N/A  |      24MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                     

In [14]:
# Import libraries

# Sentence-transformer related utils - See https://www.sbert.net/
from sentence_transformers import (
    SentenceTransformer,
    models, 
    SentenceTransformerTrainingArguments,
    SentenceTransformerTrainer
)
from sentence_transformers.losses import (
    MultipleNegativesRankingLoss,
    MultipleNegativesSymmetricRankingLoss
)
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.util import dot_score
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Other libs for data handling
from datasets import (
    Dataset,
    load_from_disk
)
from pandas import read_pickle
import numpy as np

## Model architecture

We will configure a custom sentence transformer model by defining several components, including a word embedding model, a pooling layer and a normalization layer. The word embedding model is initialized with a pre-trained transformer model ([BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)), with specific settings for a maximum sequence length and case-sensitivity. The normalization layer is defined to standardize the embeddings.

In [4]:
# Define the base model
model_name = "BAAI/bge-small-en-v1.5"

base_model = SentenceTransformer(
    model_name,                  # Name of the base model based on BERT
    device="cuda",               # Use GPU for training
    model_kwargs={"torch_dtype": "float16"},  # Use FP16 precision to reduce memory consumption
)

# Display the base model architecture
base_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [5]:
# First, we will set up the word embedding model using the base model
word_embedding_model = models.Transformer(
    model_name_or_path=model_name,  # Use BAAI/bge-small-en-v1.5
    max_seq_length=512,
    do_lower_case=True,  # Convert all text to lowercase
)

# Define the pooling layer
pooling_model = models.Pooling(
    word_embedding_dimension=512,  # word embeddings dimensional space
    pooling_mode_cls_token=True,  # Use [CLS] token
    pooling_mode_mean_tokens=False,  # Do not use mean tokens for pooling
    pooling_mode_max_tokens=False,  # Do not use max token for pooling
    pooling_mode_mean_sqrt_len_tokens=False,  # Do not use sqr len token for pooling
    pooling_mode_weightedmean_tokens=False,  # Do not use the weighted mean token for pooling
    pooling_mode_lasttoken=False,  # Do not use last token for pooling
    include_prompt=True  # Include prompt during pooling
)

# Define the normalization layer
normalize = models.Normalize()

# Define our custom model which consists of
# - word_embedding_model: The word embedding layer using BAAI/bge-small-en-v1.5
# - pooling_model: The pooling layer
# - normalize: The normalization layer
custom_domain_model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model, normalize],  # our model layers
    device="cuda"  # Use GPU for training
)

# Display the custom model architecture
custom_domain_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

## Dataset

Load the training and evaluation datasets for question answering tasks from the respective pickle files stored in the `data` directory. (These pickle files are obtained after running the `data_processing` notebook.

In [33]:
qa_train = load_from_disk('./data/train_dataset')
qa_eval = load_from_disk('./data/eval_dataset')
qa_test = load_from_disk('./data/test_dataset')

In [7]:
print("Training lenght: ", len(qa_train))
print("Validation lenght: ", len(qa_eval))
print("Test lenght: ", len(qa_test))

Training lenght:  29547
Validation lenght:  3677
Test lenght:  3666


In [27]:
# Creates a function to process the dataset so we can use it with the sentence transformer lib
corpus = read_pickle('./data/corpus.pkl') # Our corpus (cid => document)

def create_data_for_evaluator(dataset:Dataset) -> dict:
    """
    Creates a data structure for the evaluator from the given dataset.

    Args:
        dataset (Dataset): The dataset containing 'anchor_id', 'anchor', and 'positive_id' fields.

    Returns:
        dict: A dictionary containing the corpus, queries, and relevant documents.
              - 'corpus': The corpus of documents
              - 'queries': A dictionary mapping query IDs to queries.
              - 'relevant_docs': A dictionary mapping query IDs to lists of relevant document IDs.
    """

    queries = dict(
        zip(dataset['anchor_id'], 
            dataset['anchor'])
    )  # Our queries (qid => question)

    # Create a mapping of relevant document (1 in our case) for each query
    relevant_docs = {qid:[] for qid in dataset['anchor_id']}  # Query ID to relevant documents (qid => set([relevant_cids])
    for qid, cid  in zip(dataset['anchor_id'], dataset['positive_id']):
        relevant_docs[qid].append(cid)
        
    return dict(corpus=corpus, queries=queries, relevant_docs=relevant_docs)

Prepares and configures the evaluation process for the base model using the `InformationRetrievalEvaluator` using the `dot_score` function. This will serve as our baseline once we train our custom model.

In [28]:
# evaluator for the validation dataset
eval_dataset_evaluator = create_data_for_evaluator(qa_eval)
# evaluator for the testing dataset
test_dataset_evaluator = create_data_for_evaluator(qa_test)

In [30]:
# Uses the information retrieval evaluator which given a set of queries and a corpus, retrieves for each query the top k
# most similar docs. We use k=10 and the dot score function (Dot product)
dev_evaluator = InformationRetrievalEvaluator(
        queries=eval_dataset_evaluator['queries'],
        corpus=eval_dataset_evaluator['corpus'],
        relevant_docs=eval_dataset_evaluator['relevant_docs'],
        name='qa_eval', 
        map_at_k=[10],
        accuracy_at_k = [10],
        precision_recall_at_k = [10],
        score_functions={'dot_score':dot_score}
    )

test_evaluator = InformationRetrievalEvaluator(
        queries=test_dataset_evaluator['queries'],
        corpus=test_dataset_evaluator['corpus'],
        relevant_docs=test_dataset_evaluator['relevant_docs'],
        name='qa_test', 
        map_at_k=[10],
        accuracy_at_k = [10],
        precision_recall_at_k = [10],
        score_functions={'dot_score':dot_score}
    )

In [31]:
## Base model evaluation
results = dev_evaluator(base_model)
results

{'qa_eval_dot_score_accuracy@10': 0.7908895265423243,
 'qa_eval_dot_score_precision@10': 0.08418220946915352,
 'qa_eval_dot_score_recall@10': 0.7134684361549498,
 'qa_eval_dot_score_ndcg@10': 0.6009030811596184,
 'qa_eval_dot_score_mrr@10': 0.6015105839083594,
 'qa_eval_dot_score_map@10': 0.5462275469130742}

In [34]:
# We use the loss function MultipleNegativesSymmetricRankingLoss - 
# See https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html?highlight=multiplenegativessymmetricrankingloss#multiplenegativessymmetricrankingloss
# 
# This loss is desgined for information retrieval use cases where we have pairs of questions/answers and 
# we want to maximize the similarity between those.
# 'similarity_fct=dot_score' means we use the dot product as the similarity function
loss = MultipleNegativesRankingLoss(custom_domain_model,
                                    similarity_fct=dot_score)

# Configure training parameters
args = SentenceTransformerTrainingArguments(
    output_dir="./results/custom_domain_model",  # Directory to store the results/checkpoints of the model
    num_train_epochs=10,  # Numer of epochs
    per_device_train_batch_size=64,  # Batch size for training
    gradient_accumulation_steps=4, # Accumulation steps
    per_device_eval_batch_size=512,  # Batch size for evaluation
    learning_rate=2e-5,  # Learning rate
    warmup_ratio=0.1,  # Proportion of warmup steps
    bf16=True,  # Use bfloat16 to reduce memory usage
    gradient_checkpointing=False, 
    optim="adamw_torch_fused",  # Use a version of adam optmizer for gradient descent
    lr_scheduler_type="cosine",  # Learning rate scheduler with cosine decay
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # Uses batch sampling that ensures no duplicates
    eval_strategy="epoch",  # Evaluates at the end of each epoch
    save_strategy="epoch",  # Saves a checkpoint at the end of each epoch
    save_total_limit=1,  # Keeps a limit of 1 checkpoint, deleting the oldest
    logging_steps=1,  # Logs training results every step
    metric_for_best_model="eval_qa_eval_dot_score_recall@10",  # Key metric used to determine the best model (MAP@10 in this case)
    greater_is_better=True,  # Indicates that a higher value of the metric is better (used to select the best model)
    load_best_model_at_end=True,  # Automatically loads the best model at the end of training
)

# Create the trainer
trainer = SentenceTransformerTrainer(
    model=custom_domain_model,  # Custom model to train
    args=args,  # training args
    train_dataset=qa_train.select_columns(["anchor", "positive"]),  # training dataset using only query and positive samples
    loss=loss,  # loss function
    evaluator=dev_evaluator,  # evaluator to validate model after each epoch
)

# Train the model
trainer.train()

# Save the model to disk 
trainer.save_model()


Epoch,Training Loss,Validation Loss,Qa Eval Dot Score Accuracy@10,Qa Eval Dot Score Precision@10,Qa Eval Dot Score Recall@10,Qa Eval Dot Score Ndcg@10,Qa Eval Dot Score Mrr@10,Qa Eval Dot Score Map@10
1,0.6732,No log,0.875897,0.094548,0.79343,0.669875,0.674248,0.608332
2,0.8467,No log,0.887016,0.096521,0.806576,0.683009,0.687008,0.62132
3,0.8747,No log,0.888092,0.097095,0.809356,0.686894,0.689791,0.625929
4,0.191,No log,0.888092,0.097525,0.81039,0.689621,0.692547,0.629218
5,0.4173,No log,0.892037,0.09792,0.812267,0.691733,0.69524,0.631234
6,0.5546,No log,0.892396,0.09792,0.812321,0.691182,0.694321,0.630474
7,0.6649,No log,0.892755,0.098135,0.813887,0.691763,0.694289,0.630956
8,0.1241,No log,0.893831,0.098278,0.814981,0.692382,0.694677,0.631357
9,0.4929,No log,0.894907,0.098458,0.816248,0.692973,0.695109,0.631785


                                                                                                                                                                                                                                                             

## Evaluating the base model & the fine tunned model

In [35]:
# Load the best model from disk
custom_domain_model = SentenceTransformer('./results/custom_domain_model',
                                          device="cuda",
                                          model_kwargs={"torch_dtype": "float16"},
                                          )

In [36]:
## Custom model evaluation
results = dev_evaluator(custom_domain_model)
results

{'qa_eval_dot_score_accuracy@10': 0.8945480631276901,
 'qa_eval_dot_score_precision@10': 0.0983500717360115,
 'qa_eval_dot_score_recall@10': 0.8158177905308465,
 'qa_eval_dot_score_ndcg@10': 0.6926942973725922,
 'qa_eval_dot_score_mrr@10': 0.6950025335337378,
 'qa_eval_dot_score_map@10': 0.6315333809675632}

Evaluate the Mean Average Precision (MAP) at $k=10$ for both the base and custom domain models using the evaluator, and print the results for comparison.

In [37]:
eva_base_model = test_evaluator(base_model, output_path='results/base_model/')
print("Base model: ", eva_base_model)
print("\n-----------\n")
eva_custom_model = test_evaluator(custom_domain_model, output_path='results/custom_model/')
print("Custom model: ", eva_custom_model)

Base model:  {'qa_test_dot_score_accuracy@10': 0.7763819095477387, 'qa_test_dot_score_precision@10': 0.08248384781048097, 'qa_test_dot_score_recall@10': 0.7017109356305337, 'qa_test_dot_score_ndcg@10': 0.5896317890507138, 'qa_test_dot_score_mrr@10': 0.5893142013924492, 'qa_test_dot_score_map@10': 0.5357515444759702}

-----------

Custom model:  {'qa_test_dot_score_accuracy@10': 0.8872936109117013, 'qa_test_dot_score_precision@10': 0.09727207465900933, 'qa_test_dot_score_recall@10': 0.811180904522613, 'qa_test_dot_score_ndcg@10': 0.6866557656041946, 'qa_test_dot_score_mrr@10': 0.6884060324297208, 'qa_test_dot_score_map@10': 0.6261181450525491}


### Comparing QA

In [38]:
# Asumiendo que los embeddings están normalizados
question1 = "In the context of selling Direct Long-Term Insurance to Retail Clients, can you identify the rule that mandates insurers and insurance intermediaries to ensure that the insurance products are suitable for their clients?"
answer1 =  "An Insurer or an Insurance Intermediary must comply with the suitability requirement set out in Rule ‎3.4 when conducting any Insurance or Insurance Intermediation Business with or for a Retail Client in respect of Direct Long-Term Insurance."

question2 = 'Under what circumstances, as outlined in Rule ‎12.3.2, is a Fund Manager of a Domestic Fund not mandated to engage the services of an Eligible Custodian?'
answer2 =  'A Fund Manager of a Domestic Fund is not required to appoint an Eligible Custodian for the Fund pursuant to Rule ‎12.3.2 where it meets the requirements in either (2) and (3), or (4).'

print("------ Base Model ------")

emb_q1 = base_model.encode(question1)  # the embedding is normalized
emb_q2 = base_model.encode(question2)  # the embedding is normalized
ans_1 = base_model.encode(answer1)
ans_2 = base_model.encode(answer2)


print("q1", ans_1 @ emb_q1,"(answer1) --", ans_2 @ emb_q1, "(answer2)")
print("q2", ans_1 @ emb_q2, "(answer1) --", ans_2 @ emb_q2, "(answer2)")

print("------ Custom model -----")

emb_q1 = custom_domain_model.encode(question1)  # the embedding is normalized
emb_q2 = custom_domain_model.encode(question2)  # the embedding is normalized
ans_1 = custom_domain_model.encode(answer1)
ans_2 = custom_domain_model.encode(answer2)


print("q1", ans_1 @ emb_q1,"(answer1) --", ans_2 @ emb_q1, "(answer2)")
print("q2", ans_1 @ emb_q2, "(answer1) --", ans_2 @ emb_q2, "(answer2)")


------ Base Model ------
q1 0.8325 (answer1) -- 0.56 (answer2)
q2 0.581 (answer1) -- 0.9155 (answer2)
------ Custom model -----
q1 0.7383 (answer1) -- 0.01709 (answer2)
q2 0.1287 (answer1) -- 0.8804 (answer2)
