Import libraries

In [16]:
#sentence-transformers==2.7.0

from sentence_transformers import SentenceTransformer, losses, InputExample, models
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
import torch

import pandas as pd
import numpy as np

from custom_adapter_module.AdapterModule import AdapterModule

A custom SentenceTransformer model is configured by defining and configuring several components, including a word embedding model, a grouping model, a normalization layer, and an adapter module. The word embedding model is initialized with a pre-trained transformer model from Sentence Transformers, with specific settings for a maximum sequence length and case-sensitivity. The pooling model is configured to use token averaging for pooling, with other pooling modes disabled. The normalization layer is defined to standardize the embeddings.

In [17]:
# Carga del modelo de embeddings de palabras
word_embedding_model = models.Transformer(
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",  # Modelo base de Sentence Transformers
    max_seq_length=256,  # Longitud máxima de la secuencia
    do_lower_case=False  # No convertir a minúsculas
)

# Definición de los parámetros del modelo de pooling
pooling_model = models.Pooling(
    word_embedding_dimension=384,  # Dimensión de los embeddings de palabras
    pooling_mode_cls_token=False,  # No usar el token CLS para el pooling
    pooling_mode_mean_tokens=True,  # Usar el promedio de los tokens para el pooling
    pooling_mode_max_tokens=False,  # No usar el máximo de los tokens para el pooling
    pooling_mode_mean_sqrt_len_tokens=False,  # No usar el promedio de la raíz cuadrada de la longitud para el pooling
    pooling_mode_weightedmean_tokens=False,  # No usar el promedio ponderado de los tokens para el pooling
    pooling_mode_lasttoken=False,  # No usar el último token para el pooling
    include_prompt=True  # Incluir el prompt en el pooling
)

# Definición del modelo de normalización
normalize = models.Normalize()

# Congelar los pesos del modelo de embeddings de palabras para que no se entrenen
for param in word_embedding_model.parameters():
    param.requires_grad = False

# Configuración del dispositivo para usar GPU si está disponible, de lo contrario usar CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Definir el módulo adaptador con las dimensiones de entrada y salida
adapter = AdapterModule(384, 384).to(device)

# Definir el modelo base de Sentence Transformer con las capas de embedding, pooling y normalización
base_model = SentenceTransformer(modules=[word_embedding_model, pooling_model, normalize], device=device)

# Definir el modelo personalizado de Sentence Transformer que incluye el adaptador
custom_domain_model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model, adapter, normalize], device=device
)

custom_domain_model  # Mostrar la arquitectura del modelo personalizado

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): AdapterModule(
    (dense1): Linear(in_features=384, out_features=1024, bias=True)
    (dense2): Linear(in_features=1024, out_features=512, bias=True)
    (output): Linear(in_features=512, out_features=384, bias=True)
    (activation): ReLU()
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (3): Normalize()
)

To prevent the weights of the word embedding model from being updated during training, all its parameters are frozen. Device configuration ensures that the model uses a GPU if available; otherwise it will default to CPU usage. An AdapterModule instance with defined input and output dimensions is created and moved to the specified device.

Two SentenceTransformer models are instantiated: the base model, which includes the word embedding, pooling, and normalization layers; and the custom domain model, which additionally incorporates the adapter module. This configuration allows flexible adaptation of embeddings tailored to specific tasks or domains. The final architecture of the custom domain model is shown for your review.

Load the training and evaluation datasets for question answering tasks from the respective pickled files stored in the 'data' directory.

In [18]:
from random import choices

In [19]:
qa_train = choices(pd.read_pickle('data/qa_training.pkl'), k=1000)
qa_eval = pd.read_pickle('data/qa_evaluation.pkl')

Create training examples using the question-answer pairs from the dataset `qa`, where each example consists of a question (`qa[0]`) and its corresponding answer (`qa[1]`).

In [20]:
print("Training lenght: ", len(qa_train))
print("Validation lenght: ", len(qa_eval['queries']))

Training lenght:  1000
Validation lenght:  2786


Prepares and configures the training and evaluation process for a custom SentenceTransformer model. Initially, a training data set is created by generating a list of `InputExample` instances, where each instance consists of a pair of texts (question and answer). This data set is then loaded into a "DataLoader", which shuffles the data at each epoch and sets the batch size to 256.

The training loss is defined using "MultipleNegativesSymmetricRankingLoss", which is suitable for information retrieval tasks involving positive text pairs. An evaluator is configured using "InformationRetrievalEvaluator", which evaluates the performance of the model on a set of queries and corpora, with the main scoring function specified as "dot_score".

In [21]:
# Crear el dataset de entrenamiento
# Se crea una lista de InputExample, donde cada ejemplo es un par de textos (pregunta y respuesta)
train_dataset = [
    InputExample(texts=[qa['query'], qa['passage']])
    for qa in qa_train
]

# Crear el DataLoader para el dataset de entrenamiento
# shuffle=True permite mezclar los datos en cada época
loader = DataLoader(train_dataset, shuffle=True, batch_size=128)

# Definir la función de pérdida
# MultipleNegativesSymmetricRankingLoss es adecuada para tareas de recuperación de información con pares positivos
train_loss = losses.MultipleNegativesSymmetricRankingLoss(custom_domain_model)

# Definir el evaluador
# InformationRetrievalEvaluator evalúa el modelo en un conjunto de consultas y corpus
evaluator = InformationRetrievalEvaluator(
    qa_eval['queries'], 
    qa_eval['corpus'], 
    qa_eval['relevant_docs'], 
    name='qa_eval', 
    map_at_k=[10],
    accuracy_at_k = [10],
    precision_recall_at_k = [10],
    main_score_function='dot_score'
)

# Definir el número de épocas y los pasos de calentamiento
epochs = 50
warmup_steps = int(len(loader) * epochs * 0.1)

The number of training epochs is set to 500 and the warm-up steps are calculated as 10% of the total training steps, determined by the length of the DataLoader and the number of epochs. This setup ensures that the model is properly prepared and evaluated during training.

In [22]:
# Entrenar el modelo
# fit() entrena el modelo con los objetivos de entrenamiento y evaluador
custom_domain_model.fit(
    train_objectives=[(loader, train_loss)],  # Objetivos de entrenamiento: DataLoader y función de pérdida
    epochs=epochs,  # Número de épocas
    warmup_steps=warmup_steps,  # Número de pasos de calentamiento
    output_path='results/domain_adaptation_model',  # Ruta de salida para guardar el modelo entrenado
    show_progress_bar=True,  # Mostrar barra de progreso durante el entrenamiento
    save_best_model=True,  # Guardar el mejor modelo según la evaluación
    use_amp=True,  # Habilitar Mixed Precision 
    evaluator=evaluator,  # Evaluador para evaluar el modelo durante el entrenamiento
    evaluation_steps=10,  # Evaluar el modelo cada 50 pasos
)


  scaler = torch.cuda.amp.GradScaler()
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.80s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.80s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.78s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.76s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.73s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.74s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.76s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.73s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.74s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.76s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.75s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.73s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.76s/it]
Iteration: 100%|██████████| 8/8 [00:21<00:00,  2.74s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.75s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.76s/it]
Iteration: 100%|██████████| 8/8 [00:22<00:00,  2.

## Evaluating the base model & the custom model

In [23]:
custom_domain_model = SentenceTransformer('./results/domain_adaptation_model')

  config = torch.load(os.path.join(input_path, 'config.pt'))
  model.load_state_dict(torch.load(os.path.join(input_path, 'adapter_module.pt')))


Evaluate the Mean Average Precision (MAP) at k=10 for both the base and custom domain models using the evaluator, and print the results for comparison.

In [24]:
eva_base_model = evaluator(base_model, output_path='results/base_model/')
print("Base model: ", eva_base_model)

eva_custom_model = evaluator(custom_domain_model, output_path='results/custom_model/')
print("Custom model: ", eva_custom_model)

Base model:  0.050316562973598135
Custom model:  0.04098571086726148


Load evaluation results from CSV files for both the base and custom domain models, add a column to indicate the model type, and concatenate the results into a single DataFrame for comparison.

In [30]:
base_model_eval = pd.read_csv('results/base_model/Information-Retrieval_evaluation_qa_eval_results.csv')
base_model_eval['tipo'] = 'base_model'
custom_model_eval = pd.read_csv('results/custom_model/Information-Retrieval_evaluation_qa_eval_results.csv')
custom_model_eval['tipo'] = 'custom_model'

pd.concat([base_model_eval, custom_model_eval]).to_csv('results/eval_comparation.csv', index=False)

pd.concat([base_model_eval, custom_model_eval])


Unnamed: 0,epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,...,dot-Precision@3,dot-Recall@3,dot-Precision@5,dot-Recall@5,dot-Precision@10,dot-Recall@10,dot-MRR@10,dot-NDCG@10,dot-MAP@100,tipo
0,-1,-1,0.33988,0.476541,0.52946,0.596017,0.33988,0.009441,0.158847,0.013237,...,0.158847,0.013237,0.105892,0.014707,0.059602,0.016556,0.421933,0.102059,0.01193,base_model
1,-1,-1,0.33988,0.476541,0.52946,0.596017,0.33988,0.009441,0.158847,0.013237,...,0.158847,0.013237,0.105892,0.014707,0.059602,0.016556,0.421933,0.102059,0.01193,base_model
2,-1,-1,0.33988,0.476541,0.52946,0.596017,0.33988,0.009441,0.158847,0.013237,...,0.158847,0.013237,0.105892,0.014707,0.059602,0.016556,0.421933,0.102059,0.01193,base_model
3,-1,-1,0.33988,0.476541,0.52946,0.596017,0.33988,0.009441,0.158847,0.013237,...,0.158847,0.013237,0.105892,0.014707,0.059602,0.016556,0.421933,0.102059,0.01193,base_model
4,-1,-1,0.340698,0.476541,0.52946,0.596017,0.340698,0.009464,0.158847,0.013237,...,0.158847,0.013237,0.105892,0.014707,0.059602,0.016556,0.422303,0.102119,0.04223,base_model
5,-1,-1,0.596017,0.059602,0.016556,0.422303,0.102119,0.04223,0.596017,0.059602,...,,,,,,,,,,base_model
6,-1,-1,0.607474,0.060747,0.016874,0.428965,0.103823,0.042897,0.607474,0.060747,...,,,,,,,,,,base_model
7,-1,-1,0.597109,0.059711,0.016586,0.428617,0.10323,0.042862,0.597109,0.059711,...,,,,,,,,,,base_model
8,-1,-1,0.596017,0.059602,0.016556,0.422303,0.102119,0.04223,0.596017,0.059602,...,,,,,,,,,,base_model
9,-1,-1,0.691673,0.069167,0.019213,0.503166,0.120688,0.050317,0.691673,0.069167,...,,,,,,,,,,base_model


### Comparing QA

In [26]:
# Asumiendo que los embeddings están normalizados
question1 = "How does the FSRA define and evaluate 'principal risks and uncertainties' for a Petroleum Reporting Entity, particularly for the remaining six months of the financial year?"
answer1 =  "A Reporting Entity must: (a) prepare such report: (i) for the first six months of each financial year or period, and if there is a change to the accounting reference date, prepare such report in respect of the period up to the old accounting reference date; and (ii) in accordance with the applicable IFRS standards or other standards acceptable to the Regulator; (b) ensure the financial statements have either been audited or reviewed by auditors, and the audit or review by the auditor is included within the report; and (c) ensure that the report includes: (i) except in the case of a Mining Exploration Reporting Entity or a Petroleum Exploration Reporting Entity, an indication of important events that have occurred during the first six months of the financial year, and their impact on the financial statements; (ii) except in the case of a Mining Exploration Reporting Entity or a Petroleum Exploration Reporting Entity, a description of the principal risks and uncertainties for the remaining six months of the financial year; and (iii) a condensed set of financial statements, an interim management report and associated responsibility statements."

question2 = 'Under Rules 7.3.2 and 7.3.3, what are the two specific conditions related to the maturity of a financial instrument that would trigger a disclosure requirement?'
answer2 =  'Events that trigger a disclosure. For the purposes of Rules 7.3.2 and 7.3.3, a Person is taken to hold Financial Instruments in or relating to a Reporting Entity, if the Person holds a Financial Instrument that on its maturity will confer on him: (1) an unconditional right to acquire the Financial Instrument; or (2) the discretion as to his right to acquire the Financial Instrument.',


emb_q1 = custom_domain_model.encode(question1)  # el embedding está normalizado
emb_q2 = custom_domain_model.encode(question2)  # el embedding está normalizado
ans_1 = custom_domain_model.encode(answer1)
ans_2 = custom_domain_model.encode(answer2)


print("q1", ans_1 @ emb_q1,"(answer1) --", ans_2 @ emb_q1, "(answer2)")
print("q2", ans_1 @ emb_q2, "(answer1) --", ans_2 @ emb_q2, "(answer2)")


print("------ Base Model ------")

emb_q1 = base_model.encode(question1)  # el embedding está normalizado
emb_q2 = base_model.encode(question2)  # el embedding está normalizado
ans_1 = base_model.encode(answer1)
ans_2 = base_model.encode(answer2)


print("q1", ans_1 @ emb_q1,"(answer1) --", ans_2 @ emb_q1, "(answer2)")
print("q2", ans_1 @ emb_q2, "(answer1) --", ans_2 @ emb_q2, "(answer2)")


q1 0.93123436 (answer1) -- [0.9080961] (answer2)
q2 0.9262854 (answer1) -- [0.94469243] (answer2)
------ Base Model ------
q1 0.6166147 (answer1) -- [0.4293761] (answer2)
q2 0.55304617 (answer1) -- [0.7028685] (answer2)


### The custom model mantain original capabilities

Encodes sample text inputs, including the title of an article, author names, and various concepts, using both the custom domain model and the base model. Also, the dot product between the coded vectors is calculated to measure the similarity between different pairs of concepts and between the paper and a concept. Print the similarity scores for each comparison to see the differences. 

In [27]:
paper = "Composable Lightweight Processors"

concept1 = "shark"
concept2 = "ocean"
concept3 = "strawberry"

In [28]:
custom_paper = custom_domain_model.encode(paper)

custom_concept1 = custom_domain_model.encode(concept1)
custom_concept2 = custom_domain_model.encode(concept2)
custom_concept3 = custom_domain_model.encode(concept3)

# Imprimir los resultados y explicaciones
print(f"Producto punto entre dos conceptos (shark y ocean): {np.dot(custom_concept1, custom_concept2)}")
print(f"Producto punto entre dos conceptos (shark y strawberry): {np.dot(custom_concept1, custom_concept3)}")
print(f"Producto punto entre el documento y un concepto (ocean): {np.dot(custom_paper, custom_concept2)}")

Producto punto entre dos conceptos (shark y ocean): 0.7894057035446167
Producto punto entre dos conceptos (shark y strawberry): 0.6854922771453857
Producto punto entre el documento y un concepto (ocean): 0.5999240875244141


In [29]:
base_paper = base_model.encode(paper)

base_concept1 = base_model.encode(concept1)
base_concept2 = base_model.encode(concept2)
base_concept3 = base_model.encode(concept3)  

# Imprimir los resultados y explicaciones
print(f"Producto punto entre dos conceptos (shark y ocean): {np.dot(base_concept1, base_concept2)}")
print(f"Producto punto entre dos conceptos (shark y strawberry): {np.dot(base_concept1, base_concept3)}")
print(f"Producto punto entre el documento y un concepto (ocean): {np.dot(base_paper, base_concept2)}")

Producto punto entre dos conceptos (shark y ocean): 0.5527569055557251
Producto punto entre dos conceptos (shark y strawberry): 0.27426061034202576
Producto punto entre el documento y un concepto (ocean): -0.05138666182756424
