# Custom Evaluation with LlamaIndex

En esta caso, se evaluaron 3 modelos de embeddings:

1. proprietary OpenAI embedding
2. open source `BAAI/bge-small-en`
3. our finetuned embedding model

Consideramos 2 métricas de evaluación:

1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

In [24]:
import json
from tqdm.notebook import tqdm
import pandas as pd

from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import TextNode
from llama_index.embeddings import OpenAIEmbedding

### Cargar data

Primero se debe carcar el dataset geneardo automáticamente de nuestros corpus (without having access to any labellers).

In [25]:
TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/val_dataset.json'

In [26]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

### Definir función de evaluación

**Opción 1**: Utilizamos una métrica simple de **tasa de aciertos** para la evaluación:
* para cada par (query, relevant)
* recuperamos los documentos con la query
* es un **acierto** si los resultados contienen el relevant_doc relevante.

Este enfoque es muy simple e intuitivo, y podemos aplicarlo tanto al modelo de embeddings OpenAI como a nuestros modelos de incrustación de código abierto y "fine-tuneados".

In [27]:
def evaluate(
    dataset,
    embed_model,
    top_k = 5,
    verbose=False,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    service_context = ServiceContext.from_defaults(embed_model = embed_model)
    nodes = [TextNode(id_ = id_, text = text) for id_, text in corpus.items()] 
    index = VectorStoreIndex(
        nodes, 
        service_context = service_context, 
        show_progress = True
    )
    retriever = index.as_retriever(similarity_top_k = top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc
        
        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Opción 2**: Usamos el `InformationRetrievalEvaluator` de sentence_transformers.

Esto proporciona un conjunto de métricas más completo, pero solo podemos ejecutarlo en los modelos compatibles con los sentencetransformers (el de código abierto y nuestro modelo "fine-tuneado", **no** el modelo de incrustación de OpenAI).

In [28]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path='results/')

### Correr evaluación

#### OpenAI

Nota: esto puede tardar unos minutos en ejecutarse ya que tenemos que incrustar el corpus y las consultas. Gasta algunos créditos.

In [29]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = "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"

In [30]:
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

Generating embeddings:   0%|          | 0/99 [00:00<?, ?it/s]

  0%|          | 0/198 [00:00<?, ?it/s]

In [31]:
df_ada = pd.DataFrame(ada_val_results)

In [32]:
hit_rate_ada = df_ada['is_hit'].mean()
hit_rate_ada

0.9292929292929293

### BAAI/bge-small-en

In [33]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

Generating embeddings:   0%|          | 0/99 [00:00<?, ?it/s]

  0%|          | 0/198 [00:00<?, ?it/s]

In [34]:
df_bge = pd.DataFrame(bge_val_results)

In [35]:
hit_rate_bge = df_bge['is_hit'].mean()
hit_rate_bge

0.6161616161616161

In [36]:
evaluate_st(val_dataset, "BAAI/bge-small-en", name='bge')

0.4626562105085016

### Fine-tuned model

In [37]:
finetuned = "local:exp_finetune"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/99 [00:00<?, ?it/s]

  0%|          | 0/198 [00:00<?, ?it/s]

In [38]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [39]:
hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

0.6717171717171717

In [40]:
evaluate_st(val_dataset, "exp_finetune", name='finetuned')

0.5324246590927099

### Resultados

#### Tasa de aciertos (opción 1)

In [41]:
df_ada['model'] = 'ada'
df_bge['model'] = 'bge'
df_finetuned['model'] = 'fine_tuned'

Podemos ver que ajustar nuestro pequeño modelo de embeddings de código abierto mejora drásticamente su calidad de recuperación (incluso acercándose a la calidad del modelo de embeddings de OpenAI).

In [42]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby('model').mean('is_hit')

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
ada,0.929293
bge,0.616162
fine_tuned,0.671717


### InformationRetrievalEvaluator (opción 2)

In [43]:
df_st_bge = pd.read_csv('results/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('results/Information-Retrieval_evaluation_finetuned_results.csv')

Podemos ver que el fine-tuning a un modelo de mebeddings mejora las métricas de manera consistente en todo el conjunto de métricas de evaluación.

In [44]:
df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

Unnamed: 0_level_0,epoch,steps,cos_sim-Accuracy@1,cos_sim-Accuracy@3,cos_sim-Accuracy@5,cos_sim-Accuracy@10,cos_sim-Precision@1,cos_sim-Recall@1,cos_sim-Precision@3,cos_sim-Recall@3,...,dot_score-Recall@1,dot_score-Precision@3,dot_score-Recall@3,dot_score-Precision@5,dot_score-Recall@5,dot_score-Precision@10,dot_score-Recall@10,dot_score-MRR@10,dot_score-NDCG@10,dot_score-MAP@100
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bge,-1,-1,0.358586,0.5,0.555556,0.651515,0.358586,0.358586,0.166667,0.5,...,0.358586,0.166667,0.5,0.111111,0.555556,0.065152,0.651515,0.446601,0.495208,0.462656
fine_tuned,-1,-1,0.414141,0.590909,0.671717,0.752525,0.414141,0.414141,0.19697,0.590909,...,0.414141,0.19697,0.590909,0.134343,0.671717,0.075253,0.752525,0.521216,0.577016,0.532397
