# Collect metrics
For each dataset:
1. Get similarity scores between chunks and question. We collect BERT scores, Jaccard similarities and BM25 scores.
2. Apply the re-rankers to the chunks and record the ranks of the chunks. 

## Environment setup

Change the current working directory to repo root

In [1]:
import os
os.chdir("../.")
os.getcwd() # check that we are in the right directory

'/Users/lovhag/Documents/Projects/rerankers-and-lexical-similarities'

In [None]:
from datasets import load_dataset
import pandas as pd
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Load the base data

In [4]:
nq_data = load_dataset("Lo/rerankers-and-lexical-similarities", "NQ", split="standard").to_pandas()
litqao_data = load_dataset("Lo/rerankers-and-lexical-similarities", "LitQA2-o", split="standard").to_pandas()
druid_data = load_dataset("Lo/rerankers-and-lexical-similarities", "DRUID", split="standard").to_pandas()
druidq_data = load_dataset("Lo/rerankers-and-lexical-similarities", "DRUID", split="prompt").to_pandas()

# chunks with prepended titles
nq_t_data = load_dataset("Lo/rerankers-and-lexical-similarities", "NQ", split="title").to_pandas()
litqao_t_data = load_dataset("Lo/rerankers-and-lexical-similarities", "LitQA2-o", split="title").to_pandas()
druid_t_data = load_dataset("Lo/rerankers-and-lexical-similarities", "DRUID", split="title").to_pandas()

# chunks with prepended contexts
nq_c_data = load_dataset("Lo/rerankers-and-lexical-similarities", "NQ", split="context").to_pandas()
litqao_c_data = load_dataset("Lo/rerankers-and-lexical-similarities", "LitQA2-o", split="context").to_pandas()

## 1. Get similarity scores

Measure chunk similarity based on BERT score and Jaccard similarity.

#### Get BERT scores using a GPU

To get the BERT scores, run the following script:

```bash
python -m src.collect_metrics.get_bert_scores \
        --data_file <data-file> 
```

`<data-file>` should be replaced by the path to the dataset with passages/chunks for which you would like to collect BERT scores, e.g. `data/DRUID/chunks.jsonl`. For this, you will want to use a GPU. We applied it to all of the data files loaded above. 

Load BERT scores data

In [None]:
nq_data = nq_data.merge(pd.read_json("data/NQ/chunks_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
litqao_data = litqao_data.merge(pd.read_json("data/LitQA2-options/chunks_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
druid_data = druid_data.merge(pd.read_json("data/DRUID/chunks_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
druidq_data = druidq_data.merge(pd.read_json("data/DRUID-q/chunks_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)

nq_t_data = nq_t_data.merge(pd.read_json("data/NQ/chunks_w_titles_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
litqao_t_data = litqao_t_data.merge(pd.read_json("data/LitQA2-options/chunks_w_titles_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
druid_t_data = druid_t_data.merge(pd.read_json("data/DRUID/chunks_w_titles_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)

nq_c_data = nq_c_data.merge(pd.read_json("data/NQ/chunks_w_contexts_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)
litqao_c_data = litqao_c_data.merge(pd.read_json("data/LitQA2-options/chunks_w_contexts_bert_scores.jsonl", lines=True)['bert_score'], left_index=True, right_index=True)

#### Get Jaccard similarities

In [None]:
from string import punctuation
from nltk.metrics.distance import jaccard_distance
from nltk.tokenize import word_tokenize

SKIPLIST = set(list(punctuation) + ["”", "“", "—", "’", "``", "''"])

def get_jaccard_sim(row):
    def get_jaccard_index(s1, s2):
        words_1 = set([w.lower() for w in word_tokenize(s1) if w.lower() not in SKIPLIST])
        words_2 = set([w.lower() for w in word_tokenize(s2) if w.lower() not in SKIPLIST])
        return 1-jaccard_distance(words_1, words_2)

    if isinstance(row.chunks[0], dict):
        return list(map(get_jaccard_index, [row.question]*len(row.chunks), [val["chunk"] for val in row.chunks]))
    else:
        return list(map(get_jaccard_index, [row.question]*len(row.chunks), row.chunks))

nq_data["jaccard_sim"] = nq_data.apply(get_jaccard_sim, axis=1)
litqao_data["jaccard_sim"] = litqao_data.apply(get_jaccard_sim, axis=1)
druid_data["jaccard_sim"] = druid_data.apply(get_jaccard_sim, axis=1)
druidq_data["jaccard_sim"] = druidq_data.apply(get_jaccard_sim, axis=1)

nq_t_data["jaccard_sim"] = nq_t_data.apply(get_jaccard_sim, axis=1)
litqao_t_data["jaccard_sim"] = litqao_t_data.apply(get_jaccard_sim, axis=1)
druid_t_data["jaccard_sim"] = druid_t_data.apply(get_jaccard_sim, axis=1)

nq_c_data["jaccard_sim"] = nq_c_data.apply(get_jaccard_sim, axis=1)
litqao_c_data["jaccard_sim"] = litqao_c_data.apply(get_jaccard_sim, axis=1)

#### Get BM25 scores

The following code will save the BM25 scores for all datasets to files titled `chunks_bm25_scores.jsonl`.

**NQ**

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/NQ/chunks.jsonl" 
```

**LitQA2-options**

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/LitQA2-options/chunks.jsonl" 
```

**DRUID**

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/DRUID/chunks.jsonl" 
```

**DRUID-q**

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/DRUID-q/chunks.jsonl" 
```

**With titles**
```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/NQ/chunks_w_titles.jsonl" 
```

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/LitQA2-options/chunks_w_titles.jsonl" 
```

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/DRUID/chunks_w_titles.jsonl" 
```

**With contexts**
```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/NQ/chunks_w_contexts.jsonl" 
```

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/LitQA2-options/chunks_w_contexts.jsonl" 
```

```bash
python -m src.collect_metrics.get_bm25_scores \
        --data_file "data/DRUID/chunks_w_contexts.jsonl" 
```

Load the scores

In [None]:
nq_data = nq_data.merge(pd.read_json("data/NQ/chunks_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
litqao_data = litqao_data.merge(pd.read_json("data/LitQA2-options/chunks_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
druid_data = druid_data.merge(pd.read_json("data/DRUID/chunks_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
druidq_data = druidq_data.merge(pd.read_json("data/DRUID-q/chunks_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)

nq_t_data = nq_t_data.merge(pd.read_json("data/NQ/chunks_w_titles_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
litqao_t_data = litqao_t_data.merge(pd.read_json("data/LitQA2-options/chunks_w_titles_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
druid_t_data = druid_t_data.merge(pd.read_json("data/DRUID/chunks_w_titles_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)

nq_c_data = nq_c_data.merge(pd.read_json("data/NQ/chunks_w_contexts_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)
litqao_c_data = litqao_c_data.merge(pd.read_json("data/LitQA2-options/chunks_w_contexts_bm25_scores.jsonl", lines=True)['bm25_score'], left_index=True, right_index=True)

## 2. Apply the rerankers to get reranker scores

### Cohere reranker

For this, you need to set up an Cohere API key and put it under `API-keys/cohere-api-key.txt`.

In [None]:
import cohere
import time

# Load Cohere API key
with open("API-keys/cohere-api-key.txt", "r") as f_co:
    api_key = f_co.readline().strip()
co = cohere.Client(api_key=api_key)

RERANK_MAX_NBR_DOCUMENTS = 10000 # fixed by Cohere

def process_sentences_for_rerank(sentences):
    # truncate too long contexts
    if len(sentences) > RERANK_MAX_NBR_DOCUMENTS:
        sentences = sentences[:RERANK_MAX_NBR_DOCUMENTS]
    return sentences

In [None]:
def get_cohere_reranker_scores(hs_data):
    top_ixs = []
    top_scores = []
    for ix, row in tqdm(hs_data.iterrows(), total=len(hs_data)):
        if isinstance(row.chunks[0], dict):
            docs = [val["chunk"] for val in row.chunks]
        else:
            docs = row.chunks
        response = co.rerank(
                model="rerank-english-v3.0",
                query=row.question,
                documents=docs
            )
        tmp_top_ixs = [res.index for res in response.results]
        top_ixs.append(tmp_top_ixs)
        
        tmp_top_scores = [res.relevance_score for res in response.results]
        top_scores.append(tmp_top_scores)
        time.sleep(0.1)

    hs_data["reranker_top_ixs"] = top_ixs
    hs_data["reranker_top_scores"] = top_scores
    return hs_data

nq_data = get_cohere_reranker_scores(nq_data)
litqao_data = get_cohere_reranker_scores(litqao_data)
druid_data = get_cohere_reranker_scores(druid_data)
druidq_data = get_cohere_reranker_scores(druidq_data)

nq_t_data = get_cohere_reranker_scores(nq_t_data)
litqao_t_data = get_cohere_reranker_scores(litqao_t_data)
druid_t_data = get_cohere_reranker_scores(druid_t_data)

nq_c_data = get_cohere_reranker_scores(nq_c_data)
litqao_c_data = get_cohere_reranker_scores(litqao_c_data)

100%|██████████| 3759/3759 [23:19<00:00,  2.69it/s]  
100%|██████████| 124/124 [01:07<00:00,  1.84it/s]
100%|██████████| 875/875 [04:42<00:00,  3.10it/s]


Save the data

In [None]:
nq_data.to_json("data/NQ/chunks_bert_reranker_scores.jsonl", orient='records', lines=True)
litqao_data.to_json("data/LitQA2-options/chunks_bert_reranker_scores.jsonl", orient='records', lines=True)
druid_data.to_json("data/DRUID/chunks_bert_reranker_scores.jsonl", orient='records', lines=True)
druidq_data.to_json("data/DRUID-q/chunks_bert_reranker_scores.jsonl", orient='records', lines=True)

nq_t_data.to_json("data/NQ/chunks_w_titles_bert_reranker_scores.jsonl", orient='records', lines=True)
litqao_t_data.to_json("data/LitQA2-options/chunks_w_titles_bert_reranker_scores.jsonl", orient='records', lines=True)
druid_t_data.to_json("data/DRUID/chunks_w_titles_bert_reranker_scores.jsonl", orient='records', lines=True)

nq_c_data.to_json("data/NQ/chunks_w_contexts_bert_reranker_scores.jsonl", orient='records', lines=True)
litqao_c_data.to_json("data/LitQA2-options/chunks_w_contexts_bert_reranker_scores.jsonl", orient='records', lines=True)

### BAAI/bge-reranker-v2-gemma

Run the following script to get the re-ranker scores:

```bash
python -m src.collect_metrics.get_bge_reranker_v2_gemma_ranks \
        --data_file <data-path> 
```

Edit `<data-path>` as applicable to collect the re-ranker scores for all datasets covered above, e.g. `data/DRUID/chunks.jsonl`. For this, you will want to use a GPU. 

### jinaai/jina-reranker-v1-turbo-en

Run the following script to get the re-ranker scores:

```bash
python -m src.collect_metrics.get_jina_reranker_v1_turbo_en_ranks \
        --data_file <data-path> 
```

Edit `<data-path>` as applicable to collect the re-ranker scores for all datasets covered above, e.g. `data/DRUID/chunks.jsonl`. For this, you will want to use a GPU. 

### jinaai/jina-reranker-v2-base-multilingual

Run the following script to get the re-ranker scores:

```bash
python -m src.collect_metrics.get_jina_reranker_v2_base_multilingual_ranks \
        --data_file <data-path> 
```

Edit `<data-path>` as applicable to collect the re-ranker scores for all datasets covered above, e.g. `data/DRUID/chunks.jsonl`. For this, you will want to use a GPU. 