# [Golden Datasets] Results Replication

This notebook demonstrates the simplified version of how to replicate our results.

Using the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries), Anthropic's `claude-3-5-sonnet`, and OpenAI's `text-embedding-3-small`, we will demonstrate the following:
- Models have memorized public benchmarks
- We are able to generate unseen queries that are representative of the ground truth dataset

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

In [None]:
!pip install -r requirements.txt

Import modules.

In [None]:
import chromadb
import pandas as pd
import numpy as np
from tqdm import tqdm
import json
import matplotlib.pyplot as plt
import datasets
from sentence_transformers import SentenceTransformer
from voyageai import Client as VoyageClient
from openai import OpenAI as OpenAIClient
from anthropic import Anthropic as AnthropicClient
from llm_calls import *
from utils import *
from embedding_funcs import *
from chroma_funcs import *
from evaluation_funcs import *

### 1.2 Load API Keys

To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database. If you want to use local Chroma, skip this step and simply input `OPENAI_API_KEY` and `CLAUDE_API_KEY`.

In [None]:
# Chroma Cloud
CHROMA_TENANT = "YOUR CHROMA TENANT ID"
X_CHROMA_TOKEN = "YOUR CHROMA API KEY"
DATABASE_NAME = "YOUR CHROMA DATABASE NAME"

# Embedding Model
OPENAI_API_KEY = "YOUR OPENAI API KEY"

# LLM
CLAUDE_API_KEY = "YOUR CLAUDE API KEY"

### 1.3 Set Clients

We'll use our API keys to initialize the clients.

In [None]:
chroma_client = chromadb.HttpClient(
  ssl=True,
  host='api.trychroma.com',
  tenant=CHROMA_TENANT,
  database=DATABASE_NAME,
  headers={
    'x-chroma-token': X_CHROMA_TOKEN
  }
)

# If you want to use the local Chroma instead, uncomment the following line:
# chroma_client = chromadb.Client()

openai_client = OpenAIClient(api_key=OPENAI_API_KEY)
claude_client = AnthropicClient(api_key=CLAUDE_API_KEY)

### 1.4 Load Data

For this simplified version, we'll use the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries).

We'll use the `test` split for this demonstration, which contains:
- 1500 queries
- 1500 query-corpus relevance judgments
- 13500 corpus documents

First, we'll load the queries, corpus, and query-corpus relevance judgments.

In [21]:
wiki_queries = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-queries", "en")["test"].to_pandas()
wiki_corpus = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-corpus", "en")["test"].to_pandas()
wiki_qrels = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-qrels", "en")["test"].to_pandas()

For this specific dataset, the query-corpus relevance judgements include distractors as indicated by `score`s of `0.5` and target matches as indicated by `score`s of `1.0`.

We'll filter the query-corpus relevance judgments to only include target matches. Then, we'll combine the queries, corpus, and query-corpus relevance judgments into a single dataframe for convenience.

In [22]:
wiki_qrels = wiki_qrels[wiki_qrels["score"] == 1.0]

wiki_qrels = combined_datasets_dataframes(wiki_queries, wiki_corpus, wiki_qrels)

## 2. Embed Corpus & Store in Chroma

First, we create a Chroma collection to store our corpus embeddings.

In [None]:
wiki_collection = chroma_client.get_or_create_collection(
    name="wiki-text-embedding-3-small",
    metadata={"hnsw:space": "cosine"}
)

Embed the corpus using `text-embedding-3-small` and add to `wiki_collection` (we use batching and threading to speed up the process).

We'll also create a lookup dictionary to store the corpus embeddings for later use.

In [None]:
wiki_corpus_ids = wiki_corpus["_id"].tolist()
wiki_corpus_texts = wiki_corpus["text"].tolist()

wiki_corpus_embeddings = openai_embed_in_batches(openai_client, wiki_corpus_texts, "text-embedding-3-small")

collection_add_in_batches(wiki_collection, wiki_corpus_ids, wiki_corpus_texts, wiki_corpus_embeddings)

wiki_corpus_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(wiki_corpus_ids, wiki_corpus_texts, wiki_corpus_embeddings)
}

## 3. Simple Query Generation

We will demonstrate that models have memorized public benchmarks with a naive query generation approach.

Generate 1500 queries, only including the corpus as context. The prompt can be found in `llm_calls.py` under `generate_query`.

In [None]:
wiki_simple_generated_queries = []

for _, row in tqdm(wiki_qrels.iterrows(), total=len(wiki_qrels), desc="Generating simple queries..."):
    corpus = row['corpus-text']
    generated_query = generate_query(claude_client, corpus)
    wiki_simple_generated_queries.append(generated_query)

wiki_qrels["simple-generated-query-text"] = wiki_simple_generated_queries

Here, we embed the original queries and generated queries. We store the embeddings in a lookup dictionary as well for later use.

In [None]:
wiki_original_queries = wiki_qrels["query-text"].tolist()
wiki_query_ids = wiki_qrels["query-id"].tolist()

wiki_original_query_embeddings = openai_embed_in_batches(openai_client, wiki_original_queries, "text-embedding-3-small")
wiki_simple_generated_query_embeddings = openai_embed_in_batches(openai_client, wiki_simple_generated_queries, "text-embedding-3-small")

wiki_original_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(wiki_query_ids, wiki_original_queries, wiki_original_query_embeddings)
}

wiki_simple_generated_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(wiki_query_ids, wiki_simple_generated_queries, wiki_simple_generated_query_embeddings)
}

Once our embeddings are computed, we can compare the cosine similarity between the original queries and generated queries.

In [None]:
wiki_query_query_scores = score_query_query(wiki_qrels, wiki_original_query_lookup, wiki_simple_generated_query_lookup)

plt.figure(figsize=(8, 5))

plt.hist(wiki_query_query_scores["query-query-score"], bins=30, alpha=0.5,  edgecolor='black', label="Score", range=(0, 1), density=True)

plt.xlabel("Cosine Similarity")
plt.ylabel("Normalized Frequency")
plt.title("text-embedding-3-small")
plt.legend()
plt.grid(True)
plt.show()

We can see that our generated queries are very similar to the original queries, and we can investigate this further here:

In [None]:
wiki_query_query_scores.sort_values(by="query-query-score", ascending=False, inplace=True)

for i, row in wiki_query_query_scores.head(10).iterrows():
    print(f"Score: {row['query-query-score']:.4f}")
    print(f"Original Query: {row['query-text']}")
    print(f"Generated Query: {row['simple-generated-query-text']}")
    print("-" * 80)

## 4. Distinct Query Generation

Since models have memorized these public benchmarks, we will generate unseen queries by explicitely prompting the model to generate a distinct query. 

Then, we will demonstrate that these newly generated distinct queries are also representative of the ground truth dataset.

We generate 1500 queries, now including both the original query and the corpus as context. The prompt can be found in `llm_calls.py` under `generate_query_with_example`.

In [None]:
wiki_distinct_generated_queries = []

for _, row in tqdm(wiki_qrels.iterrows(), total=len(wiki_qrels), desc="Generating distinct queries..."):
    query = row['query-text']
    corpus = row['corpus-text']
    generated_query = generate_query_with_example(claude_client, query, corpus)
    wiki_distinct_generated_queries.append(generated_query)

wiki_qrels["distinct-generated-query-text"] = wiki_distinct_generated_queries

We embed the newly generated queries and store them in a lookup dictionary.

In [None]:
wiki_distinct_generated_query_embeddings = openai_embed_in_batches(openai_client, wiki_distinct_generated_queries, "text-embedding-3-small")

wiki_distinct_generated_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(wiki_query_ids, wiki_distinct_generated_queries, wiki_distinct_generated_query_embeddings)
}

We run the retrieval task across both the generated and original queries:

In [None]:
k_values = [1, 3, 5, 10]

wiki_distinct_gen_results = get_results(wiki_collection, wiki_distinct_generated_queries, wiki_corpus_ids, wiki_distinct_generated_query_embeddings)
wiki_distinct_gen_metrics = evaluate(k_values, wiki_qrels, wiki_distinct_gen_results)

wiki_original_results = get_results(wiki_collection, wiki_original_queries, wiki_corpus_ids, wiki_original_query_embeddings)
wiki_original_metrics = evaluate(k_values, wiki_qrels, wiki_original_results)

wiki_metrics = [wiki_distinct_gen_metrics, wiki_original_metrics]
labels = ["Generated", "Original"]

comparison_df = create_comparison_dataframe(wiki_metrics, labels)

comparison_df

We compare the cosine similarity distributions here:

In [None]:
wiki_distinct_gen_scores = score_query_corpus(wiki_qrels, wiki_distinct_generated_query_lookup, wiki_corpus_lookup)
wiki_original_scores = score_query_corpus(wiki_qrels, wiki_original_query_lookup, wiki_corpus_lookup)

plt.figure(figsize=(8, 5))

plt.hist(wiki_distinct_gen_scores["query-corpus-score"], bins=30, alpha=0.5,  edgecolor='black', label="Original", range=(0, 1), density=True)
plt.hist(wiki_original_scores["query-corpus-score"], bins=30, alpha=0.5, edgecolor='black', label="Generated", range=(0, 1), density=True)

plt.xlabel("Cosine Similarity")
plt.ylabel("Normalized Frequency")

plt.title("text-embedding-3-small (1500 Queries, 13500 Documents)")
plt.legend()
plt.grid(True)
plt.show()