# Needle In a Haystack pressure test

**(Databricks compatible notebook)**

## Pre-requisites:
First clone the repository from your databricks workspace and navigate to this notebook.

1. For using `OpenAI` (as `provider` to test and/or `evaluator`): create a [databricks secret](https://docs.databricks.com/en/security/secrets/index.html#secret-management) of your openAI api key _OPTIONAL_
2. For testing `databricks` as a `provider` from outside of databricks environment: create a [databricks Personnal Access Token](https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-token-authentication) and store the value as a databricks secret
    - Pick the `model` to test from this [list of supported models](https://docs.databricks.com/en/machine-learning/foundation-models/index.html#pay-per-token-foundation-model-apis)
    - **Get a Huggingface personnal access token (PAT)** from your profile and store as databricks secret as well
    - **If you choose a model different than `databricks-dbrx-instruct` you may need to provide the correct `tokenizer` _(see lines 55-60 in [databricks.py]($./providers/databricks.py))_**
3. For using `langsmith` as an `evaluator` _(recommended for multi-needles)_: create langsmith API key and store as databricks secret
    - Sign up for [LangSmith](https://smith.langchain.com/)
    - Create API key (e.g. `LANGSMITH_API_KEY`) and store as secret as specified in the [setup](https://docs.smith.langchain.com/evaluation/quickstart)
    - In the Datasets + Testing tab, use + Dataset to create a new dataset (chat type), call it `multi-needle-eval-*`
    - Populate the dataset with a test question (use "Add Example" on the dataset page) for example:
        - For 5 needles question on SF: [human/input: `What are the 5 best things to do in San Francisco?`  & ai/output: `The 5 best things to do in San Francisco are: 1) Go to Dolores Park. 2) Eat at Tony's Pizza Napoletana. 3) Visit Alcatraz. 4) Hike up Twin Peaks. 5) Bike across the Golden Gate Bridge`] (and call it `multi-needle-sf-5`)
        - For 3 needles question on Pizzas [human/input: `What are the secret ingredients needed to build the perfect pizza?` & human/ai: `The secret ingredients needed to build the perfect pizza are figs, prosciutto and goat cheese.`] (and call it `multi-needle-pizza-3`)
4. For using `Anthropic` as a `provider` to test: create a databricks secret for your anthropic key

### Databricks CUJ
[OPTIONAL] As part of the CUJ on databricks we recommend creating a catalog, schema and 3 [Unity Catalog (UC) Volumes](https://docs.databricks.com/en/connect/unity-catalog/volumes.html) to upload haystack files and store context/results.

5. (`haystacks_volume`) to download or upload the text file(s) you want to use as haystack (e.g. `main.niah_amine_elhelou.haystacks`)
6. (`results_volume`) to store raw json outputs from the tests (e.g. `main.niah_amine_elhelou.results_json`) 
7. (`contexts_volume`) to store raw json inputs with full context and needle(s) (e.g. `main.niah_amine_elhelou.contexts_json`)

## NOTE
To run this notebook you should first remove the  **`.`** from the following imports section in both
- [llm_needle_haystack_tester.py]($./llm_needle_haystack_tester.py)
```
from evaluators import Evaluator
from providers import ModelProvider
```
&
- [llm_multi_needle_haystack_tester.py]($./llm_multi_needle_haystack_tester.py)

```
from evaluators import Evaluator
from llm_needle_haystack_tester import LLMNeedleHaystackTester
from providers import ModelProvider
```

& 

- **Attach this notebook to a single-node cluster with any databricks runtime >= 14.3LTS**

## Setup haystack data/volumes and api keys

In [0]:
%pip install -r ../requirements.txt


dbutils.library.restartPython()

In [0]:
from dataclasses import dataclass, field
from typing import Optional
from llm_needle_haystack_tester import LLMNeedleHaystackTester
from llm_multi_needle_haystack_tester import LLMMultiNeedleHaystackTester
from evaluators import Evaluator, LangSmithEvaluator, OpenAIEvaluator
from providers import Anthropic, ModelProvider, OpenAI, Databricks
# You can ignore Exceptions & Warnings

In [0]:
# dbutils.widgets.removeAll()

In [0]:
import os


username = spark.sql('select current_user() as user').collect()[0]['user'].split('@')[0].replace('.', '_')
dbutils.widgets.text("catalog", "main", "Catalog to use for storing results/tables")
dbutils.widgets.text("schema", f"niah_{username}", "Schema to use for storing results/tables")

In [0]:
catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")

haystacks_volume = f"{catalog}.{schema}.haystacks" #"."
results_volume = f"{catalog}.{schema}.results_json" #"."
contexts_volume = f"{catalog}.{schema}.contexts_json" #"."

# OPTIONAL
local_workspace_path = "PaulGrahamEssays" # If haystack files are available in local/root repo in a folder
haystack_dir = os.path.join("/Volumes",haystacks_volume.replace('.','/'), local_workspace_path) # Label as needed
context_dir = os.path.join("/Volumes",contexts_volume.replace('.','/'), "contexts_dbrx_32K") # Label as needed
results_dir = os.path.join("/Volumes",results_volume.replace('.','/'), "results_dbrx_32K") # Label as needed

In [0]:
%sql
-- CREATE CATALOG IF NOT EXISTS $catalog; 
CREATE SCHEMA IF NOT EXISTS $catalog.$schema;

In [0]:
spark.sql(f"CREATE VOLUME IF NOT EXISTS {haystacks_volume}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {results_volume}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {contexts_volume}")

In [0]:
from shutil import copyfile


# Create haystack path/folder in UC volumes first (if not exists)
if not os.path.exists(haystack_dir):
  os.makedirs(haystack_dir)

  # Copy files into it
  for f in os.listdir(os.path.join(os.getcwd(), local_workspace_path)):
    copyfile(os.path.join(os.getcwd(), local_workspace_path, f), f"{haystack_dir}/{f}")

In [0]:
dbutils.widgets.dropdown("provider", "databricks", ["databricks", "openai", "anthropic"],"Provider")
dbutils.widgets.text("model","databricks-dbrx-instruct","Model to test")
dbutils.widgets.dropdown("evaluator", "openai", ["openai", "langsmith"],"Evaluator")
dbutils.widgets.text("haystack_dir", haystack_dir, "Folder containing haystack files")
dbutils.widgets.text("context_min_length", "1000", "Minimum length of context")
dbutils.widgets.text("context_max_length", "32000", "Maximum length of context")
dbutils.widgets.text("results_dir", results_dir, "Folder to store json results")
dbutils.widgets.dropdown("multi_needle", "False", ["True", "False"],"Multiple Needles?")
dbutils.widgets.text("base_url", "", "Databricks Endpoint URL")
dbutils.widgets.text("contexts_dir", context_dir, "Folder to store json contexts")

In [0]:
# Create folders inside each volume
haystack_dir = dbutils.widgets.get("haystack_dir")
contexts_dir = dbutils.widgets.get("contexts_dir")
results_dir = dbutils.widgets.get("results_dir")
if not os.path.exists(haystack_dir):
  os.makedirs(haystack_dir)

if not os.path.exists(contexts_dir):
  os.makedirs(contexts_dir)

if not os.path.exists(results_dir):
  os.makedirs(results_dir)

In [0]:
your_secret_scope = "niah" # Change this
databricks_pat_secret_key = "databricks_pat_aeh" # Change this (if applicable)
hugging_face_pat_key = "hf_token_aeh" # Change this (if applicable)
openai_api_secret_key = "openai_aeh" # Change this (if applicable)
langsmith_api_secret_key = "langsmith_api_aeh" # Change this (if applicable)
anthropic_api_secret_key = "" # Change this (if applicable)
os.environ['LANGCHAIN_API_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=langsmith_api_secret_key)
os.environ['NIAH_EVALUATOR_API_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=openai_api_secret_key)

provider = dbutils.widgets.get("provider")
if provider == "openai":
  os.environ['NIAH_MODEL_API_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=openai_api_secret_key)
elif provider =="anthropic":
  os.environ['NIAH_MODEL_API_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=anthropic_api_secret_key)
else:
  # Databricks
  os.environ['NIAH_MODEL_API_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=databricks_pat_secret_key)
  os.environ['HF_PAT_KEY'] = dbutils.secrets.get(scope=your_secret_scope, key=hugging_face_pat_key)

## Define testing parameters

**TO-FILL**:
- Set `base_url` and `model` parameters/widgets (see list of supported `model` and how to get the `base_url` [here](https://docs.databricks.com/en/machine-learning/foundation-models/index.html#use-foundation-model-apis))
- Set/Pick retrieval questions (e.g.):
    - `What is the best thing to do in San Francisco?` (single needle example) OR 
    - `What are the secret ingredients needed to build the perfect pizza?` (multi-needle example)
- Set/Pick your needle(s) (e.g.):
    - `\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n` OR
    - `" Figs are one of the secret ingredients needed to build the perfect pizza. ", 
        " Prosciutto is one of the secret ingredients needed to build the perfect pizza. ", 
        " Goat cheese is one of the secret ingredients needed to build the perfect pizza. "`


In [0]:
from openai import OpenAI

client = OpenAI(
  api_key=dbutils.secrets.get(scope=your_secret_scope, key=openai_api_secret_key)
)

completion = client.chat.completions.create(
  messages=[{"role":"user", "content": "What is the capital of France?"}],
  model="gpt-3.5-turbo"
)

In [0]:
completion.choices[0]

In [0]:
@dataclass
class CommandArgs():
    provider: str = provider
    evaluator_label: str = dbutils.widgets.get("evaluator")
    model_name: str = dbutils.widgets.get("model")
    evaluator_model_name: Optional[str] = "gpt-3.5-turbo"
    needle: Optional[str] = "\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n"
    haystack_dir: Optional[str] = haystack_dir
    retrieval_question: Optional[str] = "What is the best thing to do in San Francisco?"
    results_version: Optional[int] = 1
    context_lengths_min: Optional[int] = int(dbutils.widgets.get("context_min_length"))
    context_lengths_max: Optional[int] = int(dbutils.widgets.get("context_max_length"))
    context_lengths_num_intervals: Optional[int] = 16
    context_lengths: Optional[list[int]] = None
    document_depth_percent_min: Optional[int] = 0
    document_depth_percent_max: Optional[int] = 100
    document_depth_percent_intervals: Optional[int] = 10
    document_depth_percents: Optional[list[int]] = None
    document_depth_percent_interval_type: Optional[str] = "linear"
    num_concurrent_requests: Optional[int] = 1
    save_results: Optional[bool] = True
    results_dir: Optional[str] = results_dir
    save_contexts: Optional[bool] = True
    contexts_dir: Optional[str] = contexts_dir
    final_context_length_buffer: Optional[int] = 200
    seconds_to_sleep_between_completions: Optional[float] = None
    print_ongoing_status: Optional[bool] = True
    # LangSmith parameters
    eval_set: Optional[str] = "multi-needle-eval-pizza-3"
    # Multi-needle parameters
    multi_needle: Optional[bool] = bool(dbutils.widgets.get("multi_needle")=="True")
    needles: list[str] = field(default_factory=lambda: [
        " Figs are one of the secret ingredients needed to build the perfect pizza. ", 
        " Prosciutto is one of the secret ingredients needed to build the perfect pizza. ", 
        " Goat cheese is one of the secret ingredients needed to build the perfect pizza. "
    ])

In [0]:
def get_model_to_test(args: dict) -> ModelProvider:
    """
    Determines and returns the appropriate model provider based on the provided command dictionnary.
    
    Args:
        args (dict): The command line arguments parsed into a CommandArgs dataclass instance.
        
    Returns:
        ModelProvider: An instance of the specified model provider class.
    
    Raises:
        ValueError: If the specified provider is not supported.
    """
    match args.provider.lower():
        case "openai":
            return OpenAI(model_name=args.model_name)
        case "anthropic":
            return Anthropic(model_name=args.model_name)
        case "databricks":
            return Databricks(model_name=args.model_name, base_url=dbutils.widgets.get("base_url"))
        case _:
            raise ValueError(f"Invalid provider: {args.provider}")

def get_evaluator(args: dict) -> Evaluator:
    """
    Selects and returns the appropriate evaluator based on the provided command arguments.
    
    Args:
        args (CommandArgs): The command line arguments parsed into a CommandArgs dataclass instance.
        
    Returns:
        Evaluator: An instance of the specified evaluator class.
        
    Raises:
        ValueError: If the specified evaluator is not supported.
    """
    match args.evaluator_label.lower():
        case "openai":
            return OpenAIEvaluator(model_name=args.evaluator_model_name,
                                   question_asked=args.retrieval_question,
                                   true_answer=args.needle)
        case "langsmith":
            return LangSmithEvaluator()
        case _:
            raise ValueError(f"Invalid evaluator: {args.evaluator}")

In [0]:
args = CommandArgs
args.model_to_test = get_model_to_test(args)
args.evaluator = get_evaluator(args)

if args.multi_needle == True:
    print("Testing multi-needle")
    tester = LLMMultiNeedleHaystackTester(**args.__dict__)
    
else: 
    print("Testing single-needle")
    tester = LLMNeedleHaystackTester(**args.__dict__)

## Run test (asynchrounously)

In [0]:
await tester.run_test()

## Visualize

In [0]:
%sql
CREATE OR REPLACE VIEW $catalog.$schema.results_$provider AS(
  SELECT * FROM json.`$results_dir`
  ORDER BY context_length, depth_percent ASC
)

In [0]:
%sql SELECT * FROM $catalog.$schema.results_$provider

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Create new [visualization](https://docs.databricks.com/en/visualizations/index.html#create-a-new-visualization) by hitting the `+` tab and in order to reproduce the heatmap:
1. Choose `Heatmap` as plot style
2. X: `context_length`
3. Y: `depth_percent`
4. Color column: `score` and choose `Average` as aggregate method

## Introspect bad response

In [0]:
%sql
-- For single needle test using San Francisco
SELECT context_length as tokens, depth_percent as needle_depth, score, model_response as completion 
FROM $catalog.$schema.results_$provider
WHERE score <10
AND model_response LIKE '%best thing to do in San Francisco is to eat a sandwich and sit in Dolores Park%'