# Evaluation of Spotlight

This notebook will look into how to evaluate the spotlight.

The first part of the spotlight summarises a document - the accuracy of this summary can be tested, using a ground truth summary and another LLM to do a comparison.

This notebook will:
* make an evaluation pipeline, which will take in a summary, a ground-truth summary, and then score how "correct" the summary is.
* extract the summarisation pipeline from the app codebase, and test this pipeline against that.

**This example relies on the "``Guidance to civil servants on use of generative AI - GOV.UK.pdf``" file having been loaded into the app on "dev mode".** 

### imports

In [None]:
import json
import os
import textwrap
from datetime import datetime

import dotenv
from langchain.chat_models import ChatAnthropic
from langchain.evaluation import load_evaluator
from langchain.evaluation.qa.eval_chain import QAEvalChain
from langchain.prompts import PromptTemplate
from langchain.schema import HumanMessage, SystemMessage

import redbox.llm.llm_base as llm_base
import redbox.llm.spotlight as spotlight
from redbox.llm.prompts.spotlight import SPOTLIGHT_SUMMARY_TASK_PROMPT
from redbox.models.file import File

In [None]:
os.chdir("..")
os.getcwd()

In [None]:
dotenv.load_dotenv(".env")
# Grab it as a dictionary too for convenience
ENV = dotenv.dotenv_values(".env")
model_params = {"max_tokens": 4096, "temperature": 0.2}

### model setup

In [None]:
llm = ChatAnthropic(
    anthropic_api_key=ENV["ANTHROPIC_API_KEY"],
    max_tokens=model_params["max_tokens"],
    temperature=model_params["temperature"],
    streaming=True,
)

llm_handler = llm_base.LLMHandler(llm=llm, user_uuid="dev")

## Spotlight summary
We need to get three things:
- the prompt used by the Spotlight summary
- a summary of a document

First we will make a function to generate the prompt, as it appears in the spotlight summary:

In [None]:
user_info = {
    "name": "",
    "email": "",
    "department": "Cabinet Office",
    "role": "Civil Servant",
    "preffered_language": "British English",
}

In [None]:
def generate_spotlight_summary_prompt(file, prompt_template, user_info=user_info):
    """Generates a spotlight summary prompt for the supplied file and template"""
    payload = f"<Doc{file.uuid}>Title: {file.name}\n\n{file.text}</Doc{file.uuid}>\n\n"

    messages_to_send = [
        SystemMessage(
            content=prompt_template.format(
                current_datetime=datetime.now().isoformat(),
                user_info=user_info,
            )
        ),
        HumanMessage(content=payload),
    ]
    return messages_to_send

Next, we will get one of the documents, and prepare if for the splotlight:

In [None]:
guidance_to_civil_servants_json_path = (
    "data/dev/file/Guidance to civil servants on use of generative AI - GOV.UK.pdf.json"
)

with open(guidance_to_civil_servants_json_path, "r", encoding="utf-8") as f:
    guidance_to_civil_servants = File(**json.load(f))

In [None]:
def spotlight_summary(file, llm=llm, user_info=user_info):
    """Takes the supplied file and generates a spotlight summary"""
    spotlight_model = spotlight.Spotlight(files=[file])

    summary_task = spotlight.SpotlightTask(
        id="summary",
        title="Summary",
        prompt_template=SPOTLIGHT_SUMMARY_TASK_PROMPT,
    )

    task_result = spotlight_model.run_task(
        task=summary_task,
        llm=llm,
        user_info=user_info,
    )
    return task_result

Now, we can assemble the actual prompt, the generated summary and a reference (ground truth summary), written by a human

In [None]:
guidance_to_civil_servants_summary_prompt = generate_spotlight_summary_prompt(
    guidance_to_civil_servants, SPOTLIGHT_SUMMARY_TASK_PROMPT
)
guidance_to_civil_servants_summary = spotlight_summary(
    guidance_to_civil_servants
).content
guidance_to_civil_servants_reference_summary = """"AI can be helpful and assist your work and you are encouraged to explore this technology and consisder how it could be used in your organisation. However you must never put sensitive data into these tools and always be aware of how these systems can store, learn from and repeat what is given to them. The outputs from generative AI is susceptible to bias and misinformation and so should always be checked and cite"""

## Evaluation pipeline
This is using the "labelled criteria evaluator" from langchain - reasonably simple to use, but we could always change this, or make our own.

This takes in:
- **input**: this is the question that was passed to the LLM being tested
- **submission**: this is the answer the LLM being tested gave
- **Criteria**: We will use "correctness", but there are others you can choose, such as "conciseness".
- **Reference**: this is the ground truth answer to the "input" question, which the evaluator will compare the submission against

In [None]:
template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: {input}
***
[Submission]: {output}
***
[Criteria]: {criteria}
***
[Reference]: {reference}
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion \
to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only \
the single number from 0 to 9 (without quotes or punctuation) on its own line corresponding to score \
of whether the submission meets all criteria. At the end, repeat just the score again by itself on a new line."""

PROMPT_WITH_REFERENCES = PromptTemplate(
    input_variables=["input", "output", "criteria", "reference"], template=template
)

evaluator = load_evaluator(
    "labeled_criteria", criteria="correctness", llm=llm, prompt=PROMPT_WITH_REFERENCES
)

Here is a simple example of the evaluator being tested:

In [None]:
eval_result = evaluator.evaluate_strings(
    input=guidance_to_civil_servants_summary_prompt,
    prediction=guidance_to_civil_servants_summary,
    reference=guidance_to_civil_servants_reference_summary,
)
print(eval_result["reasoning"])

The below prints the prompt, the model output and the reference, for comparison

In [None]:
def wrap_print(text, width=150):
    [print(line) for line in textwrap.wrap(str(text), width=width)]


print("PROMPT: ", end="")
wrap_print(guidance_to_civil_servants_summary_prompt)
print("----------------------------------------")
print("MODEL_OUTPUT: ", end="")
wrap_print(guidance_to_civil_servants_summary)
print("----------------------------------------")
print("REFERENCE: ", end="")
wrap_print(guidance_to_civil_servants_reference_summary)

## Simpler examples

In [None]:
eval_result = evaluator.evaluate_strings(
    input="What is the Royal Society's motto?",
    prediction="Nullius in verba",
    reference="The Royal Society's motto is 'Nullius in verba': take no ones word.",
)
print(f'With ground truth: {eval_result["score"]}')
print(eval_result["reasoning"])

In [None]:
eval_result = evaluator.evaluate_strings(
    input="What are the main benefits of Reproducible Analytical Pipelines (RAP)?",
    prediction="Reproducible Analytical Pipelines allow for the automation of data processing and analysis, and through encouraging sharing and reuse of code, makes development more efficient.",
    reference="RAP results makes developmennt more efficient.",
)
print(f'With ground truth: {eval_result["score"]}')
print(eval_result["reasoning"])