#Learning QA/Evaluation with GisKard

Giskard is an open-source framework for testing all ML models, from LLMs to tabular models.
In this tutorial we will use Giskard's LLM Scan to automatically detect issues on a Retrieval Augmented Generation (RAG) task. We will test a model that answers questions about climate change, based on the [2023 Climate Change Synthesis Report](https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf) by the IPCC.

## Install dependencies

Make sure to install the `giskard[llm]` flavor of Giskard, which includes support for LLM models.

In [None]:
%pip install "giskard[llm]" --upgrade

Collecting giskard[llm]
  Downloading giskard-2.17.0-py3-none-any.whl.metadata (15 kB)
Collecting mlflow-skinny>=2 (from giskard[llm])
  Downloading mlflow_skinny-3.1.0-py3-none-any.whl.metadata (30 kB)
Collecting numpy<2,>=1.26.0 (from giskard[llm])
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.12.0,>=1.7.3 (from giskard[llm])
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mixpanel>=4.4.0 (from giskard[llm])
  Downloading mixpanel-4.10.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langdetect>=1.0.9 (from giskard[llm])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

 We also install the project-specific dependencies for this tutorial.

In [None]:
%pip install langchain langchain-community langchain-openai pypdf faiss-cpu openai tiktoken

Collecting langchain-community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.22-py3-none-any.whl.metadata (2.3 kB)
Collecting pypdf
  Downloading pypdf-5.6.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-core<1.0.0,>=0.3.58 (from langchain)
  Downloading langchain_core-0.3.65-py3-none-any.whl.metadata (5.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typ

## Setup OpenAI

LLM scan requires an OpenAI API key. We set it here:

In [None]:
import os

# Set the OpenAI API Key environment variable.
os.environ["OPENAI_API_KEY"] = ""

## Import libraries

## Model building

### Create a model with LangChain

Now we create our model with langchain, using the `RetrievalQA` class:

In [None]:
from langchain import FAISS, PromptTemplate
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())
documents = loader.load_and_split(text_splitter)
# print(documents)





In [None]:
import pandas as pd

df = pd.DataFrame([d.page_content for d in documents], columns=["text"])
df.head(10)

Unnamed: 0,text
0,35\nClimate Change 2023\nSynthesis Report\nIPC...
1,37\nSection 1\nIntroduction
2,38\nSection 1 \nSection 1\nThis Synthesis Repo...
3,"interdependence of climate, ecosystems and bi..."
4,"change, and its impacts. It assesses the curr..."
5,"global greenhouse gas emission pathways, in th..."
6,"as monsoon regions, coastlines, mountain range..."
7,"italics: for example, very likely. This is con..."
8,"Policymakers (hereafter SPM), Technical Summar..."
9,39\nIntroduction \nSection 1\nFigure 1.1: The ...


We can now create a Knowledge Base using the DataFrame we created before.

In [None]:
from giskard.rag import KnowledgeBase

knowledge_base = KnowledgeBase(df)

INFO:giskard.llm.embeddings:No embedding model set though giskard.llm.set_embedding_model. Defaulting to openai/text-embedding-3-small since OPENAI_API_KEY is set.


Generating Test Set

In [None]:
from giskard.rag import generate_testset

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    agent_description="A chatbot answering questions about climate change, based on the 2023 Climate Change Synthesis Report by the IPCC",
)

INFO:giskard.rag:Finding topics in the knowledge base.
INFO:giskard.rag:Found 20 topics in the knowledge base.


Generating questions:   0%|          | 0/60 [00:00<?, ?it/s]

In [None]:
test_set_df = testset.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")

Question 1: How many modelled emissions pathways were assessed in WGIII for projected global warming over the 21st century?
Reference answer: 1202 pathways were categorised based on their projected global warming over the 21st century.
Reference context:
Document 140: impacts and risks. {WGI Box SPM.1} (Cross-Section Box.2 Figure 1)
In WGIII, a large number of global modelled emissions pathways were assessed, of which 1202 pathw ays were categorised based on their 
projected global warming over the 21st century, with categories ranging from pathways that limit w a r m i n g  t o  1 . 5 ° C  w i t h  m o r e  than 50% 
likelihood108 w i t h  n o  o r  l i m i t e d  overshoot (C1) to pathways that exceed 4°C (C8). Methods to project global warming associated with the 
modelled pathways were updated to ensure consistency with the AR6 WGI assessment of the climate system response109. {WGIII Box SPM.1,WGIII 
Table 3.1} (Table 3.1, Cross-Section Box.2 Figure 1)
102 In the literature, the te

Save it to test_set.jsonl

In [None]:
testset.save("test-set.jsonl")

Building a RAG Pipeline

In [None]:
# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

# Test that everything works
climate_qa_chain.invoke({"query": "Is sea level rise avoidable? When will it stop?"})

{'query': 'Is sea level rise avoidable? When will it stop?',
 'result': 'Sea level rise is unavoidable and will continue for millennia. The rate and amount of sea level rise depends on future emissions. It is projected that sea level will continue to rise by 2100 and beyond, with higher emissions leading to larger and faster sea level rise. However, the exact timing and amount of sea level rise cannot be predicted with certainty.'}

It’s working! The answer is coherent with what is stated in the report:

> Sea level rise is unavoidable for centuries to millennia due to continuing deep ocean warming and ice sheet melt, and sea levels will remain elevated for thousands of years
>
> (_2023 Climate Change Synthesis Report_, page 77)

## Detect vulnerabilities in your model

### Wrap model and dataset with Giskard

Before running the automatic LLM scan, we need to wrap our model into Giskard's `Model` object. We can also optionally create a small dataset of queries to test that the model wrapping worked.

In [None]:
def answer_fn(question, history=None):
    return climate_qa_chain.invoke({"query": question})

In [None]:
import giskard
import pandas as pd


def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.invoke({"query": question}) for question in df["question"]]


# Don’t forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

INFO:giskard.models.automodel:Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.


Let’s check that the model is correctly wrapped by running it over a small dataset:

In [None]:
# Optional: let’s test that the wrapped model works
# examples = [
#     "According to the IPCC report, what are key risks in the Europe?",
#     "Is sea level rise avoidable? When will it stop?",
# ]
giskard_dataset = giskard.Dataset(pd.DataFrame({"question": test_set_df['question']}), target=None)

# print(giskard_model.predict(giskard_dataset).prediction)
# print(giskard_dataset)

INFO:giskard.datasets.base:Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.


<giskard.datasets.base.Dataset object at 0x794aed951c90>


### Scan your model for vulnerabilities with Giskard

We can now run Giskard's `scan` to generate an automatic report about the model vulnerabilities. This will thoroughly test different classes of model vulnerabilities, such as harmfulness, hallucination, prompt injection, etc.

The scan will use a mixture of tests from predefined set of examples, heuristics, and LLM-based generations and evaluations.

Since running the whole scan can take a bit of time, let’s start by limiting the analysis to the hallucination category:

In [None]:
report = giskard.scan(giskard_model, giskard_dataset, only="hallucination")

INFO:giskard.scanner.logger:Running detectors: ['LLMImplausibleOutputDetector', 'LLMBasicSycophancyDetector']
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai


🔎 Running scan…
Estimated calls to your model: ~30
Estimated LLM calls for evaluation: 22

Running detector LLMImplausibleOutputDetector…


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:giskard.datasets.base:Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP

LLMImplausibleOutputDetector: 0 issue detected. (Took 0:00:26.324717)
Running detector LLMBasicSycophancyDetector…


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:LiteLLM:selected model name for cost calculation: openai/gpt-4o-2024-08-06
INFO:giskard.datasets.base:Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP

LLMBasicSycophancyDetector: 0 issue detected. (Took 0:00:53.206910)
Scan completed: no issues found. (Took 0:01:19.543025)
LLM-assisted detectors have used the following resources:
OpenAI LLM calls for evaluation: 22 (12962 prompt tokens and 1126 sampled tokens)



In [None]:
display(report)

### Running the whole scan

We will now run the full scan, testing for all issue categories. Note: this can take up to 30 min, depending on the speed of the API.

Note that the scan results are not deterministic. In fact, LLMs may generally give different answers to the same or similar questions. Also, not all tests we perform are deterministic.

In [None]:
full_report = giskard.scan(giskard_model, giskard_dataset)

If you are running in a notebook, you can display the scan report directly in the notebook using `display(...)`, otherwise you can export the report to an HTML file. Check the [API Reference](https://docs.giskard.ai/en/stable/reference/scan/report.html#giskard.scanner.report.ScanReport) for more details on the export methods available on the `ScanReport` class.

In [None]:
display(full_report)

# Save it to a file
full_report.to_html("scan_report.html")

## Generate comprehensive test suites automatically for your model

### Generate test suites from the scan

The objects produced by the scan can be used as fixtures to generate a test suite that integrates all detected vulnerabilities. Test suites allow you to evaluate and validate your model's performance, ensuring that it behaves as expected on a set of predefined test cases, and to identify any regressions or issues that might arise during development or updates.

In [None]:
test_suite = full_report.generate_test_suite(name="Test suite generated by scan")
test_suite.run()