# Catbot - Miao

This notebook 
* Creates a database using ChromaDB
* Creates a QA RAG to query the database
* Generates a synthetic dataset
* Evaluation the responses obtained from the RAG using Ragas metrics

In [None]:
# Database
from catbot.database.downloader import download_and_chunk_wikipedia_articles
from catbot.database.embedding import create_data_base

# RAG
from chromadb import PersistentClient
from catbot.database.embedding import embedding_function
from catbot.rag.basic_rag import BasicRAG

# Synthetic data
from ragas.testset import TestsetGenerator
from catbot.utils import get_evaluation_models

# Evaluation
from ragas.dataset_schema import SingleTurnSample 
from ragas.metrics import (
    AnswerCorrectness,
    AnswerSimilarity,
    Faithfulness,
    LLMContextRecall,
    ResponseRelevancy,
    SemanticSimilarity,
)

generator_llm, generator_embeddings = get_evaluation_models()


ImportError: cannot import name 'get_evaluation_models' from 'ragas.utils' (/Users/mariabader/Documents/catbot/.catbot/lib/python3.13/site-packages/ragas/utils.py)

## Create the databse
Download the articles under 'Cat' and chunk them according to their sections. Then, embedd the chunks and save them as a chroma collection.

This only needs to run once.

In [None]:
articles = download_and_chunk_wikipedia_articles(['Cat'])
create_data_base(articles)

## Set up the QA RAG
Load the locally stored database into a chroma collection. Provide it to the basic RAG - done. You can now query the rag.

In [8]:
client = PersistentClient(path='catbot/database/chroma')
collection = client.get_collection("its_all_about_cats", embedding_function=embedding_function(model_name="text-embedding-3-small"))

rag = BasicRAG(collection)

question = "What is the history of cats?"
response = rag.respond(question)

print("Question: ", question)
print("Response: ", response['response'])

context_titles = [context['metadata']['title'] for context in response['sources']]
print("Contexts:")
for title in context_titles:
    print("->", title)

Question:  What is the history of cats?
Response:  Cats have been domesticated for nearly 10,000 years. The oldest evidence of cats kept as pets is from the Mediterranean island of Cyprus, around 7500 BC. In the past, mostly in Egypt, people kept cats because they hunted and ate mice and rats. Ancient Egyptians worshipped cats as gods and often mummified them so they could be with their owners "for all of eternity". Cats started becoming pets during the time of the ancient Egyptians. Today, many people keep cats as pets, and some domestic cats live without human care as feral or stray cats.
Contexts:
-> Cat - History
-> Cat - Introduction


## Generate Synthetic data

### Using a single call prompt

In [5]:
prompt = """
Given the following documentation page, generate one clear, specific question that can 
be answered using information found only on this page.

Guidelines:
- Keep the question focused and concise (no more than 25 words).
- Make it practical—ask about concepts, actions, or procedures as a real user would.
- Avoid asking for lists of steps or comprehensive summaries.
- Ask as if you needed a precise answer for a specific task, not a generic explanation.
- Avoid overly broad, complex, or multi-part questions.
- Do not ask for details not explicitly stated in the document.
- Don't use the question you find in the title, come up with something new
Example of a good question:
- How do I reconcile Mollie transactions with payments and invoices?

Documentation Page:
{context}

Generated question: 
"""

## Using Ragas

In [None]:
from ragas.testset.synthesizers import (
    # SingleHopSpecificQuerySynthesizer,
    # MultiHopAbstractQuerySynthesizer,
    # MultiHopSpecificQuerySynthesizer,
    default_query_distribution
)
query_distribution = default_query_distribution(llm=generator_llm)



generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
query_distribution = default_query_distribution(llm=generator_llm)

synthetic_ragas = generator.generate_with_langchain_docs(sections, testset_size=5, query_distribution=[query_distribution[2]],)

In [None]:
# synthetic_ragas = generator.generate_with_langchain_docs(docs_ragas, testset_size=len(docs_ragas))
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
    default_query_distribution
)
query_distribution = default_query_distribution(llm=generator_llm)


# synthetic_ragas = generator.generate_with_langchain_docs(docs_ragas, testset_size=3, with_debugging_logs = True, query_distribution=[multihop_query_synthesizer],)
# synthetic_ragas = generator.generate_with_langchain_docs(sections, testset_size=2, with_debugging_logs = True, query_distribution=[query_distribution[0]],)
synthetic_ragas = generator.generate_with_langchain_docs(sections, testset_size=5, query_distribution=[query_distribution[2]],)

## Evaluation with Ragas metrics

Define the LLM and embeddings model for evaluation

  generator_embeddings = LangchainEmbeddingsWrapper(embeddings)


### Retrieval: Context Recall 

In [36]:
sample = SingleTurnSample(
    user_input=question,
    reference = 'some text', #Todo
    retrieved_contexts=[context['document'] for context in response['sources']],
)

context_recall = LLMContextRecall(llm=generator_llm)
await context_recall.single_turn_ascore(sample)

0.0

## Generation: Faithfulness and Correctness

### Faithfulness
Evaluate the response based on the retrieved context, requires question, response and context.

In [27]:
sample = SingleTurnSample(
    user_input=question,
    response=response['response'],
    retrieved_contexts=[context['document'] for context in response['sources']],
)
scorer = Faithfulness(llm=generator_llm)

print("Question: ", sample.user_input)
print("Response: ", sample.response)
print("Contexts: ", len(sample.retrieved_contexts), " contexts")
print("Faithfulness score: ", await scorer.single_turn_ascore(sample))


Question:  What is the history of cats?
Response:  Cats have been domesticated for nearly 10,000 years. The oldest evidence of cats kept as pets is from the Mediterranean island of Cyprus, around 7500 BC. In ancient times, especially in Egypt, people kept cats because they hunted and ate mice and rats. Ancient Egyptians worshipped cats as gods and often mummified them so they could be with their owners "for all of eternity." Cats started becoming pets during the time of the ancient Egyptians. Today, people often keep cats as pets, and some domestic cats live without care from people and are known as "feral cats" or "stray cats."
Contexts:  2  contexts
Faithfulness score:  1.0


### Answer Correctness
Evaluate the response based on the ground truth, requires question, response and context and ground truth.

In [None]:
# TODO
reference = 'What is the plural form of "felis" in Latin?'


sample = SingleTurnSample(
    user_input=question,
    response=response['response'],
    reference=reference,
    retrieved_contexts=[context['document'] for context in response['sources']], #todo - can this be commented out
)

answer_similarity = AnswerSimilarity(embeddings = generator_embeddings)
scorer = AnswerCorrectness(llm=generator_llm, answer_similarity=answer_similarity, weights = [0,1])
await scorer.single_turn_ascore(sample)

0.2724298216509958

Answer Correctness needs the semantic similarity module, it can be explicitely called as below

In [33]:
sample = SingleTurnSample(
    response='what is the latin name for cat?',
    reference  = 'Translate "Felis catus"'
)

scorer = SemanticSimilarity(embeddings=generator_embeddings)
print("Semantic Similarity: ", await scorer.single_turn_ascore(sample))

Semantic Similarity:  0.5873162152223527


## End-to-End: Response Relevancy
Generates a set of questions from the response, embedds them, calculates cosine similarity to the user question and averages.

In [None]:
sample = SingleTurnSample(
    user_input=question,
    response = response['response'],
    )

scorer = ResponseRelevancy(llm=generator_llm, embeddings=generator_embeddings)

print("Question: ", sample.user_input)
print("Response: ", sample.response)
print("Response Relevancy: ", await scorer.single_turn_ascore(sample))

np.float64(0.7658875076572474)