# Lab-03: Tweaking chunking strategy of Ragas and reviewing results

### Context

In Lab-01 and Lab-02 you created an Amazon Bedrock Knowledge Base to power your RAG applicaiton. Then, you evaluated it using the RAGAS framework against different RAGAS metics and powering your RAG with different Large Language Models (llms).

In this notebook, we will work on the Amazon Bedrock Knowledge Base to observe how to improve the RAGAS metrics changing the document chunking strategies.

### Chunking Introduction

Chunking is a critical step in building an effective knowledge base for RAG applications. The choice of chunking strategy can significantly improve or lower your RAG implementation's performance. Here's why:

1. Retrieval accuracy: Different chunking methods can lead to varying levels of precision in retrieving relevant information. An optimal chunking strategy ensures that the most pertinent information is captured in each chunk, improving the chances of retrieving the right context for a given query.

2. Context quality: The size and content of chunks directly affect the quality of context provided to the language model. Too large chunks may include irrelevant information, while too small chunks might miss important context. Finding the right balance is crucial for generating accurate and relevant responses.

3. Computational efficiency: Chunking strategies impact the number and size of vectors in your knowledge base. This, in turn, affects the computational resources required for embedding generation and similarity search. An efficient chunking strategy can lead to faster retrieval times and lower resource consumption.

4. Adaptability to content: Different types of documents (e.g., technical reports, narratives, or structured data) may benefit from different chunking approaches. The ability to tailor your chunking strategy to your specific content can significantly enhance your RAG system's performance.

<!-- ![retrieveapi.png](./images/retrieveAPI.png) -->
<img src="./assets/retrieveAPI.png" width=50% height=20% />

In the RAG workflow, chunking impacts the "Retrieve API" stage, as shown in the provided image. The process involves:

1. Generating query embeddings from the user input.
2. Retrieving similar documents (chunks) from the knowledge base.
3. Using these retrieved chunks as context for prompt augmentation.

It's important to note that there is no one-size-fits-all solution for chunking. The optimal strategy often depends on the nature of your documents, the specifics of your use case, and the characteristics of your target queries. This is why testing different chunking strategies and evaluating their impact on your RAG system's performance is crucial.

### Amazon Bedrock Knowledge Bases Chunking Strategies

Amazon Bedrock offers several chunking strategies to optimize your knowledge base for different types of content and use cases:

1. Standard Chunking:
   - Fixed-size chunking: Allows you to specify the number of tokens per chunk and an overlap percentage.
   - Default chunking: Splits content into approximately 300-token chunks, preserving sentence boundaries.
   - Pros:
       - Simple and straightforward to implement
       - Works well for uniform, well-structured documents
       - Overlap feature helps maintain context across chunk boundaries
   - Cons:
       - May split semantic units or important context
       - Less effective for documents with varying content density or structure
       - Fixed-size approach might not adapt well to diverse document types

3. Hierarchical Chunking:
   - Creates a two-level structure with parent and child chunks.
   - You can set maximum token sizes for both parent and child chunks, as well as overlap tokens.
   - Balances precision (small child chunks) with comprehensive context (larger parent chunks).
   - Pros:
       - Preserves both local and broader context
       - Allows for more nuanced retrieval (e.g., returning child chunks with parent context)
       - Can improve performance for documents with clear hierarchical structure
   - Cons:
       - More complex to set up and fine-tune
       - May introduce overhead in storage and retrieval processes
       - Might not be beneficial for flat or unstructured documents

4. Semantic Chunking:
   - Uses natural language processing to create meaningful chunks based on semantic content.
   - Configurable parameters include maximum tokens, buffer size, and breakpoint percentile threshold.
   - Aims to improve retrieval accuracy by focusing on semantic rather than just syntactic structure.
   - Pros:
       - Creates more meaningful and coherent chunks based on content
       - Can significantly improve retrieval relevance for complex documents
       - Adapts to varying content density within documents
   - Cons:
       - More computationally intensive during ingestion
       - May require more fine-tuning to achieve optimal results
       - Performance can vary depending on the effectiveness of the underlying NLP model

5. Advanced Parsing Options:
   - Utilizes foundation models (like Claude 3 Sonnet or Claude 3 Haiku) for parsing complex data such as tables and charts.
   - Allows customization of parsing prompts for specific use cases.
   - Pros:
       - Excellent for handling complex, structured data like tables and charts
       - Allows for customization to specific document types or domains
       - Can significantly improve accuracy for specialized content
   - Cons:
       - Requires more setup and potentially ongoing maintenance
       - May be overkill for simpler document types
       - Dependent on the capabilities of the chosen foundation model

6. Custom Transformation:
   - Enables the use of a Lambda function to implement custom chunking logic.
   - Useful for specific chunking requirements not natively supported by Amazon Bedrock.
   - Pros:
       - Offers maximum flexibility for unique document structures or use cases
       - Allows integration of domain-specific knowledge into the chunking process
       - Can be optimized for specific performance requirements
   - Cons:
       - Requires custom development and maintenance
       - May introduce complexity and potential points of failure
       - Can be challenging to scale or adapt to changing requirements

It's crucial to emphasize that the effectiveness of these chunking strategies can vary significantly depending on your specific use case, document types, and query patterns. Testing different strategies and carefully evaluating their impact on your RAGAS metrics is essential for optimizing your RAG implementation. This notebook will guide you through this process, helping you understand how different chunking approaches affect the performance of your knowledge base for the Octank financial reports dataset.


### Notebook Walkthrough

In this notebook we will create different Knowledge Bases with different chunking strategies based on the same dataset of previous labs.
Then, we will test each Knowledge Base created with RAGAS using `anthropic.claude-3-5-sonnet-20240620-v1:0` as evaluator.

In the end, we will be able to observe that different chunking strategies impact the RAGAS metrics.

#### Evaluation
1. Utilize RAGAS for evaluation on:
    1. **Faithfulness:** This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
    2. **Answer Relevance:** The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer. Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guaranteed, due to the nature of the cosine similarity ranging from -1 to 1.
    3. **Context Precision:** Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
    4. **Context Recall:** Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
    5. **Context entities recall:** This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.
    6. **Answer Semantic Similarity:** The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
    7. **Answer Correctness:** The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a 'threshold' value to round the resulting score to binary, if desired.
    8. **Aspect Critique:** This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the 'answer' as input.
    

### USE CASE:

#### Dataset

In this example, you will use Octank's financial 10k reports (sythetically generated dataset) as a text corpus to perform Q&A on. This data will be ingested into the knowledge base.

### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and RAGAS and the updated boto3, botocore whls.


## TODO: add screenshot and more detailed steps for the KB creation
## Creating Knowledge Bases with Different Chunking Strategies

In this section, we will create three different knowledge bases using the AWS console in Amazon Bedrock. Each knowledge base will use a different chunking strategy to process the same dataset.

### Prerequisites

- Go to the AWS console
- Octank financial reports dataset (synthetically generated). ADD DOWNLOAD LINK

### Steps

1. Set up S3 Bucket
   - Navigate to the AWS Console and select the Oregon (us-west-2) region
   - Go to the S3 service
   - Create a new S3 bucket to store our dataset
   - Upload the Octank financial reports to this bucket

2. Create Knowledge Base 1: Standard Chunking
   - Navigate to Amazon Bedrock in the AWS Console
   - Click on "Create knowledge base"
   - Select the S3 bucket created in step 1 as the data source
   - Choose "Standard Chunking" as the chunking strategy
   - Select Cohere as the embedding model
   - Complete the creation process and note down the Knowledge Base ID
   - Sync the data source to ingest the documents

3. Create Knowledge Base 2: Custom Transformation
   - Repeat the process in step 2, but select "Custom Transformation" as the chunking strategy
   - Note down the new Knowledge Base ID

4. Create Knowledge Base 3: Semantic Chunking
   - Repeat the process once more, this time selecting "Semantic Chunking" as the chunking strategy
   - Note down the third Knowledge Base ID

### Important Notes

- Ensure you're using the same embedding model (Cohere) for all three knowledge bases to maintain consistency in our comparison.
- The Knowledge Base IDs will be crucial for our subsequent notebook operations, so make sure to record them accurately.
- The syncing process may take some time depending on the size of your dataset. You can proceed with setting up the rest of the notebook while waiting for the sync to complete.

In the next section, we'll use these Knowledge Base IDs in our Python notebook to query and evaluate the performance of each chunking strategy using RAGAS metrics.


In [3]:
%pip install --upgrade pip
%pip install boto3 --force-reinstall --quiet
%pip install botocore --force-reinstall --quiet
%pip install langchain>0.1 --force-reinstall --quiet
%pip install -Uq langchain-aws --force-reinstall
%pip install ragas==0.1.9 --force-reinstall --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


#### Restart the kernel with the updated packages that are installed through the dependencies above

In [4]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-3-haiku-20240307-v1:0` as our large language model to perform query completions using the RAG pattern. 
   2. Initializing bedrock model  `anthropic.claude-3-sonnet-20240229-v1:0` as our large language model to perform RAG evaluation. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [5]:
# TODO: use langchain-aws
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_community.chat_models.bedrock import BedrockChat
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA

pp = pprint.PrettyPrinter(indent=2)

kb_ids = ["kb_id_1", "kb_id_2", "kb_id_3"] # Replace with your knowledge base ids here.

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

llm_for_text_generation = BedrockChat(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=bedrock_client)

llm_for_evaluation = BedrockChat(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)

bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

CredentialRetrievalError: Error when retrieving credentials from custom-process: You need to authenticate with Midway. 
Run the following command before retrying: mwinit --aea


### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retrieve API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:
retrivers = []
for kb_id in kb_ids:
    retrivers.append({
        "kb_id": kb_id,
        "retriever": AmazonKnowledgeBasesRetriever(knowledge_base_id=kb_id,
                                                   retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 5}})
    })


`score`: You can view the associated score of each of the text chunk that was returned which depicts its correlation to the query in terms of how closely it matches it.

### Model Invocation and Response Generation using RetrievalQA chain 

Invoke the model and visualize the response

Question = `Provide a list of few risks for Octank financial in numbered list without description."`

Ground truth answer = 
```
1. Commodity Prices
2. Foreign Exchange Rates 
3. Equity Prices
4. Credit Risk
5. Liquidity Risk
...
...
```

In [None]:
query = "Provide a list of few risks for Octank financial in numbered list without description."

for retriever in retrivers:
    qa_chain = RetrievalQA.from_chain_type(
    llm=llm_for_text_generation, retriever=retriever["retriever"], return_source_documents=True
    )
    response = qa_chain.invoke(query)
    print("This is the result of the following KB: " + retriever["kb_id"])
    print(response["result"])

## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `questions`.

In [None]:
from datasets import Dataset

questions = [
    "What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?",
    "In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?",
    "What was the primary source of cash inflows from financing activities for Octank Financial in 2021?",
    "Calculate the year-over-year percentage change in cash and cash equivalents for Octank Financial from 2020 to 2021.",
    "Based on the information provided, what can you infer about Octank Financial's overall financial health and growth prospects?"
]
ground_truth = [
    "The increase in net cash provided by operating activities was primarily due to an increase in net income and favorable changes in operating assets and liabilities.",
    "Octank Financial had the highest net cash used in investing activities in 2021, at $360 million, compared to $290 million in 2020 and $240 million in 2019. The primary reason for this was an increase in purchases of property, plant, and equipment and marketable securities.",
    "The primary source of cash inflows from financing activities for Octank Financial in 2021 was an increase in proceeds from the issuance of common stock and long-term debt.",
    "To calculate the year-over-year percentage change in cash and cash equivalents from 2020 to 2021: \
    2020 cash and cash equivalents: $350 million \
    2021 cash and cash equivalents: $480 million \
    Percentage change = (2021 value - 2020 value) / 2020 value * 100 \
    = ($480 million - $350 million) / $350 million * 100 \
    = 37.14% increase",
    "Based on the information provided, Octank Financial appears to be in a healthy financial position and has good growth prospects. The company has consistently increased its net cash provided by operating activities, indicating strong profitability and efficient management of working capital. Additionally, Octank Financial has been investing in long-term assets, such as property, plant, and equipment, and marketable securities, which suggests plans for future growth and expansion. The company has also been able to finance its growth through the issuance of common stock and long-term debt, indicating confidence from investors and lenders. Overall, Octank Financial's steady increase in cash and cash equivalents over the past three years provides a strong foundation for future growth and investment opportunities."
]

data = {}
for retriever in retrivers:
    answers = []
    contexts = []
    kb_id = retriever["kb_id"]
    qa_chain = RetrievalQA.from_chain_type(
    llm=llm_for_text_generation, retriever=retriever["retriever"], return_source_documents=True
    )
    for query in questions:
      answers.append(qa_chain.invoke(query)["result"])
      contexts.append([docs.page_content for docs in retriever.invoke(query)])
    # To dict
    data[kb_id] = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truth
    }


datasets = {}
for kb_id in kb_ids:
    datasets[kb_id] = Dataset.from_dict(data[kb_id])

## Evaluating the RAG application
First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset.

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)

from ragas.metrics.critique import (
harmfulness, 
maliciousness, 
coherence, 
correctness, 
conciseness
)

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        context_entity_recall,
        answer_similarity,
        answer_correctness,
        harmfulness, 
        maliciousness, 
        coherence, 
        correctness, 
        conciseness
    ]

dfs = {}
for kb_id in kb_ids:
    result = evaluate(
        dataset = datasets[kb_id], 
        metrics=metrics,
        llm=llm_for_evaluation,
        embeddings=bedrock_embeddings,
    )
    dfs[kb_id] = result.to_pandas()

Below, you can see the resulting RAGAS scores for the examples:

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 800
for kb_id in kb_ids:
    print(df[kb_id])

> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 

<div class="alert alert-block alert-warning">
<b>Note:</b> Remember to delete KB, OSS index and related IAM roles and policies to avoid incurring any charges.
</div>