# Lab-03: Tweaking chunking strategy of Ragas and reviewing results

### Context

In Lab-01 and Lab-02 you created an Amazon Bedrock Knowledge Base to power your RAG applicaiton. Then, you evaluated it using the RAGAS framework against different RAGAS metics and powering your RAG with different Large Language Models (llms).

In this notebook, we will work on the Amazon Bedrock Knowledge Base to observe how to improve the RAGAS metrics by changing the document chunking strategies.

### Chunking Introduction

Chunking is a critical step in building an effective knowledge base for RAG applications. The choice of chunking strategy can significantly improve or lower your RAG implementation's performance. Here's why:

1. Retrieval accuracy: Different chunking methods can lead to varying levels of precision in retrieving relevant information. An optimal chunking strategy ensures that the most pertinent information is captured in each chunk, improving the chances of retrieving the right context for a given query.

2. Context quality: The size and content of chunks directly affect the quality of context provided to the language model. Too large chunks may include irrelevant information, while too small chunks might miss important context. Finding the right balance is crucial for generating accurate and relevant responses.

3. Computational efficiency: Chunking strategies impact the number and size of vectors in your knowledge base. This, in turn, affects the computational resources required for embedding generation and similarity search. An efficient chunking strategy can lead to faster retrieval times and lower resource consumption.

4. Adaptability to content: Different types of documents (e.g., technical reports, narratives, or structured data) may benefit from different chunking approaches. The ability to tailor your chunking strategy to your specific content can significantly enhance your RAG system's performance.

<!-- ![retrieveapi.png](./images/retrieveAPI.png) -->
<img src="./assets/retrieveAPI.png" width=50% height=20% />

In the RAG workflow, chunking impacts the "Retrieve API" stage, as shown in the provided image. The process involves:

1. Generating query embeddings from the user input.
2. Retrieving similar documents (chunks) from the knowledge base.
3. Using these retrieved chunks as context for prompt augmentation.

It's important to note that there is no one-size-fits-all solution for chunking. The optimal strategy often depends on the nature of your documents, the specifics of your use case, and the characteristics of your target queries. This is why testing different chunking strategies and evaluating their impact on your RAG system's performance is crucial.

### Amazon Bedrock Knowledge Bases Chunking Strategies

Amazon Bedrock offers several chunking strategies to optimize your knowledge base for different types of content and use cases:

1. Standard Chunking:
   - Fixed-size chunking: Allows you to specify the number of tokens per chunk and an overlap percentage.
   - Default chunking: Splits content into approximately 300-token chunks, preserving sentence boundaries.
   - Pros:
       - Simple and straightforward to implement
       - Works well for uniform, well-structured documents
       - Overlap feature helps maintain context across chunk boundaries
   - Cons:
       - May split semantic units or important context
       - Less effective for documents with varying content density or structure
       - Fixed-size approach might not adapt well to diverse document types

3. Hierarchical Chunking:
   - Creates a two-level structure with parent and child chunks.
   - You can set maximum token sizes for both parent and child chunks, as well as overlap tokens.
   - Balances precision (small child chunks) with comprehensive context (larger parent chunks).
   - Pros:
       - Preserves both local and broader context
       - Allows for more nuanced retrieval (e.g., returning child chunks with parent context)
       - Can improve performance for documents with clear hierarchical structure
   - Cons:
       - More complex to set up and fine-tune
       - May introduce overhead in storage and retrieval processes
       - Might not be beneficial for flat or unstructured documents

4. Semantic Chunking:
   - Uses natural language processing to create meaningful chunks based on semantic content.
   - Configurable parameters include maximum tokens, buffer size, and breakpoint percentile threshold.
   - Aims to improve retrieval accuracy by focusing on semantic rather than just syntactic structure.
   - Pros:
       - Creates more meaningful and coherent chunks based on content
       - Can significantly improve retrieval relevance for complex documents
       - Adapts to varying content density within documents
   - Cons:
       - More computationally intensive during ingestion
       - May require more fine-tuning to achieve optimal results
       - Performance can vary depending on the effectiveness of the underlying NLP model

5. Advanced Parsing Options:
   - Utilizes foundation models (like Claude 3 Sonnet or Claude 3 Haiku) for parsing complex data such as tables and charts.
   - Allows customization of parsing prompts for specific use cases.
   - Pros:
       - Excellent for handling complex, structured data like tables and charts
       - Allows for customization to specific document types or domains
       - Can significantly improve accuracy for specialized content
   - Cons:
       - Requires more setup and potentially ongoing maintenance
       - May be overkill for simpler document types
       - Dependent on the capabilities of the chosen foundation model

6. Custom Transformation:
   - Enables the use of a Lambda function to implement custom chunking logic.
   - Useful for specific chunking requirements not natively supported by Amazon Bedrock.
   - Pros:
       - Offers maximum flexibility for unique document structures or use cases
       - Allows integration of domain-specific knowledge into the chunking process
       - Can be optimized for specific performance requirements
   - Cons:
       - Requires custom development and maintenance
       - May introduce complexity and potential points of failure
       - Can be challenging to scale or adapt to changing requirements

It's crucial to emphasize that the effectiveness of these chunking strategies can vary significantly depending on your specific use case, document types, and query patterns. Testing different strategies and carefully evaluating their impact on your RAGAS metrics is essential for optimizing your RAG implementation. This notebook will guide you through this process, helping you understand how different chunking approaches affect the performance of your knowledge base for the Octank financial reports dataset.


### Notebook Walkthrough

In this notebook we will create different Knowledge Bases with different chunking strategies based on the same dataset of previous labs.
Then, we will test each Knowledge Base created with RAGAS using `anthropic.claude-3-5-sonnet-20240620-v1:0` as evaluator.

In the end, we will be able to observe that different chunking strategies impact the RAGAS metrics.

### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and RAGAS and the updated boto3, botocore whls.

In [None]:
%pip install --upgrade pip
%pip install boto3 --quiet
%pip install botocore --quiet
%pip install langchain>0.1 --quiet
%pip install ragas==0.1.9 --quiet
%pip install opensearch-py --quiet
%pip install retrying==1.3.4 --quiet
%pip install langchain-aws --quiet

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [4]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Creating Knowledge Bases with Different Chunking Strategies

In this section, we will create two different knowledge bases and reuse one of the knowledge bases created in Lab 1. Each knowledge base will use a different chunking strategy to process the same dataset and we will compare their performance using Ragas. 


### Steps

1. Create Knowledge Base 1: Hierarchical Transformation
   - Set maximum token size of parent to 1024 (max size of cohere embedding model)
   - Set maximum token size of child to 300
   - Set overlap tokens to 60

2. Create Knowledge Base 2: Semantic Transformation
   - Set breakpoint percentile threshold to 55
   - Set buffer size to 1
   - Set maximum tokens to 300

3. Reuse Knowledge Base from Lab 1: Fixed Chunking
   - Rerun the knowledge base creation function to reuse the already created knowledge base

### Important Notes

- Ensure you're using the same embedding model (Cohere) for all three knowledge bases to maintain consistency in our comparison.
- The Knowledge Base IDs will be crucial for our subsequent notebook operations, so make sure to record them accurately.

### First lets double check our data is uploaded to s3

In [None]:
import sagemaker
import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)
# Specify your bucket to be the default sagemaker bucket
sess = sagemaker.Session()
bucket = sess.default_bucket() #sagemaker-abcdef
filename = 'octank_financial_10K.pdf'

# Create an S3 client
s3 = boto3.client('s3')

# Upload the file
s3.upload_file(filename, bucket, filename)
pp.pprint(f"Upload Successful: {filename} uploaded to {bucket}/{filename}")

### Create hierarchical and semantic knowledge bases

In [None]:
import json
import boto3
from utility import interactive_sleep, create_knowledge_base, create_ds


boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

body_json = {
       "settings": {
          "index.knn": "true",
           "number_of_shards": 1,
           "knn.algo_param.ef_search": 512,
           "number_of_replicas": 0,
       },
       "mappings": {
          "properties": {
             "vector": {
                "type": "knn_vector",
                "dimension": 1024,
                 "method": {
                     "name": "hnsw",
                     "engine": "faiss",
                     "space_type": "l2"
                 },
             },
             "text": {
                "type": "text"
             },
             "text-metadata": {
                "type": "text"         }
          }
       }
    }

## create new hierarchical knowledge base
cohere_embed_hierarchical_knowledge_base = create_knowledge_base(
    index_name="cohere-embed-english-v3", 
    body_json=body_json, 
    collection_name="cohere-embed-english-v3", 
    knowledge_base_name="cohere-embed-hierarchical-english-v3",
    vector_store_name="cohere-embed-english-v3",
    access_policy_name="cohere-embed-access-policy",
    embedding_model_arn=f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-english-v3"
    )

## create new semantic knowledge base
cohere_embed_semantic_knowledge_base = create_knowledge_base(
    index_name="cohere-embed-english-v3", 
    body_json=body_json, 
    collection_name="cohere-embed-english-v3", 
    knowledge_base_name="cohere-embed-semantic-english-v3",
    vector_store_name="cohere-embed-english-v3",
    access_policy_name="cohere-embed-access-policy",
    embedding_model_arn=f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-english-v3"
    )

## reuse existing fixed embedding knowledge base. 
## Don't worry, this will not create a new one and will just return the old knowledge base created in Lab 1
cohere_embed_fixed_knowledge_base = create_knowledge_base(
    index_name="cohere-embed-english-v3", 
    body_json=body_json, 
    collection_name="cohere-embed-english-v3", 
    knowledge_base_name="cohere-embed-english-v3",
    vector_store_name="cohere-embed-english-v3",
    access_policy_name="cohere-embed-access-policy",
    embedding_model_arn=f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-english-v3"
    )

pp.pprint(f"Cohere Embed Hierarchical Knowledge Base Id: {cohere_embed_hierarchical_knowledge_base['knowledgeBaseId']}")
pp.pprint(f"Cohere Embed Semantic Knowledge Base Id: {cohere_embed_semantic_knowledge_base['knowledgeBaseId']}")
pp.pprint(f"Cohere Embed Fixed Knowledge Base Id: {cohere_embed_fixed_knowledge_base['knowledgeBaseId']}")

### Create Data Sources and Sync

Steps:

* determine chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in the chunkingStrategyConfiguration.
* initialize the s3 configuration in order to create the data source.

In [None]:
hierarchicalChunkingStrategyConfiguration = {
    "chunkingStrategy": "HIERARCHICAL", 
    "hierarchicalChunkingConfiguration": {
        "levelConfigurations": [{"maxTokens": 1024}, {"maxTokens": 300}],
        "overlapTokens": 60
    }
}

semanticChunkingStrategyConfiguration = {
    "chunkingStrategy": "SEMANTIC", 
    "semanticChunkingConfiguration": {
        "breakpointPercentileThreshold": 55,
        "bufferSize": 1,
        "maxTokens": 300
    }
}

s3DataSourceConfiguration = {
    "type": "S3",
    "s3Configuration": {
        "bucketArn": "",
        "inclusionPrefixes":["octank_financial_10K.pdf"] 
    }
}

data_sources=[
                {"type": "S3", "bucket_name": bucket} 
            ]
cohere_embed_hierarchical_data_source = create_ds(data_sources, hierarchicalChunkingStrategyConfiguration, s3DataSourceConfiguration, cohere_embed_hierarchical_knowledge_base['knowledgeBaseId'])
cohere_embed_semantic_data_source = create_ds(data_sources, semanticChunkingStrategyConfiguration, s3DataSourceConfiguration, cohere_embed_semantic_knowledge_base['knowledgeBaseId'])

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-3-haiku-20240307-v1:0` as our large language model to perform query completions using the RAG pattern. 
   2. Initializing bedrock model  `anthropic.claude-3-sonnet-20240229-v1:0` as our large language model to perform RAG evaluation. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [5]:
from langchain.llms.bedrock import Bedrock
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
from langchain_aws import BedrockEmbeddings
from langchain_aws import ChatBedrock


bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_runtime_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

kbs = [cohere_embed_hierarchical_knowledge_base, cohere_embed_semantic_knowledge_base, cohere_embed_fixed_knowledge_base] 

llm_for_text_generation = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=bedrock_runtime_client)

llm_for_evaluation = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_runtime_client)

bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3",client=bedrock_runtime_client)

CredentialRetrievalError: Error when retrieving credentials from custom-process: You need to authenticate with Midway. 
Run the following command before retrying: mwinit --aea


### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retrieve API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:
# Create three retrievers with the knowledge base IDs
retrievers = [
    AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb['knowledgeBaseId'],
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 15}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
    )
    for kb in kbs
]

## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `questions`.

In [None]:
from datasets import Dataset

questions = [
    "What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?",
    "In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?",
    "What was the primary source of cash inflows from financing activities for Octank Financial in 2021?",
    "Calculate the year-over-year percentage change in cash and cash equivalents for Octank Financial from 2020 to 2021.",
    "Based on the information provided, what can you infer about Octank Financial's overall financial health and growth prospects?"
]
ground_truth = [
    "The increase in net cash provided by operating activities was primarily due to an increase in net income and favorable changes in operating assets and liabilities.",
    "Octank Financial had the highest net cash used in investing activities in 2021, at $360 million, compared to $290 million in 2020 and $240 million in 2019. The primary reason for this was an increase in purchases of property, plant, and equipment and marketable securities.",
    "The primary source of cash inflows from financing activities for Octank Financial in 2021 was an increase in proceeds from the issuance of common stock and long-term debt.",
    "To calculate the year-over-year percentage change in cash and cash equivalents from 2020 to 2021: \
    2020 cash and cash equivalents: $350 million \
    2021 cash and cash equivalents: $480 million \
    Percentage change = (2021 value - 2020 value) / 2020 value * 100 \
    = ($480 million - $350 million) / $350 million * 100 \
    = 37.14% increase",
    "Based on the information provided, Octank Financial appears to be in a healthy financial position and has good growth prospects. The company has consistently increased its net cash provided by operating activities, indicating strong profitability and efficient management of working capital. Additionally, Octank Financial has been investing in long-term assets, such as property, plant, and equipment, and marketable securities, which suggests plans for future growth and expansion. The company has also been able to finance its growth through the issuance of common stock and long-term debt, indicating confidence from investors and lenders. Overall, Octank Financial's steady increase in cash and cash equivalents over the past three years provides a strong foundation for future growth and investment opportunities."
]

data = {}
for retriever in retrievers:
    answers = []
    contexts = []
    kb_id = retriever.knowledge_base_id
    qa_chain = RetrievalQA.from_chain_type(
    llm=llm_for_text_generation, retriever=retriever, return_source_documents=True
    )
    for query in questions:
      answers.append(qa_chain.invoke(query)["result"])
      contexts.append([docs.page_content for docs in retriever.invoke(query)])
    # To dict
    data[kb_id] = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truth
    }


datasets = {}
for kb in kbs:
    datasets[kb['knowledgeBaseId']] = Dataset.from_dict(data[kb['knowledgeBaseId']])

print(datasets)

## Evaluating the RAG application
First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset.

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)

from ragas.metrics.critique import (
harmfulness, 
maliciousness, 
coherence, 
correctness, 
conciseness
)

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        context_entity_recall,
        answer_similarity,
        answer_correctness,
        harmfulness, 
        maliciousness, 
        coherence, 
        correctness, 
        conciseness
    ]

dfs = {}
for kb in kbs:
    result = evaluate(
        dataset=datasets[kb['knowledgeBaseId']],
        metrics=metrics,
        llm=llm_for_evaluation,
        embeddings=bedrock_embeddings, 
    )
    dfs[kb['knowledgeBaseId']] = result.to_pandas()
    
# disregard any "failed to parse output" errors. Those originate from the Ragas library    

Below we will plot the results:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

numeric_cols = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall',
                'context_entity_recall', 'answer_similarity', 'answer_correctness', 'harmfulness',
                'maliciousness', 'coherence', 'correctness', 'conciseness']

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Calculate the mean of numeric columns for each DataFrame
means = []
kb_names = []
for key, df in dfs.items():
    mean_values = df[numeric_cols].mean()
    means.append(mean_values)
    
    # Find the corresponding KB name from the 'kbs' list
    kb_name = next((kb['name'] for kb in kbs if kb['knowledgeBaseId'] == key), 'Unknown')
    kb_names.append(kb_name)

# Combine the means into a single DataFrame
means_df = pd.DataFrame(means, columns=numeric_cols)
means_df = means_df.set_index(pd.Series(kb_names))

# Reshape the DataFrame for plotting
means_df = means_df.reset_index().melt(id_vars='index', value_vars=numeric_cols, var_name='Metric', value_name='Mean')

# Plot the means as a bar chart
bar_plot = means_df.set_index(['Metric', 'index'])['Mean'].unstack().plot(kind='bar', ax=ax, figsize=(20, 6), width=0.8, legend=True)

# Set the chart title and axis labels
ax.set_title('Ragas Performance', fontsize=14)
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Average', fontsize=12)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Update the legend labels with KB names
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, kb_names, loc='upper right', ncol=1)

# Display the chart
plt.show()

> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

If you made it this far, congrats! You have completed the workshop! If you have extra time feel free to tweak the chunking strategies further and rerunning the analysis to view your results. 