# Retrieval Augmented Generation with Amazon Bedrock - Enhancing Chat Applications with RAG

> *PLEASE NOTE: This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

---

## Chat with LLMs Overview

Conversational interfaces such as chatbots and virtual assistants can be used to enhance the user experience for your customers. Chatbots can be used in a variety of applications, such as customer service, sales, and e-commerce, to provide quick and efficient responses to users.

The key technical detail which we need to include in our system to enable a chat feature is conversational memory. This way, customers can ask follow up questions and the LLM will understand what the customer has already said in the past. The image below shows how this is orchestrated at a high level.

![Amazon Bedrock - Conversational Interface](./images/chatbot_bedrock.png)

## Extending Chat with RAG

However, in our workshop's situation, we want to be able to enable a customer to ask follow up questions regarding documentation we provide through RAG. This means we need to build a system which has conversational memory AND contextual retrieval built into the text generation.

![4](./images/context-aware-chatbot.png)

Let's get started!

---

## Setup `boto3` Connection

#### Libraries needed for the installs

In [17]:
%pip install  \
    "langchain>=0.1.11" \
    "transformers>=4.24,<5" \
    "faiss-cpu>=1.7.4,<2" \
    "pypdf>=3.8,<4" \
    pinecone-client==2.2.4 \
    apache-beam==2.52. \
    tiktoken==0.5.2 \
    "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0 \
    llama-index==0.9.0

Note: you may need to restart the kernel to use updated packages.


In [18]:
import boto3
import os
from IPython.display import Markdown, display

region = os.environ.get("AWS_REGION")
boto3_bedrock = boto3.client(
    service_name='bedrock-runtime',
    region_name=region,
)

### Use the following COT prompt to test 

In [19]:
prompt = "Human: You are a supply chain inspector. Your job is to assess the risk of a supplier. You measure the supplier risk by evaluating the supplier across three dimensions: country, size, and reputation.\nCountry - north America and west Europe countries are considered low risk. The rest of the world is considered medium risk.\nSize - supplier with over 1000 employees is low risk. Supplier with 50 to 999 employees is medium risk, and a supplier with under 50 employees is high risk.\nReputation - reputation scores are between 1 to 10 where a score of 1 to 3 is low risk, a score of 4 to 7 is medium risk, and a reputation score of 8 to 10 is high risk.\nThe risk formula is to take the maximum risk across the three dimensions.\n##\nExample:\nSupplier: A\nCountry: Chad\nSize: 30\nReputation: 8\nLet's think step by step:\nChad is not in North America or West Europe therefore country risk is medium.\nA size of 30 is below 50 and therefore considered high risk.\nA reputation score of 8 is between 8 to 10 and therefore considered high risk.\nFinal Answer taking the maximum risk among all: Supplier A is at High risk.\n##\nSupplier: B\nCountry: USA\nSize: 40\nReputation: 2\nLet's think step by step: \nAssistant: "


model_output = " Okay, let's evaluate Supplier B:\n\n- Country: USA is in North America, so the country risk is low.\n\n- Size: 40 employees is below 50, so the size risk is high. \n\n- Reputation: A score of 2 is between 1-3, so the reputation risk is low.\n\nTo determine the overall risk, we take the maximum risk across the three dimensions. \n\nThe maximum risk for Supplier B is high due to its small size.\n\nTherefore, my assessment is that Supplier B is at high risk overall."

---
## Using LangChain for Conversation Memory

We will use LangChain's `ConversationBufferMemory` class provides an easy way to capture conversational memory for LLM chat applications. Let's check out an example of Claude being able to retrieve context through conversational memory below.

Similar to the last workshop, we will use both a prompt template and a LangChain LLM for this example. Note that this time our prompt template includes a `{history}` variable where our chat history will be included to the prompt.

In [20]:
from langchain import PromptTemplate

CHAT_PROMPT_TEMPLATE = '''You are a helpful conversational assistant.
{history}

Human: {human_input}

Assistant:
'''
PROMPT = PromptTemplate.from_template(CHAT_PROMPT_TEMPLATE)

In [21]:
from langchain.llms import Bedrock

llm = Bedrock(
    client=boto3_bedrock,
    model_id="anthropic.claude-instant-v1",
    model_kwargs={
        "max_tokens_to_sample": 500,
        "temperature": 0.9,
    },
)

The `ConversationBufferMemory` class is instantiated here and you will notice that we use Claude specific human and assistant prefixes. When we initialize the memory, the history is blank.

In [22]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(human_prefix="\nHuman", ai_prefix="\nAssistant")
history = memory.load_memory_variables({})['history']
print(history)




We now ask Claude a simple question "How can I check for imbalances in my model?". The LLM responds to the question and we can use the `add_user_message` and `add_ai_message` functions to save the input and output into memory. We can then retrieve the entire conversation history and print the response. Currently the model will still return answer using the data it was trained upon. Further will examine how to get a curated answer using our own FAq's

In [23]:
human_input = 'How can I check for imbalances in my model?'

prompt_data = PROMPT.format(human_input=human_input, history=history)
ai_output = llm(prompt_data)

memory.chat_memory.add_user_message(human_input)
memory.chat_memory.add_ai_message(ai_output.strip())
history = memory.load_memory_variables({})['history']
display(Markdown(f'{history}'))


Human: How can I check for imbalances in my model?

Assistant: Here are some things you can do to check for and address potential imbalances in your machine learning model:

- Evaluate model performance on different demographic groups. Check error rates, accuracy,etc. for groups defined by attributes like gender, age, location, etc. Large differences could indicate bias.

- Review training data for uneven or unrepresentative coverage of different groups. Your model is only as fair as the data it learns from. Check for biases in data collection or labeling.

- Perform bias-mitigation techniques like reweighting or oversampling underrepresented groups in training. This can help balance the model's experience. 

- Adjust decision thresholds for different groups to account for natural variance and achieve fair outcomes rather than just equal error rates. 

- Conduct demographic parity tests to ensure model outputs are not overly correlated with sensitive attributes like those that could lead to direct or indirect discrimination.

- Get feedback from a diverse set of users and experts on how the model may treat or perceive different demographic profiles. Look for unintended harms.

- Regularly monitor model performance over time as data, user profiles or environments change to catch new imbalances that emerge. Fairness is an ongoing process.

The key is having transparency into a model's behavior for different subpopulations and adjusting accordingly based on empirical evidence of unfair impacts or biases. Continuous evaluation helps catch issues early.

Now we will ask a follow up question about the kind of imbalances does it detect and save the input and outputs again. Notice how the model is able to understand that when the human says "it", because it has access to the context of the chat history, the model is able to accurately understand what the user is asking about.

In [24]:
human_input = 'What kind does it detect?'

prompt_data = PROMPT.format(human_input=human_input, history=history)
ai_output = llm(prompt_data)

memory.chat_memory.add_user_message(human_input)
memory.chat_memory.add_ai_message(ai_output.strip())
#display(Markdown(f'{history}'))
display(Markdown(f'{ai_output}'))

 Here are some common types of imbalances that can be detected when checking a machine learning model:

- Statistical parity/demographic disparity - When the model's outcomes are unevenly distributed across demographic groups like gender or race. For example, approval rates differ significantly.

- Equalized odds/predictive parity - When the model has different true/false positive rates or precision/recall between groups for the same target variable. For example, higher error rates for one group. 

- Calibration - When the model's confidence in its predictions is not well-aligned between demographic groups. It may be over- or under-confident for some populations.

- Treatment equality - When similar individuals from different demographic groups are assigned meaningfully different model outputs or treatments without justification. 

- Disparate mistreatment - When the model systematically disadvantages certain groups even when they have the same characteristics as others. 

- Redlining - When geographic boundary or regional biases emerge, such as better performance in urban vs rural areas.

- Feedback loops - When the model reinforces pre-existing biases over time due to data or decisions made on its previous outcomes.

- Proxy variables - When non-protected attributes serve as a proxy for protected ones, still leading to potential discrimination.

Checking for imbalances in these types of fairness metrics is important to ensure a model is equitable in how it perceives and treats different segments of the population.

---
## Creating a class to help facilitate conversation

To help create some structure around these conversations, we create a custom `Conversation` class below. This class will hold a stateful conversational memory and be the base for conversational RAG later.

In [25]:
class Conversation:
    def __init__(self, client, model_id: str="anthropic.claude-instant-v1") -> None:
        """instantiates a new rag based conversation

        Args:
            model_id (str, optional): which bedrock model to use for the conversational agent. Defaults to "anthropic.claude-instant-v1".
        """

        # instantiate memory
        self.memory = ConversationBufferMemory(human_prefix="\nHuman", ai_prefix="\nAssistant")

        # instantiate LLM connection
        self.llm = Bedrock(
            client=client,
            model_id=model_id,
            model_kwargs={
                "max_tokens_to_sample": 500,
                "temperature": 0.9,
            },
        )

    def ai_respond(self, user_input: str=None):
        """responds to the user input in the conversation with context used

        Args:
            user_input (str, optional): user input. Defaults to None.

        Returns:
            ai_output (str): response from AI chatbot
        """

        # format the prompt with chat history and user input
        history = self.memory.load_memory_variables({})['history']
        llm_input = PROMPT.format(history=history, human_input=user_input)

        # respond to the user with the LLM
        ai_output = self.llm(llm_input).strip()

        # store the input and output
        self.memory.chat_memory.add_user_message(user_input)
        self.memory.chat_memory.add_ai_message(ai_output.strip())

        return ai_output

Let's see the class in action with two contextual questions. Again, notice the model is able to correctly interpret the context because it has memory of the conversation.

In [26]:
chat = Conversation(client=boto3_bedrock)

In [27]:
output = chat.ai_respond('How can I check for imbalances in my model?')
display(Markdown(f'{output}'))

Here are some things you can do to check for and address potential imbalances in your machine learning model:

- Evaluate model performance on different demographic groups. Check accuracy, precision, recall etc. broken down by things like gender, age, race, income level etc. Significant differences could indicate unfairness.

- Use bias metrics like statistical parity difference, equal opportunity difference, disparate impact ratio to directly measure bias between groups. Tools like Tensorflow Model Analysis can help with this.

- Oversample or undersample data to balance groups and re-evaluate model. See if performance changes significantly. 

- Try debiasing techniques like adversarial debiasing or prejudice remover to reduce proxy variables encoding protected attributes from the model. Re-evaluate.

- Use a balanced training objective like equalized odds or equal opportunity to explicitly optimize for fairness between groups during training.

- Check feature importance - if a few features related to protected attributes have outsized importance, it could indicate reliance on unfair proxies. 

- Get a model provenance report listing the data, parameters and training process. More transparency helps identify potential sources of bias.

- Consult literature and guidelines on mitigating bias from groups like ML fairness community, FATML etc for recommended validation strategies.

The key is evaluating on multiple relevant metrics and demographic splits to test for unintended discrimination or unfairness in your model. Continual monitoring is also important.

In [28]:
output = chat.ai_respond('What kind does it detect?')
display(Markdown(f'{output}'))

Here are some common types of imbalances that model checking aims to detect:

- Demographic imbalances - Where the model performs differently based on attributes like gender, age, race, income level etc. This includes differences in accuracy, false positive/negative rates between groups.

- Class imbalance - When the distribution of target classes is uneven in the training data. For example, if one class only makes up 5% of records but is critical. This can skew models. 

- Feature covariate imbalance - When predictive features are correlated with protected attributes. Models may rely on these proxy variables rather than direct targets.

- Missing data imbalance - If data is incomplete in a biased way, like certain groups having missing values more often. Imputation can propagate this bias.

- Selection bias - When the training data doesn't accurately represent the population of interest, due to issues in how the data was originally collected or labeled. 

- Multicollinearity - High correlation between predictors can cause overfitting to specific redundant features instead of general patterns. 

- Catastrophic forgetting - Models struggle to maintain performance on older/minority data points as they learn from new/majority data.

- Dataset shift - When the joint distribution of features and labels changes between training and inference environments.

So in summary, model checking tries to identify unfairness, inefficiencies or fragilities in a model that stem from skew, imbalance or bias in the underlying data or model assumptions.

---
## Combining RAG with Conversation

Now that we have a conversational system built, lets incorporate the RAG system we built in notebook 02 into the chat paradigm. 

First, we will create the same vector store with LangChain and FAISS from the last notebook.

Our goal is to create a curated response from the model and only use the FAQ's we have provided.

In [29]:
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
import os
from pathlib import Path

# create instantiation to embedding model
embedding_model = BedrockEmbeddings(
    client=boto3_bedrock,
    model_id="amazon.titan-embed-text-v1"
)

# create vector store
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)

### Visualize Semantic Search 

⚠️ ⚠️ ⚠️ This section is for Advanced Practioners. Please feel free to run through these cells and come back later to re-examine the concepts ⚠️ ⚠️ ⚠️ 

Let's see how the semantic search works:
1. First we calculate the embeddings vector for the query, and
2. then we use this vector to do a similarity search on the store


##### Citation
We will also be able to get the `citation` or the underlying documents which our Vector Store matched to our query. This is useful for debugging and also measuring the quality of the vector stores. let us look at how the underlying Vector store calculates the matches

##### Vector DB Indexes
One of the key components of the Vector DB is to be able to retrieve documents matching the query with accuracy and speed. There are multiple algorithims for the same and some examples can be [read here](https://thedataquarry.com/posts/vector-db-3/) 

In [30]:
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')
#- helpful function to display in tabular format

def display_table(data):
    html = "<table>"
    for row in data:
        html += "<tr>"
        for field in row:
            html += "<td>%s</td>"%(field)
        html += "</tr>"
    html += "</table>"
    display(HTML(html))

In [31]:

v = embedding_model.embed_query("How can I check for imbalances in my model?")
print(v[0:10])
results = vs.similarity_search_by_vector(v, k=2)
display(Markdown('Let us look at the documents which had the relevant information pertaining to our query'))
for r in results:
    display(Markdown(f'{r.page_content}'))
    display(Markdown(f'------------------------------------'))

[-0.14746094, 0.77734375, 0.26953125, -0.55859375, 0.047851562, -0.43554688, -0.057617188, -0.00030326843, -0.5703125, -0.33789062]


Let us look at the documents which had the relevant information pertaining to our query

What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example)."

------------------------------------

How do I build an ML model to generate accurate predictions in SageMaker Canvas?," Once you have connected sources, selected a dataset, and prepared your data, you can select the target column that you want to predict to initiate a model creation job. SageMaker Canvas will automatically identify the problem type, generate new relevant features, test a comprehensive set of prediction models using ML techniques such as linear regression, logistic regression, deep learning, time-series forecasting, and gradient boosting, and build the model that makes accurate predictions based on your dataset."

------------------------------------

#### Similarity Search

##### Distance scoring in Vector Data bases
[Distance scores](https://weaviate.io/blog/distance-metrics-in-vector-search) are the key in vector searches. Here are some FAISS specific methods. One of them is similarity_search_with_score, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance ( Squared Euclidean) . Therefore, a lower score is better. Further in FAISS we have similarity_search_with_score (ranked by distance: low to high) and similarity_search_with_relevance_scores ( ranked by relevance: high to low) with both using the distance strategy. The similarity_search_with_relevance_scores calculates the relevance score as 1 - score. For more details of the various distance scores [read here](https://milvus.io/docs/metric.md)


In [32]:
display(Markdown(f"##### Let us look at the documents based on {vs.distance_strategy.name} which will be used to answer our question 'What kind of bias does Clarify detect ?'"))

context = vs.similarity_search('What kind of bias does Clarify detect ?', k=2)
#-  langchain.schema.document.Document
display(Markdown(f'------------------------------------'))
list_context = [[doc.page_content, doc.metadata] for doc in context]
list_context.insert(0, ['Documents', 'Meta-data'])
display_table(list_context)

##### Let us look at the documents based on EUCLIDEAN_DISTANCE which will be used to answer our question 'What kind of bias does Clarify detect ?'

------------------------------------

0,1
Documents,Meta-data
"What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""",{}
"How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.",{}


Let us first look at the Page context and the meta data associated with the documents. Now let us look at the L2 scores based on the distance scoring as explained above. Lower score is better

In [33]:
#- relevancy of the documents
results = vs.similarity_search_with_score("What kind of bias does Clarify detect ?", k=2, fetch_k=3)
display(Markdown(f'##### Similarity Search Table with relevancy score.'))
display(Markdown(f'------------------------------------'))   
results.insert(0,['Documents', 'Relevancy Score'])
display_table(results)

##### Similarity Search Table with relevancy score.

------------------------------------

0,1
Documents,Relevancy Score
"page_content='What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""' _lc_kwargs={'page_content': 'What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""', 'metadata': {}}",130.9253
"page_content='How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.' _lc_kwargs={'page_content': 'How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.', 'metadata': {}}",188.70462


#### Marginal Relevancy score

Maximal Marginal Relevance  has been introduced in the paper [The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf). Maximal Marginal Relevance tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc. In the below results since we have a very limited data set it might not make a difference but for larger data sets the query will theoritically run faster while still preserving the over all relevancy of the documents

In [34]:
#- normalizing the relevancy
display(Markdown('##### Let us look at MRR scores'))
results = vs.max_marginal_relevance_search_with_score_by_vector(embedding_model.embed_query("What kind of bias does Clarify detect ?"), k=3)
results.insert(0, ["Document", "MRR Score"])
display_table(results)
  

##### Let us look at MRR scores

0,1
Document,MRR Score
"page_content='What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""' _lc_kwargs={'page_content': 'What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""', 'metadata': {}}",130.9253
"page_content='How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.' _lc_kwargs={'page_content': 'How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.', 'metadata': {}}",188.70462
"page_content='What is the underlying tuning algorithm for Automatic Model Tuning?,"" Currently, the algorithm for tuning hyperparameters is a customized implementation of Bayesian Optimization. It aims to optimize a customer-specified objective metric throughout the tuning process. Specifically, it checks the object metric of completed training jobs, and uses the knowledge to infer the hyperparameter combination for the next training job.""\nDoes Automatic Model Tuning recommend specific hyperparameters for tuning?,"" No. How certain hyperparameters impact the model performance depends on various factors, and it is hard to definitively say one hyperparameter is more important than the others and thus needs to be tuned. For built-in algorithms within SageMaker, we do call out whether or not a hyperparameter is tunable.""' _lc_kwargs={'page_content': 'What is the underlying tuning algorithm for Automatic Model Tuning?,"" Currently, the algorithm for tuning hyperparameters is a customized implementation of Bayesian Optimization. It aims to optimize a customer-specified objective metric throughout the tuning process. Specifically, it checks the object metric of completed training jobs, and uses the knowledge to infer the hyperparameter combination for the next training job.""\nDoes Automatic Model Tuning recommend specific hyperparameters for tuning?,"" No. How certain hyperparameters impact the model performance depends on various factors, and it is hard to definitively say one hyperparameter is more important than the others and thus needs to be tuned. For built-in algorithms within SageMaker, we do call out whether or not a hyperparameter is tunable.""', 'metadata': {}}",281.98914


#### Update embeddings of the Vector Databases

Update of documents happens all the time and we have multiple versions of the documents. Which means we need to also factor how do we update the embeddings in our Vector Data bases. Fortunately we have and can leverage the meta data to update embeddings

The key steps are:
1. Load the new embeddings and add the meta data stating the version as 2
2. Merge to the exisiting Vector database
3. Run the query using the filter to only search in the new index and get the latest documents for the same query


In [36]:
# create vector store
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.schema import Document
loader = CSVLoader(
    file_path="./data/sagemaker/sm_faq_v2.csv",
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": ["Question", "Answer"],
    },
)

#docs_split = loader.load()
docs_split = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator=",").split_documents(loader.load())
list_of_documents = [Document(page_content=doc.page_content, metadata=dict(page='v2')) for idx, doc in enumerate(docs_split)]
print(f"Number of split docs={len(docs_split)}")
db = FAISS.from_documents(list_of_documents, embedding_model)

Number of split docs=6


#### Run a query against version 2 of the documents
Let us run the query agsint our exisiting vector data base and we will see the the exisiting or the version 1 of the documents coming back. If we run with the filter since those do not exist in our vector Database we will see no results returned or an empty list back


In [37]:
# Run the query with requesting data from version 2 which does not exist
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)
search_query = "How can I check for imbalances in my model?"
#print(f"Running with v1 of the documents we get response of {vs.similarity_search_with_score(query=search_query, k=1, fetch_k=4)}")
print("------\n")
print(f"Running the query with V2 of the document we get {vs.similarity_search_with_score(query=search_query, filter=dict(page='v2'), k=1)}:")


------

Running the query with V2 of the document we get []:


#### Add a new version of the document
We will create the version 2 of the documents and use meta data to add to our original index. Once done we will then apply a filter in our query which will return to us the documents newly added. Run the query now after adding version of the documents

We will also examine a way to speed up our searches and queries and look at another way to narrow the search using the  fetch_k parameter when calling similarity_search with filters. Usually you would want the fetch_k to be more than the k parameter. This is because the fetch_k parameter is the number of documents that will be fetched before filtering. If you set fetch_k to a low number, you might not get enough documents to filter from.

In [38]:
# - now let us add version 2 of the data set and run query from that

vs.merge_from(db)

#### Query complete merged data base with no filters
Run the query against the fully merged DB without any filters for the meta data and we see that it returns the top results of the new V2 data and also the top results of the v1 data. Essentially it will match and return data closest to the query

In [39]:
# - run the query again
search_query_v2 = "How can I check for imbalances in my model?"
results_with_scores = vs.similarity_search_with_score(search_query_v2, k=2, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
"Question: How can I check for imbalances in my model? Answer: Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time",{'page': 'v2'},154.83807
"What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""",{},229.07722


#### Query with Filter
Now we will ask to search only against the version 2 of the data and use filter criteria against it

In [40]:
# - run the query again
search_query_v2 = "How can I check for imbalances in my model?"
results_with_scores = vs.similarity_search_with_score(search_query_v2, filter=dict(page='v2'), k=2, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
"Question: How can I check for imbalances in my model? Answer: Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time",{'page': 'v2'},154.83807
Question: What kind of bias does SageMaker Clarify detect? Answer: Measuring bias in ML models is a first step to mitigating bias.,{'page': 'v2'},268.9629


#### Query for new data
Now let us ask a question which exists only on the version 2 of the document

In [41]:
# - now let us ask a question which ONLY exits in the version 2 of the document
search_query_v2 = "Can i use Quantum computing?"
results_with_scores = vs.similarity_search_with_score(query=search_query_v2, filter=dict(page='v2'), k=1, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
Question: Can i use Quantum computing? Answer: Yes SageMaker version sometime in future will let you run quantum computing,{'page': 'v2'},97.10315


### Let us continue to build our chatbot

The prompt template is now altered to include both conversation memory as well as chat history as inputs along with the human input. Notice how the prompt also instructs Claude to not answer questions which it does not have the context for. This helps reduce hallucinations which is extremely important when creating end user facing applications which need to be factual.

In [None]:
# re-create vector store and continue
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)

In [42]:
RAG_TEMPLATE = """You are a helpful conversational assistant.

If you are unsure about the answer OR the answer does not exist in the context, respond with
"Sorry but I do not understand your request. I am still learning so I appreciate your patience! 😊
NEVER make up the answer.

If the human greets you, simply introduce yourself.

The context will be placed in <context></context> XML tags. 

<context>{context}</context>

Do not include any xml tags in your response.

{history}

Human: {input}

Assistant:
"""
PROMPT = PromptTemplate.from_template(RAG_TEMPLATE)

The new `ConversationWithRetrieval` class now includes a `get_context` function which searches our vector database based on the human input and combines it into the base prompt.

In [43]:
class ConversationWithRetrieval:
    def __init__(self, client, vector_store: FAISS=None, model_id: str="anthropic.claude-instant-v1") -> None:
        """instantiates a new rag based conversation

        Args:
            vector_store (FAISS, optional): pre-populated vector store for searching context. Defaults to None.
            model_id (str, optional): which bedrock model to use for the conversational agent. Defaults to "anthropic.claude-instant-v1".
        """

        # store vector store
        self.vector_store = vector_store
        
        # instantiate memory
        self.memory = ConversationBufferMemory(human_prefix="Human", ai_prefix="Assistant")

        # instantiate LLM connection
        self.llm = Bedrock(
            client=client,
            model_id=model_id,
            model_kwargs={
                "max_tokens_to_sample": 500,
                "temperature": 0.0,
            },
        )

    def ai_respond(self, user_input: str=None):
        """responds to the user input in the conversation with context used

        Args:
            user_input (str, optional): user input. Defaults to None.

        Returns:
            ai_output (str): response from AI chatbot
            search_results (list): context used in the completion
        """

        # format the prompt with chat history and user input
        context_string, search_results = self.get_context(user_input)
        history = self.memory.load_memory_variables({})['history']
        llm_input = PROMPT.format(history=history, input=user_input, context=context_string)

        # respond to the user with the LLM
        ai_output = self.llm(llm_input).strip()

        # store the input and output
        self.memory.chat_memory.add_user_message(user_input)
        self.memory.chat_memory.add_ai_message(ai_output.strip())

        return ai_output, search_results

    def get_context(self, user_input, k=5):
        """returns context used in the completion

        Args:
            user_input (str): user input as a string
            k (int, optional): number of results to return. Defaults to 5.

        Returns:
            context_string (str): context used in the completion as a string
            search_results (list): context used in the completion as a list of Document objects
        """
        search_results = self.vector_store.similarity_search(
            user_input, k=k
        )
        context_string = '\n\n'.join([f'Document {ind+1}: ' + i.page_content for ind, i in enumerate(search_results)])
        return context_string, search_results

Now the model can answer some specific domain questions based on our document database!

In [44]:
chat = ConversationWithRetrieval(boto3_bedrock, vs)

In [45]:
output, context = chat.ai_respond('How can I check for imbalances in my model?')
display(Markdown(f'{output}'))

Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time. Measuring bias in ML models is a first step to mitigating bias. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).

In [46]:
output, context = chat.ai_respond('What kind does it detect?')
display(Markdown(f'** Ai Assistant Answer: ** \n{output}'))
display(Markdown(f'\n\n** Relevant Documentation: ** \n{context}'))

** Ai Assistant Answer: ** 
SageMaker Clarify can detect different types of bias. Specifically, it can detect representational harms, predictive equality harms, and historical harms.

- Representational harms refer to biases or imbalances in the training data that could influence the model, such as underrepresentation of certain groups. SageMaker Clarify can detect this type of bias by checking if one group is underrepresented compared to others in the training data.

- Predictive equality harms refer to differences in how accurately or reliably a model performs for different groups. SageMaker Clarify can detect this type of bias by measuring metrics like error rates, precision, and recall and comparing the performance across different groups after model training or during deployment. 

- Historical harms refer to biases from past data or decisions that influence future outcomes in a unfair way. SageMaker Clarify can help detect this type of bias over time by monitoring model performance and metrics continuously during deployment to identify new biases or drifts in model behavior.

So in summary, SageMaker Clarify can detect various types of biases like representational biases, predictive biases, and historical biases through different bias metrics computed during data preparation, after model training, and during model deployment and monitoring. This helps identify any unfair harms or inequalities in how the model treats different groups.



** Relevant Documentation: ** 
[Document(page_content='What is Amazon SageMaker Autopilot?," SageMaker Autopilot is the industry’s first automated machine learning capability that gives you complete control and visibility into your ML models. SageMaker Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance, all with just a few clicks. The result is the best-performing model that you can deploy at a fraction of the time normally required to train the model. You get full visibility into how the model was created and what’s in it, and SageMaker Autopilot integrates with SageMaker Studio. You can explore up to 50 different models generated by SageMaker Autopilot inside SageMaker Studio so it’s easy to pick the best model for your use case. SageMaker Autopilot can be used by people without ML experience to easily produce a model, or it can be used by experienced developers to quickly develop a baseline model on which teams can further iterate."', _lc_kwargs={'page_content': 'What is Amazon SageMaker Autopilot?," SageMaker Autopilot is the industry’s first automated machine learning capability that gives you complete control and visibility into your ML models. SageMaker Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance, all with just a few clicks. The result is the best-performing model that you can deploy at a fraction of the time normally required to train the model. You get full visibility into how the model was created and what’s in it, and SageMaker Autopilot integrates with SageMaker Studio. You can explore up to 50 different models generated by SageMaker Autopilot inside SageMaker Studio so it’s easy to pick the best model for your use case. SageMaker Autopilot can be used by people without ML experience to easily produce a model, or it can be used by experienced developers to quickly develop a baseline model on which teams can further iterate."', 'metadata': {}}), Document(page_content='What is Amazon SageMaker Studio Lab?," SageMaker Studio Lab is a free ML development environment that provides the compute, storage (up to 15 GB), and security—all at no cost—for anyone to learn and experiment with ML. All you need to get started is a valid email ID; you don’t need to configure infrastructure or manage identity and access or even sign up for an AWS account. SageMaker Studio Lab accelerates model building through GitHub integration, and it comes preconfigured with the most popular ML tools, frameworks, and libraries to get you started immediately. SageMaker Studio Lab automatically saves your work so you don’t need to restart between sessions. It’s as easy as closing your laptop and coming back later."', _lc_kwargs={'page_content': 'What is Amazon SageMaker Studio Lab?," SageMaker Studio Lab is a free ML development environment that provides the compute, storage (up to 15 GB), and security—all at no cost—for anyone to learn and experiment with ML. All you need to get started is a valid email ID; you don’t need to configure infrastructure or manage identity and access or even sign up for an AWS account. SageMaker Studio Lab accelerates model building through GitHub integration, and it comes preconfigured with the most popular ML tools, frameworks, and libraries to get you started immediately. SageMaker Studio Lab automatically saves your work so you don’t need to restart between sessions. It’s as easy as closing your laptop and coming back later."', 'metadata': {}}), Document(page_content='What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example)."', _lc_kwargs={'page_content': 'What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example)."', 'metadata': {}}), Document(page_content='How can I reproduce a feature from a given moment in time?, SageMaker Feature Store maintains time stamps for all features at every instance of time. This helps you retrieve features at any period of time for business or compliance requirements. You can easily explain model features and their values from when they were first created to the present time by reproducing the model from a given moment in time.\nWhat are offline features?," Offline features are used for training because you need access to very large volumes over a long period of time. These features are served from a high-throughput, high-bandwidth repository."\nWhat are online features?, Online features are used in applications required to make real-time predictions. Online features are served from a high-throughput repository with single-digit millisecond latency for fast predictions.', _lc_kwargs={'page_content': 'How can I reproduce a feature from a given moment in time?, SageMaker Feature Store maintains time stamps for all features at every instance of time. This helps you retrieve features at any period of time for business or compliance requirements. You can easily explain model features and their values from when they were first created to the present time by reproducing the model from a given moment in time.\nWhat are offline features?," Offline features are used for training because you need access to very large volumes over a long period of time. These features are served from a high-throughput, high-bandwidth repository."\nWhat are online features?, Online features are used in applications required to make real-time predictions. Online features are served from a high-throughput repository with single-digit millisecond latency for fast predictions.', 'metadata': {}}), Document(page_content='Why should I use SageMaker for shadow testing?," SageMaker simplifies the process of setting up and monitoring shadow variants so you can evaluate the performance of the new ML model on live production traffic. SageMaker eliminates the need for you to orchestrate infrastructure for shadow testing. It lets you control testing parameters such as the percentage of traffic mirrored to the shadow variant and the duration of the test. As a result, you can start small and increase the inference requests to the new model after you gain confidence in model performance. SageMaker creates a live dashboard displaying performance differences across key metrics, so you can easily compare model performance to evaluate how the new model differs from the production model."', _lc_kwargs={'page_content': 'Why should I use SageMaker for shadow testing?," SageMaker simplifies the process of setting up and monitoring shadow variants so you can evaluate the performance of the new ML model on live production traffic. SageMaker eliminates the need for you to orchestrate infrastructure for shadow testing. It lets you control testing parameters such as the percentage of traffic mirrored to the shadow variant and the duration of the test. As a result, you can start small and increase the inference requests to the new model after you gain confidence in model performance. SageMaker creates a live dashboard displaying performance differences across key metrics, so you can easily compare model performance to evaluate how the new model differs from the production model."', 'metadata': {}})]

--- 
## Using LangChain for Orchestration of RAG

Beyond the primitive classes for prompt handling and conversational memory management, LangChain also provides a framework for [orchestrating RAG flows](https://python.langchain.com/docs/expression_language/cookbook/retrieval) with what purpose built "chains". In this section, we will see how to be a retrieval chain with LangChain which is more comprehensive and robust than the original retrieval system we built above.

The workflow we used above follows the following process...

1. User input is received.
2. User input is queried against the vector database to retrieve relevant documents.
3. Relevant documents and chat memory are inserted into a new prompt to respond to the user input.
4. Return to step 1.

However, more complex methods of interacting with the user input can generate more accurate results in RAG architectures. One of the popular mechanisms which can increase accuracy of these retrieval systems is utilizing more than one call to an LLM in order to reformat the user input for more effective search to your vector database. A better workflow is described below compared to the one we already built...

1. User input is received.
2. An LLM is used to reword the user input to be a better search query for the vector database based on the chat history and other instructions. This could include things like condensing, rewording, addition of chat context, or stylistic changes.
3. Reformatted user input is queried against the vector database to retrieve relevant documents.
4. The reformatted user input and relevant documents are inserted into a new prompt in order to answer the user question.
5. Return to step 1.

Let's now build out this second workflow using LangChain below.

First we need to make a prompt which will reformat the user input to be more compatible for searching of the vector database. The way we do this is by providing the chat history as well as the some basic instructions to Claude and asking it to condense the input into a single output.

In [47]:
condense_prompt = PromptTemplate.from_template("""\
<chat-history>
{chat_history}
</chat-history>

<follow-up-message>
{question}
<follow-up-message>

Human: Given the conversation above (between Human and Assistant) and the follow up message from Human, \
rewrite the follow up message to be a standalone question that captures all relevant context \
from the conversation. Answer only with the new question and nothing else.

Assistant: Standalone Question:""")

The next prompt we need is the prompt which will answer the user's question based on the retrieved information. In this case, we provide specific instructions about how to answer the question as well as provide the context retrieved from the vector database.

In [48]:
respond_prompt = PromptTemplate.from_template("""\
<context>
{context}
</context>

Human: Given the context above, answer the question inside the <q></q> XML tags.

<q>{question}</q>

If the answer is not in the context say "Sorry, I don't know as the answer was not found in the context". Do not use any XML tags in the answer.

Assistant:""")

Now that we have our prompts set up, let's set up the conversational memory buffer just like we did earlier in the notebook. Notice how we inject an example human and assistant message in order to help guide our AI assistant on what its job is.

In [49]:
llm = Bedrock(
    client=boto3_bedrock,
    model_id="anthropic.claude-instant-v1",
    model_kwargs={"max_tokens_to_sample": 500, "temperature": 0.9}
)
memory_chain = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    human_prefix="Human",
    ai_prefix="Assistant"
)
memory_chain.chat_memory.add_user_message(
    'Hello, what are you able to do?'
)
memory_chain.chat_memory.add_ai_message(
    'Hi! I am a help chat assistant which can answer questions about Amazon SageMaker.'
)

Lastly, we will used the `ConversationalRetrievalChain` from LangChain to orchestrate this whole system. If you would like to see some more logs about what is happening in the orchestration and not just the final output, make sure to change the `verbose` argument to `True`.

In [50]:
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(
    llm=llm, # this is our claude model
    retriever=vs.as_retriever(), # this is our FAISS vector database
    memory=memory_chain, # this is the conversational memory storage class
    condense_question_prompt=condense_prompt, # this is the prompt for condensing user inputs
    verbose=False, # change this to True in order to see the logs working in the background
)
qa.combine_docs_chain.llm_chain.prompt = respond_prompt # this is the prompt in order to respond to condensed questions

Let's go ahead and generate some responses from our RAG solution!

In [51]:
display(Markdown(f"{qa.run({'question': 'How can I check for imbalances in my model?'})}"))

 With Amazon SageMaker, you can check for imbalances in your model in a few ways:

You can use Amazon SageMaker Clarify to check for imbalances during data preparation, after training, and ongoing over time. SageMaker Clarify will analyze your model and data to detect any potential biases, like disproportionate mispredictions for certain groups. 

You can also use SageMaker Studio to explore up to 50 different models generated by SageMaker Autopilot. This allows you to compare multiple models and evaluate things like if certain groups are predicted accurately compared to others. SageMaker Studio gives full visibility into how each model was created.

Additionally, SageMaker Data Wrangler supports running bias analysis using SageMaker Clarify directly during the data preparation process. This lets you detect potential biases in the data before training models.

So in summary, SageMaker Clarify, SageMaker Studio, SageMaker Autopilot and SageMaker Data Wrangler all provide capabilities to analyze models and data for imbalances or unintended biases towards specific groups.

In [52]:
display(Markdown(f"{qa.run({'question': 'What kind does it detect?' })}"))

 SageMaker Clarify, SageMaker Studio, SageMaker Autopilot and SageMaker Data Wrangler can detect the following types of imbalances in models and data:

SageMaker Clarify can check for imbalances in the training data before and after model training, as well as for an inference model. It can detect differences in label distributions and underrepresentation across different groups in the training data. For trained models and deployed inferences, it can measure whether performance like error rates, precision and recall differ across groups. 

SageMaker Data Wrangler allows running bias analysis supported by SageMaker Clarify directly during data preparation to detect potential biases. 

SageMaker Autopilot and SageMaker Studio integrate with SageMaker Clarify, so they can explore up to 50 different models generated by Autopilot to help identify any performance imbalances across groups.

In [53]:
display(Markdown(f"{qa.run({'question': 'How does this improve model explainability?' })}"))

 SageMaker Clarify, SageMaker Studio, SageMaker Autopilot and SageMaker Data Wrangler help improve the explainability of Machine Learning models developed with Amazon SageMaker in the following ways:

SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time. This helps identify any unfairness or biases in the training data or model. SageMaker Studio provides a single interface where all ML development steps can be performed, including using SageMaker Clarify to debug and explain models. SageMaker Autopilot automatically trains and tunes multiple models, then ranks them based on performance and transparency metrics from SageMaker Clarify. This helps produce the most fair and explainable model. SageMaker Data Wrangler allows preparing data for ML within SageMaker Studio, and can detect potential bias during this stage using SageMaker Clarify. Together, these capabilities give visibility into how data impacts the model and how to improve any issues, leading to more transparent and trustworthy models.

## Let us use LLM to validate if the response was factual

#### We first create a sanity prompt, which will use the vector DB results and ask the LLM to validate if the respinse given was acurate or not

In [54]:
# create sanity check prompt
from langchain.chains.question_answering import load_qa_chain
from langchain import PromptTemplate

def create_sanity_prompt(instruction_start: str = None,instruction_end: str = None,) -> PromptTemplate:
    """
    Create a prompt template for LLM sanity check

    Parameters
    ----------
    instruction_start : str, optional
        Instrcution in the beginning of the prompt, by default None
    instruction_end : str, optional
        Instrcution in the end of the prompt, by default None

    Returns
    -------
    PromptTemplate
        Prompt template in the LangChain format
    """

    # first instruction
    prompt_template_build = instruction_start + "\n"

    # add context
    prompt_template_build += "Context: {context}" + "\n"

    # add statement
    prompt_template_build += "Statement: {statement}" + "\n"

    # addinstruction
    prompt_template_build += "Question: " + instruction_end + "\n"

    # add answer placeholder
    prompt_template_build += "Answer:"

    print(prompt_template_build)
    # build the template
    llm_prompt = PromptTemplate(
        template=str(prompt_template_build),
        input_variables=["context", "statement"],
    )
    return llm_prompt

sanity_prompt = create_sanity_prompt(
    instruction_start="""The following is a conversation between a highly knowledgeable and intelligent AI assistant, called Falcon, and a human user asking Questions. In the following interactions, Falcon will converse in natural language, and Falcon will answer the questions based only on the provided Context. Falcon will provide accurate, short and direct answers to the questions.""",
    instruction_end="Is the above statement based directly on the provided context? Answer with yes or no.",
)

docs = vs.similarity_search_with_score('How can I check for imbalances in my model?')
contexts = []
source = []
for doc, score in docs:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
    if score <= 0.9:
        contexts.append(doc)
        source.append(doc.metadata['source'])
        print(f"\n INPUT CONTEXT:{contexts}")

sanity_chain = load_qa_chain(llm=llm, prompt=sanity_prompt)
sanity_check = sanity_chain({"input_documents": contexts, "statement": output},return_only_outputs=True)['output_text']

sanity_check

The following is a conversation between a highly knowledgeable and intelligent AI assistant, called Falcon, and a human user asking Questions. In the following interactions, Falcon will converse in natural language, and Falcon will answer the questions based only on the provided Context. Falcon will provide accurate, short and direct answers to the questions.
Context: {context}
Statement: {statement}
Question: Is the above statement based directly on the provided context? Answer with yes or no.
Answer:
Content: Question: How can I check for imbalances in my model?
Answer: Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time, Metadata: {'page': 'v2'}, Score: 154.83807373046875
Content: What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as

' Yes'

#### We can see the Vector database responded accurately to our query

In [55]:
sanity_check

' Yes'