**Summary: This notebooks demonstrates how to compare a fine-tuned model and the original pre-trained model using RAG by plotting their evaluation results in a radar plot and adding cost analysis.**

In [1]:
!pip install -qU pinecone-client==2.2.1 ipywidgets==7.0.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autovizwidget 0.21.0 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.1.1 which is incompatible.
hdijupyterutils 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.1 which is incompatible.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.1 which is incompatible.[0m[31m
[0m

In [2]:
!pip install fmeval --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mkl-fft 1.3.8 requires mkl, which is not installed.
autovizwidget 0.21.0 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.1.4 which is incompatible.
awscli 1.32.3 requires botocore==1.34.3, but you have botocore 1.34.155 which is incompatible.
awscli 1.32.3 requires s3transfer<0.10.0,>=0.9.0, but you have s3transfer 0.10.2 which is incompatible.
hdijupyterutils 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.4 which is incompatible.
numba 0.57.1 requires numpy<1.25,>=1.21, but you have numpy 1.26.4 which is incompatible.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.4 which is incompatible.
sphinx 7.2.6 requires docutils<0.21,>=0.18.1, but you have docutils 0.16 which is incompatible.[0m[31m
[0m

In [3]:
# install packages needed for plotting
! pip install -U kaleido --quiet
! pip install plotly --quiet

In [4]:
import pandas as pd
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
import os
import json
import os
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
import sagemaker, boto3, json
from sagemaker.session import Session


  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## here get the metrics from lab1 and lab2 notebooks

## 1. Visualize results pre-trained vs fine-tuned

We load the results and visualize them as radar plots.

In [None]:
# code for loading the results (obs: for llama 2 7B chat doesn't have the key accuracy in the json file generated
def load_results(models):
    accuracy_results = []
    for model in models:
        #print(">>>model",model,"<<<")
        file = f'example_results/{model}.json'
        with open(file, 'r') as f:
            res = json.load(f)
            #print(">>>res",res,"<<<")
            #print(">>>res[0]",res[2],"<<<")
            #for accuracy_eval in res['accuracy']:
            for accuracy_eval in res:
                for accuracy_scores in accuracy_eval["dataset_scores"]:
                    accuracy_results.append(
                        {'model': model, 'evaluation': 'accuracy', 'dataset': accuracy_eval["dataset_name"],
                         'metric': accuracy_scores["name"], 'value': accuracy_scores["value"]})
        
    accuracy_results_df = pd.DataFrame(accuracy_results)
    return accuracy_results_df

In [None]:
# code for plotting the results
def visualize_radar(results_df, dataset):
    # aggregate 3 datasets into 1 by taking mean across datasets
    if dataset == 'all':
       mean_across_datasets = results_df.drop('evaluation', axis=1).groupby(['model', 'metric']).describe()['value']['mean']
       results_df = pd.DataFrame(mean_across_datasets).reset_index().rename({'mean':'value'}, axis=1)
    # plot a single dataset
    else:
        results_df = results_df[results_df['dataset'] == dataset]
    
    fig = px.line_polar(results_df, r='value', theta='metric', color='model', line_close=True) 
    xlim = 1
    fig.update_layout(
        polar=dict(
            radialaxis=dict(
            visible=True,
            range=[0, xlim],
            )),
        margin=dict(l=150, r=0, t=100, b=80)
    )

    
    title =  'Average Performance over 3 QA Datasets' if dataset == 'all' else dataset
    fig.update_layout(
            title=dict(text=title, font=dict(size=20), yref='container')
        )
    
    directory = "example_results"
    fig.show()
    fig.write_image(f"{directory}/radarplot.pdf")

In [None]:
models = [model_id_base + "_base", model_id_instruct + "_instruct"]
results_df = load_results(models)
visualize_radar(results_df, dataset='all')

## 2. Run Retrieval-Augmented Generation model

Ref. notebook (https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/llama-2-chat-completion.ipynb)

In this notebook we will demonstrate how to use [**Llama-2-7b**](https://ai.meta.com/llama/) to answer questions using a library of documents as a reference, by using document embeddings and retrieval. Unlike other RAG solutions, embeddings will be generated and combined with the embedding model to identify the nearest neighbors, all from a single endpoint in this solution.


To perform inference on the [Llama models](https://ai.meta.com/llama/), you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from this [webpage](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.


# from rag notebook use below the same code

In [None]:
!pip install PyMuPDF

In [None]:
import fitz

In [None]:
# Function to extract text using PyMuPDF
def extract_text_from_pdf_mupdf(pdf_path):
    text = ""
    document = fitz.open(pdf_path)
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        text += page.get_text()
    return text

In [None]:
# Extract the text from the PDF using PyMuPDF
pdf_text_mupdf = extract_text_from_pdf_mupdf("../lab-data/BONTONSTORESINC_04_20_2018-EX-99.3-AGENCY AGREEMENT.PDF")
pdf_text_mupdf[:2000]  # Displaying the first 2000 characters to get an overview of the content

In [None]:
def chunk_text(text, chunk_size, overlap):
    """
    Chunk text into smaller segments with a specified chunk size and overlap.

    Parameters:
    - text (str): The text to be chunked.
    - chunk_size (int): The size of each chunk.
    - overlap (int): The number of characters that overlap between chunks.

    Returns:
    - List[str]: A list of text chunks.
    """
    if chunk_size <= overlap:
        raise ValueError("Chunk size must be greater than overlap")

    chunks = []
    start = 0
    end = chunk_size

    while start < len(text):
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
        end = start + chunk_size

    return chunks

In [None]:
chunks = chunk_text(pdf_text_mupdf, 1024, 256)
print(len(chunks))
chunks[:2]

In [None]:
import boto3
import json

In [None]:
# Initialize the Bedrock client
client = boto3.client('bedrock-runtime', region_name='us-west-2')

In [None]:
# Function to embed text
def embed_text(input_text):
    # Create the request payload
    payload = {
        "inputText": input_text,
        "dimensions": 512,  # Specify the desired dimension size
        "normalize": True  # Whether to normalize the output embeddings
    }

    # Invoke the model
    response = client.invoke_model(
        body=json.dumps(payload),
        modelId='amazon.titan-embed-text-v2:0',  # Specify the Titan embedding model
        accept='application/json',
        contentType='application/json'
    )

    # Get the embedding result
    response_body = json.loads(response['body'].read())
    embedding = response_body.get('embedding')
    return embedding

# Print the embedding
print(embed_text(chunks[0]))

In [None]:
!pip install pyepsilla

In [None]:
!sh ../setup.sh

In [None]:
from pyepsilla import vectordb
## connect to vectordb
db = vectordb.Client(
  host='localhost',
  port='8888'
)

In [None]:
db.unload_db("kdd_lab1_rag")
db.load_db(db_name="kdd_lab1_rag", db_path="/tmp/kdd_lab1_rag")

In [None]:
db.use_db(db_name="kdd_lab1_rag")
db.create_table(
  table_name="NaiveRAG",
  table_fields=[
    {"name": "ID", "dataType": "INT", "primaryKey": True},
    {"name": "Doc", "dataType": "STRING"},
    {"name": "Embedding", "dataType": "VECTOR_FLOAT", "dimensions": 512}
  ]
)

In [None]:
records = [
    {
        "ID": index,
        "Doc": text,
        "Embedding": embed_text(text)
    }
    for index, text in enumerate(chunks)
]
records[:2]

In [None]:
db.insert("NaiveRAG", records)

In [None]:
def generate(prompt):
    # Create the request payload
    payload = {
        "prompt": prompt,
        "temperature": 0,  # Adjust the randomness of the output
        "max_gen_len": 128
    }

    # Initialize the Bedrock runtime client
    client = boto3.client('bedrock-runtime', region_name='us-west-2')

    # Invoke the model
    response = client.invoke_model(
        modelId='meta.llama3-8b-instruct-v1:0',
        contentType='application/json',
        accept='application/json',
        body=json.dumps(payload)
    )
    
    byte_response = response['body'].read()
    json_string = byte_response.decode('utf-8')

    # Get the chat response
    response_body = json.loads(json_string)
    chat_response = response_body.get('generation')

    return chat_response

# Example usage
input_text = "How are you?"
prompt = f"""
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

{input_text}[/INST]
"""

print(generate(prompt))

In [None]:
def basic_retriever(question, top_k):
    code, resp = db.query(
        table_name="NaiveRAG",
        query_field="Embedding",
        query_vector=embed_text(question),
        limit=top_k
    )
    return resp["result"]
basic_retriever("What's the agreement date?", 5)

In [None]:
def naive_rag(question):
    docs = basic_retriever(question, 5)
    docs_str = "------------------------\n"
    for doc in docs:
        docs_str += doc["Doc"] + "------------------------\n"
    prompt = f"""
<s>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Your answer should be grounded by the information provided in the documents below.
Don't make up answers.
Don't explain your thought process.
Directly answer the question in concise way.

<documents>
{docs_str}
</documents>

{question}
"""
    return generate(prompt)

naive_rag("What's the agreement date?")

# rag notebook until here

# run cost experiment here

## 3. Run prompt experiment
after defined the performance baseline for fine-tuned model, we now run and experinebt to measure the input token size reduction when using fine-tuned model compared with pre-trained using RAG.
The goal is the measure the savings in token input use that we have by not using context information in RAG solutions. 
#### Question to answer
When is it worth to move to RAG instead of fine-tune the pre-trained model? considering input token costs only as this is the big cost difference between the same solutioj using RAg or fine-tuning.
RAG has an extra cost related to the context added to the prompt. 
We need to measure this difference on average and use this information to support a cost based decison related to use RAg or fine-tune.
basically, considering a break even approach: what is the fine-tuned frequency that would compensate for the same RAG cost, considering a defined fine-tune frequency? While keeping at least the same performance of fine-tuned model
We need to perfom the steps:
1. define a batch of at least 30 prompts and compute the mean size of input
2. run the pre-trained model + RAG with these prompts
    2.1 get the context retrieved and measure the mean size returned. This is the average difference of tokens added related to the fine-tun used (with same prompts)
3. Simulate an example of fine-tuned model costs considering a basic AWS architecture (OpenSearch, SageMaker, S3, etc) and calculate total cost for each traning
4. Create a break even plot based on tese 2 costs


Let's place all of this logic into a single RAG query function:

In [52]:
# CASS: using embedding model bellow: "huggingface-sentencesimilarity-gte-small"
import spacy 
def rag_query(question: str) -> str:
    # Get nearest neighbor THIS WIORKS FOR EMBEDDING MODEL
    payload_nearest_neighbour = {
        "queries": question,
        "top_k": 5,
        "mode": "nn_train_data",
        "return_text": True,
    } 
    
    response = predictor_nn.predict(payload_nearest_neighbour)[0]
    #print("\n>>>resp", response, " \n>>>")
    # get contexts
    contexts = [ans["text"] for ans in response]
    #print("\n>>>contexts", contexts, " \n>>>")
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # counting number of tokens of context text
    nlp = spacy.blank("en")
    context_tokens = nlp(context_str) 
    #print("\n>>>context_str", context_str, " \n>>>")
    # create our retrieval augmented prompt
    payload = create_payload(question, context_str)
    #print(">>>>>>>>>>IN payload['inputs'][0]",payload['inputs'][0], "<<<<<<<<<<<")
    inputs = format_messages(payload['inputs'][0])
    #print(">>>>>>>>>> OUT inputs IS OK!",inputs, "<<<<<<<<<<<")
    payload['inputs'] = inputs
    #print(">>>>>>>>>>payload built ",payload, "<<<<<<<<<<<")
    # make prediction
    out = predictor_base.predict(payload, custom_attributes="accept_eula=true")
    #print("\n>>>out", out, " \n<<<")
    return len(context_tokens), out[0]['generated_text']
    #return out[0]["generation"]["content"]

In [53]:
# First we need to import spacy ADD ABOVE!!
import spacy 

# Creating blank language object then 
# tokenizing words of the sentence 
nlp = spacy.blank("en") 

doc = nlp("GeeksforGeeks is a one stop learning destination for geeks.") 
print (len(doc))
for token in doc: 
	print(token) 


10
GeeksforGeeks
is
a
one
stop
learning
destination
for
geeks
.


## CASS: TODO create loop to run 30 questions from list and claculate average context tokens

#### Example of payload from here (https://studio-d-m1tidsvb7ha6.studio.us-east-1.sagemaker.aws/JumpStart/SageMakerPublicHub/Model/huggingface-sentencesimilarity-gte-small)

We can now ask the question:

In [60]:
num_tokens, out_prompt = rag_query("Does SageMaker support spot instances?")

With maximum sequence length 1000, selected top 2 document sections: 
Amazon SageMaker is designed for high availability. There are no maintenance windows or scheduled downtimes. SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage.
Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.


In [61]:
num_tokens

97

In [66]:
!pwd

/home/sagemaker-user/KDD2024


In [73]:
import csv
file = open("example_results/sagemaker_questions.csv", "r")
list_of_questions = list(csv.reader(file, delimiter=","))
file.close()
list_of_questions

[['\ufeffWhat is Amazon SageMaker?'],
 ['In which Regions is Amazon SageMaker available?'],
 ['What is the service availability of Amazon SageMaker?'],
 ['How does Amazon SageMaker secure my code?'],
 ['What security measures does Amazon SageMaker have?'],
 ['Does Amazon SageMaker use or share models',
  ' training data',
  ' or algorithms?'],
 ['How am I charged for Amazon SageMaker?'],
 ['How can I optimize my Amazon SageMaker costs',
  ' such as detecting and stopping idle resources in order to avoid unnecessary charges?'],
 ['What if I have my own notebook', ' training', ' or hosting environment?'],
 ['Is R supported with Amazon SageMaker?'],
 ['How can I check for imbalances in my model?'],
 ['What kind of bias does Amazon SageMaker Clarify detect?'],
 ['How does Amazon SageMaker Clarify improve model explainability?'],
 ['What is Amazon SageMaker Studio?'],
 ['What is RStudio on Amazon SageMaker?'],
 ['How does Amazon SageMaker Studio pricing work?'],
 ['In which Regions is Amazo

In [78]:
l_tokens = []
for q in list_of_questions:
    num_tokens, out_prompt = rag_query(str(q))
    l_tokens.append(num_tokens)
    print(num_tokens)

With maximum sequence length 1000, selected top 2 document sections: 
Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.
Amazon SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You pass AWS Identity and Access Management roles to SageMaker to provide permissions to access resources on your behalf for training and deployment. You can use encrypted Amazon Simple Storage Service (Amazon S3) buckets for model artifacts and data, as well as pass an AWS Key Management Service (KMS) key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume. Amazon SageMaker also supports Amazon Virtual Private Cloud (VPC) and AWS PrivateLink support.
166
With maximum sequence length 1000, se

In [106]:
#print(l_tokens)
print("Average # context tokens over %2d questions: %.2f" % (len(l_tokens), sum(l_tokens) / len(l_tokens)))   # print integer value

Average # context tokens over 31 questions: 155.39


### cost analysis Sessions here
#### costs connsiderations
+ common to FT and RAG will not be considered (Amazon API Gateway (WebSocket)), Amazon CloudFront, Amazon CloudWatch, __Amazon DynamoDB (probably consider for RAG too?)__, AWS Lambda
+ only input token costs for fine-tuning and RAG
+ knowledge base costs for RAG
#### costs considered: queries per day, number docs RAG, number of tokens? 
(https://docs.aws.amazon.com/solutions/latest/generative-ai-application-builder-on-aws/cost.html)
or this simplified external costs for the model: ref. (https://aws.amazon.com/bedrock/pricing/) Amazon Titan Embeddings	N/A	
##### ref. Sample costs for a text-based proof of concept
Ex. Amazon Bedrock (Titan Text Express) ->  Assumptions for 100 interactions per day:
+ Monthly cost for 150K input tokens per day = $3.60
+ Monthly cost for 16K output tokens per day = $0.768
Ex. kendra for RAG:
 + 0-8,000 queries a day and up to 100,000 documents with Amazon Kendra Enterprise Edition with 0-50 data sources = $1,008.00 

#### simulate FT costs:
#### simulate RAG costs
#### Breakeven analysis
#### Putting all together performancfe and cost tables and plots to support cost comparison analysis

We can also ask questions about things that are out of context (not contained within our dataset). From this we expect the model to *not* hallucinate and honestly tell us that it does not know the answer:

###### accessing the ft pre-trained end point and make prediscion with batch of 30 prompts