# Retrieval Augmented Generation

We will use the semantic search to provide the best matching book based on the author, genre and plot summary. [Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401) is a process that combines retrieval-based models and generative models to enhance natural language generation by retrieving relevant information and incorporating it into the generation process. In this notebook, we'll walk through enhancing an OpenSearch cluster search with generative AI to output conversationa book recommendations based on a desired author, genre and plot summary.

### 1. Install OpenSearch ML Python library

For this notebook we require the use of a few key libraries. We'll use the Python clients for OpenSearch and SageMaker, and Python frameworks for text embeddings.

Note: Running of the code cell below does not produce any output.

In [None]:
%%capture
## Code Cell 1 ##
!pip install opensearch-py-ml accelerate tqdm --quiet
!pip install sagemaker --upgrade --quiet

### 2. Check PyTorch Version


let's import PyTorch and confirm that the latest version of PyTorch is running. The version should already be at 2.0.0 or higher.

In [None]:
## Code Cell 2 ##

import torch
print(torch.__version__)

### 3. Import libraries
The line below will import all the relevant libraries and modules used in this notebook.

In [None]:
## Code Cell 3 ##

import boto3
import os
import time
import json
import pandas as pd
from tqdm import tqdm
import sagemaker
from opensearchpy import OpenSearch, RequestsHttpConnection
from sagemaker import get_execution_role


### 4. Prepare data

This lab combines semantic search with a generative model to present the retrieved data to the user . Below is a dataset of book information and plot summaries, we'll sample this data set to recommend books that resemble the user provided description.

### Note
The original data set is described here: 
https://www.cs.cmu.edu/~dbamman/booksummaries.html

Run the following cells to inspect the dataset, transform it into a pandas DataFrame, clean-up missing values, and sample a subset of the data.

In [None]:
## Code Cell 4 ##

# Read in the Tab delimited data file and print the shape of the resulting data frame.
pd.options.mode.chained_assignment = None
df = pd.read_csv('booksummaries.txt',sep='\t')
print(df.shape)

# Add columns headers to the data frame to be able to analyze.
df.columns = ['WikiPediaId','FreeBaseId','title','author','pub_date','genres','plot_summary']

# Display entries with null data in any column (NaN).
df[df.isnull().any(axis=1)]


In [None]:
## Code Cell 5 ##

# Let's drop any rows that contain null values in any column. In a real production application you would want to replace NaN values with correct data.
df_1 = df.dropna()

# clean-up Freebase markup and other unwanted characters in the Genres columm. 
df_1.genres.replace(to_replace='\"\/m\/.{4,7}\"\:', value='', regex=True,inplace=True)  # remove Freebase markup
df_1.genres.replace(to_replace='\\\\u00e0\sclef', value='', regex=True,inplace=True)    # remove utf-8 special characters
df_1.genres.replace(to_replace='\{\s"|"\}', value='', regex=True,inplace=True)          # Remove {" or "}
df_1.genres.replace(to_replace='"', value='', regex=True,inplace=True)                  # Remove double quotes

# Only use the first value as the genre
df_2 = df_1['genres'].str.split(',', expand=True, n=1)
df_2.rename( columns={0:'genre'}, inplace=True )
df_2.drop(columns=[1],inplace=True)


df_3 = pd.concat([df_1, df_2], axis=1)

# Trim the size of the plot summary to 500 characters.
df_3['plot_summary'] = df_3['plot_summary'].apply(lambda x: ' '.join(x[:500].split(' ')[:-1]) if len(x) > 500 else x)
df_3['book_summary'] = df_3[['title','author','genre','plot_summary']].agg(' '.join, axis=1)

# Sort by the author column, and to keep the lab within a reasonable time-frame drop the last 5000 rows of data.
# Drop other columns not needed.
df_3.sort_values(by=['author'],inplace = True)
df_3.drop(df_3.tail(5000).index,inplace = True)
df_3.drop(columns=['genres','WikiPediaId','FreeBaseId','plot_summary'],inplace=True)

# Create a dictionary of the remaining data to be used for further processing.
wm_list = df_3.to_dict('records')

# Let's look at the data now that it has been cleansed and trimmed.
df_3

### 5. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below. Also, be sure to replace the \<StackName\> with the value you copied 
    in an ealier step.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
## Code Cell 6 ##

## Update the below <StackName> placeholder with the value from your Lab you copied in an earlier step. 
cloudformation_stack_name = '<StackName>'

region = 'us-east-1' 

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OSDomainEndpoint']

outputs

In [None]:
## Code Cell 7 ##

## To authenticate to the OpenSearch domain we need to retrieve username and password stored in Secrets Manager.
secrets = boto3.client('secretsmanager')
os_domain_secret = secrets.list_secrets(
    Filters=[
        {
            'Key':'name',
            'Values': ['DomainMasterUser']
        }
    ]
)['SecretList'][0]['Name']

aos_credentials = json.loads(secrets.get_secret_value(SecretId=os_domain_secret)['SecretString'])

auth = (aos_credentials['username'], aos_credentials['password'])
print(auth)

## The below client will be used in a later step below.
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 6. Get SageMaker endpoint for embedding

---
This is SageMaker Endpoint with Bloom 7B1 Embedding FP16 parameters model to convert text into vector.
This transformer-based model from Hugging Face without a text generation model head, takes a text string as input and produces an embedding vector with 4096 dimensions. 

Note: Make sure to replace the \<SageMaker_Embedding_Endpoint_Name\> placeholder with the value you copied in an ealier step.

Define function to convert text into vector with SageMaker Embedding endpoint created in an ealier step. Running of the code cell below does not produce any output.

In [None]:
## Code Cell 8 ##

embedding_endpoint_name = '<SageMaker_Embedding_Endpoint_Name>'

def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
   
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )

    response_json = json.loads(response['Body'].read().decode("utf-8"))
    embeddings = response_json["embedding"]
    if len(embeddings) == 1:
        return [embeddings[0]]
    return embeddings


### 7. Test the embeddings endpoint with a sample phrase
Using any text phrase, the endpoint converts the text to a vectorized array of size 4096. We're also creating a function `embed_phrase` so that we can call it later. Running of the code cell below does not produce any output.

In [None]:
## Code Cell 9 ##

def embed_phrase(input_data):
    input_str = json.dumps({"text_inputs": input_data})
    encoded_input_str = input_str.encode("utf-8")
    features = query_endpoint_with_json_payload(encoded_input_str,embedding_endpoint_name)
    return features

Ask a general question about a book

In [None]:
## Code Cell 10 ##

question_on_book="A book by the author edgar rice burroughs that is genre of science fiction and involves Tarzan the ape man"
result = embed_phrase(question_on_book)

print(len(result[0]))
result[0][:10]


### 8. Create an index in Amazon OpenSearch Service 
Whereas we previously created an index with 2-3 fields, this time we'll define the index with multiple fields: the vectorization of the `PlotSummary` field, and several others present within the dataset.

To create the index, we first define the index in JSON, then use the aos_client connection we defined earlier to create the index in OpenSearch.

In [None]:
## Code Cell 11 ##

knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "booksummary_vector": {
                "type": "knn_vector",
                "dimension": 4096,
                "store": True
            },
            "book_summary": {
                "type": "text",
                "store": True
            },
            "author": {
                "type": "text",
                "store": True
            },
            "title": {
                "type": "text",
                "store": True
            },
            "pub_date": {
                "type": "text",
                "store": True
            },
            "genre": {
                "type": "text",
                "store": True
            },
        }
    }
}


Using the above index definition, we now need to create the index in Amazon OpenSearch. Running this cell will recreate the index if you have already executed this notebook.

In [None]:
## Code Cell 12 ##

index_name = "book_knowledge_base"

try:
    aos_client.indices.delete(index=index_name)
    print("Recreating index '" + index_name + "' on cluster.")
    aos_client.indices.create(index=index_name,body=knn_index,ignore=400)
except:
    print("Index '" + index_name + "' not found. Creating index on cluster.")
    aos_client.indices.create(index=index_name,body=knn_index,ignore=400)


Let's verify the created index information

In [None]:
## Code Cell 13 ##

aos_client.indices.get(index=index_name)

### 9. Load the raw data into the Index
Next, let's load the book summary data into the index we've just created. During the ingest data defined by the `os_import` function, the `book_summary` field will also be converted to vector (embedding) by calling the previously created endpoint using the embed_phrase function.

In [None]:
## Code Cell 14 ##

def os_import(record, aos_client, index_name):
    book_summary = record["book_summary"]
    search_vector = embed_phrase(book_summary)
    aos_client.index(index=index_name,
             body={"booksummary_vector": search_vector[0], 
                   "book_summary": record["book_summary"],
                   "author":record["author"],
                   "genre":record["genre"],
                   "pub_date":record["pub_date"],
                   "title":record["title"]
                  }
            )

print("Loading records...")
for record in tqdm(wm_list): 
    os_import(record, aos_client, index_name)
print("Records loaded.")

To validate the load, we'll query the number of documents number in the index. We should have approximately 4200 hits in the index, or however many was specified earlier in sampling.

In [None]:
## Code Cell 15 ##

res = aos_client.search(index=index_name, body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

### 10. Search vector with "Semantic Search" 

Now we can define a helper function to execute the search query for us to find a book whose review most closely matches the requested description. `retrieve_opensearch_with_semantic_search` embeds the search phrase, searches the index for the closest matching vector, and returns the top result.


In [None]:
## Code Cell 16 ##

def retrieve_opensearch_with_semantic_search(phrase, n=2):
    
    search_vector = embed_phrase(phrase)[0]

    osquery={
        "_source": {
            "exclude": [ "booksummary_vector" ]
        },
        
      "size": n,
      "query": {
        "knn": {
          "booksummary_vector": {
            "vector":search_vector,
            "k":n
          }
        }
      }
    }

    res = aos_client.search(index=index_name, 
                           body=osquery,
                           stored_fields=["title","author","pub_date", "genre", "book_summary"],
                           explain = True)
    top_result = res['hits']['hits'][1]
    
    result = {
        "title":top_result['_source']['title'],
        "author":top_result['_source']['author'],
        "pub_date":top_result['_source']['pub_date'],
        "genre":top_result['_source']['genre'],
        "book_summary":top_result['_source']['book_summary'],
    }
    
    return result


Use the semantic search to get similar records with the sample question

In [None]:
## Code Cell 17 ##

example_request = retrieve_opensearch_with_semantic_search(question_on_book)
print(question_on_book)
print(example_request)

### 11. Get SageMaker endpoint for content generation

We are using Falcon 7B LLM in this lab. Please refere HuggingFace documentaion for more information: https://huggingface.co/tiiuae/falcon-7b

In [None]:
## Code Cell 18 ##

llm_endpoint_name=outputs['LLMEndpointName']
print(llm_endpoint_name)


Define function to use LLM to generate content. As LLM is trained with static, outdated data, and it does not have business domain knowledge, the generated content is not factual(hallucination).

In [None]:
## Code Cell 19 ##

def query_llm_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    model_predictions = json.loads(response["Body"].read())
    return [gen["generated_text"] for gen in model_predictions]

def query_llm_with_hallucination(question):
    payload = {
        "inputs": question,
        "parameters":{
            "max_new_tokens": 1024,
            "num_return_sequences": 1,
            "top_k": 100,
            "top_p": 0.95,
            "do_sample": False,
            "return_full_text": True,
            "temperature": 0.9
        }
    }
    query_response = query_llm_endpoint_with_json_payload(json.dumps(payload).encode("utf-8"), endpoint_name=llm_endpoint_name)
    return query_response


Check the generated result from LLM by directly querying the invoke endpoint.

In [None]:
## Code Cell 20 ##

generated_texts = query_llm_with_hallucination(question_on_book)

print(f"The recommended book from LLM without vector embedding: \n\n{generated_texts[0]}\n")

### Retrieval Augmented Generation
---
To mitigate the LLM hallucination issue, we can provide more context to the LLM so that it can use the information to engineer a guiding prompt and generate more relevant results. 


### 12. Create a prompt for the LLM using the search results from OpenSearch

We will be using the Falcon-7B model for one-shot generation, using a canned recommendation and response to guide the output. 

Before querying the model, the below function `generate_prompt_to_llm` is used to easily make a prompt for one-shot generation. The function takes in an input string to search the OpenSearch cluster for a matching book, then compose the prompt to LLM. The prompt is in the following format:

```
A librarian uses their vast knowledge of books to make great recommendations people will enjoy. As a librarian, you must include the author, the title of the book, and a summary of the plot relating to the following phrase: {original_question_on_book}.

Data:{'book_summary': ' Tarzan tracks down a man who has been mistaken for him. The man is under the delusion that he is Tarzan, and he is living in a lost city inhabited by people descended from early Portuguese explorers. The plot devices of a lost city and a Tarzan "double" or impostor had been used by Burroughs in some previous Tarzan novels.', 'author': 'Edgar Rice Burroughs', 'title': 'Tarzan and the Madman', 'pub_date': 1964, 'genres': 'Science Fiction'}

Recommendation:I have a wonderful book recommendation for you. It's a story about Tarzan and how a madman has been impersonating him. The author is Edgar Rice Burroughs and it's a science fiction book with adventure and fun. It was published in the year 1964. 

Data: {retrieved_documents}

Recommendation:
```



In [None]:
## Code Cell 21 ##

def generate_prompt_to_llm(original_question_on_book):
    retrieved_documents = retrieve_opensearch_with_semantic_search(original_question_on_book)
    print("retrieved relevant book per your query is : \n" + str(retrieved_documents))
    print("------------")
    one_shot_description_example = "{'book_summary': 'Tarzan tracks down a man who has been mistaken for him. The man is under the delusion that he is Tarzan, and he is living in a lost city inhabited by people descended from early Portuguese explorers. The plot devices of a lost city and a Tarzan double or impostor had been used by Burroughs in some previous Tarzan novels.', 'author': 'Edgar Rice Burroughs', 'title': 'Tarzan and the Madman', 'genre': 'Science fiction', 'pub_date': '1964'}"
    one_shot_response_example = "It's a real page turning story about Tarzan and how a madman has been impersonating him. The author is Edgar Rice Burroughs and it's a science fiction book with adventure and lots of fun. It was published in the year 1964."
    prompt = (
        f" Make a book recommendation that is similar to the {original_question_on_book} The recommendation must include the title of the book, the author and genre: \n"
        f"Data: {one_shot_description_example} \n Recommendation: {one_shot_response_example} \n"
        f"Data: {retrieved_documents} \n Recommendation:"
    )
    return prompt

### 13. Format LLM prompt and query using the generated prompt
We also need a few more helper functions to query the LLM. `generate_llm_input` transforms the generated prompt into the correct input format, `render_llm_output` parses the LLM output. 

`query_llm_with_rag` combines everything we've done in this module. It does all of the following:
- generate vector for the input
- searches the OpenSearch index with semantic search for the relevant book with "description vector"
- generate an LLM prompt from the search results
- queriy the LLM with RAG for a response

In [None]:
## Code Cell 22 ##

def generate_llm_input(data, **kwargs):
    default_kwargs = {
        "num_beams": 5,
        "no_repeat_ngram_size": 3,
        "do_sample": True,
        "max_new_tokens": 100,
        "temperature": 0.9,
        "watermark": True,
        "top_k": 200,
        "max_length": 200,
        "early_stopping": True
    }
    
    default_kwargs = {**default_kwargs, **kwargs}
    
    input_data = {
        "inputs": data,
        "parameters": default_kwargs
    }
    
    return input_data

def query_llm_with_rag(description, **kwargs):
    prompt = generate_prompt_to_llm(description)
    query_payload = generate_llm_input(prompt, **kwargs)
    response = query_llm_endpoint_with_json_payload(json.dumps(query_payload).encode("utf-8"), endpoint_name=llm_endpoint_name)
    return response

#### And finally, let's call the function and get a book recommendation.

In [None]:
## Code Cell 23 ##

recommendation = query_llm_with_rag(question_on_book)
print(question_on_book)
print(recommendation)

### 14. DIY Section

To complete the solution you will need to make minor modifications to the code cells below to show your understanding of the concepts taught in this lab.

- Clean up the existing dataframe by replacing all pub_date columns that are in YYYY-MM or YYYY to a default date of YYYY-MM-DD
- Review the results that are obtained from the LLM by performing prompt engineering
- Upload the resulting files to an S3 bucket for validation

In [None]:
# DIY Code Cell 1 ##

# DIY: Modify the below two lines and replace the <column> place holder with the column that stores the book's published date.
df_3.<column>.replace(r'^(\d{4})$', r'\1-01-01', regex=True,inplace=True)
df_3.<column>.replace(r'^(\d{4})-(\d{2})$', r'\1-\2-01', regex=True,inplace=True)

df_3


In [None]:
# DIY Code Cell 2 ##

question_on_book_diy = 'I read the following book ' + question_on_book + ' and hated it, can you recommend something different. Maybe a love story?'

def generate_prompt_to_llm(original_question_on_book):
    retrieved_documents = retrieve_opensearch_with_semantic_search(original_question_on_book)
    print("retrieved relevant book per your query is : \n" + str(retrieved_documents))
    print("------------")
    one_shot_description_example = "{'book_summary': 'The novel opens during a Russian Orthodox funeral liturgy, or panikhida, for Yuri's mother, Marya Nikolaevna Zhivago. Having long ago been abandoned by his father, Yuri is taken in by his maternal uncle, a former Orthodox priest and philosopher. Formerly a wealthy member of Moscow's merchant gentry, Yuri's father, Andrei Zhivago, has squandered the family's millions through debauchery and carousing, and has been progressively bled dry by the corrupt lawyer Viktor', 'author': 'Boris Pasternak', 'title': 'Doctor Zhivago', 'genre': 'Novel', 'pub_date': '1957'}"
    one_shot_response_example = "Here is a book that is very different from the one you read recently. It is not science fiction. Doctor Zhivago is a novel set in revolutionary russia during the 1900s, It is about Yuri Zhivago as he deals with revolution and social upheaval in his native country of russia."
    prompt = (
        f" Make a book recommendation that is different than the author or genre as described here: {original_question_on_book} The output must include the title, the author and genre: \n"
        f"Data: {one_shot_description_example} \n Recommendation: {one_shot_response_example} \n"
        f"Data: {retrieved_documents} \n Recommendation:"
    )
    return prompt


recommendation = query_llm_with_rag(question_on_book_diy)
print(question_on_book_diy)
print(recommendation)



In [None]:
# DIY Code Cell 3 ##

## NO NEED TO CHANGE ANY OF THIS CODE ##
## RUN THIS CELL AS-IS TO COMPLETE THE DIY ##
FILENAMES=['diy_out1.csv','diy_out2.txt']
BUCKET='model-training'

# Output the updated dataframe with correct pub_date to a csv file.
df_3.to_csv(FILENAMES[0])

# Output the new book recommendation to a text file.
diy_file2 = open(FILENAMES[1], 'w')
diy_file2.writelines(question_on_book_diy)
diy_file2.writelines(recommendation)
# Closing file
diy_file2.close()

s3 = boto3.client('s3')
response = s3.list_buckets()
buckets = response['Buckets']

# Find the correct bucket to upload the diy files to.
bucket = [x['Name'] for x in buckets if BUCKET in x['Name']][0]

for files in FILENAMES:
    print(f"Uploading file {files} to S3 bucket {bucket}")
    with open(files, "rb") as objects:
        s3.upload_fileobj(objects, bucket, files)
