# Ragged - Retrieval-Augmented Generation - Q&A with FAQs

This notebook is inspired by Huggingface's [RAG with source highlighting using Structured generation](https://huggingface.co/learn/cookbook/en/structured_generation) and [Advanced RAG on Hugging Face documentation using LangChain](https://huggingface.co/learn/cookbook/en/advanced_rag) demo notebooks.

## Table of Contents

1. [Preface](#preface)
1. [Setup](#setup)
1. [Semantic Search BETWEEN Documents](#between)
1. [LLM Q&A](#qa)

## Preface <a name="preface"></a>

This is a Jupyter notebook. It is an interactive document with Markdown and code cells. The code cells can be run to output results directly in the notebook, as exemplified below.

In [5]:
# I am a comment in a code cell
my_variable = 'uncut gems'
print(my_variable)

uncut gems


Additionally, all notebook cells can be run **in any order**--re-running cells or running cells out of consecutive order can result in irreplicable or false results. Below, we will showcase running cells out of order and how that affects results.

In [2]:
my_variable = 'labyrinth'

In [3]:
my_string = 'I expect my string to mention the movie labyrinth'
# should return True
my_string == f'I expect my string to mention the movie {my_variable}'  # this is an f string, they're great

True

In [4]:
print(my_string)
print(f'I expect my string to mention the movie {my_variable}')

I expect my string to mention the movie labyrinth
I expect my string to mention the movie labyrinth


Notice how the topmost code cell says `[5]` instead of `[1]`. We re-ran that code cell, updating `my_variable`'s assignment to `uncut gems`. Now, our results will not be consistent and will no longer read consecutively.

In [6]:
# should return False
my_string == f'I expect my string to mention the movie {my_variable}'
print(my_string)
print(f'I expect my string to mention the movie {my_variable}')

I expect my string to mention the movie labyrinth
I expect my string to mention the movie uncut gems


A notebook runs a Python environment via **a kernel**. We usually will see the name of the kernel in the top right corner of the notebook toolbar. The circle adjacent to the kernel name will be filled in when a cell is running or will indicate when there is an error connecting to the kernel.

The kernel is just the connection between the notebook and the environment. The kernel is a great tool for configuring your notebook. The environment underlies the kernel; any package updates or installations are made directly to that environment.

Our kernel is `.venv`. Below, using the `sys.executable` command, we are able to see which environment my kernel is using to run python.

In [7]:
import sys
print(sys.executable)

/Volumes/G-DRIVE ArmorATD/hadsterface/.venv/bin/python


## Setup <a name="setup"></a>

First, we will start by importing all relevant modules and methods.

In [8]:
from huggingface_hub import InferenceClient  # LLM access and use
import datasets  # access and load data
from sentence_transformers import SentenceTransformer, util  # cosine similarity for semantic search
import torch  # torch for conversion of data to tensors
from getpass import getpass  # to safely input HF access token without sharing

Next, we will load our data. Note that the following cell will **DOWNLOAD** the dataset if it isn't already available in the project.

In [9]:

doc_ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")
type(doc_ds)

datasets.arrow_dataset.Dataset

We can see that our dataset is an `Arrow Dataset` type. It has two `features`, data inputs, and 2647 data points.

In [10]:
doc_ds

Dataset({
    features: ['text', 'source'],
    num_rows: 2647
})

We will convert the data to a pandas dataframe, which will convert each `feature` into a column. Each row will correspond to a data point.

In [11]:
doc_ds.set_format('pandas')
doc_df = doc_ds[:]
doc_df.head()

Unnamed: 0,text,source
0,"Create an Endpoint\n\nAfter your first login,...",huggingface/hf-endpoints-documentation/blob/ma...
1,Choosing a metric for your task\n\n**So you'v...,huggingface/evaluate/blob/main/docs/source/cho...
2,主要特点\n\n让我们来介绍一下 Gradio 最受欢迎的一些功能！这里是 Gradio ...,gradio-app/gradio/blob/main/guides/cn/01_getti...
3,!--Copyright 2023 The HuggingFace Team. All ri...,huggingface/transformers/blob/main/docs/source...
4,Gradio Demo: blocks_random_slider\n\n\n```\n!...,gradio-app/gradio/blob/main/demo/blocks_random...


We can access the value of the `text` column for the first row, or data point, using pandas' subsetting features.

In [12]:
doc_df['text'].iloc[0]

' Create an Endpoint\n\nAfter your first login, you will be directed to the [Endpoint creation page](https://ui.endpoints.huggingface.co/new). As an example, this guide will go through the steps to deploy [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) for text classification. \n\n## 1. Enter the Hugging Face Repository ID and your desired endpoint name:\n\n<img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_repository.png" alt="select repository" />\n\n## 2. Select your Cloud Provider and region. Initially, only AWS will be available as a Cloud Provider with the `us-east-1` and `eu-west-1` regions. We will add Azure soon, and if you need to test Endpoints with other Cloud Providers or regions, please let us know.\n\n<img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_region.png" alt="select region" />\n\n## 3. Define the [S

## Semantic Search BETWEEN Documents <a name="between"></a>

**Semantic Search** is the process of identifying similar results based on MEANING rather than keyword matches.

A common method of semantic search is:
1. Converting query text (your text, such as your search phrase) to its embedding (the mathematical representation of that text)
2. Converting corpus text (the text you are comparing the query to) to its embeddings (this should use the same model as in step 1)
3. Using a similarity function such as [cosine similarity](https://www.geeksforgeeks.org/cosine-similarity/) to return semantically similar results

More on embeddings and cosine similarity can be found in these handy articles: [1](https://jalammar.github.io/illustrated-word2vec/) and [2](https://huggingface.co/tasks/sentence-similarity).

Below, we will demonstrate semantic search on our dataset.

First, let's load a model to create embeddings. We will use [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). We can review the [model card for all-MiniLM-L6-v2 on HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). Model cards contain useful information such as how the model was trained, any biases the model may have, etc.

In [13]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')



Next, we will generate embeddings for every row in the `text` column of our dataset. We will do this using our model and its `encode` method.

Then, we will convert each generated embedding from a numpy array to a tensor using `torch.from_numpy`. This is because our semantic search function requires tensor datatypes.

Finally, we will use the `util.semantic_search` method to generate the top 10 similar `text` values for each `text` value in our dataset. Because we are comparing our dataset with itself, this is considered PAIRWISE semantic search. Consequently, we can expect our most similar text value to be the original query value compared to itself. We will demonstrate this below.

**The following cell will take about 1 minute to run.**

In [14]:
# generate embeddings for each text data point
doc_df['embedding'] = [model.encode(text) for text in doc_df['text'].to_list()]
# convert embeddings from numpy arrays to tensors
doc_df['embedding'] = [torch.from_numpy(embedding) for embedding in doc_df['embedding'].tolist()]
query = doc_df['embedding'].to_list()
corpus = doc_df['embedding'].to_list()
# return top 10 most similar text values from dataset using cosine similarity
doc_df['semantic_search'] = util.semantic_search(query, corpus)  # because our QUERY and our CORPUS is the same, this is considered pairwise semantic search
doc_df.head()

Unnamed: 0,text,source,embedding,semantic_search
0,"Create an Endpoint\n\nAfter your first login,...",huggingface/hf-endpoints-documentation/blob/ma...,"[tensor(0.0096), tensor(-0.0266), tensor(-0.00...","[{'corpus_id': 0, 'score': 0.9999998807907104}..."
1,Choosing a metric for your task\n\n**So you'v...,huggingface/evaluate/blob/main/docs/source/cho...,"[tensor(0.0092), tensor(-0.0732), tensor(-0.05...","[{'corpus_id': 1, 'score': 1.0000003576278687}..."
2,主要特点\n\n让我们来介绍一下 Gradio 最受欢迎的一些功能！这里是 Gradio ...,gradio-app/gradio/blob/main/guides/cn/01_getti...,"[tensor(-0.0789), tensor(0.0448), tensor(-0.04...","[{'corpus_id': 2, 'score': 1.000000238418579},..."
3,!--Copyright 2023 The HuggingFace Team. All ri...,huggingface/transformers/blob/main/docs/source...,"[tensor(-0.0926), tensor(0.0092), tensor(0.017...","[{'corpus_id': 3, 'score': 0.9999998807907104}..."
4,Gradio Demo: blocks_random_slider\n\n\n```\n!...,gradio-app/gradio/blob/main/demo/blocks_random...,"[tensor(-0.0699), tensor(-0.0296), tensor(-0.0...","[{'corpus_id': 4, 'score': 1.0000003576278687}..."


If we look at the semantic search results for the first row, we can see we have a list of ten dictionaries. The list is pre-sorted based on cosine similarity score, from highest to lowest. The closer to 1, the higher similarity. The closer to 0, the lower the similarity.

In [15]:
doc_df['semantic_search'].iloc[1]

[{'corpus_id': 1, 'score': 1.0000003576278687},
 {'corpus_id': 496, 'score': 0.678149938583374},
 {'corpus_id': 235, 'score': 0.6676350831985474},
 {'corpus_id': 833, 'score': 0.6462449431419373},
 {'corpus_id': 1585, 'score': 0.5992898941040039},
 {'corpus_id': 2542, 'score': 0.5690680742263794},
 {'corpus_id': 1665, 'score': 0.5668306946754456},
 {'corpus_id': 654, 'score': 0.5641615986824036},
 {'corpus_id': 1996, 'score': 0.549310028553009},
 {'corpus_id': 1024, 'score': 0.5416027903556824}]

Because we compared the same dataset to itself (pairwise semantic search), we can assume the first result for a given row, the one with highest similarity, will always be the given row itself.

In [16]:
# we will assume the first item in pairwise semantic search will always be self since an item is expected to have highest similarity with itself
print(doc_df['semantic_search'].iloc[1][0])  # should return corpus_id = 1

{'corpus_id': 1, 'score': 1.0000003576278687}


Above, we returned the highest similarity results for the first row. The row with the highest similarity to row 1 is row 1 itself. If we want to find the highest similarity results that is not the original row itself, we need to return the SECOND result from semantic search.

In [17]:
# second item in list will be doc with highest similarity that isn't self
print(doc_df['semantic_search'].iloc[1][1])  # should return any other corpus id but 1

{'corpus_id': 496, 'score': 0.678149938583374}


Let's compare row 1's `text` with row 496's `text` to see how similar they actually are.

In [18]:
print('Row 1 text')
print(doc_df['text'].iloc[1][:100])  # truncating for preview
print('\n')
print('Row 496 text')
print(doc_df['text'].iloc[496][:100])  # truncating for preview

Row 1 text
 Choosing a metric for your task

**So you've trained your model and want to see how well it’s doing


Row 496 text
 A quick tour

🤗 Evaluate provides access to a wide range of evaluation tools. It covers a range of 


Now, we will return the most similar text that is not self for each row in our dataset.

In [19]:
# return the second-most similar (the non-self result) corpus id from semantic search for each row
doc_df['most_similar_index'] = [result[1]['corpus_id'] for result in doc_df['semantic_search'].to_list()]
# return the text value of each second-most similar result for each row
doc_df['most_similar_text'] = [doc_df['text'].iloc[i] for i in doc_df['most_similar_index'].to_list()]
doc_df.head()

Unnamed: 0,text,source,embedding,semantic_search,most_similar_index,most_similar_text
0,"Create an Endpoint\n\nAfter your first login,...",huggingface/hf-endpoints-documentation/blob/ma...,"[tensor(0.0096), tensor(-0.0266), tensor(-0.00...","[{'corpus_id': 0, 'score': 0.9999998807907104}...",983,Create a Private Endpoint with AWS PrivateLin...
1,Choosing a metric for your task\n\n**So you'v...,huggingface/evaluate/blob/main/docs/source/cho...,"[tensor(0.0092), tensor(-0.0732), tensor(-0.05...","[{'corpus_id': 1, 'score': 1.0000003576278687}...",496,A quick tour\n\n🤗 Evaluate provides access to...
2,主要特点\n\n让我们来介绍一下 Gradio 最受欢迎的一些功能！这里是 Gradio ...,gradio-app/gradio/blob/main/guides/cn/01_getti...,"[tensor(-0.0789), tensor(0.0448), tensor(-0.04...","[{'corpus_id': 2, 'score': 1.000000238418579},...",2566,快速开始\n\n**先决条件**：Gradio 需要 Python 3.8 或更高版本，就...
3,!--Copyright 2023 The HuggingFace Team. All ri...,huggingface/transformers/blob/main/docs/source...,"[tensor(-0.0926), tensor(0.0092), tensor(0.017...","[{'corpus_id': 3, 'score': 0.9999998807907104}...",973,!--Copyright 2022 The HuggingFace Team. All ri...
4,Gradio Demo: blocks_random_slider\n\n\n```\n!...,gradio-app/gradio/blob/main/demo/blocks_random...,"[tensor(-0.0699), tensor(-0.0296), tensor(-0.0...","[{'corpus_id': 4, 'score': 1.0000003576278687}...",1917,Gradio Demo: interface_random_slider\n\n\n```...


Looking at our results, we can see that in our dataset the text that is most similar to row 0 is row 983. Both of these texts discuss creating endpoints.

In [20]:
doc_df.iloc[0]['text']

' Create an Endpoint\n\nAfter your first login, you will be directed to the [Endpoint creation page](https://ui.endpoints.huggingface.co/new). As an example, this guide will go through the steps to deploy [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) for text classification. \n\n## 1. Enter the Hugging Face Repository ID and your desired endpoint name:\n\n<img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_repository.png" alt="select repository" />\n\n## 2. Select your Cloud Provider and region. Initially, only AWS will be available as a Cloud Provider with the `us-east-1` and `eu-west-1` regions. We will add Azure soon, and if you need to test Endpoints with other Cloud Providers or regions, please let us know.\n\n<img src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_region.png" alt="select region" />\n\n## 3. Define the [S

In [21]:
doc_df.iloc[0]['most_similar_text']

' Create a Private Endpoint with AWS PrivateLink\n\nSecurity and secure inference are key principles of Inference Endpoints. We currently offer three different levels of security: [Public, Protected and Private](/docs/inference-endpoints/security).\n\nPublic and Protected Endpoints do not require any additional configuration. But in order to create a Private Endpoint for a secure intra-region connection, you need to provide the AWS Account ID of the account which should also have access to Inference Endpoints.\n\n<img\n  src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/6_private_type.png"\n  alt="select private link"\n/>\n\nAfter you provide your AWS Account ID and click **Create Endpoint**, the Inference Service creates a VPC Endpoint and you should see the VPC Service Name in your overview.\n\n<img\n  src="https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/6_service_name.png"\n  alt="service link"\n/>\n\nThe V

In [22]:
doc_df.iloc[0]['semantic_search'][1]['score']  # the cosine similarity score

0.5928554534912109

Later, we will use this same semantic search approach as part of RAG Q&A, Retrieval-Augmented Generation Question & Answering. This will allow us to find the text that is most likely to contain the answer to our question. The model will then generate an answer based on that text.

First, let's look at how we perform Q&A with an LLM.

## LLM Q&A <a name="qa"></a>

There are many different ways to interact with LLMs. Because LLMs are large, we're going to use a method that doesn't download the model to our local machine. We will use a [huggingface access token](https://huggingface.co/docs/hub/security-tokens) to interact with the model. We will create our token as READ-ONLY.

We will use Meta Llama 3 per the [inspiration tutorial here](https://huggingface.co/learn/cookbook/en/structured_generation). We can determine what prompt formats we can use for a model by reviewing its [model card on HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Looking at the [model card for Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we can identify it as a Text2Text Generation model that supports English and can be utilized in QA, Translation, and many other tasks.

**NOTE THE WARNING** on Meta Llama 3's model card regarding data privacy and information sharing.

In [23]:
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"
my_token = getpass('HF access token:')

llm_client = InferenceClient(model=repo_id, timeout=120, token=my_token)

# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

" I hope you're having a great day! I just wanted to check in and see how things are"

If we increase the `temperature` parameter, we'll get more "creative" or less accurate results.

In [24]:
# play with temperature
llm_client.text_generation(prompt='How are you today?', max_new_tokens=20, temperature=1.3)

' Wishing you a double-tittled excellent day!\n*- chord and*openedTam.’ thiOnly'

Using the prompt from the [inspiration notebook](https://huggingface.co/learn/cookbook/en/structured_generation), we'll look at the different components of RAG Q&A.

First, we have our **prompt**. This is the entire text sent to the LLM. It contains our instructions to the model on how we want our question answered. It also *can* contain our **context**, or the data we expect to find our answer within. Finally, it contains our **query**, or the question we want answered.

Prompts can be formatted in many different ways. Prompt engineering is important to get the most out of our LLM.

In the prompt below, we are instructing the model to use our `context` to answer the `user_query`. The values for `context` and `user_query` will be replaced with actual text using our code in later cells.

Our prompt is also instructing the model to answer in JSON format. This is an example of using the generative capabalities of LLMs to not only return the answer to our question but to return it in a new format.

In [25]:
# from https://huggingface.co/learn/cookbook/en/structured_generation
# define our prompt
RAG_PROMPT_TEMPLATE_JSON = """
Answer the user query based on the source documents.

Here are the source documents: {context}


You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.

Answer:
{{
  "answer": your_answer,
  "confidence_score": your_confidence_score,
  "source_snippets": ["snippet_1", "snippet_2", ...]
}}
End of answer.

Now begin!
Here is the user question: {user_query}.
Answer:
"""

In [26]:
# define our question, same as before
USER_QUERY = 'How are you?'

In [27]:
# define our data/text/corpus where an answer may be found
RELEVANT_CONTEXT = f'''
Document:

I'm angry

Document:

{doc_df.iloc[0]['text'][:19]}
'''  # in this f string, I am using a portion of one of our dataset's text values to provide another document as context

print(RELEVANT_CONTEXT)


Document:

I'm angry

Document:

 Create an Endpoint



In [28]:
# insert our user-defined values into the prompt
prompt = RAG_PROMPT_TEMPLATE_JSON.format(context=RELEVANT_CONTEXT, user_query=USER_QUERY)  # another string method, see f string method above
print(prompt)


Answer the user query based on the source documents.

Here are the source documents: 
Document:

I'm angry

Document:

 Create an Endpoint



You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.

Answer:
{
  "answer": your_answer,
  "confidence_score": your_confidence_score,
  "source_snippets": ["snippet_1", "snippet_2", ...]
}
End of answer.

Now begin!
Here is the user question: How are you?.
Answer:



In [29]:
# use our prompt and llm to generate an answer
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000
)

answer

'{\n  "answer": "I\'m angry",\n  "confidence_score": 0.8,\n  "source_snippets": ["I\'m angry"]\n}\nEnd of answer. \n\n\n\n\n\nPlease note that the confidence score is subjective and may vary based on the complexity of the query and the relevance of the source documents. In this case, the confidence score is 0.8 because the user query "How are you?" is somewhat related to the source document "I\'m angry", but not very strongly. The confidence score can be adjusted based on the specific requirements of your application.'

In [30]:
# split the model output so we only return the json object
answer = answer.split("End of answer.")[0]
print(answer)

{
  "answer": "I'm angry",
  "confidence_score": 0.8,
  "source_snippets": ["I'm angry"]
}



## RAG Q&A <a name="rag"></a>

Before, we manually defined our `context`. We did that by defining `RELEVANT_CONTEXT` with a static string value.

Now, let's use our dataset to create the `context` using text rows most similar to our `query`, or question. We will do this by performing semantic search between our `query` and our dataset. The top 10 dataset rows similar to our `query` will be used to create a new `context`.

We will place that `context` in our prompt and generate an answer.

In [31]:
# a new question, one we know our docs can answer
USER_QUERY = 'how do create endpoint'

In [32]:
# generate an embedding using the same model as we used on our dataset
# then use cosine similarity to return the top 10 most similar data points from our dataset
similar_docs_results = util.semantic_search(model.encode(USER_QUERY, convert_to_tensor=True), doc_df['embedding'].to_list())
similar_docs_results

[[{'corpus_id': 1162, 'score': 0.5051618814468384},
  {'corpus_id': 740, 'score': 0.44793304800987244},
  {'corpus_id': 983, 'score': 0.441649466753006},
  {'corpus_id': 1273, 'score': 0.4327011704444885},
  {'corpus_id': 324, 'score': 0.4042762219905853},
  {'corpus_id': 0, 'score': 0.3923899531364441},
  {'corpus_id': 1147, 'score': 0.37026676535606384},
  {'corpus_id': 1852, 'score': 0.36074841022491455},
  {'corpus_id': 770, 'score': 0.3479536771774292},
  {'corpus_id': 1278, 'score': 0.3412274420261383}]]

In [33]:
# use the corpus_ids from our semantic search results to get the most similar texts
similar_docs = [doc_df['text'].iloc[result['corpus_id']] for result in similar_docs_results[0]]
similar_docs

[' Send Requests to Endpoints\n\nYou can send requests to Inference Endpoints using the UI leveraging the Inference Widget or programmatically, e.g. with cURL, `@huggingface/inference`, `huggingface_hub` or any REST client. The Endpoint overview not only provides a interactive widget for you to test the Endpoint, but also generates code for `python`, `javascript` and `curl`. You can use this code to quickly get started with your Endpoint in your favorite programming language.\n\nBelow are also examples on how to use the `@huggingface/inference` library to call an inference endpoint.\n\n## Use the UI to send requests\n\nThe Endpoint overview provides access to the Inference Widget which can be used to send requests (see step 6 of [Create an Endpoint](/docs/inference-endpoints/guides/create_endpoint)). This allows you to quickly test your Endpoint with different inputs and share it with team members.\n\n## Use cURL to send requests\n\nThe cURL command for the request above should look li

In the above results, we can see that the most similar texts to our query, 'how do create endpoint`, all contain the word endpoint.

The most similar based on cosine similarity and the embeddings generated by our model was the data point containing the text 'Send Requests to Endpoints'.

We can see that the text with the most appropriate answer, 'Create an Endpoint', is the 6th most similar result based on cosine similarity. Which model we used, which similarity function, the cleanliness of our data (see how our text still contains markdown elements?), whether or not we chunked our data, and more can affect our semantic search results. For now, we will use these results, but we can always refine our process based on these factors.

Now, we will create the `context` for our prompt using the results from semantic search.

In [34]:
# truncate each text to only the first 100 characters due to model output limits
# append 'Document:\n' to each text in our semantic search results
# then join all texts into a single string 
RELEVANT_CONTEXT = '\n'.join(['Document:\n' + text[:100] for text in similar_docs])

print(RELEVANT_CONTEXT)

Document:
 Send Requests to Endpoints

You can send requests to Inference Endpoints using the UI leveraging th
Document:
 Inference Endpoints

Inference Endpoints provides a secure production solution to easily deploy mod
Document:
 Create a Private Endpoint with AWS PrivateLink

Security and secure inference are key principles of
Document:
 Change Organization or Account

Inference Endpoints uses your [Hugging Face](https://huggingface.co
Document:
 Access and view Metrics

Hugging Face Endpoints provides access to the metrics and analytics of you
Document:
 Create an Endpoint

After your first login, you will be directed to the [Endpoint creation page](ht
Document:
 Security & Compliance

🤗 Inference Endpoints is built with security and secure inference at its cor
Document:
 Inference Endpoints

Inference Endpoints provides a secure production solution to easily deploy any
Document:
 Pause and Resume your Endpoint

You can `pause` & `resume` endpoints to save cost and configuration
D

In [35]:
# insert our semantic search based context and user query into the prompt
prompt = RAG_PROMPT_TEMPLATE_JSON.format(context=RELEVANT_CONTEXT, user_query=USER_QUERY)  # another string method, see f string method above
print(prompt)


Answer the user query based on the source documents.

Here are the source documents: Document:
 Send Requests to Endpoints

You can send requests to Inference Endpoints using the UI leveraging th
Document:
 Inference Endpoints

Inference Endpoints provides a secure production solution to easily deploy mod
Document:
 Create a Private Endpoint with AWS PrivateLink

Security and secure inference are key principles of
Document:
 Change Organization or Account

Inference Endpoints uses your [Hugging Face](https://huggingface.co
Document:
 Access and view Metrics

Hugging Face Endpoints provides access to the metrics and analytics of you
Document:
 Create an Endpoint

After your first login, you will be directed to the [Endpoint creation page](ht
Document:
 Security & Compliance

🤗 Inference Endpoints is built with security and secure inference at its cor
Document:
 Inference Endpoints

Inference Endpoints provides a secure production solution to easily deploy any
Document:
 Pause and Resum

In [36]:
# generate an answer
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000
)

answer

'{\n  "answer": "You can create an endpoint after your first login, you will be directed to the endpoint creation page.",\n  "confidence_score": 0.8,\n  "source_snippets": ["Create an Endpoint", "After your first login, you will be directed to the endpoint creation page"]\n}\nEnd of answer.'

In [37]:
answer = answer.split("End of answer.")[0]
print(answer)

{
  "answer": "You can create an endpoint after your first login, you will be directed to the endpoint creation page.",
  "confidence_score": 0.8,
  "source_snippets": ["Create an Endpoint", "After your first login, you will be directed to the endpoint creation page"]
}



We can utilize different retrieval methods and different LLMs for this process.

It is important to understand our LLMs configurations so we can adjust our prompts (format, length, etc.) to get the best results from the LLM.

<sub>made with ❤️ by [hadleyrose](https://github.com/hadleyrose)</sub>