Skip to content

Commit

Permalink
Added docs for PebbloRetrievalQA chain
Browse files Browse the repository at this point in the history
  • Loading branch information
Raj725 committed May 9, 2024
1 parent 669d852 commit 9444bfb
Showing 1 changed file with 356 additions and 0 deletions.
356 changes: 356 additions & 0 deletions docs/docs/how_to/pebblo_retrieval_qa.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,356 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3ce451e9-f8f1-4f27-8c6b-4a93a406d504",
"metadata": {},
"source": [
"# Identity-enabled RAG using PebbloRetrievalQA\n",
"\n",
"> PebbloRetrievalQA is a Retrieval chain with Identity & Semantic Enforcement for question-answering\n",
"against a vector database.\n",
"\n",
"This notebook covers how to retrieve documents with Identity & Semantic Enforcement.\n",
"\n",
"To start, we will load documents with authorization metadata into an in-memory Qdrant vector database we want to use and then use it as a retriever in PebbloRetrievalQA. Next, we will define an \"ask\" function that loads the PebbloRetrievalQA chain using the retriever and provided *auth_context*. Finally, we will ask it questions with authorization context for authorized users and unauthorized users."
]
},
{
"cell_type": "markdown",
"id": "4ee16b6b-5dac-4b5c-bb69-3ec87398a33c",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"### Dependencies\n",
"\n",
"We'll use an OpenAI LLM, OpenAI embeddings and a Qdrant vector store in this walkthrough.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e68494fa-f387-4481-9a6c-58294865d7b7",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain langchain-community langchain-openai qdrant_client"
]
},
{
"cell_type": "markdown",
"id": "61498d51-0c38-40e2-adcd-19dfdf4d37ef",
"metadata": {},
"source": [
"### Identity-aware Data Ingestion\n",
"\n",
"Here we are using Qdrant as a vector database; however, you can use any of the supported vector databases.\n",
"\n",
"**PebbloRetrievalQA chain supports the following vector databases:**\n",
"- Qdrant\n",
"- Pinecone\n",
"\n",
"\n",
"**Load vector database with authorization information in metadata:**\n",
"\n",
"In this step, we capture the authorization information of the source document into the `authorized_identities` field within the metadata of the VectorDB entry for each chunk. \n",
"\n",
"Example:\n",
"```\n",
"{\n",
" \"page_content\": \"Employee leave-of-absence policy ...\",\n",
" \"metadata\": {\n",
" \"authorized_identities\": [\"hr-support\", \"hr-leadership\"],\n",
" ...\n",
" }\n",
" ...\n",
"},\n",
"...\n",
"\n",
"```\n",
"\n",
"\n",
"*NOTE: To use the PebbloRetrievalQA chain, you always need to place authorization metadata in the `authorized_identities` field, which must be a list of strings.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae4fcbc1-bdc3-40d2-b2df-8c82cad1f89c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores.qdrant import Qdrant\n",
"from langchain_core.documents import Document\n",
"from langchain_openai.embeddings import OpenAIEmbeddings\n",
"from langchain_openai.llms import OpenAI\n",
"\n",
"llm = OpenAI()\n",
"embeddings = OpenAIEmbeddings()\n",
"collection_name = \"pebblo-identity-rag\"\n",
"\n",
"page_content = \"\"\"\n",
"Performance Report: John Smith\n",
"Employee Information:\n",
" •Name: John Smith\n",
" •Employee ID: JS12345\n",
" •Department: Sales\n",
" •Position: Sales Representative\n",
" •Review Period: January 1, 2023 - December 31, 2023\n",
"\n",
"Performance Summary: \n",
"John Smith has demonstrated commendable performance as a Sales Representative during the review period. \n",
"He consistently met and often exceeded sales targets, contributing significantly to the department's success. \n",
"His dedication, professionalism, and collaborative approach have been instrumental in fostering positive \n",
"relationships with clients and colleagues alike.\n",
"\n",
"Key Achievements:\n",
"•Exceeded sales targets by 20% for the fiscal year, demonstrating exceptional sales acumen and strategic planning skills.\n",
"•Successfully negotiated several high-value contracts, resulting in increased revenue and client satisfaction.\n",
"•Proactively identified opportunities for process improvement within the sales team, \n",
" leading to streamlined workflows and enhanced efficiency.\n",
"•Received positive feedback from clients and colleagues for excellent communication skills, responsiveness, and customer service.\n",
" Areas for Development: While John's performance has been exemplary overall, \n",
"there are opportunities for further development in certain areas:\n",
"•Continued focus on expanding product knowledge to better address client needs and provide tailored solutions.\n",
"•Enhancing time management skills to prioritize tasks effectively and maximize productivity during busy periods.\n",
"•Further development of leadership abilities to support and mentor junior team members within the sales department.\n",
"\n",
"Conclusion: In conclusion, John Smith has delivered outstanding results as a Sales Representative at ACME Corp. \n",
"His dedication, performance, and commitment to excellence reflect positively on the organization.\" \n",
"\"\"\"\n",
"\n",
"documents = [\n",
" Document(\n",
" **{\n",
" \"page_content\": page_content,\n",
" \"metadata\": {\n",
" \"authorized_identities\": [\"hr-support\", \"hr-leadership\"],\n",
" \"page\": 0,\n",
" \"source\": \"https://drive.google.com/file/d/xxxxxxxxxxxxx/view\",\n",
" \"title\": \"Performance Report- John Smith.pdf\",\n",
" },\n",
" }\n",
" )\n",
"]\n",
"\n",
"print(\"Loading vectordb...\")\n",
"\n",
"vectordb = Qdrant.from_documents(\n",
" documents,\n",
" embeddings,\n",
" location=\":memory:\",\n",
" collection_name=collection_name,\n",
")\n",
"\n",
"print(\"Vectordb loaded.\")"
]
},
{
"cell_type": "markdown",
"id": "f630bb8b-67ba-41f9-8715-76d006207e75",
"metadata": {},
"source": [
"## Retrieval with Identity & Semantic Enforcement\n",
"\n",
"PebbloRetrievalQA chain uses a SafeRetrieval to enforce that the snippets used for in-context are retrieved only from the documents authorized for the user. \n",
"To achieve this, the Gen-AI application needs to provide an authorization context for this retrieval chain. \n",
"This *auth_context* should be filled with the identity and authorization groups of the user accessing the Gen-AI app.\n",
"\n",
"\n",
"Here is the sample code for the `PebbloRetrievalQA` with `authorized_identities` from the user accessing the RAG application, passed in `auth_context`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e978bee6-3a8c-459f-ab82-d380d7499b36",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.chains import PebbloRetrievalQA\n",
"from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput\n",
"\n",
"# Initialize PebbloRetrievalQA chain\n",
"qa_chain = PebbloRetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" app_name=\"pebblo-identity-and-semantic-retriever-app\",\n",
" owner=\"Joe Smith\",\n",
" description=\"Identity and Semantic filtering using PebbloSafeLoader, and PebbloRetrievalQA\",\n",
" chain_type=\"stuff\",\n",
" retriever=vectordb.as_retriever(),\n",
" verbose=True,\n",
")\n",
"\n",
"\n",
"def ask(question: str, auth_context: dict):\n",
" \"\"\"\n",
" Ask a question to the PebbloRetrievalQA chain\n",
" \"\"\"\n",
" auth_context_obj = AuthContext(**auth_context) if auth_context else None\n",
" chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)\n",
" return qa_chain.invoke(chain_input_obj.dict())"
]
},
{
"cell_type": "markdown",
"id": "7a267e96-70cb-468f-b830-83b65e9b7f6f",
"metadata": {},
"source": [
"### Questions by Authorized User\n",
"\n",
"We ingested data for authorized identities [\"hr-support\", \"hr-leadership\"], so a user with the authorized identity/group \"hr-support\" should receive the correct answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2688fc18-1eac-45a5-be55-aabbe6b25af5",
"metadata": {},
"outputs": [],
"source": [
"auth = {\n",
" \"user_id\": \"hr-user@acme.org\",\n",
" \"authorized_identities\": [\n",
" \"hr-support\",\n",
" ]\n",
"}\n",
"\n",
"question = \"Please share the performance report for John Smith?\"\n",
"resp = ask(question, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\\n\")"
]
},
{
"cell_type": "markdown",
"id": "b4db6566-6562-4a49-b19c-6d99299b374e",
"metadata": {},
"source": [
"### Questions by Unauthorized User\n",
"\n",
"Since the user's authorized identity/group \"eng-support\" is not included in the authorized identities [\"hr-support\", \"hr-leadership\"], we should not receive an answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d736ce3-6e05-48d3-a5e1-fb4e7cccc1ee",
"metadata": {},
"outputs": [],
"source": [
"auth = {\n",
" \"user_id\": \"eng-user@acme.org\",\n",
" \"authorized_identities\": [\n",
" \"eng-support\",\n",
" ]\n",
"}\n",
"\n",
"question = \"Please share the performance report for John Smith?\"\n",
"resp = ask(question, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\\n\")"
]
},
{
"cell_type": "markdown",
"id": "33a8afe1-3071-4118-9714-a17cba809ee4",
"metadata": {},
"source": [
"### Using PromptTemplate to provide additional instructions\n",
"You can use PromptTemplate to provide additional instructions to the LLM for generating a custom response."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59c055ba-fdd1-48c6-9bc9-2793eb47438d",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"prompt_template = PromptTemplate.from_template(\n",
" \"\"\"\n",
"Answer the question using the provided context. \n",
"If no context is provided, just say \"I'm sorry, but that information is unavailable, or Access to it is restricted.\".\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
")\n",
"\n",
"question = \"Please share the performance summary for John Smith?\"\n",
"prompt = prompt_template.format(question=question)"
]
},
{
"cell_type": "markdown",
"id": "c4d27c00-73d9-4ce8-bc70-29535deaf0e2",
"metadata": {},
"source": [
"#### Questions by Authorized User"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e68a13a4-b735-421d-9655-2a9a087ba9e5",
"metadata": {},
"outputs": [],
"source": [
"auth = {\n",
" \"user_id\": \"hr-user@acme.org\",\n",
" \"authorized_identities\": [\n",
" \"hr-support\",\n",
" ]\n",
"}\n",
"resp = ask(prompt, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\\n\")"
]
},
{
"cell_type": "markdown",
"id": "7b97a9ca-bdc6-400a-923d-65a8536658be",
"metadata": {},
"source": [
"#### Questions by Unauthorized Users"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "438e48c6-96a2-4d5e-81db-47f8c8f37739",
"metadata": {},
"outputs": [],
"source": [
"auth = {\n",
" \"user_id\": \"eng-user@acme.org\",\n",
" \"authorized_identities\": [\n",
" \"eng-support\",\n",
" ]\n",
"}\n",
"resp = ask(prompt, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\\n\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 9444bfb

Please sign in to comment.