# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [46]:
!python3 -m pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai.llms import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from getpass import getpass

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [47]:
# Prompt for Elastic Cloud ID and API Key
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")
OPENAI_API_KEY = getpass("OpenAI API key: ")

# Create embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Connect to Elasticsearch
vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="index_test",  # Replace with a meaningful index name
    embedding=embeddings,
)


In [48]:
# Check the structure of the documents
print(data[0])  # Print the first document to see its structure


{'content': "Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are respon

## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [49]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)
data = json.load(response)

with open("temp.json", "w") as json_file:
    json.dump(data, json_file)

### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [50]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Function to populate metadata
def metadata_func(record: dict, metadata: dict) -> dict:
    # Populate the metadata dictionary with keys
    metadata['name'] = record.get('name', 'Unknown Name')  # Default to 'Unknown Name' if missing
    metadata['summary'] = record.get('summary', '')
    metadata['url'] = record.get('url', '')
    metadata['category'] = record.get('category', '')
    metadata['created_on'] = record.get('created_on', '')
    metadata['updated_at'] = record.get('updated_at', None) if record.get('updated_at') != 'No update date' else None
    return metadata



loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)
docs = loader.load_and_split(text_splitter=text_splitter)


# Split documents into passages
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800,  # Define chunk size
    chunk_overlap=400 # Define chunk overlap
)
docs = loader.load_and_split(text_splitter=text_splitter)


In [51]:
print(data[0].keys())  # Print keys of the first record to inspect the structure


dict_keys(['content', 'summary', 'name', 'url', 'created_on', 'updated_at', 'category', '_run_ml_inference', 'rolePermissions'])


In [52]:
# Check the structure of the first document
print(docs[0])


page_content='Effective: March 2020
Purpose

The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
Scope

This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.
Eligibility

Employees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.
Equipment and Resources

The necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are responsible for m

In [53]:
# Check the count of documents and sample metadata
print(f"Total documents: {len(docs)}")
for i in range(3):  # Print the metadata for the first 3 documents
    print(docs[i].metadata)


Total documents: 15
{'source': '/Users/marinacastilloariza/Desktop/AI_Work/7_Week/lab-chatbot-with-multi-query-retriever/temp.json', 'seq_num': 1, 'name': 'Work From Home Policy', 'summary': 'This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns', 'url': './sharepoint/Work from home policy.txt', 'category': 'teams', 'created_on': '2020-03-01', 'updated_at': '2020-03-01'}
{'source': '/Users/marinacastilloariza/Desktop/AI_Work/7_Week/lab-chatbot-with-multi-query-retriever/temp.json', 'seq_num': 2, 'name': 'April Work From Home Update', 'summary': 'Starting May 2022, employees will need to work two days a week in the office. Coordinate with your supervisor and HR department for th

In [54]:
try:
    documents = vectorstore.from_documents(
        docs,
        embeddings,
        index_name="index_test",  # Ensure this is correctly set
        es_cloud_id=ELASTIC_CLOUD_ID,
        es_api_key=ELASTIC_API_KEY,
    )
except Exception as e:  # Catch any exception
    print(f"Error indexing documents: {e}")


In [55]:
for doc in docs:
    print(doc.metadata)


{'source': '/Users/marinacastilloariza/Desktop/AI_Work/7_Week/lab-chatbot-with-multi-query-retriever/temp.json', 'seq_num': 1, 'name': 'Work From Home Policy', 'summary': 'This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns', 'url': './sharepoint/Work from home policy.txt', 'category': 'teams', 'created_on': '2020-03-01', 'updated_at': '2020-03-01'}
{'source': '/Users/marinacastilloariza/Desktop/AI_Work/7_Week/lab-chatbot-with-multi-query-retriever/temp.json', 'seq_num': 2, 'name': 'April Work From Home Update', 'summary': 'Starting May 2022, employees will need to work two days a week in the office. Coordinate with your supervisor and HR department for these days while follo

### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [56]:
documents = vectorstore.from_documents(
    docs,
    embeddings,
    index_name="index_test",  #
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
)

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

retriever = MultiQueryRetriever.from_llm(vectorstore.as_retriever(), llm)


# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

In [59]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("How do different teams collaborate within the company?")
print("---- Answer ----")
print(ans)

ans = chain.invoke("What are the key company policies regarding remote work?")
print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the various ways in which teams work together in the company?', '2. Can you provide insights on the collaboration methods used by different teams in the company?', '3. How does the company foster collaboration among its teams?']


---- Answer ----

Different teams within the company, such as the engineering and sales teams, collaborate through effective communication, mutual respect and support, and continuous improvement. This includes attending meetings and calls, providing regular updates and training, working together on projects, seeking feedback, and recognizing each other's efforts. By understanding each other's roles and working together, these teams can contribute to the overall success of the company.


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the main company guidelines for remote work?', "2. Can you provide information on the company's policies for working remotely?", '3. How does the company handle remote work and what are the key policies in place?']


---- Answer ----

The key company policies regarding remote work include eligibility, equipment and resources, workspace, communication, work hours and availability, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, policy review and updates, and questions and concerns. These policies are designed to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond. Eligible employees must have approval from their direct supervisor and the HR department, and will be provided with necessary equipment and resources such as a company-issued laptop and access to secure communication tools. Employees are responsible for creating a comfortable and safe workspace, maintaining regular communication with their supervisors and colleagues, and adhering to confidentiality and data security policies. They are also expected t

**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?