# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [2]:
!pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken

In [3]:
#!python3 -m pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai.llms import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from getpass import getpass

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [8]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
OPENAI_API_KEY = getpass("OpenAI API key: ")

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="workplace_docs",  # Index name for storing workplace document embeddings
    embedding=embeddings,
)

## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [9]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)
data = json.load(response)

with open("temp.json", "w") as json_file:
    json.dump(data, json_file)

In [10]:
data

[{'content': "Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are respo

### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [11]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


def metadata_func(record: dict, metadata: dict) -> dict:
    # Populate the metadata dictionary with keys from the record
    metadata["name"] = record.get("name")
    metadata["summary"] = record.get("summary") 
    metadata["url"] = record.get("url")
    metadata["category"] = record.get("category")
    metadata["created_on"] = record.get("created_on")
    return metadata


# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800, chunk_overlap=400  # Set chunk size to 800 tokens with 400 token overlap
)
docs = loader.load_and_split(text_splitter=text_splitter)

In [None]:
docs

## Splitting documents into passages is important for document organization and memory efficiency

- Breaking documents into smaller chunks helps with:
- Better organization and retrieval of relevant content
- Reduced memory consumption by working with smaller text segments
- More focused context for question answering

### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [14]:
documents = vectorstore.from_documents(
    docs,
    embeddings,
    index_name="workplace_docs",  # Added index name which is required
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
)

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

retriever = MultiQueryRetriever.from_llm(vectorstore.as_retriever(), llm)

# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

In [15]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("what is the nasa sales team?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Answer ----
The NASA sales team is a part of the Americas region in the sales organization of the company. It is led by two Area Vice-Presidents, Laura Martinez for North America and Gary Johnson for South America. The team is responsible for promoting and selling the company's products and services in the North and South American markets. They work closely with other departments, such as marketing, product development, and customer support, to ensure the company's success in these regions.


**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?

In [16]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks focused on company culture and values. Use the following pieces of retrieved context to provide detailed insights about workplace culture, employee experience, and organizational values. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible.
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

# Query about company culture and values
ans = chain.invoke("What are our company's core values and culture? How do they shape the employee experience?")

print("---- Answer about Company Culture and Values ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ["1. What are the fundamental principles and beliefs that drive our company's operations and define our culture?", "2. How do our company's core values and culture impact the overall employee experience?", '3. Can you provide insights into the core values and culture of our company and how they influence the employee experience?']


---- Answer about Company Culture and Values ----

Our company's core values are integrity, teamwork, excellence, innovation, and respect. These values are the foundation of our company culture and guide our actions and decisions. We believe that by upholding these values, we can create a diverse, inclusive, and supportive work environment for our employees.

Our culture is one of collaboration, innovation, and continuous learning. We value our employees as our most valuable asset and strive to foster a culture of teamwork and mutual support. We encourage creativity and embrace change to stay ahead in the market, and we believe in treating each other with dignity and respect, valuing the unique perspectives of all our colleagues.

These values and culture shape the employee experience in several ways. Firstly, they set the tone for how employees interact with each other and with the company. By promoting integrity and respect, we create a positive and inclusive work environment where e

In [18]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks focused on company drug policies. Use the following pieces of retrieved context to provide detailed insights about drug testing requirements, substance abuse policies, and workplace safety guidelines. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible.
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

# Query about drug policies
ans = chain.invoke("What are our company's drug policies? What are the guidelines around drug testing and substance abuse?")

print("---- Answer about Drug Policies ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the protocol for drug testing and substance abuse within our company?', "2. Can you provide information on our company's policies regarding drugs and substance abuse?", '3. How does our company handle drug-related issues and testing?']


---- Answer about Drug Policies ----

Our company has a strict code of conduct that outlines guidelines for professional and ethical behavior in the workplace. This includes adhering to our core values of integrity, respect, accountability, collaboration, and excellence. As part of this code of conduct, employees are expected to comply with all applicable laws, regulations, and organizational policies, including those related to drug use and substance abuse.

In terms of drug testing, our company may conduct drug tests as part of the hiring process or randomly throughout an employee's tenure. This is to ensure a safe and productive work environment for all employees. Employees who test positive for illegal substances may face disciplinary action, up to and including termination of employment.

Our company also has policies in place to address substance abuse. We understand that addiction is a serious issue and we are committed to providing support and resources for employees who may be

In [17]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks focused on remote work policies. Use the following pieces of retrieved context to provide detailed insights about work-from-home guidelines, expectations, and best practices. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible.
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

# Query about remote work policies
ans = chain.invoke("What are our company's work-from-home policies and guidelines? What are the expectations for remote employees?")

print("---- Answer about Remote Work Policies ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the specific policies and guidelines for working from home at our company?', '2. Can you provide information on the work-from-home expectations for employees at our company?', '3. How does our company handle remote work and what are the guidelines for employees?']


---- Answer about Remote Work Policies ----

Our company has a full-time work-from-home policy in place, effective since March 2020. This policy is designed to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond. The policy applies to all eligible employees, as determined by their role and responsibilities, and allows for full-time remote work while maintaining the same level of performance and collaboration as in-office work.

To be eligible for remote work, employees must have received approval from their direct supervisor and the HR department. Necessary equipment and resources, such as a company-issued laptop, software licenses, and access to secure communication tools, will be provided to employees for remote work. However, employees are responsible for maintaining and protecting the company's equipment and data.

When working from home, employees are ex

In [19]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks focused on conflict of interest policies. Use the following pieces of retrieved context to provide detailed insights about disclosure requirements, prohibited activities, and reporting procedures. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible.
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

# Query about conflict of interest policies
ans = chain.invoke("What are our company's conflict of interest policies? What activities are prohibited and what are the disclosure requirements?")

print("---- Answer about Conflict of Interest Policies ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the conflict of interest policies in place at our company?', 'What actions are not allowed and what are the rules for disclosing potential conflicts?', "2. Can you provide information on our company's policies regarding conflicts of interest?", 'What activities are forbidden and what are the requirements for disclosing any potential conflicts?', '3. How does our company handle conflicts of interest?', 'What actions are restricted and what are the guidelines for disclosing any potential conflicts?']


---- Answer about Conflict of Interest Policies ----

Our company has a strict code of conduct that outlines guidelines for professional and ethical behavior in the workplace. This code applies to all employees, contractors, and volunteers within the organization, regardless of their role or seniority.

One of the core values that employees are expected to adhere to is integrity, which includes acting honestly, ethically, and in the best interests of the organization at all times. This value extends to avoiding situations where personal interests may conflict with or influence professional judgment.

If a potential conflict of interest arises, employees are required to disclose it to their supervisor or the appropriate authority within the organization. This includes any personal relationships, financial interests, or outside activities that may impact their work.

Prohibited activities include using company resources for personal gain, accepting gifts or favors that may influence deci

 # Let's try some creative variations of our multi-query retriever:
  
 # 1. Exploring company culture and values:
 - We can ask about workplace culture, values, and employee experience
 - The multi-query retriever will generate different perspectives on these topics
 
 # 2. Understanding drug and substance policies:
 - We can query about drug testing procedures and substance abuse guidelines
 - The retriever will surface different aspects of drug-related policies
 
 # 3. Understanding remote work policies:
 - We can explore work-from-home guidelines and expectations
 - Multiple queries will help surface different aspects of remote work
 
 # 4. Examining conflict of interest policies:
- We can investigate guidelines around potential conflicts
- The retriever can surface nuanced aspects of disclosure requirements
- Multiple queries help understand both prohibited activities and reporting procedures
 
# Through this lab, I've learned that the Multi-Query Retriever:
- Automatically generates multiple variations of a query
- Improves retrieval by considering different angles/perspectives
- Helps surface more comprehensive and relevant information
- Is especially useful for complex or nuanced questions
- Can effectively parse policy documents to find relevant guidelines
