# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [13]:
# !python3 -m pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai.llms import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [15]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = os.environ.get('ELASTIC_CLOUD_ID')

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = os.environ.get('ELASTIC_API_KEY')

# https://platform.openai.com/api-keys
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="elastic_ironhack", #give it a meaningful name
    embedding=embeddings,
)

## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [16]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

# Open the URL and the data
response = urlopen(url)
data = json.load(response)

# Store the data in the vectorstore
with open("temp.json", "w") as json_file:
    json.dump(data, json_file)

### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [42]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


def metadata_func(record: dict, metadata: dict) -> dict:
    #Populate the metadata dictionary with keys name, summary, url, category, and updated_at.
    
    for k, v in record.items():
        if k != 'content':
            metadata[k] = v
    return metadata


# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",          # Path to the JSON file
    jq_schema=".[]",                # JSON schema
    content_key="content",          # Key for the content
    metadata_func=metadata_func,    # Metadata function
)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800, chunk_overlap=400 #define chunk size and chunk overlap
)
docs = loader.load_and_split(text_splitter=text_splitter)
print(len(docs))

# print 1 document content
print(docs[0])

15
page_content='Effective: March 2020
Purpose

The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
Scope

This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.
Eligibility

Employees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.
Equipment and Resources

The necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are responsible fo

### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [59]:
# initialize the Elastic vectorstore with the documents
documents = vectorstore.from_documents(
    docs,                           # list of documents
    embeddings,                     # embeddings
    index_name='elastic_ironhack',  # index name
    es_cloud_id=ELASTIC_CLOUD_ID,   # cloud id
    es_api_key=ELASTIC_API_KEY,     # api key
)

# define the language model
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, max_tokens=500)

# initialize the retriever
retriever = MultiQueryRetriever.from_llm(vectorstore.as_retriever(), llm)

# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

In [53]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging
# Set the logging level to INFO for the retriever
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Define the prompt templates
LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved
    context to answer the question. If you don't know the answer, just say that you don't know. 
    Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

# Define the document prompt template
LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
    ---
    SOURCE: {name}
    {page_content}
    ---
    """
)


def _combine_documents(docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"):
    """Combine multiple documents into a single string with a separator between each document.

    Args:
        docs: List of documents to combine.
        document_prompt: prompt template to use for each document.
        document_separator: separator to use between each document.

    Returns:
        str: combined documents.
    """
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("what is the nasa sales team?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Answer ----
The NASA sales team is a part of the Americas region within the sales organization. It is led by two Area Vice-Presidents, Laura Martinez and Gary Johnson, who are responsible for managing the sales team in North America and South America respectively. The team is made up of dedicated account managers, sales representatives, and support staff who work together to identify and pursue new business opportunities, nurture existing client relationships, and ensure customer satisfaction. They also collaborate closely with other departments, such as marketing, product development, and customer support, to deliver high-quality products and services to clients in the Americas region.


**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?

In [56]:
# Define the prompt templates
LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for summarizing information into bullet points. Use the following pieces
    of retrieved context to generate a summarized answer to the question, in 3 clear bullet-points. 
    If you don't know the answer, just say that you don't know. 
    Bullet points should be brief and only contains the most relevant keywords. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

# Define the document prompt template
LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
    ---
    SOURCE: {name}
    {page_content}
    ---
    """
)


def _combine_documents(docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"):
    """Combine multiple documents into a single string with a separator between each document.

    Args:
        docs: List of documents to combine.
        document_prompt: prompt template to use for each document.
        document_separator: separator to use between each document.

    Returns:
        str: combined documents.
    """
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("What are our core values?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the fundamental principles that guide our organization?', '2. Can you tell me about the key beliefs that define our company?', "3. What are the central ideals that shape our company's culture?"]


---- Answer ----
- Our core values include integrity, teamwork, excellence, innovation, and respect.
    - We strive to create a diverse, inclusive, and supportive work environment.
    - We encourage creativity and embrace change to stay ahead in the market.


In [60]:
# Define the prompt templates
LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for generating emails about a given topic. Use the following pieces
    of retrieved context to generate the content of the email. Add the usual email intro and ending.
    The tone should be formal, it's for a professional email. The email should be brief and straight to the point.
    If you don't know the answer, just say that you don't know.
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

# Define the document prompt template
LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
    ---
    SOURCE: {name}
    {page_content}
    ---
    """
)


def _combine_documents(docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"):
    """Combine multiple documents into a single string with a separator between each document.

    Args:
        docs: List of documents to combine.
        document_prompt: prompt template to use for each document.
        document_separator: separator to use between each document.

    Returns:
        str: combined documents.
    """
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("Explain to our new client Mr. John Doe where to find and how to use the TD1 form.")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide a detailed explanation to our new client Mr. John Doe on the location and proper usage of the TD1 form?', '2. How can Mr. John Doe access and effectively utilize the TD1 form? Please provide a step-by-step guide.', '3. In what ways can you assist our new client Mr. John Doe in locating and utilizing the TD1 form? Please provide clear instructions.']


---- Answer ----

Dear Mr. John Doe,

I hope this email finds you well. As a new employee in Canada, it is important for you to understand how to update your tax elections forms to ensure accurate tax deductions from your pay. This guide will provide you with the necessary information on how to access and complete the TD1 Personal Tax Credits Return form.

Firstly, the TD1 form can be found on the Canada Revenue Agency (CRA) website. Your employer may also provide you with a paper copy or a link to the online form. To access the form directly, please use the following link: https://www.canada.ca/en/revenue-agency/services/forms-publications/td1-personal-tax-credits-returns.html

Once you have accessed the form, please make sure to select the correct version based on your province or territory of residence. It is important to fill out both the federal TD1 form and, if applicable, the provincial or territorial TD1 form.

For the best experience, we recommend downloading and opening the f