# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [1]:
!python3 -m pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Users\larry\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
!pip install langchain_elasticsearch

Collecting langchain_elasticsearch
  Using cached langchain_elasticsearch-0.3.2-py3-none-any.whl.metadata (8.3 kB)
Collecting elasticsearch<9.0.0,>=8.13.1 (from elasticsearch[vectorstore-mmr]<9.0.0,>=8.13.1->langchain_elasticsearch)
  Using cached elasticsearch-8.17.1-py3-none-any.whl.metadata (8.8 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch<9.0.0,>=8.13.1->elasticsearch[vectorstore-mmr]<9.0.0,>=8.13.1->langchain_elasticsearch)
  Using cached elastic_transport-8.17.0-py3-none-any.whl.metadata (3.6 kB)
Collecting simsimd>=3 (from elasticsearch[vectorstore-mmr]<9.0.0,>=8.13.1->langchain_elasticsearch)
  Downloading simsimd-6.2.1-cp310-cp310-win_amd64.whl.metadata (67 kB)
Using cached langchain_elasticsearch-0.3.2-py3-none-any.whl (45 kB)
Using cached elasticsearch-8.17.1-py3-none-any.whl (653 kB)
Using cached elastic_transport-8.17.0-py3-none-any.whl (64 kB)
Downloading simsimd-6.2.1-cp310-cp310-win_amd64.whl (86 kB)
Installing collected packages: simsimd, elastic-tra

In [2]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai.llms import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from getpass import getpass

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [6]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
OPENAI_API_KEY = getpass("OpenAI API key: ")

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="multi_query_index", #give it a meaningful name
    embedding=embeddings,
)

## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [7]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)
data = json.load(response)

with open("temp.json", "w") as json_file:
    json.dump(data, json_file)

In [11]:
# To confirm the expected data content
with open("temp.json", "r") as f:
    data = json.load(f)
print(data[:5])  # Print the first 5 records

[{'content': "Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are respo

In [29]:
count = 0
for i in data:
    display(count, len(i), i)
    count += 1

0

9

{'content': "Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are respon

1

9

{'content': 'Starting May 2022, the company will be implementing a two-day in-office work requirement per week for all eligible employees. Please coordinate with your supervisor and HR department to schedule your in-office workdays while continuing to follow all safety protocols.\n',
 'summary': 'Starting May 2022, employees will need to work two days a week in the office. Coordinate with your supervisor and HR department for these days while following safety protocols.',
 'name': 'April Work From Home Update',
 'url': './sharepoint/April work from home update.txt',
 'created_on': '2022-04-29',
 'updated_at': '2022-04-29',
 'category': 'teams',
 '_run_ml_inference': True,
 'rolePermissions': ['demo', 'manager']}

2

9

{'content': 'As we continue to prioritize the well-being of our employees, we are making a slight adjustment to our hybrid work policy. Starting May 1, 2023, employees will be required to work from the office three days a week, with two days designated for remote work. Please communicate with your supervisor and HR department to establish your updated in-office workdays.\n',
 'summary': 'Starting May 1, 2023, our hybrid work policy will require employees to work from the office three days a week and two days remotely.',
 'name': 'Wfh Policy Update May 2023',
 'url': './sharepoint/WFH policy update May 2023.txt',
 'created_on': '2023-05-01',
 'updated_at': '2023-05-01',
 'category': 'teams',
 '_run_ml_inference': True,
 'rolePermissions': ['demo', 'manager']}

3

9

{'content': "Executive Summary:\nThis sales strategy document outlines the key objectives, focus areas, and action plans for our tech company's sales operations in fiscal year 2024. Our primary goal is to increase revenue, expand market share, and strengthen customer relationships in our target markets.\n\nI. Objectives for Fiscal Year 2024\n\nIncrease revenue by 20% compared to fiscal year 2023.\nExpand market share in key segments by 15%.\nRetain 95% of existing customers and increase customer satisfaction ratings.\nLaunch at least two new products or services in high-demand market segments.\n\nII. Focus Areas\nA. Target Markets:\nContinue to serve existing markets with a focus on high-growth industries.\nIdentify and penetrate new markets with high potential for our products and services.\n\nB. Customer Segmentation:\nStrengthen relationships with key accounts and strategic partners.\nPursue new customers in underserved market segments.\nDevelop tailored offerings for different cust

4

9

{'content': "Purpose\n\nThe purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. This policy aims to promote a healthy work-life balance and encourage employees to take time to rest and recharge.\nScope\n\nThis policy applies to all full-time and part-time employees who have completed their probationary period.\nVacation Accrual\n\nFull-time employees accrue vacation time at a rate of [X hours] per month, equivalent to [Y days] per year. Part-time employees accrue vacation time on a pro-rata basis, calculated according to their scheduled work hours.\n\nVacation time will begin to accrue from the first day of employment, but employees are eligible to take vacation time only after completing their probationary period. Unused vacation time will be carried over to the next year, up to a maximum of [Z days]. Any additional unused vacation time will be forfeited.\nVacation Scheduling\n\nEmp

5

8

{'content': 'This career leveling matrix provides a framework for understanding the various roles and responsibilities of Software Engineers, as well as the skills and experience required for each level. This matrix is intended to support employee development, facilitate performance evaluations, and provide a clear career progression path.\nJunior Software Engineer\n\nResponsibilities:\nCollaborate with team members to design, develop, and maintain software applications and components.\nWrite clean, well-structured, and efficient code following established coding standards.\nParticipate in code reviews, providing and receiving constructive feedback.\nTroubleshoot and resolve software defects and issues.\nAssist with the creation of technical documentation.\nContinuously learn and stay up-to-date with new technologies and best practices.\n\nSkills & Experience:\nBachelor’s degree in Computer Science or a related field, or equivalent work experience.\nBasic understanding of software deve

6

8

{'content': "Title: Working with the Sales Team as an Engineer in a Tech Company\n\nIntroduction:\nAs an engineer in a tech company, collaboration with the sales team is essential to ensure the success of the company's products and services. This guidance document aims to provide an overview of how engineers can effectively work with the sales team, fostering a positive and productive working environment.\nUnderstanding the Sales Team's Role:\nThe sales team is responsible for promoting and selling the company's products and services to potential clients. Their role involves establishing relationships with customers, understanding their needs, and ensuring that the offered solutions align with their requirements.\n\nAs an engineer, it is important to understand the sales team's goals and objectives, as this will help you to provide them with the necessary information, tools, and support to successfully sell your company's products and services.\nCommunication:\nEffective communication 

7

8

{'content': "Purpose\nThe purpose of this Intellectual Property Policy is to establish guidelines and procedures for the ownership, protection, and utilization of intellectual property generated by employees during their employment. This policy aims to encourage creativity and innovation while ensuring that the interests of both the company and its employees are protected.\n\nScope\nThis policy applies to all employees, including full-time, part-time, temporary, and contract employees.\n\nDefinitions\na. Intellectual Property (IP): Refers to creations of the mind, such as inventions, literary and artistic works, designs, symbols, and images, that are protected by copyright, trademark, patent, or other forms of legal protection.\nb. Company Time: Refers to the time during which an employee is actively engaged in performing their job duties.\nc. Outside Company Time: Refers to the time during which an employee is not engaged in performing their job duties.\n\nOwnership of Intellectual Pr

8

8

{'content': "Code of Conduct\nPurpose\n\nThe purpose of this code of conduct is to establish guidelines for professional and ethical behavior in the workplace. It outlines the principles and values that all employees are expected to uphold in their interactions with colleagues, customers, partners, and other stakeholders.\nScope\n\nThis code of conduct applies to all employees, contractors, and volunteers within the organization, regardless of their role or seniority.\nCore Values\n\nEmployees are expected to adhere to the following core values:\n\na. Integrity: Act honestly, ethically, and in the best interests of the organization at all times.\nb. Respect: Treat all individuals with dignity, courtesy, and fairness, regardless of their background, beliefs, or position.\nc. Accountability: Take responsibility for one's actions and decisions, and be willing to learn from mistakes.\nd. Collaboration: Work cooperatively with colleagues and partners to achieve shared goals and promote a po

9

8

{'content': "Content:\nThe purpose of this office pet policy is to outline the guidelines and procedures for bringing pets into the workplace. This policy aims to create a positive and inclusive work environment while ensuring the comfort, safety, and well-being of all employees, visitors, and pets.\nScope\n\nThis policy applies to all employees who wish to bring their pets to the office. Pets covered under this policy include dogs, cats, and other small, non-exotic animals, subject to approval by the HR department.\nPet Approval Process\n\nEmployees must obtain prior approval from their supervisor and the HR department before bringing their pets to the office. The approval process includes:\n\na. Submitting a written request, including a description of the pet, its breed, age, and temperament.\nb. Providing proof of up-to-date vaccinations and any required licenses or permits.\nc. Obtaining written consent from all employees who share the workspace with the pet owner.\n\nThe HR depart

10

8

{'content': 'Performance Management Policy\nPurpose and Scope\nThe purpose of this Performance Management Policy is to establish a consistent and transparent process for evaluating, recognizing, and rewarding employee performance. This policy applies to all employees and aims to foster a culture of continuous improvement, professional growth, and open communication between employees and management.\nPerformance Planning and Goal Setting\nAt the beginning of each performance cycle, employees and their supervisors will collaborate to set clear, achievable, and measurable performance goals. These goals should align with the company’s strategic objectives and take into account the employee’s job responsibilities, professional development, and career aspirations.\nOngoing Feedback and Communication\nThroughout the performance cycle, employees and supervisors are encouraged to engage in regular, constructive feedback and open communication. This includes discussing progress towards goals, ad

11

8

{'content': 'Our sales organization is structured to effectively serve our customers and achieve our business objectives across multiple regions. The organization is divided into the following main regions:\n\nThe Americas: This region includes the United States, Canada, Mexico, as well as Central and South America. The North America South America region (NASA) has two Area Vice-Presidents: Laura Martinez is the Area Vice-President of North America, and Gary Johnson is the Area Vice-President of South America.\n\nEurope: Our European sales team covers the entire continent, including the United Kingdom, Germany, France, Spain, Italy, and other countries. The team is responsible for understanding the unique market dynamics and cultural nuances, enabling them to effectively target and engage with customers across the region. The Area Vice-President for Europe is Rajesh Patel.\nAsia-Pacific: This region encompasses countries such as China, Japan, South Korea, India, Australia, and New Zeal

12

9

{'content': "Introduction:\nThis document outlines the compensation bands strategy for the various teams within our IT company. The goal is to establish a fair and competitive compensation structure that aligns with industry standards, rewards performance, and attracts top talent. By implementing this strategy, we aim to foster employee satisfaction and retention while ensuring the company's overall success.\n\nPurpose:\nThe purpose of this compensation bands strategy is to:\na. Define clear guidelines for salary ranges based on job levels and market benchmarks.\nb. Support equitable compensation practices across different teams.\nc. Encourage employee growth and performance.\nd. Enable effective budgeting and resource allocation.\n\nJob Levels:\nTo establish a comprehensive compensation structure, we have defined distinct job levels within each team. These levels reflect varying degrees of skills, experience, and responsibilities. The levels include:\na. Entry-Level: Employees with li

13

8

{'content': "As an employee in Canada, it's essential to understand how to update your tax elections forms to ensure accurate tax deductions from your pay. This guide will help you navigate the process of updating your TD1 Personal Tax Credits Return form.\n\nStep 1: Access the TD1 form\nThe TD1 form is available on the Canada Revenue Agency (CRA) website. Your employer might provide you with a paper copy or a link to the online form. You can access the form directly through the following link: https://www.canada.ca/en/revenue-agency/services/forms-publications/td1-personal-tax-credits-returns.html\n\nStep 2: Choose the correct form version\nYou'll need to fill out the federal TD1 form and, if applicable, the provincial or territorial TD1 form. Select the appropriate version based on your province or territory of residence.\n\nStep 3: Download and open the form\nFor the best experience, download and open the TD1 form in Adobe Reader. If you have visual impairments, consider using the l

14

8

{'content': "Welcome to our team! We are excited to have you on board and look forward to your valuable contributions. This onboarding guide is designed to help you get started by providing essential information about our policies, procedures, and resources. Please read through this guide carefully and reach out to the HR department if you have any questions.\nIntroduction to Our Company Culture and Values\nOur company is committed to creating a diverse, inclusive, and supportive work environment. We believe that our employees are our most valuable asset and strive to foster a culture of collaboration, innovation, and continuous learning. Our core values include:\nIntegrity: We act ethically and honestly in all our interactions.\nTeamwork: We work together to achieve common goals and support each other's growth.\nExcellence: We strive for the highest quality in our products, services, and relationships.\nInnovation: We encourage creativity and embrace change to stay ahead in the market

In [31]:
# Function to check the length of page_content for all documents
def check_document_lengths(docs):
    for i, doc in enumerate(docs):
        content_length = len(doc.page_content) if doc.page_content else 0
        print(f"Document {i+1} - Length: {content_length} characters")

# Call the function to inspect lengths
check_document_lengths(docs)

Document 1 - Length: 3266 characters
Document 2 - Length: 267 characters
Document 3 - Length: 360 characters
Document 4 - Length: 3017 characters
Document 5 - Length: 2539 characters
Document 6 - Length: 3837 characters
Document 7 - Length: 3295 characters
Document 8 - Length: 3172 characters
Document 9 - Length: 3863 characters
Document 10 - Length: 3034 characters
Document 11 - Length: 3477 characters
Document 12 - Length: 2132 characters
Document 13 - Length: 3422 characters
Document 14 - Length: 2483 characters
Document 15 - Length: 3992 characters


### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [13]:
!pip install jq

Collecting jq
  Downloading jq-1.8.0-cp310-cp310-win_amd64.whl.metadata (7.2 kB)
Downloading jq-1.8.0-cp310-cp310-win_amd64.whl (417 kB)
Installing collected packages: jq
Successfully installed jq-1.8.0


In [33]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


def metadata_func(record: dict, metadata: dict) -> dict:
    #Populate the metadata dictionary with keys name, summary, url, category, and updated_at.
    """
    Populate the metadata dictionary with relevant fields from the record.
    This metadata will be added to each document chunk.
    """
    # Extract specific metadata fields from the record
    metadata["name"] = record.get("name")  # Document name, default is "Unknown" if missing
    metadata["summary"] = record.get("summary")  # Document summary
    metadata["url"] = record.get("url")  # Source URL of the document
    metadata["category"] = record.get("category",)  # Category of the document
    metadata["updated_at"] = record.get("updated_at")  # Last update timestamp

    # Return the updated metadata dictionary
    return metadata


# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800, chunk_overlap=400 #define chunk size and chunk overlap
)
docs = loader.load_and_split(text_splitter=text_splitter)

In [15]:
#Test the metadata_func by printing metadata for a sample record

sample_record = {
    "name": "Document 1",
    "summary": "A brief summary.",
    "url": "http://example.com",
    "category": "Tutorial",
    "updated_at": "2025-01-30"
}
metadata = {}
print(metadata_func(sample_record, metadata))

{'name': 'Document 1', 'summary': 'A brief summary.', 'url': 'http://example.com', 'category': 'Tutorial', 'updated_at': '2025-01-30'}


In [16]:
print(docs[:2])  # Check the first two chunks of the documents

[Document(metadata={'source': 'C:\\Users\\larry\\OneDrive\\Documents\\GitHub\\lab-chatbot-with-multi-query-retriever\\temp.json', 'seq_num': 1, 'name': 'Work From Home Policy', 'summary': 'This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns', 'url': './sharepoint/Work from home policy.txt', 'category': 'teams', 'updated_at': '2020-03-01'}, page_content="Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are

### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [18]:
# Step 1: Print a Sample of the Documents
# Check the first few documents to identify inconsistencies
def debug_documents(docs):
    for i, doc in enumerate(docs[:5]):
        print(f"Document {i + 1}:\n", doc, "\n")

# Call the function to print the first few documents
debug_documents(docs)

Document 1:
 page_content='Effective: March 2020
Purpose

The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
Scope

This policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.
Eligibility

Employees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.
Equipment and Resources

The necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are resp

In [34]:
# Inspect the specific document causing the issue (document 6)
print(f"Inspecting problematic document:\n{docs[5]}")

Inspecting problematic document:
page_content='This career leveling matrix provides a framework for understanding the various roles and responsibilities of Software Engineers, as well as the skills and experience required for each level. This matrix is intended to support employee development, facilitate performance evaluations, and provide a clear career progression path.
Junior Software Engineer

Responsibilities:
Collaborate with team members to design, develop, and maintain software applications and components.
Write clean, well-structured, and efficient code following established coding standards.
Participate in code reviews, providing and receiving constructive feedback.
Troubleshoot and resolve software defects and issues.
Assist with the creation of technical documentation.
Continuously learn and stay up-to-date with new technologies and best practices.

Skills & Experience:
Bachelor’s degree in Computer Science or a related field, or equivalent work experience.
Basic understan

In [30]:
# Validate the metadata and page content of the document
problematic_doc = docs[5]
print("Page Content:", problematic_doc.page_content)
print("Metadata:", problematic_doc.metadata)

Page Content: This career leveling matrix provides a framework for understanding the various roles and responsibilities of Software Engineers, as well as the skills and experience required for each level. This matrix is intended to support employee development, facilitate performance evaluations, and provide a clear career progression path.
Junior Software Engineer

Responsibilities:
Collaborate with team members to design, develop, and maintain software applications and components.
Write clean, well-structured, and efficient code following established coding standards.
Participate in code reviews, providing and receiving constructive feedback.
Troubleshoot and resolve software defects and issues.
Assist with the creation of technical documentation.
Continuously learn and stay up-to-date with new technologies and best practices.

Skills & Experience:
Bachelor’s degree in Computer Science or a related field, or equivalent work experience.
Basic understanding of software development prin

In [21]:
# Step 2: Validate the Metadata for Each Document
# Ensure each document has all required fields and valid values
def validate_metadata(docs):
    for i, doc in enumerate(docs):
        if not doc.metadata:
            print(f"Document {i} is missing metadata.")
        if not doc.page_content:
            print(f"Document {i} is missing page_content.")

# Call the function to validate metadata
validate_metadata(docs)

In [35]:
# Index the split documents into Elasticsearch
try:
    documents = vectorstore.from_documents(
    docs, # List of document chunks
    embeddings, # Embedding model
    index_name="multi_query_index", # Ensure this matches the created index name
    es_cloud_id=ELASTIC_CLOUD_ID, # Cloud ID for Elasticsearch
    es_api_key=ELASTIC_API_KEY, # API Key for Elasticsearch
)
except Exception as e:
    print(f"Error during bulk indexing: {e}")

# Initialize the OpenAI language model (LLM)
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY) # temperature=>Controls randomness (0 = deterministic)

# Set up the MultiQueryRetriever using the LLM and vectorstore
retriever = MultiQueryRetriever.from_llm(vectorstore.as_retriever(), llm) # Elasticsearch retriever with LLM

# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

In [36]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("what is the nasa sales team?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Answer ----
The NASA sales team is a part of the Americas region in the sales organization of the company. It is led by two Area Vice-Presidents, Laura Martinez for North America and Gary Johnson for South America. The team is responsible for promoting and selling the company's products and services in the North and South American markets. They work closely with other departments, such as marketing, product development, and customer support, to ensure the company's success in these regions.


**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?

# **Iteration No. 01: Modified Context Prompt**

In [37]:
# Custom context prompt for more structured and formal responses
CUSTOM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant specialized in delivering structured and precise answers. Use the following retrieved context to answer the question. If you don't know the answer, clearly state so. Provide bullet points for key aspects and explain thoroughly.

    context: {context}
    Question: "{question}"
    Answer (structured response):
    """
)

# Use the existing _context pipeline
custom_chain = _context | CUSTOM_CONTEXT_PROMPT | llm

# Test with a new question
ans01 = custom_chain.invoke("What are the key responsibilities of a Senior Software Engineer?")

print("---- Answer ----")
print(ans01)


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the main duties and obligations of a Senior Software Engineer?', '2. Can you list the primary tasks and roles of a Senior Software Engineer?', '3. What are the core responsibilities that come with being a Senior Software Engineer?']


---- Answer ----
- Design, develop, and maintain complex software applications and components.
    - Lead and mentor junior team members in software development best practices and techniques.
    - Conduct code reviews and ensure adherence to coding standards and best practices.
    - Collaborate with cross-functional teams to define, design, and deliver software solutions.
    - Identify, troubleshoot, and resolve complex software defects and issues.
    - Contribute to the creation and maintenance of technical documentation.
    - Evaluate and recommend new technologies, tools, and practices to improve software quality and efficiency.


# **Iteration No. 02: Modified Document Prompt**

In [38]:

# Custom document prompt for retrieved documents
CUSTOM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
==== Document Metadata ====
SOURCE: {name}
CATEGORY: {category}
UPDATED: {updated_at}

CONTENT:
{page_content}
===========================
"""
)

# Update the document formatting in the chain
def _combine_custom_documents(
    docs, document_prompt=CUSTOM_DOCUMENT_PROMPT, document_separator="\n\n=== NEXT DOCUMENT ===\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

# Update the context pipeline with custom document formatting
custom_context = RunnableParallel(
    context=retriever | _combine_custom_documents,
    question=RunnablePassthrough(),
)

# Create the new chain with the custom context
custom_chain_documents = custom_context | LLM_CONTEXT_PROMPT | llm

# Test with the same or new question
ans02 = custom_chain_documents.invoke("What is the NASA sales team?")

print("---- Custom Document Answer ----")
print(ans02)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Custom Document Answer ----
The NASA sales team is a part of the Americas region in the sales organization of the company. It is responsible for serving customers in North and South America, including the United States, Canada, Mexico, Central and South America. The team is led by two Area Vice-Presidents, Laura Martinez for North America and Gary Johnson for South America. Their main responsibilities include identifying and pursuing new business opportunities, nurturing existing client relationships, and ensuring customer satisfaction. They also collaborate closely with other departments, such as marketing, product development, and customer support, to deliver high-quality products and services to clients.


# **Iteration No. 03: Dynamic Filters Into the Retriever**

In [39]:
from langchain.schema.runnable import RunnableMap

# Define a filter function for the retriever
def filter_documents_by_metadata(docs, category=None, updated_after=None):
    """
    Filters documents based on metadata conditions.

    Args:
    - docs: List of retrieved documents.
    - category: Filter by document category (e.g., 'sharepoint').
    - updated_after: Filter by update date (e.g., '2025-01-01').

    Returns:
    - Filtered list of documents.
    """
    filtered_docs = []
    for doc in docs:
        doc_category = doc.metadata.get("category", None)
        doc_updated_at = doc.metadata.get("updated_at", None)

        # Apply category filter
        if category and doc_category != category:
            continue

        # Apply updated_after filter (assuming dates are formatted as 'YYYY-MM-DD')
        if updated_after and doc_updated_at:
            try:
                if doc_updated_at < updated_after:
                    continue
            except ValueError:
                pass  # Skip if date format is invalid

        filtered_docs.append(doc)
    return filtered_docs

# Modify the context pipeline to include filtering
filtered_context = RunnableMap(
    {
        "context": retriever | (lambda docs: filter_documents_by_metadata(docs, category="sharepoint", updated_after="2025-01-01")),
        "question": RunnablePassthrough(),
    }
)

# Create a new chain with the filtered context
filtered_chain = filtered_context | LLM_CONTEXT_PROMPT | llm

# Test with a filtered query
ans03 = filtered_chain.invoke("What documents are relevant for NASA projects?")
print("---- Filtered Answer ----")
print(ans03)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which documents pertain to projects at NASA?', '2. Can you suggest any relevant documents for NASA projects?', '3. What are some documents that would be useful for NASA projects?']


---- Filtered Answer ----

The documents that are relevant for NASA projects are the Sales Organization Overview and the Swe Career Matrix. These documents provide information on the structure and responsibilities of the NASA region, as well as the roles and skills required for software engineers working on NASA projects. Additionally, the Intellectual Property Policy may also be relevant for NASA projects, as it outlines guidelines for ownership and protection of intellectual property, which may be important for projects involving innovative technology.


# **Iteration No. 04: Combining Dynamic Filters Into the Retriever and Summarization**

In [43]:
from langchain.schema.runnable import RunnableLambda, RunnableMap
from langchain.prompts import ChatPromptTemplate

# Define the filtering function
def filter_documents_by_metadata(docs, category=None, updated_after=None):
    """
    Filters documents based on metadata conditions.

    Args:
    - docs: List of retrieved documents.
    - category: Filter by document category (e.g., 'sharepoint').
    - updated_after: Filter by update date (e.g., '2025-01-01').

    Returns:
    - Filtered list of documents.
    """
    filtered_docs = []
    for doc in docs:
        doc_category = doc.metadata.get("category", None)
        doc_updated_at = doc.metadata.get("updated_at", None)

        # Apply category filter
        if category and doc_category != category:
            continue

        # Apply updated_after filter
        if updated_after and doc_updated_at:
            try:
                if doc_updated_at < updated_after:
                    continue
            except ValueError:
                pass  # Skip invalid date formats

        filtered_docs.append(doc)
    return filtered_docs

# Define the summarization function
def summarize_document(doc):
    """
    Summarizes the content of a single document using the LLM.
    """
    summary_prompt = f"""
    Here is the content of a document:
    ---
    {doc.page_content}
    ---
    Please summarize the main idea of this document in 1-2 sentences.
    """
    return llm(summary_prompt)

# Combine filtering and summarization
def filter_and_summarize(docs, category=None, updated_after=None):
    """
    Filters documents by metadata and generates summaries for each document.
    """
    filtered_docs = filter_documents_by_metadata(docs, category, updated_after)
    summarized_docs = []
    for doc in filtered_docs:
        summary = summarize_document(doc)
        summarized_docs.append({"metadata": doc.metadata, "summary": summary})
    return summarized_docs

# Test the pipeline with filtered and summarized output
query = "What documents are relevant for NASA projects?"
retrieved_docs = retriever.get_relevant_documents(query)

# Filter and summarize the documents
filtered_docs = filter_and_summarize(retrieved_docs, category="sharepoint", updated_after="2025-01-01")

# Print filtered and summarized documents
print("---- Filtered and Summarized Documents ----")
if filtered_docs:
    for doc in filtered_docs:
        print(f"Source: {doc['metadata']['name']}")
        print(f"Summary: {doc['summary']}\n")
else:
    print("No relevant documents found.")

# Generate the final response using the LLM
if filtered_docs:
    context = "\n".join([f"Source: {doc['metadata']['name']}\nSummary: {doc['summary']}" for doc in filtered_docs])
    final_prompt = f"""
    Here is the context from relevant documents:
    {context}
    Question: {query}
    Answer:
    """
    ans04 = llm(final_prompt)
else:
    ans04 = "No relevant documents found."

# Print the final answer
print("---- Final Answer ----")
print(ans04)

  retrieved_docs = retriever.get_relevant_documents(query)
INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which documents pertain to projects at NASA?', '2. Can you suggest any relevant documents for NASA projects?', '3. What are some documents that would be useful for NASA projects?']


---- Filtered and Summarized Documents ----
Source: Sales Organization Overview
Summary: 
The document outlines the structure and responsibilities of the sales organization, which is divided into four main regions (Americas, Europe, Asia-Pacific, and Middle East & Africa) with dedicated teams led by Area Vice-Presidents. These teams work together to identify and pursue new business opportunities, maintain client relationships, and ensure customer satisfaction.

Source: Intellectual Property Policy
Summary: 
This document outlines the guidelines and procedures for ownership, protection, and utilization of intellectual property created by employees during their employment with the company, with the goal of encouraging creativity and innovation while safeguarding the interests of both the company and its employees.

Source: Swe Career Matrix
Summary: 
This document outlines a career leveling matrix for Software Engineers, providing a framework for understanding roles, responsibilities, sk