## **HyDE**

**Hy**pothetical **D**ocument **E**mbedding, HyDE is an interesting approach to RAG-systems on AI engineering.

The key idea is that a query, by definition, is a question, whereas the documents we are looking to retrieve contain an- (possible) answer, and the question & answer may occupy different embedding space.

HyDE seeks to compensate for this by getting an LLM to produce sample answers. These answers are then fed in to the embedding space with the hope that they will be closer than the query on its own.

The key limitation here is that the LLM should have *some* knowledge of the topic at hand, otherwise it is likely to hallucinate and therefore could lead to embeddings that are actually even less relevant.

In [54]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
import langchain
import openai
import os
from dotenv import load_dotenv, find_dotenv
from src.langchainHelpers import PdfLoad

_ = load_dotenv(find_dotenv())  # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

langchain.debug = True

## **Data & Constants**

Here I will use a simple PDF, but this could be any data source or directory.

In [78]:
LLM = OpenAI(n=3, best_of=3)
EMBEDDINGS = OpenAIEmbeddings()

PROMPT_TEMPLATE = """Please answer the question as best you can.
Make sure you mention the key ideas & concepts where relevant.
Answer in a variety of voices, ranging from professional to highly technical.
Your answer should be concise.
Question: {question}
Answer:"""

PROMPT = PromptTemplate(input_variables=["question"], template=PROMPT_TEMPLATE)

LLM_CHAIN = LLMChain(llm=LLM, prompt=PROMPT)

PDF_FILEPATH = "big-book-of-machine-learning-use-cases-2nd-edition.pdf"

texts = PdfLoad(PDF_FILEPATH).characterSplitter(chunk_size=1000, chunk_overlap=200)

## **The Hypothetical Embedder**

Here I define the embedder.

In [79]:
embedder = HypotheticalDocumentEmbedder(llm_chain=LLM_CHAIN, base_embeddings=EMBEDDINGS)

Now if I call the embedder with a test query (with `lanchain.debug = True`) we can see what it going on.

Note how I use a vague query which might yield poor results in a traditional vanilla-RAG application.

In [80]:
result = embedder.embed_query("What is is the biggest stadium?")

[32;1m[1;3m[llm/start][0m [1m[1:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Please answer the question as best you can.\nMake sure you mention the key ideas & concepts where relevant.\nAnswer in a variety of voices, ranging from professional to highly technical.\nYour answer should be concise.\nQuestion: What is is the biggest stadium?\nAnswer:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:OpenAI] [1.63s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " The biggest stadium in the world is the Rungrado 1st of May Stadium in Pyongyang, North Korea. It has a seating capacity of 114,000 and is used for both sporting and musical events. The stadium was built in 1989 and is the home of the North Korean national football team.",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        },
        "type": "Generation"
      },
      {
        "text": " The largest stadium in the world is the

## **Create the Vectorstore**

I'll use Chroma. We could persist this too.

In [81]:
docsearch = Chroma.from_documents(texts, embedder)
retriever = docsearch.as_retriever(k=3)

## **Retrieve Relevant Docs**

In [82]:
query = "What is Machine Learning?"
docs = retriever.get_relevant_documents(query)

[32;1m[1;3m[llm/start][0m [1m[1:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Please answer the question as best you can.\nMake sure you mention the key ideas & concepts where relevant.\nAnswer in a variety of voices, ranging from professional to highly technical.\nYour answer should be concise.\nQuestion: What is Machine Learning?\nAnswer:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:OpenAI] [1.90s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " Machine Learning is a subfield of Artificial Intelligence that builds algorithms to learn from data, identify patterns, and make decisions without being explicitly programmed. It uses techniques such as supervised learning, unsupervised learning, and reinforcement learning to process large amounts of data and provide predictions and recommendations. Machine Learning is used in a wide range of applications, from self-driving cars to voice recognition systems.",
        "generation_

In [84]:
def show_documents(docs: list) -> None:
    print(
        f"\n{'-' * 50}\n".join(
            [f"Document {i + 1}: \n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

show_documents(docs)

Document 1: 

Organizations across many industries are using machine learning to power 
new customer experiences, optimize business processes and improve 
employee productivity. From detecting financial fraud to improving the 
play-by-play decision-making for professional sports teams, this book 
brings together a multitude of practical use cases to get you started on 
your machine learning journey. The collection also serves as a guide — 
including code samples and notebooks — so you can roll up your sleeves 
and dive into machine learning on the Databricks Lakehouse.CHAPTER 1:
Introduction
3
EBOOK: BIG BOOK OF MACHINE LEARNING USE CASES — 2ND EDITION
--------------------------------------------------
Document 2: 

Organizations across many industries are using machine learning to power 
new customer experiences, optimize business processes and improve 
employee productivity. From detecting financial fraud to improving the 
play-by-play decision-making for professional sports teams, t

## **Utilise Docs in a Chain**

Given we have now retrieved our docs, lets use these in an LLM chain to help answer the query.

In [85]:
CHAT_LLM = ChatOpenAI(temperature=0)

qa_chain = RetrievalQA.from_chain_type(llm=CHAT_LLM, chain_type="stuff", retriever=retriever)

In [86]:
qa_chain.run(query)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is Machine Learning?"
}
[32;1m[1;3m[llm/start][0m [1m[1:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Please answer the question as best you can.\nMake sure you mention the key ideas & concepts where relevant.\nAnswer in a variety of voices, ranging from professional to highly technical.\nYour answer should be concise.\nQuestion: What is Machine Learning?\nAnswer:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:OpenAI] [3.14s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " Machine Learning is a field of Artificial Intelligence (AI) that focuses on enabling machines to learn from data and improve their performance over time without explicit programming. It uses algorithms to identify patterns in data and make predictions based on that data. Machine Learning is used in many areas, such as healthcare, finance, marketing

'Machine learning is a field of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn and make predictions or decisions without being explicitly programmed. It involves training a computer system on a large amount of data, allowing it to recognize patterns, make predictions, and improve its performance over time. Machine learning is used in various applications, such as image and speech recognition, natural language processing, recommendation systems, and predictive analytics.'