# **HyDE**

**Hypothetical Document Embeddings (HyDE)**, an innovative approach detailed in the paper titled Precise Zero-Shot Dense Retrieval without Relevance Labels. The core hypothesis of HyDE is simple yet profound: when conducting a document search, using hypothetical answers might yield superior results compared to using the question itself.

For more technical details check our blog \
https://medium.com/@aksdesai1998/exploring-hyde-a-new-approach-to-relevance-based-retrieval-0946c54dfdcb

## Installing Libraries

In [10]:
%%capture
!pip install langchain openai tiktoken langchain-community
!pip install lancedb
!pip install pypdf
!pip install -q langchain-openai

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


### Download the data (you can change the pdf as you like)

In [3]:
!wget https://ncert.nic.in/textbook/pdf/kehe103.pdf

--2024-09-10 10:36:00--  https://ncert.nic.in/textbook/pdf/kehe103.pdf
Resolving ncert.nic.in (ncert.nic.in)... 164.100.166.133
Connecting to ncert.nic.in (ncert.nic.in)|164.100.166.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1705618 (1.6M) [application/pdf]
Saving to: ‘kehe103.pdf’


2024-09-10 10:36:04 (871 KB/s) - ‘kehe103.pdf’ saved [1705618/1705618]



### Importing the neccessary libraries

In [11]:
from langchain_openai import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

### Instantiate llm and embeddings

In [32]:
# instantiate llm
llm = OpenAI()
base_embeddings = OpenAIEmbeddings()

In [33]:
embeddings = HypotheticalDocumentEmbedder.from_llm(llm, base_embeddings, "web_search")

# Now we can use it as any embedding class
result = embeddings.embed_query(
    "which factors appear to be the major nutritional limitations of fast-food meals."
)

### Generated Embeddings using HYDE

In [62]:
result

[-0.02528518708247117,
 0.00831749535867708,
 0.005965764392187786,
 -0.01159230572621031,
 0.018957366397482973,
 0.0017434122730279422,
 0.0008382731126336937,
 -0.005045947413990321,
 -0.0017417814108506582,
 -0.003525965703971974,
 0.03152167669156384,
 -0.03068666594356896,
 -0.011487929615541584,
 -0.03838442855260409,
 -0.039558662358764235,
 -0.003100305669842161,
 0.04649969374166905,
 0.005714608590087773,
 0.011807581861918166,
 -0.049317857111627485,
 -0.011957623510534657,
 -0.00332862916862956,
 -0.0056787293893568866,
 0.011403123385339,
 -0.020353400253721378,
 0.013236234358686061,
 0.008376207794043116,
 -0.02193209439910576,
 -0.0010135928921674534,
 -0.01484754528195486,
 -0.005655897086044274,
 -0.004511018100583344,
 -0.01492582759778704,
 -0.013816828278718265,
 0.00905465391037364,
 -0.013340610469354787,
 -0.01672631993060461,
 -0.010757294046892891,
 0.017991884812660282,
 -0.014469180600212241,
 0.023236799042093713,
 0.0013047052250645386,
 -0.00480783850930

# Multiple generations

We can also **generate multiple documents** and then combine the embeddings for those. By default, we combine those by taking the average. We can do this by changing the LLM we use to generate documents to return multiple things.

In [35]:
multi_llm = OpenAI(n=4, best_of=4)

In [60]:
embeddings = HypotheticalDocumentEmbedder.from_llm(
    multi_llm, base_embeddings, "web_search"
)

In [45]:
result = embeddings.embed_query(
    "which factors appear to be the major nutritional limitations of fast-food meals."
)

The `HypotheticalDocumentEmbedder` does not actually create full hypothetical documents. It only generates an embedding vector representing a hypothetical document. This is used to generate **dummy** embeddings that can be inserted into a vectorstore index.
This allows you to reserve space for documents that don't exist yet, so that you can incrementally add new real documents later.
But the embedder itself does not generate any actual text content for these hypothetical documents. It simply generates an embedding vector using a strategy like sampling from a normal distribution.

### Making Your Own Prompts
You can also make and use your own prompts when creating documents with LLMChain. This is helpful if you know what topic you're asking about. With a custom prompt, you can get text that fits your topic better.

Let's try this out. We'll make a prompt about a state of the union address, which we'll use in the next example.

In [56]:
prompt_template = """
As a knowledgeable and helpful research assistant, your task is to provide informative answers based on the given context. Use your extensive knowledge base to offer clear, concise, and accurate responses to the user's inquiries.
if quetion is not related to documents simply say you dont know
Question: {question}

Answer:
"""

prompt = PromptTemplate(input_variables=["question"], template=prompt_template)
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [63]:
llm_chain_embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain, base_embeddings=base_embeddings
)

### Loading data from pdf

Download book from here https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf

In [39]:
# Load the  pdf
pdf_folder_path = "/content/kehe103.pdf"

loader = PyPDFLoader(pdf_folder_path)
docs = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
documents = text_splitter.split_documents(docs)

### Initialize the Vectorstore

In [64]:
result = embeddings.embed_query(
    "What did the president say about Ketanji Brown Jackson"
)

In [66]:
from langchain.vectorstores import LanceDB

vector_store = LanceDB.from_documents(documents, embeddings)

In [67]:
# passing in the string query to get some refrence
query = (
    "which factors appear to be the major nutritional limitations of fast-food meals"
)

vector_store.similarity_search(query)

[Document(metadata={'page': 11, 'source': '/content/kehe103.pdf'}, page_content='Calcium, riboflavin, vitamin A: These essential nutrients are low unless milk or a milkshake is ordered.Folic acid, fibre:  There are few fast food sources of these key factors.\nFat: The percentage of energy from fat is high in many meal combinations.Sodium: The sodium content of fast food meals is high, which is not desirable.Energy: Common meal combinations contain excessive energy when compared with the amounts of other nutrients provided.'),
 Document(metadata={'page': 11, 'source': '/content/kehe103.pdf'}, page_content='39population. \tIf\tthis\tis\tnot\tmain tained,\t80\tper\tcent\tof\tthem\twill\tstay\t \noverweight \tas\tadul ts.\tThis\tcan\tput\tthem\tat\trisk\tfor\tmany\tmedi cal\tprobl ems,\t\nincluding \tdia betes,\thigh\tblo od\tpre ssure,\thigh\tcho lesterol\tand\tsle ep\tapn ea\t\n(a\tsleep\tdisorder) .\nTable 2: Nutritional Limitations of Fast Foods\nThe following factors appear to be the 

In [68]:
llm_chain.run(
    "which factors appear to be the major nutritional limitations of fast-food meals"
)

  llm_chain.run(


"The major nutritional limitations of fast-food meals can vary depending on the specific meal and location, but some common factors include high levels of saturated fat, sodium, and added sugars. These ingredients can contribute to an increased risk of heart disease, obesity, and other health issues. Additionally, fast-food meals often lack important nutrients such as fiber, vitamins, and minerals. Limited options for fresh fruits and vegetables and the use of processed meats are also common limitations. However, it's important to note that not all fast-food meals are nutritionally limited, and some options may offer healthier choices."

In [69]:
llm_chain.run("what kind of  Main Nutrients provided by Pulses and Legumes?")

'\nPulses and legumes are excellent sources of plant-based protein and dietary fiber. They also contain a variety of essential vitamins and minerals, including iron, potassium, magnesium, and folate. Additionally, they are low in fat and cholesterol, making them a healthy choice for a balanced diet.'

In [70]:
llm_chain.run("serving come from which groups & how much")

'It is difficult to determine which specific groups serve come from without more context or information. However, it is likely that serve come from a variety of food groups, including grains, vegetables, fruits, dairy, and protein sources. The amount of serve come from each group may vary depending on the specific recipe or dish being served. It is best to consult a nutritionist or refer to a dietary guideline for more specific information on serving sizes and food groups. '

Thanks