# **HyDE**

**Hypothetical Document Embeddings (HyDE)**, an innovative approach detailed in the paper titled Precise Zero-Shot Dense Retrieval without Relevance Labels. The core hypothesis of HyDE is simple yet profound: when conducting a document search, using hypothetical answers might yield superior results compared to using the question itself.

For more technical details check our blog \
https://medium.com/@aksdesai1998/exploring-hyde-a-new-approach-to-relevance-based-retrieval-0946c54dfdcb

## Installing Libraries

In [10]:
%%capture
!pip install langchain openai tiktoken
!pip install lancedb
!pip install pypdf

In [11]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


### Download the data (you can change the pdf as you like)

In [3]:
!wget https://ncert.nic.in/textbook/pdf/kehe103.pdf

--2023-11-27 16:16:32--  https://ncert.nic.in/textbook/pdf/kehe103.pdf
Resolving ncert.nic.in (ncert.nic.in)... 164.100.166.133
Connecting to ncert.nic.in (ncert.nic.in)|164.100.166.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1801441 (1.7M) [application/pdf]
Saving to: ‘kehe103.pdf’


2023-11-27 16:16:39 (343 KB/s) - ‘kehe103.pdf’ saved [1801441/1801441]



### Importing the neccessary libraries

In [4]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

### Instantiate llm and embeddings

In [5]:
# instantiate llm
llm = OpenAI()
emebeddings = OpenAIEmbeddings()

In [12]:
embeddings = HypotheticalDocumentEmbedder.from_llm(llm, emebeddings, "web_search")

# Now we can use it as any embedding class
result = embeddings.embed_query("which factors appear to be the major nutritional limitations of fast-food meals.")

### Result

In [13]:
result

[0.007334397356573164,
 -0.004199635807417721,
 -0.00806531702641324,
 -0.02089421130455058,
 -0.01657170470982839,
 -0.009067179860357249,
 -0.005604765051475951,
 -0.039545253901332754,
 -0.004719470279156168,
 -0.0008994718305639656,
 0.023465030478077045,
 0.020024668922649064,
 -0.00894746113438485,
 -0.009180599120278498,
 -0.010333687323676639,
 0.005003016566314093,
 0.05118955798920321,
 -0.01955839295086177,
 0.019823035717591954,
 -0.019911250594050415,
 -0.0025534919165954104,
 0.016823746681811077,
 -0.014051294303227498,
 0.0048454907994862125,
 0.006779906973988708,
 0.0145049676176221,
 -0.006348286447386124,
 -0.024800848831765854,
 -0.002161252831699796,
 -0.010169860228152413,
 0.023956511764649725,
 -0.026892789376112336,
 -0.006817713083521591,
 -0.004413870738544927,
 0.002488906324256302,
 0.02785054663447231,
 -0.005812699119568109,
 -0.021436099495403642,
 0.021902375467190933,
 -0.02172594757691921,
 0.00834256151921352,
 -0.0004717895179812731,
 -0.0059040637

# Multiple generations

We can also **generate multiple documents** and then combine the embeddings for those. By default, we combine those by taking the average. We can do this by changing the LLM we use to generate documents to return multiple things.

In [14]:
multi_llm = OpenAI(n=3, best_of=4)

In [15]:
embeddings = HypotheticalDocumentEmbedder.from_llm(
    multi_llm, embeddings, "web_search"
)

The `HypotheticalDocumentEmbedder` does not actually create full hypothetical documents. It only generates an embedding vector representing a hypothetical document. This is used to generate **dummy** embeddings that can be inserted into a vectorstore index.
This allows you to reserve space for documents that don't exist yet, so that you can incrementally add new real documents later.
But the embedder itself does not generate any actual text content for these hypothetical documents. It simply generates an embedding vector using a strategy like sampling from a normal distribution.

### Making Your Own Prompts
You can also make and use your own prompts when creating documents with LLMChain. This is helpful if you know what topic you're asking about. With a custom prompt, you can get text that fits your topic better.

Let's try this out. We'll make a prompt about a state of the union address, which we'll use in the next example.

In [16]:
prompt_template = """
As a knowledgeable and helpful research assistant, your task is to provide informative answers based on the given context. Use your extensive knowledge base to offer clear, concise, and accurate responses to the user's inquiries.
if quetion is not related to documents simply say you dont know
Question: {question}

Answer:
"""

prompt = PromptTemplate(input_variables=["question"], template=prompt_template)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [17]:
embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=embeddings
)

### Loading data from pdf

Download book from here https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf

In [18]:
#Load the  pdf
pdf_folder_path = '/content/kehe103.pdf'

loader = PyPDFLoader(pdf_folder_path)
docs = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
documents = text_splitter.split_documents(docs)


### Initialize the Vectorstore

In [19]:
from langchain.vectorstores import LanceDB
import lancedb

# lancedb as vectorstore
db = lancedb.connect('/tmp/lancedb')
table = db.create_table("documentsai", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
vector_store = LanceDB.from_documents(documents, embeddings, connection=table)


In [20]:
# passing in the string query to get some refrence
query = "which factors appear to be the major nutritional limitations of fast-food meals"

vector_store.similarity_search(query)

[Document(page_content='Calcium, riboflavin, vitamin A: These essential nutrients are low unless milk or a milkshake is ordered.Folic acid, fibre:  There are few fast food sources of these key factors.\nFat: The percentage of energy from fat is high in many meal combinations.Sodium: The sodium content of fast food meals is high, which is not desirable.Energy: Common meal combinations contain excessive energy when compared with the amounts of other nutrients provided.', metadata={'vector': array([ 0.00966273, -0.00786132, -0.01703396, ..., -0.01882212,
        -0.00976207, -0.01739159], dtype=float32), 'id': 'bb6030a0-3667-4ca5-932f-a5a4e4ff2feb', '_distance': 0.19605225324630737}),
 Document(page_content='39population. \tIf\tthis\tis\tnot\tmaint ained,\t80\tper\tcent\tof\tthem\twill\tstay\t \noverweight \tas\tadult s.\tThis\tcan\tput\tthem\tat\trisk\tfor\tmany\tmedic al\tproble ms,\t\nincluding \tdiab etes,\thigh\tbloo d\tpres sure,\thigh\tchol esterol\tand\tslee p\tapne a\t\n(a\tsleep

In [21]:
llm_chain.run("which factors appear to be the major nutritional limitations of fast-food meals")


'The major nutritional limitations of fast-food meals include excessive calories, sodium, saturated fat, and sugar. Fast-food meals are often high in calories, but low in essential nutrients like vitamins, minerals, and fiber. Additionally, they may contain unhealthy levels of sodium, saturated fat, and added sugars, which can lead to various health issues such as obesity, heart disease, and diabetes.'

In [22]:
llm_chain.run("what kind of  Main Nutrients provided by Pulses and Legumes?")


'Pulses and legumes are an excellent source of essential nutrients, including protein, fiber, vitamins, and minerals. They are particularly high in zinc, folate, magnesium, phosphorus, iron, and potassium.'

In [23]:
llm_chain.run("serving come from which groups & how much")


"Serving typically refers to a portion or amount of food, usually for one person. Its size may vary depending on the type of food and the group it's being served to. For example, a serving size for a family dinner may be larger than a single serving of a snack."

Thanks