**HyDE**

Exploring HyDE: A New Approach to Relevance-Based Retrieval

For more technical tedeatils check our blog
https://medium.com/@aksdesai1998/exploring-hyde-a-new-approach-to-relevance-based-retrieval-0946c54dfdcb

In [2]:
%%capture
!pip install langchain openai tiktoken
!pip install lancedb
!pip install pypdf

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


In [4]:
# download the book for producing same results
!wget https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf

--2023-11-25 07:14:09--  https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf
Resolving www.gita-society.com (www.gita-society.com)... 192.124.249.102
Connecting to www.gita-society.com (www.gita-society.com)|192.124.249.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 540128 (527K) [application/pdf]
Saving to: ‘bhagavad-gita-in-english-source-file.pdf’


2023-11-25 07:14:10 (3.97 MB/s) - ‘bhagavad-gita-in-english-source-file.pdf’ saved [540128/540128]



In [5]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.prompts import PromptTemplate

In [6]:
# instantiate llm
llm = OpenAI()
emebeddings = OpenAIEmbeddings()

In [7]:
embeddings = HypotheticalDocumentEmbedder.from_llm(llm, emebeddings, "web_search")

# Now we can use it as any embedding class
result = embeddings.embed_query("What bhagavad gita tell us?")


# Multiple generations

We can also generate multiple documents and then combine the embeddings for those. By default, we combine those by taking the average. We can do this by changing the LLM we use to generate documents to return multiple things.


In [8]:
multi_llm = OpenAI(n=3, best_of=3)

In [9]:
embeddings = HypotheticalDocumentEmbedder.from_llm(
    multi_llm, embeddings, "web_search"
)

In [10]:
result = embeddings.embed_query("What bhagavad gita tell us?")

The `HypotheticalDocumentEmbedder` does not actually create full hypothetical documents.

It only generates an embedding vector representing a hypothetical document.

The `HypotheticalDocumentEmbedder` is used to generate "dummy" embeddings that can be inserted into a vectorstore index.

This allows you to reserve space for documents that don't exist yet, so that you can incrementally add new real documents later.

But the embedder itself does not generate any actual text content for these hypothetical documents.

It simply generates an embedding vector using a strategy like sampling from a normal distribution.



## Making Your Own Prompts
You can also make and use your own prompts when creating documents with LLMChain. This is helpful if you know what topic you're asking about. With a custom prompt, you can get text that fits your topic better.

Let's try this out. We'll make a prompt about a state of the union address, which we'll use in the next example.

In [11]:
prompt_template = """
As a knowledgeable and helpful research assistant, your task is to provide informative answers based on the given context. Use your extensive knowledge base to offer clear, concise, and accurate responses to the user's inquiries.

Question: {question}

Answer:
"""


prompt = PromptTemplate(input_variables=["question"], template=prompt_template)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [12]:
embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=embeddings
)

Loading data from pdf

Download book from here https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

#Load the  pdf
pdf_folder_path = '/content/bhagavad-gita-in-english-source-file.pdf'


loader = PyPDFLoader(pdf_folder_path)
docs = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
documents = text_splitter.split_documents(docs)


In [17]:
from langchain.vectorstores import LanceDB
import lancedb

# lancedb as vectorstore
db = lancedb.connect('/tmp/lancedb')
table = db.create_table("documentsai", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
vector_store = LanceDB.from_documents(documents, embeddings, connection=table)


In [18]:
# passing in the string query to get some refrence
query = "What is karma -yoga  ?"

vector_store.similarity_search(query)

[Document(page_content='gaged in work, such a Karma -yogi is not bound by Karma. (4.22) \nThe one who is free f rom attachment, whose mind is fixed in Self -\nknowledge, who does work as a service (Sev a) to the Lord, all K ar-\nmic bonds of such a philanthropic person ( Karma -yogi) dissolve \naway. (4.23) God shall be realized by the one who considers eve-\nrything as a manifest ation or an act of God. (Also see 9.16) (4.24)  \nDifferent types of spiritual practices', metadata={'vector': array([-0.00890432, -0.01419295,  0.00024622, ..., -0.0255662 ,
         0.01837529, -0.0352935 ], dtype=float32), 'id': '849b3475-6bf5-4a6a-955c-aa9c1426cdbb', '_distance': 0.2407873421907425}),
 Document(page_content='renunciation (Samny asa) is also known as Karma -yoga . No one \nbecomes a Karma -yogi who has not renounced the selfish motive \nbehind an action. (6.02)  \nA definition of yoga and yogi  \nFor the wise who seeks to attain yoga of meditation or calm-\nness of mind, Karma -yoga  is sa

In [None]:
llm_chain.run("tell me 10 key points from this book bhagavad gita")

In [None]:
llm_chain.run("explain the karma & its impact ")

In [None]:
llm_chain.run("who is yogi ?")

In [None]:
llm_chain.run("what are the 3 models of nature?")

In [None]:
llm_chain.run("what is god ")