In [1]:
%%capture
!pip install langchain openai tiktoken
!pip install lancedb
!pip install pypdf

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


In [None]:
# download the book for producing same results
https://www.gita-society.com/bhagavad-gita-in-english-source-file.pdf

**Introducing Hypothetical Document Embeddings (HyDE): A Game-Changer in Zero-Shot Dense Retrieval**

The world of information retrieval is witnessing a significant leap forward with the advent of Hypothetical Document Embeddings (HyDE). This breakthrough, developed by a team of researchers, addresses the longstanding challenge of effective zero-shot dense retrieval in the absence of relevance labels. Let's delve into what makes HyDE a revolutionary approach in the field of information retrieval.

**The Challenge in Traditional Dense Retrieval**

Traditionally, dense retrieval methods have depended heavily on relevance labels to retrieve documents. These labels are crucial for training systems to understand and match semantic similarities between queries and documents. However, this dependency becomes a significant hurdle, especially in scenarios lacking a large, labeled dataset for training. This is where the concept of zero-shot dense retrieval comes into play - a domain that, until now, remained a considerable challenge.

**What is HyDE?**

HyDE stands out as an innovative solution to this problem. It's an embedding technique that fundamentally changes the retrieval process. Given a query, HyDE uses a language model to generate a hypothetical document. This document, while not real and possibly containing inaccuracies, captures essential relevance patterns to the query.

**The Process of HyDE**

The magic of HyDE begins with feeding a query into a generative model. The instruction is simple yet powerful: "write a document that answers this question." The result is a hypothetical document that, despite not being real, encapsulates the essence of the query's relevance.
The Novelty of Hypothetical Document Embedding
HyDE doesn't generate any actual text content for the hypothetical document. Instead, it creates an embedding vector for this "fake" document. This vector is crucial as it reserves space in the vector store index but doesn't provide accessible full text. The generated embedding vector is then used to search against the corpus embeddings. The most similar real documents to this embedding are retrieved, making HyDE a novel approach in the field.

**Semantic Similarity: The Core Idea**
The core idea behind HyDE is that a hypothetical answer to a question is more semantically similar to the actual answer than the question itself. In practical terms, this means your search would use a generative model like GPT to create a hypothetical answer, embed it, and then use this embedding for the search.


**Implementing HyDE in LangChain**

To utilize HyDE effectively, one needs to provide a base embedding model and an LLMChain for generating documents. The HyDE class comes with default prompts, but there's also the flexibility to create custom prompts. This adaptability makes HyDE not just a tool but a versatile framework adaptable to various needs and scenarios.

In [None]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.prompts import PromptTemplate

In [None]:
# instantiate llm
llm = OpenAI()
emebeddings = OpenAIEmbeddings()

In [None]:
embeddings = HypotheticalDocumentEmbedder.from_llm(llm, emebeddings, "web_search")

# Now we can use it as any embedding class
result = embeddings.embed_query("What bhagavad gita tell us?")


# Multiple generations

We can also generate multiple documents and then combine the embeddings for those. By default, we combine those by taking the average. We can do this by changing the LLM we use to generate documents to return multiple things.


In [None]:
multi_llm = OpenAI(n=3, best_of=3)

In [None]:
embeddings = HypotheticalDocumentEmbedder.from_llm(
    multi_llm, embeddings, "web_search"
)

In [None]:
result = embeddings.embed_query("What bhagavad gita tell us?")

The `HypotheticalDocumentEmbedder` does not actually create full hypothetical documents.

It only generates an embedding vector representing a hypothetical document.

The `HypotheticalDocumentEmbedder` is used to generate "dummy" embeddings that can be inserted into a vectorstore index.

This allows you to reserve space for documents that don't exist yet, so that you can incrementally add new real documents later.

But the embedder itself does not generate any actual text content for these hypothetical documents.

It simply generates an embedding vector using a strategy like sampling from a normal distribution.



## Making Your Own Prompts
You can also make and use your own prompts when creating documents with LLMChain. This is helpful if you know what topic you're asking about. With a custom prompt, you can get text that fits your topic better.

Let's try this out. We'll make a prompt about a state of the union address, which we'll use in the next example.

In [None]:
prompt_template = """
As a knowledgeable and helpful research assistant, your task is to provide informative answers based on the given context. Use your extensive knowledge base to offer clear, concise, and accurate responses to the user's inquiries.

Question: {question}

Answer:
"""


prompt = PromptTemplate(input_variables=["question"], template=prompt_template)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [None]:
embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=embeddings
)

Loading data from pdf

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

#Load the  multiple pdfs
pdf_folder_path = '/content/book'

from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
documents = text_splitter.split_documents(docs)


In [None]:
from langchain.vectorstores import LanceDB
import lancedb

# lancedb as vectorstore
db = lancedb.connect('/tmp/lancedb')
table = db.create_table("documentsai", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
vector_store = LanceDB.from_documents(documents, embeddings, connection=table)


In [None]:
# passing in the string query to get some refrence
query = "What is karma -yoga  ?"

vector_store.similarity_search(query)

[Document(page_content='gaged in work, such a Karma -yogi is not bound by Karma. (4.22) \nThe one who is free f rom attachment, whose mind is fixed in Self -\nknowledge, who does work as a service (Sev a) to the Lord, all K ar-\nmic bonds of such a philanthropic person ( Karma -yogi) dissolve \naway. (4.23) God shall be realized by the one who considers eve-\nrything as a manifest ation or an act of God. (Also see 9.16) (4.24)  \nDifferent types of spiritual practices', metadata={'vector': array([-0.00885663, -0.01420569,  0.00011372, ..., -0.02555135,
         0.01844176, -0.03537256], dtype=float32), 'id': '073ebe84-0eac-444e-9547-1fdc17f568ed', '_distance': 0.22069776058197021}),
 Document(page_content='the best of your ability, O Arjuna, with your mind attached to the \nLord, aba ndoning worry and attachment  to the results, and remain-\ning calm in both success and failure. The calmness of mind  is \ncalled Karma -yoga . (2.48) Work done with selfish motives is infe-\nrior by far 

In [None]:
llm_chain.run("tell me 10 key points from this book bhagavad gita")

'1. We are all connected to the divine and should strive to realize this connection.\n2. Our actions in life should be guided by dharma, or right conduct.\n3. We should strive to be free from attachment and selfish desires.\n4. We should strive to control our senses and mind.\n5. We should strive to remain dedicated to our spiritual path.\n6. We should focus on the present and not be swayed by the past or future.\n7. We should strive to live a balanced life of moderation.\n8. We should always strive to do our utmost duty.\n9. We should strive to develop our inner spiritual power.\n10. We should strive to surrender to the divine will.'

In [None]:
llm_chain.run("explain the karma & its impact ")

'Karma is a spiritual concept in Hinduism, Buddhism, Jainism, Sikhism, and Taoism which suggests that the actions of an individual (as well as the intentions behind those actions) have an effect on the future of that individual. It is believed that good deeds result in favourable karma, while bad deeds result in bad karma. The impact of karma can be seen in many aspects of life, including relationships, health, success, and overall wellbeing.'

In [None]:
llm_chain.run("who is yogi ?")

'Yogi is a bear character created by Hanna-Barbera who first appeared in 1958. He is a brown bear who likes to talk in rhyme and is known for his catchphrase "Hey, Boo Boo!". He is best friends with his sidekick, Boo Boo.'

In [None]:
llm_chain.run("what are the 3 models of nature?")

'The three models of nature are the mechanistic model, the organismic model, and the ecosystemic model. The mechanistic model views nature as an inanimate system of parts. The organismic model views nature as a living organism. The ecosystemic model views nature as an interconnected web of living and nonliving components.'

In [None]:
llm_chain.run("what is god ")

'God is traditionally understood as a supernatural being who is all-knowing, all-powerful, and the creator of the universe. Different religions and cultures have different beliefs about the nature of God.'