### Query Transformation using HyDE: Pre-retrieval optimization technique for Advanced RAG

In traditional RAG , semantic meaning of documents and user query is used to find the documents that are similar to the user query. But there is one problem with this approach is that , semantic meaning of the search query is not always well represented in the document embeddings .It means document embeddings are  not perfectly aligned with search query embeddings . 

##### How it Works:
* Hypothetical document generation: User query is passed to the LLM to generate hypothetical documents/answers .
* use embedding model to encode these fake documents into embeddings.
* Use similarity search to find the most similar document chunks to the hypothetical document embeddings.
* Finally , use the retrieved document chunks to generate final answer.

HyDE helps to improve retrieval accuracy and reduce hallucinations . It only works when LLM has some knowledge about the asked question. If LLM produce wrong hypothetical answer to the user query , then final retrieved document chunks will not be relevant. So this is one of the limitation of this method. 

## Implementation using LangChain

In [1]:
import os
os.environ['OPENAI_API_KEY'] = ""

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import  RecursiveCharacterTextSplitter 
from langchain.vectorstores import FAISS 
from langchain_openai import OpenAIEmbeddings
from langchain.document_loaders import WebBaseLoader

In [22]:
# Load data from website
url = "https://utsavdesai26.medium.com/linear-regression-made-simple-a-step-by-step-tutorial-fb8e737ea2d9"
loader = WebBaseLoader(url)
docs = loader.load()

In [23]:
# spplit data into chunks
text_splitter  = RecursiveCharacterTextSplitter(chunk_size=  1000 , chunk_overlap=100)
texts = text_splitter.split_documents(docs)
len(texts)

21

In [24]:
embeddings = OpenAIEmbeddings() 
vectorstore = FAISS.from_documents(texts , embeddings)

In [25]:
retriever = vectorstore.as_retriever() 

In [35]:
retrieved_docs = retriever.invoke("What are the types of relationships between independent variables ?")

In [36]:
len(retrieved_docs)

4

In [37]:
retrieved_docs

[Document(page_content='may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).Little or No Multicollinearity between features: Multicollinearity exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of predictors with the target variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.Little or No Autocorrelation in residuals: The presence of correlation in error terms drastically reduces the model’s accuracy. This usually occurs in time series models where the next instant is dependent on the previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.No Heteroscedasticity: The presence of non-constant variance in the error terms results in heteroscedasticit

In [30]:
from langchain.chains import  HypotheticalDocumentEmbedder
llm = ChatOpenAI()
hyde = HypotheticalDocumentEmbedder.from_llm(llm=llm ,base_embeddings=embeddings , prompt_key='web_search' )

In [31]:
hyde_db = FAISS.from_documents(texts , hyde)

In [38]:
hyde_retriever = hyde_db.as_retriever()
hyde_retriever.invoke("What are the types of relationships between independent variables ?")

[Document(page_content='between the predicted values and the actual values. Linear regression is widely used in many real-world applications, such as finance, marketing, and healthcare, for predicting outcomes such as stock prices, customer behavior, and patient outcomes.Linear Regression LineIn machine learning, a regression line can show two types of relationships between the input variables (also known as predictors or features) and the output variable (also known as the response variable) in a linear regression model.Positive Relationship: A positive relationship exists between the input variables and the output variable when the slope of the regression line is positive. In other words, as the values of the input variables increase, the value of the output variable also increases. This can be seen as an upward slope on a scatter plot of the data.Negative Relationship: A negative relationship exists between the input variables and the output variable when the slope of the regression