# HyDE - Hypothetical Document Embeddings

- [HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels](https://github.com/texttron/hyde)
- HyDE creates a "Hypothetical" answer with the LLM and then embeds that for search

![](https://github.com/texttron/hyde/raw/main/approach.png)

# 0. Setup

In [54]:
!pip -q install -U boto3 awscli langchain pypdf

In [59]:
import boto3
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain.document_loaders import TextLoader
import langchain

In [18]:
profile_name = None
region = 'us-east-1'

In [24]:
# modelId = 'anthropic.claude-instant-v1'
modelId = 'anthropic.claude-v2'

In [25]:
session = boto3.Session(
    profile_name=profile_name,
    region_name=region,
)
bedrock = session.client(service_name='bedrock-runtime')

## Embeddings

- bedrock embedding model - amazon.titan-embed-text-v1
- 일반적으로는 SOTA 인 baai 의 [bge](https://python.langchain.com/docs/integrations/text_embedding/bge_huggingface) 씀.

In [27]:
llm = Bedrock(
    model_id=modelId,
    client=bedrock,
    model_kwargs={
        'max_tokens_to_sample': 1024
    },
)
bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1",
    client=bedrock,
)

In [28]:
# Load with `web_search` prompt
embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm,
    bedrock_embeddings,
    prompt_key="web_search",
)

In [29]:
embeddings.llm_chain.prompt

PromptTemplate(input_variables=['QUESTION'], template='Please write a passage to answer the question \nQuestion: {QUESTION}\nPassage:')

In [30]:
langchain.debug = True

# 1. Run query

In [31]:
# Now we can use it as any embedding class!
result = embeddings.embed_query("What items does McDonalds make?")

[32;1m[1;3m[llm/start][0m [1m[1:llm:Bedrock] Entering LLM run with input:
[0m{
  "prompts": [
    "Please write a passage to answer the question \nQuestion: What items does McDonalds make?\nPassage:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:Bedrock] [8.83s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " Here is a passage answering the question \"What items does McDonalds make?\":\n\nMcDonald's is known for its wide variety of fast food menu items. Some of the most popular products McDonald's makes include hamburgers, cheeseburgers, Big Macs, Quarter Pounders, Chicken McNuggets, Filet-O-Fish sandwiches, french fries, milkshakes, sodas, McCafé coffee drinks, salads, wraps, and desserts like apple pies, cookies, and ice cream sundaes. McDonald's breakfast menu features options like Egg McMuffins, hotcakes, and hash browns. McDonald's frequently adds limited-time and seasonal items to its menus as well, such as the McRib sandwich. Overall, McD

# 2. Custom prompt

In [49]:
prompt_template = """
Please answer the user's question as a single food item.
Question: {QUESTION}
Answer: """.strip()
prompt = PromptTemplate(
    input_variables=["QUESTION"],
    template=prompt_template,
)

In [50]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [51]:
embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=bedrock_embeddings,
)

In [52]:
query = "What is is McDonalds best selling item?"
result = embeddings.embed_query(query)

[32;1m[1;3m[llm/start][0m [1m[1:llm:Bedrock] Entering LLM run with input:
[0m{
  "prompts": [
    "Please answer the user's question as a single food item.\nQuestion: What is is McDonalds best selling item?\nAnswer:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:Bedrock] [763ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " Big Mac",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}


# 3. Advanced usage

- AWS IoT Provisioning Whitepaper [다운로드](https://docs.aws.amazon.com/pdfs/whitepapers/latest/device-manufacturing-provisioning/device-manufacturing-provisioning.pdf#device-manufacturing-provisioning)
- 간접적인 질문을 통해 HyDE 사용

In [93]:
loader = PyPDFLoader("iot.pdf")
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = loader.load_and_split(text_splitter=text_splitter)

In [94]:
len(texts)

36

In [95]:
texts[0]

Document(page_content='Device Manufacturing and \nProvisioning with X.509 \nCertiﬁcates in AWS IoT Core\nAWS Whitepaper', metadata={'source': 'iot.pdf', 'page': 0})

In [97]:
prompt_template = """
Please answer the user's question as related to Internet of things provisioning.
Question: {QUESTION}
Answer: """.strip()
prompt = PromptTemplate(
    input_variables=["QUESTION"],
    template=prompt_template,
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [98]:
embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=bedrock_embeddings,
)

In [99]:
%%time

docsearch = Chroma.from_documents(texts, embeddings)

CPU times: user 409 ms, sys: 53.1 ms, total: 462 ms
Wall time: 11.6 s


- nosql 이라는 단어는 문서에 등장하지 않지만, nosql 데이터베이스를 언급하고 mongodb 가 AGPL 사용하고 있다고 알려준다.

In [100]:
query = "What is the best way to provision device when I can not put certificates into my devices?"
docs = docsearch.similarity_search(query)

[32;1m[1;3m[llm/start][0m [1m[1:llm:Bedrock] Entering LLM run with input:
[0m{
  "prompts": [
    "Please answer the user's question as related to Internet of things provisioning.\nQuestion: What is the best way to provision device when I can not put certificates into my devices?\nAnswer:"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[1:llm:Bedrock] [26.43s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " Here are a few suggestions for provisioning internet of things (IoT) devices without being able to put certificates on them:\n\n- Use symmetric key encryption - Generate a symmetric key and install it on both the device and IoT hub/platform during manufacturing. The device can use this key to authenticate to the hub.\n\n- Use a Trust On First Use (TOFU) model - The first time a device connects to the hub/platform, its credentials are stored and trusted going forward. Not as secure as asymmetric cryptography but can work for some basic scenarios.\n\n-

In [101]:
print(docs[0].page_content)

Device Manufacturing and Provisioning with X.509 
Certiﬁcates in AWS IoT Core AWS Whitepaper
Network File System (NFS), or over a serial connection, and store those credentials to a secure place on 
the device. Security credentials and PKI may be handled and exposed to the contract manufacturer, so 
it’s important that the provisioning process is performed in a secure environment by trusted individuals.
Inject credentials at manufacturing time
Introducing customization for each device image at manufacturing time can add valuable time to 
produce each device and additional logistical overhead, because the manufacturer must track that 
customization for each device produced. This can lead to increased cost per unit to the device maker 
charged by the contract manufacturer, due to the additional time using the manufacturer’s production 
line.
To isolate and protect device keys from the ﬁrmware, device makers may choose to use a hardware 
security module (HSM), such as a secure element or 

In [104]:
print(docs[0].page_content)

Device Manufacturing and Provisioning with X.509 
Certiﬁcates in AWS IoT Core AWS Whitepaper
Network File System (NFS), or over a serial connection, and store those credentials to a secure place on 
the device. Security credentials and PKI may be handled and exposed to the contract manufacturer, so 
it’s important that the provisioning process is performed in a secure environment by trusted individuals.
Inject credentials at manufacturing time
Introducing customization for each device image at manufacturing time can add valuable time to 
produce each device and additional logistical overhead, because the manufacturer must track that 
customization for each device produced. This can lead to increased cost per unit to the device maker 
charged by the contract manufacturer, due to the additional time using the manufacturer’s production 
line.
To isolate and protect device keys from the ﬁrmware, device makers may choose to use a hardware 
security module (HSM), such as a secure element or 