<a href="https://colab.research.google.com/github/malinphy/q_17/blob/main/HYDE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!wget  https://www.dropbox.com/s/9nulx9nxn4chujw/faiss.msmarco-v1-passage.contriever.pq-m192.tar.gz
!tar -xvf contriever_msmarco_index.tar.gz

--2023-03-02 13:25:04--  https://www.dropbox.com/s/9nulx9nxn4chujw/faiss.msmarco-v1-passage.contriever.pq-m192.tar.gz
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/9nulx9nxn4chujw/faiss.msmarco-v1-passage.contriever.pq-m192.tar.gz [following]
--2023-03-02 13:25:05--  https://www.dropbox.com/s/raw/9nulx9nxn4chujw/faiss.msmarco-v1-passage.contriever.pq-m192.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc269330f338390cadcb1895a08f.dl.dropboxusercontent.com/cd/0/inline/B3c2HUl-w1fldOcdb8zKgzNFsvury-VTehej58e4NpT0Nw1nUTPqDg46yesiGssh3qT1BWzJICM4qGsRsU9KAzA6EtzKNqbTQzLFYpb_n8Jcecq2wd0m8fmu4dFmHV-G5JheevfoTqM8zHjCx-ZvDXhl-EnBsMlj4tjJlm4v19pJCw/file# [following]
--2023-03-02 13:25:05--  https://uc269330f338390cadcb1895a08f.dl.d

In [2]:
!tar -xvf faiss.msmarco-v1-passage.contriever.pq-m192.tar.gz.1

faiss.msmarco-v1-passage.contriever.pq-m192/
faiss.msmarco-v1-passage.contriever.pq-m192/docid
faiss.msmarco-v1-passage.contriever.pq-m192/index


In [3]:
!pip install pyserini -q

In [4]:
!pip install openai -q
!pip install cohere -q
!pip install faiss-cpu -q

In [5]:
import time
import openai
import cohere

class Generator:
    def __init__(self, model_name, api_key):
        self.model_name = model_name
        self.api_key = api_key
    
    def generate(self):
        return ""


class OpenAIGenerator(Generator):
    def __init__(self, model_name, api_key, n=8, max_tokens=512, temperature=0.7, top_p=1, frequency_penalty=0.0, presence_penalty=0.0, stop=['\n\n\n'], wait_till_success=False):
        super().__init__(model_name, api_key)
        self.n = n
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.top_p = top_p
        self.frequency_penalty = frequency_penalty
        self.presence_penalty = presence_penalty
        self.stop = stop
        self.wait_till_success = wait_till_success
    
    @staticmethod
    def parse_response(response):
        to_return = []
        for _, g in enumerate(response['choices']):
            text = g['text']
            logprob = sum(g['logprobs']['token_logprobs'])
            to_return.append((text, logprob))
        texts = [r[0] for r in sorted(to_return, key=lambda tup: tup[1], reverse=True)]
        return texts

    def generate(self, prompt):
        get_results = False
        while not get_results:
            try:
                result = openai.Completion.create(
                    engine=self.model_name,
                    prompt=prompt,
                    api_key=self.api_key,
                    max_tokens=self.max_tokens,
                    temperature=self.temperature,
                    frequency_penalty=self.frequency_penalty,
                    presence_penalty=self.presence_penalty,
                    top_p=self.top_p,
                    n=self.n,
                    stop=self.stop,
                    logprobs=1
                )
                get_results = True
            except Exception as e:
                if self.wait_till_success:
                    time.sleep(1)
                else:
                    raise e
        return self.parse_response(result)


class CohereGenerator(Generator):
    def __init__(self, model_name, api_key, n=8, max_tokens=512, temperature=0.7, p=1, frequency_penalty=0.0, presence_penalty=0.0, stop=['\n\n\n'], wait_till_success=False):
        super().__init__(model_name, api_key)
        self.cohere = cohere.Cohere(self.api_key)
        self.n = n
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.p = p
        self.frequency_penalty = frequency_penalty
        self.presence_penalty = presence_penalty
        self.stop = stop
        self.wait_till_success = wait_till_success

    
    @staticmethod
    def parse_response(response):
        text = response.generations[0].text
        return text
    
    def generate(self, prompt):
        texts = []
        for _ in range(self.n):
            get_result = False
            while not get_result:
                try:
                    result = self.cohere.generate(
                        prompt=prompt,
                        model=self.model_name,
                        max_tokens=self.max_tokens,
                        temperature=self.temperature,
                        frequency_penalty=self.frequency_penalty,
                        presence_penalty=self.presence_penalty,
                        p=self.p,
                        k=0,
                        stop=self.stop,
                    )
                    get_result = True
                except Exception as e:
                    if self.wait_till_success:
                        time.sleep(1)
                    else:
                        raise e
            text = self.parse_response(result)
            texts.append(text)
        return texts

In [6]:
import numpy as np


class HyDE:
    def __init__(self, promptor, generator, encoder, searcher):
        self.promptor = promptor
        self.generator = generator
        self.encoder = encoder
        self.searcher = searcher
    
    def prompt(self, query):
        return self.promptor.build_prompt(query)

    def generate(self, query):
        prompt = self.promptor.build_prompt(query)
        hypothesis_documents = self.generator.generate(prompt)
        return hypothesis_documents
    
    def encode(self, query, hypothesis_documents):
        all_emb_c = []
        for c in [query] + hypothesis_documents:
            c_emb = self.encoder.encode(c)
            all_emb_c.append(np.array(c_emb))
        all_emb_c = np.array(all_emb_c)
        avg_emb_c = np.mean(all_emb_c, axis=0)
        hyde_vector = avg_emb_c.reshape((1, len(avg_emb_c)))
        return hyde_vector
    
    def search(self, hyde_vector, k=10):
        hits = self.searcher.search(hyde_vector, k=k)
        return hits
    

    def e2e_search(self, query, k=10):
        prompt = self.promptor.build_prompt(query)
        hypothesis_documents = self.generator.generate(prompt)
        hyde_vector = self.encode(query, hypothesis_documents)
        hits = self.searcher.search(hyde_vector, k=k)
        return hits

In [7]:
WEB_SEARCH = """Please write a passage to answer the question.
Question: {}
Passage:"""


SCIFACT = """Please write a scientific paper passage to support/refute the claim.
Claim: {}
Passage:"""


ARGUANA = """Please write a counter argument for the passage.
Passage: {}
Counter Argument:"""


TREC_COVID = """Please write a scientific paper passage to answer the question.
Question: {}
Passage:"""


FIQA = """Please write a financial article passage to answer the question.
Question: {}
Passage:"""


DBPEDIA_ENTITY = """Please write a passage to answer the question.
Question: {}
Passage:"""


TREC_NEWS = """Please write a news passage about the topic.
Topic: {}
Passage:"""


MR_TYDI = """Please write a passage in {} to answer the question in detail.
Question: {}
Passage:"""


class Promptor:
    def __init__(self, task: str, language: str = 'en'):
        self.task = task
        self.language = language
    
    def build_prompt(self, query: str):
        if self.task == 'web search':
            return WEB_SEARCH.format(query)
        elif self.task == 'scifact':
            return SCIFACT.format(query)
        elif self.task == 'arguana':
            return ARGUANA.format(query)
        elif self.task == 'trec-covid':
            return TREC_COVID.format(query)
        elif self.task == 'fiqa':
            return FIQA.format(query)
        elif self.task == 'dbpedia-entity':
            return DBPEDIA_ENTITY.format(query)
        elif self.task == 'trec-news':
            return TREC_NEWS.format(query)
        elif self.task == 'mr-tydi':
            return MR_TYDI.format(self.language, query)
        else:
            raise ValueError('Task not supported')

In [9]:
import json
from pyserini.search import FaissSearcher, LuceneSearcher
from pyserini.search.faiss import AutoQueryEncoder

# from hyde import Promptor, OpenAIGenerator, CohereGenerator, HyDE

In [10]:
KEY = 'KEY' # replace with your API key, it can be OpenAI api key or Cohere api key
promptor = Promptor('web search')
generator = OpenAIGenerator('text-davinci-003', KEY)
encoder = AutoQueryEncoder(encoder_dir='facebook/contriever', pooling='mean')
# searcher = FaissSearcher('contriever_msmarco_index/', encoder)
searcher = FaissSearcher('faiss.msmarco-v1-passage.contriever.pq-m192/',encoder)
corpus = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.msmarco-v1-passage.20221004.252b5e.tar.gz...


lucene-index.msmarco-v1-passage.20221004.252b5e.tar.gz: 2.02GB [00:30, 70.4MB/s]                            


In [11]:
hyde = HyDE(promptor, generator, encoder, searcher)

In [12]:
query = 'how long does it take to remove wisdom tooth'

In [13]:
prompt = hyde.prompt(query)
print(prompt)

Please write a passage to answer the question.
Question: how long does it take to remove wisdom tooth
Passage:


In [14]:
hypothesis_documents = hyde.generate(query)
for i, doc in enumerate(hypothesis_documents):
    print(f'HyDE Generated Document: {i}')
    print(doc.strip())

HyDE Generated Document: 0
The length of time it takes to remove a wisdom tooth will vary depending on the individual. Generally, the procedure can take anywhere from 30 minutes to several hours. The complexity of the extraction, the location of the tooth, and the number of teeth being removed all affect the length of time it takes to complete the process. In most cases, the entire procedure can be completed in one visit to the dentist. However, if multiple teeth are being extracted or the tooth is impacted, the process may take longer and require multiple visits.
HyDE Generated Document: 1
Removing a wisdom tooth typically takes anywhere from 20 minutes to an hour, depending on the complexity of the procedure. If the tooth is impacted, or stuck in the jawbone, the procedure may take longer. Removal of a wisdom tooth usually requires a local anesthetic to numb the area and a sedative, such as nitrous oxide, to keep the patient comfortable. In some cases, general anesthesia may be requi

In [15]:
hyde_vector = hyde.encode(query, hypothesis_documents)
print(hyde_vector.shape)

(1, 768)


In [16]:
hits = hyde.search(hyde_vector, k=10)
for i, hit in enumerate(hits):
    print(f'HyDE Retrieved Document: {i}')
    print(hit.docid)
    print(json.loads(corpus.doc(hit.docid).raw())['contents'])

HyDE Retrieved Document: 0
91493
The time it takes to remove the tooth will vary. Some procedures only take a few minutes, whereas others can take 20 minutes or longer. After your wisdom teeth have been removed, you may experience swelling and discomfort, both on the inside and outside of your mouth. This is usually worse for the first three days, but it can last for up to two weeks. Read more about how a wisdom tooth is removed and recovering from wisdom tooth removal.
HyDE Retrieved Document: 1
4174313
The time it takes to remove the tooth will vary. Some procedures only take a few minutes, whereas others can take 20 minutes or longer. After your wisdom teeth have been removed, you may experience swelling and discomfort, both on the inside and outside of your mouth.This is usually worse for the first three days, but it can last for up to two weeks. Read more about how a wisdom tooth is removed and recovering from wisdom tooth removal.he time it takes to remove the tooth will vary. So

In [17]:
hits = hyde.e2e_search(query, k=10)
for i, hit in enumerate(hits):
    print(f'HyDE Retrieved Document: {i}')
    print(hit.docid)
    print(json.loads(corpus.doc(hit.docid).raw())['contents'])

HyDE Retrieved Document: 0
91493
The time it takes to remove the tooth will vary. Some procedures only take a few minutes, whereas others can take 20 minutes or longer. After your wisdom teeth have been removed, you may experience swelling and discomfort, both on the inside and outside of your mouth. This is usually worse for the first three days, but it can last for up to two weeks. Read more about how a wisdom tooth is removed and recovering from wisdom tooth removal.
HyDE Retrieved Document: 1
4174313
The time it takes to remove the tooth will vary. Some procedures only take a few minutes, whereas others can take 20 minutes or longer. After your wisdom teeth have been removed, you may experience swelling and discomfort, both on the inside and outside of your mouth.This is usually worse for the first three days, but it can last for up to two weeks. Read more about how a wisdom tooth is removed and recovering from wisdom tooth removal.he time it takes to remove the tooth will vary. So