# Research Chain

This is an experimental research chain to attempt to answer a research question using information on the web.

A research chain is composed of the following components:

1. A searcher that searches for documents using a search engine.
    - The searcher is responsible to return a list of URLs of documents that
      may be relevant to read to be able to answer the question.
2. A downloader that downloads the documents.
3. An HTML to markdown parser (hard coded) that converts the HTML to markdown.
    * Conversion to markdown is lossy
    * However, it can significantly reduce the token count of the document
    * Markdown helps to preserve some styling information
      (e.g., bold, italics, links, headers) which is expected to help the reader
      to answer certain kinds of questions correctly.
4. A reader that reads the documents and produces an answer.

## Limitations

* Chain can be potentially long running (Use initialization parameters to control how many options are eplored).
* This research chain only implements a single hop at the moment; i.e.,
  it goes from the questions to a list of URLs to documents to compiling answers.
  Without continuing the crawl, web-sites that require pagnation will not be explored fully.
* The reader chain must match the type of question. For example, the QA refine chain 
  isn't good at extracting a list of entries from a long document.
* Content downloader may get blocked (since it looks like a bot).  
  
## Extending

* Continue crawling documents to discover more relevant pages that were not surfaced by the search engine.
* Adapt reading strategy based on nature of question.
* Analyze the query and determine whether the query is a multi-hop query and change search/crawling strategy based on that.
* Provide smaller pieces to an agent. :)
* Add cheaper strategies for selecting which links should be explored further (e.g., based on tf-idf similarity instead of gpt-4)
* Add a summarization chain on top of the individually collected answers

In [1]:
from langchain.chains.research.api import Research
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.research.download import PlaywrightDownloadHandler

In [2]:
question = """\
Compile information about Albert Einstein.
Ignore if it's a different Albert Einstein. 
Only include information you're certain about.

Include:
* education history
* major contributions
* names of spouse 
* date of birth
* place of birth
* a 3 sentence short biography

Format your answer in a bullet point format for each sub-question.

If the context text claims that a security verification is required or a different web-browser is 
required to download the page, then please output: "Download May Have Failed."
""".strip()

Instantiate LLMs

In [3]:
llm = OpenAI(temperature=0, model='text-davinci-003') # Used for the readers and the query generator
selector_llm = ChatOpenAI(temperature=0, model='gpt-4') # Used for selecting which links to explore

Create a chain that can be used to extract the answer to the question above from a given document.

This chain must be tailored to the task.

In [4]:
qa_chain = load_qa_chain(llm, chain_type='refine')

In [7]:
d = PlaywrightDownloadHandler()

In [9]:
s = await d.adownload(['https://www.britannica.com/biography/Albert-Einstein'])

ValueError: Could not create a blob for content at https://www.britannica.com/biography/Albert-Einstein. Content type is <class 'playwright._impl._api_types.TimeoutError'>

In [5]:
research = Research.from_llms(
    query_generation_llm=llm, 
    link_selection_llm=selector_llm, 
    underlying_reader_chain=qa_chain, 
    top_k_per_search=2, 
    max_num_pages_per_doc=3, 
    download_handler=PlaywrightDownloadHandler()
)

In [6]:
results = await research.acall(inputs={'question': question})

ValueError: Could not create a blob for content at https://www.britannica.com/biography/Albert-Einstein. Content type is <class 'playwright._impl._api_types.TimeoutError'>

In [None]:
for doc in results['docs']:
    print('--'*80)
    print(doc.metadata['source'])
    print(doc.page_content)

If useful we can produce another summary!

In [None]:
qa_chain = load_qa_chain(llm, chain_type='stuff')

In [None]:
summary = await qa_chain.acall(inputs={'input_documents':  results['docs'], 'question': question})

In [None]:
print(summary['output_text'])

## Under the hood

A searcher is invoked first to find URLs that are good to explore

In [None]:
results = await research.searcher.acall(inputs={'question': question})
results

The webpages are downloaded

In [None]:
blobs = await research.downloader.adownload(results['urls'])

In [None]:
blobs[0].as_string()[:50]

In [None]:
blobs[0].source

The blobs are parsed with an HTML parser and read by the reader chain (not shown) -- see underlying code for details