# Research Chain

This is an experimental research chain that tries to answer "researchy" questions using information on the web.


For example, 

```
Compile information about Albert Einstein.
Ignore if it's a different Albert Einstein. 
Only include information you're certain about.

Include:
* education history
* major contributions
* names of spouse 
* date of birth
* place of birth
* a 3 sentence short biography

Format your answer in a bullet point format for each sub-question.
```

Or replace `Albert Einstein` with another person of interest (e.g., John Smith of Boston).


The chain is composed of the following components:

1. A searcher that searches for documents using a search engine.
    - The searcher is responsible to return a list of URLs of documents that
      may be relevant to read to be able to answer the question.
2. A downloader that downloads the documents.
3. An HTML to markdown parser (hard coded) that converts the HTML to markdown.
    * Conversion to markdown is lossy
    * However, it can significantly reduce the token count of the document
    * Markdown helps to preserve some styling information
      (e.g., bold, italics, links, headers) which is expected to help the reader
      to answer certain kinds of questions correctly.
4. A reader that reads the documents and produces an answer.

## Limitations

* Quality of results depends on LLM used, and can be improved by providing more specialized parsers (e.g., parse only the body of articles).
* If asking about people, provide enough information to disambiguate the person.
* Content downloader may get blocked (e.g., if attempting to download from linkedin) -- may need to read terms of service / user agents appropriately.
* Chain can be potentially long running (use initialization parameters to control how many options are explored) -- use async implementation as it uses more concurrency.
* This research chain only implements a single hop at the moment; i.e.,
  it goes from the questions to a list of URLs to documents to compiling answers.
  Without continuing the crawl, web-sites that require pagnation will not be explored fully.
* The reader chain must match the type of question. For example, the QA refine chain 
  isn't good at extracting a list of entries from a long document.
  
## Extending

* Continue crawling documents to discover more relevant pages that were not surfaced by the search engine.
* Adapt reading strategy based on nature of question.
* Analyze the query and determine whether the query is a multi-hop query and change search/crawling strategy based on that.
* Break components into tools that can be exposed to an agent. :)
* Add cheaper strategies for selecting which links should be explored further (e.g., based on tf-idf similarity instead of gpt-4)
* Add a summarization chain on top of the individually collected answers.
* Improve strategy to ignore irrelevant information.

# Requirements

Please install: 

* `playwright` for fetching content from the web (or use the RequestsDownloadHandler)
* `lxml` and `markdownify` for parsing HTMLs

In [1]:
from langchain.chains.research.api import Research
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.research.download import PlaywrightDownloadHandler
# If you don't have playwright installed, can experiment with requests
# Be aware that some web-pages won't download properly as javascript won't be executed
from langchain.chains.research.download import RequestsDownloadHandler 

In [3]:
question = """\
Compile information about Albert Einstein.
Ignore if it's a different Albert Einstein. 
Only include information you're certain about.

Include:
* education history
* major contributions
* names of spouse 
* date of birth
* place of birth
* a 3 sentence short biography

Format your answer in a bullet point format for each sub-question.
""".strip()

Instantiate LLMs

In [4]:
llm = OpenAI(
    temperature=0, model="text-davinci-003"
)  # Used for the readers and the query generator
selector_llm = ChatOpenAI(
    temperature=0, model="gpt-4"
)  # Used for selecting which links to explore

Create a chain that can be used to extract the answer to the question above from a given document.

This chain must be tailored to the task.

In [5]:
qa_chain = load_qa_chain(llm, chain_type="refine")

In [6]:
research = Research.from_llms(
    query_generation_llm=llm,
    link_selection_llm=selector_llm,
    underlying_reader_chain=qa_chain,
    top_k_per_search=1,
    max_num_pages_per_doc=3,
    download_handler=PlaywrightDownloadHandler(),
)

In [7]:
results = await research.acall(inputs={"question": question})

In [8]:
for doc in results["docs"]:
    print("--" * 80)
    print(doc.metadata["source"])
    print(doc.page_content)

----------------------------------------------------------------------------------------------------------------------------------------------------------------
https://en.wikipedia.org/wiki/Albert_Einstein


Albert Einstein:
* Education history: Attended elementary school in Munich, Germany, and later attended the Swiss Federal Polytechnic School in Zurich, Switzerland.
* Major contributions: Developed the theory of relativity, made major contributions to quantum theory, and won the Nobel Prize in Physics in 1921. He also published more than 300 scientific papers and 150 non-scientific works. He was also the first to propose the existence of black holes and gravitational waves. He was also a polyglot, speaking over 15 languages, including Afrikaans, Alemannisch, Amharic, Anarâškielâ, Angika, Old English, Abkhazian, Arabic, Aragonese, Western Armenian, Aromanian, Arpitan, Assamese, Asturian, Guarani, Aymara, Azerbaijani, South Azerbaijani, Balinese, Bambara, Bangla, Min Nan Chinese, Ba

If useful we can produce another summary!

In [9]:
qa_chain = load_qa_chain(llm, chain_type="stuff")

In [10]:
summary = await qa_chain.acall(
    inputs={"input_documents": results["docs"], "question": question}
)

In [11]:
print(summary["output_text"])



Education History:
* Attended Aargau Cantonal School in Aarau, Switzerland from 1895-1896
* Attended ETH Zurich (Swiss Federal Institute of Technology) from 1896-1900
* Obtained his doctorate degree from Swiss Federal Polytechnic School in Zurich in 1901

Major Contributions:
* Developed the theory of relativity
* Developed the mass-energy equivalence formula (E=mc2)
* Developed the law of the photoelectric effect
* Postulated that the correct interpretation of the special theory of relativity must also furnish a theory of gravitation
* Contributed to the problems of the theory of radiation and statistical mechanics
* Investigated the thermal properties of light with a low radiation density and his observations laid the foundation of the photon theory of light
* Contributed to statistical mechanics by his development of the quantum theory of a monatomic gas
* Worked towards the unification of the basic concepts of physics, taking the opposite approach, geometrisation, to the majority

## Under the hood

A searcher is invoked first to find URLs that are good to explore

In [12]:
suggetions = await research.searcher.acall(inputs={"question": question})
suggetions

{'question': "Compile information about Albert Einstein.\nIgnore if it's a different Albert Einstein. \nOnly include information you're certain about.\n\nInclude:\n* education history\n* major contributions\n* names of spouse \n* date of birth\n* place of birth\n* a 3 sentence short biography\n\nFormat your answer in a bullet point format for each sub-question.",
 'urls': ['https://en.wikipedia.org/wiki/Albert_Einstein',
  'https://www.britannica.com/biography/Albert-Einstein',
  'https://www.advergize.com/edu/7-albert-einstein-inventions-contributions/',
  'https://www.nobelprize.org/prizes/physics/1921/einstein/biographical/']}

The webpages are downloaded

In [13]:
blobs = await research.downloader.adownload(suggetions["urls"])

The blobs are parsed with an HTML parser and read by the reader chain (not shown) -- see underlying code for details.