# Simple Perplexity AI clone


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mklarqvist/ai-playground/blob/main/perplexity-clone/perplexity_clone.ipynb)

Perplexity AI ([perplexity.ai](perplexity.ai)) is an chat tool that uses foundational language models, such as GPT-4 from OpenAI, along with current information from the internet. It not only provides answers, but also references to the sources that contributed to those answers. This simple, yet powerful approach addresses the limitation of potentially outdated training data used to train the models. By returning the sources used to provide an answer, you can verify its accuracy. This combats the issue of language models generating incorrect answers.

This may sound like a major project and a serious undertaking, but modern tools have made it surprisingly easy.

The workflow can be described as followed:

1. The user poses a question.
2. A Google search is performed using the question.
3. The top-k search results, or the most relevant webpages, are downloaded.
4. Raw HTML data is transformed into a usable format by LangChain.
5. All documents are split into 1,000 character chunks.
6. Compute embeddings for each document chunk and store them in a vector store (chromadb).
7. Build a prompt using the user's question from step 1 and all the scraped web data using LangChain.
8. Query an OpenAI model to generate an answer.
9. Identify the documents that contributed to the answer and return them as references.

## Querying Google and scraping websites

First we need to install the required dependencies.

In [None]:
!pip3 install -U readabilipy langchain openai bs4 requests chromadb tiktoken

In [1]:
import requests # Required to make HTTP requests
from bs4 import BeautifulSoup # Required to parse HTML
import numpy as np # Required to dedupe sites
from urllib.parse import unquote # Required to unquote URLs

In [2]:
query = 'history of the human genome project' # The query to search Google for and ask the AI about

In [3]:
response = requests.get(f"https://www.google.com/search?q={query}") # Make the request
soup = BeautifulSoup(response.text, "html.parser") # Parse the HTML
links = soup.find_all("a") # Find all the links in the HTML

In [4]:
# loop over `links` and keep only the one that have the href starting with "/url?q="
urls = []
for l in [link for link in links if link["href"].startswith("/url?q=")]:
    # get the url
    url = l["href"]
    # remove the "/url?q=" part
    url = url.replace("/url?q=", "")
    # remove the part after the "&sa=..."
    url = unquote(url.split("&sa=")[0])
    # special case for google scholar
    if url.startswith("https://scholar.google.com/scholar_url?url=http"):
        url = url.replace("https://scholar.google.com/scholar_url?url=", "").split("&")[0]
    elif 'google.com/' in url: # skip google links
        continue
    if url.endswith('.pdf'): # skip pdf links
        continue
    if '#' in url: # remove anchors (e.g. wikipedia.com/bob#history and wikipedia.com/bob#genetics are the same page)
        url = url.split('#')[0]
    # print the url
    urls.append(url)

# Use numpy to dedupe the list of urls after removing anchors
urls = list(np.unique(urls))
urls

['https://ellipse.prbb.org/the-human-genome-20-years-of-history/',
 'https://en.wikipedia.org/wiki/Human_Genome_Project',
 'https://europepmc.org/article/med/8489331',
 'https://genomemedicine.biomedcentral.com/articles/10.1186/gm483',
 'https://plato.stanford.edu/entries/human-genome/',
 'https://pubmed.ncbi.nlm.nih.gov/1991595/',
 'https://web.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml',
 'https://www.britannica.com/event/Human-Genome-Project',
 'https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project',
 'https://www.genome.gov/about-nhgri/Brief-History-Timeline',
 'https://www.genome.gov/human-genome-project',
 'https://www.genome.gov/human-genome-project/20th-anniversary-of-landmark-human-genome-project-publications',
 'https://www.genome.gov/leadership-initiatives/History-of-Genomics-Program',
 'https://www.jstor.org/stable/3083473',
 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1592577/']

In [5]:
from readabilipy import simple_json_from_html_string # Required to parse HTML to pure text
from langchain.schema import Document # Required to create a Document object

In [6]:
def scrape_and_parse(url: str) -> Document:
    """Scrape a webpage and parse it into a Document object"""
    req = requests.get(url)
    article = simple_json_from_html_string(req.text, use_readability=True)
    # The following line seems to work with the package versions on my local machine, but not on Google Colab
    # return Document(page_content=article['plain_text'][0]['text'], metadata={'source': url, 'page_title': article['title']})
    return Document(page_content='\n\n'.join([a['text'] for a in article['plain_text']]), metadata={'source': url, 'page_title': article['title']})

In [7]:
# It's possible to optitimize this by using asyncio, but it's not necessary for this example
documents = [scrape_and_parse(f) for f in urls] # Scrape and parse all the urls

In [8]:
documents

[Document(page_content='For humans, the history of DNA began just over 150 years ago when Friedrich Miescher accidentally discovered nuclein, a phosphate-rich molecule that did not behave like a protein and that, over the years, would end up called DNA. But in this short story there are two other great milestones. The first, when almost seventy years ago Rosalind Franklin, and later James Watson and Francis Crick, discovered through X-ray images, that DNA was structured as a double helix. The second milestone came just 20 years ago, when two teams of scientists from different countries published the first sequence of the human genome. The project to sequence the human genome was extremely long and expensive. But the nearly $ 3 billion invested were justified by the possibility to better understand the diseases of genetic origin, which at that time could be counted with the fingers of one hand. In the words of Roderic Guigó, head of the computational biology group of RNA processing, coo

## Splitting documents into chunks

As most webpages (documents) are too long to feed directly to our language model, we must split the data into smaller pieces or chunks. In this case, we have divided the text into blocks of 1,000 characters each, with a 200 character overlap between each stride. For instance, chunk 1 spans from 0 to 1000, while chunk 2 ranges from 800 to 1800, and so on.

In [9]:
from langchain.text_splitter import CharacterTextSplitter

In [10]:
text_splitter = CharacterTextSplitter(separator=' ', chunk_size=1000, chunk_overlap=200)

In [11]:
texts = text_splitter.split_documents(documents)

In [12]:
len(texts)

400

## Computing embeddings of chunks and storage in a vector store

We can use a language model to compute embeddings for our input text (chunks). If you're unfamiliar with the concept of embeddings, imagine a pile of colored M&Ms in the middle of your table. Large foundational models have learned the general properties of M&Ms and will move them to different locations on the table based on their color. So you will end up with green M&Ms in one corner, blue in another, and so on. Now imagine we are given a cyan M&M. We've never seen that color before. But we know that cyan is a mix of green and blue, so it can be placed somewhere close to those colors on the table. We can then measure how far away the new cyan M&M is from each other M&M on the table. The closest one will be the most similar and the furthest one will be the most dissimilar.

Instead of M&Ms, we have chunks of documents. And instead of 2 dimensions, like our flat table surface, we have 1,000+ dimensions.

In [13]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [14]:
# Get your own APU key at https://platform.openai.com/account/api-keys
OPENAI_API_KEY = 'REPLACE_WITH_YOUR_KEY' # Replace with your own OpenAI API key

In [15]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [16]:
from langchain.vectorstores import Chroma

In [17]:
docsearch = Chroma.from_documents(texts, embeddings)

Using embedded DuckDB without persistence: data will be transient


## Configuring what model we use and ask questions

We can now pick which model to use and start asking questions!

In [18]:
from langchain.llms import OpenAIChat # Required to create a Language Model

In [19]:
# Pick an OpenAI model
llm = OpenAIChat(model_name='gpt-3.5-turbo', openai_api_key=OPENAI_API_KEY)



In [20]:
from langchain import  VectorDBQA # Required to create a Question-Answer object using a vector

In [21]:
import pprint # Required to pretty print the results

In [22]:
# Stuff all the information into a single prompt (see https://docs.langchain.com/docs/components/chains/index_related_chains#stuffing)
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "Who were the main players in the race to complete the human genome? And what were their approaches? Give as much detail as possible."
result = qa({"query": query})



In [23]:
pprint.pprint(result)

{'query': 'Who were the main players in the race to complete the human genome? '
          'And what were their approaches? Give as much detail as possible.',
 'result': 'The main players in the race to complete the human genome were the '
           'publicly funded Human Genome Project (HGP) and the privately '
           'funded Celera Corporation, led by J. Craig Venter. Their '
           'approaches differed in that the HGP was a large, collaborative '
           'international effort, while Celera focused on creating a '
           'proprietary database using advanced sequencing technology. The '
           'competition arose from the prospect of gaining control over '
           'potential patents on the genome sequence, which was considered '
           'valuable. However, the rivalry ended when Celera and the HGP '
           'joined forces, thus speeding completion of the rough draft '
           'sequence of the human genome. Collaborative efforts continued for '
          

In [24]:
[a.metadata['source'] for a in result['source_documents']] # Print the source documents

['https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project',
 'https://www.britannica.com/event/Human-Genome-Project',
 'https://web.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml',
 'https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project']

In [25]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "How were the donor participants recruited for the human genome project? Summarize in three sentences."
result = qa({"query": query})

In [26]:
pprint.pprint(result)

{'query': 'How were the donor participants recruited for the human genome '
          'project? Summarize in three sentences.',
 'result': 'The International Human Genome Sequencing Consortium collected '
           'blood or sperm samples from many donors, with their identities '
           'protected to maintain anonymity. Only a few samples were used for '
           'DNA resources, and most of the sequence generated by the public '
           'HGP came from a single anonymous male donor from Buffalo, New '
           'York. Volunteers were recruited through a process of informed '
           'consent, with a 1997 newspaper advertisement from Buffalo seeking '
           'participants.',
 'source_documents': [Document(page_content='of the joint publications, press releases announced that the project had been completed by both groups. Improved drafts were announced in 2003 and 2005, filling in to approximately 92% of the sequence currently. Genome donors[edit] In the International Hu

In [27]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "What happened to Craig Venter's company following the completion of the human genome project? Give as much detail as possible."
result = qa({"query": query})

In [28]:
pprint.pprint(result)

{'query': "What happened to Craig Venter's company following the completion of "
          'the human genome project? Give as much detail as possible.',
 'result': 'Following the completion of the human genome project, Craig '
           "Venter's company, Celera Genomics, faced a decision on what type "
           'of company it would become. It added sequences from three '
           'different mouse strains to its database and briefly ventured into '
           'proteomics. However, Venter resigned as CEO in January 2002, and '
           'the company decided to focus on drug discovery rather than '
           'information. Despite being timed to coincide with the celebrations '
           'of the 50th anniversary of the Watson-Crick discovery of the '
           'double-helical structure of DNA, there was less fanfare '
           'surrounding the official date of completion of the HGP in April '
           '2003. Celera remained a threat, as the validity of the WGS '
           's

In [29]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "How come the project finished earlier then expected? Give as much detail as possible."
result = qa({"query": query})

In [30]:
pprint.pprint(result)

{'query': 'How come the project finished earlier then expected? Give as much '
          'detail as possible.',
 'result': 'The project finished earlier than expected due to a deliberate '
           'focus on technology development, improved sequencing technologies, '
           'and a change in approach to the finishing process. The original '
           "completion date was set for 2005, but the project's goals and "
           'related strategic plans were updated periodically throughout the '
           'project. The final completion date was moved forward to 2003 with '
           'a plan for a "working draft" of the human genome sequence by '
           'December 2001. The project ended up costing less than expected, at '
           "around $2.7 billion. Many of the project's achievements were "
           'beyond what scientists thought possible in 1988.',
 'source_documents': [Document(page_content='original goals for the Human Genome Project in 1988, which included sequencing

In [31]:
import session_info
session_info.show()