## Introduction

Let's explore a more advanced application of artificial intelligence - creating a question-answering (QA) chatbot that works on a document and feeds its answers. Our QA chatbot uses a chain (specifically, RetrievalQAWithSourcesChain) and leverages it to sift through a set of documents, extracting relevant information to answer queries.

The thread sends a structured prompt to the underlying language model to generate a response. These prompts are designed to guide language modeling, thereby improving the quality and relevance of responses. In addition, the recovery chain is designed to keep track of the sources of information it fetches to provide answers, providing the ability to back up its answers with trusted references. As we go, we will learn how to:

1. Scan online articles and store the text content and URL of each article.
2. Use the embedding model to calculate the embeddings of these documents and store them in Deep Lake, a vector database.
3. Divide the article text into smaller sections, keeping track of the origin of each section.
4. Use RetrievalQAWithSourcesChain to create a chatbot that retrieves responses and tracks their source.
5. Generate a response to the query using a string and display the response with its source. 

This knowledge is transformative, allowing you to create intelligent chatbots that can answer questions with derived information, increasing the reliability and usefulness of the chatbot. 

## Import Libs & Setup

Remember to install the required packages with the following command: pip install langchain==0.0.208 deeplake openai tiktoken. Additionally, install the newspaper3k package with version 0.2.8.

Then, you need to add your OpenAI and Deep Lake API keys to the environment variables. The LangChain library will read the tokens and use them in the integrations.

In [None]:
#| include: false
!pip install -q langchain==0.0.208 deeplake openai tiktoken python-dotenv newspaper3k

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h

In [None]:
from dotenv import load_dotenv

!echo "OPENAI_API_KEY='<OPENAI_API_KEY>'" > .env
!echo "ACTIVELOOP_TOKEN='<ACTIVELOOP_TOKEN>'" >> .env

load_dotenv()

True

## Scrapping for the News

Now, let's begin by fetching some articles related to AI news. We're particularly interested in the text content of each article and the URL where it was published.

In the code, you’ll see the following:

- **Imports:** We begin by importing necessary Python libraries. requests are used to send HTTP requests, the newspaper is a fantastic tool for extracting and curating articles from a webpage, and time will help us introduce pauses during our web scraping task.
- **Headers:** Some websites may block requests without a proper User-Agent header as they may consider it as a bot's action. Here we define a User-Agent string to mimic a real browser's request.
- **Article URLs:** We have a list of URLs for online articles related to artificial intelligence news that we wish to scrape.
- **Web Scraping:** We create an HTTP session using requests.Session() allows us to make multiple requests within the same session. We also define an empty list of pages_content to store our scraped articles.

In [None]:
import requests
from newspaper import Article # https://github.com/codelucas/newspaper
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_urls = [
    "https://www.artificialintelligence-news.com/2023/05/16/openai-ceo-ai-regulation-is-essential/",
    "https://www.artificialintelligence-news.com/2023/05/15/jay-migliaccio-ibm-watson-on-leveraging-ai-to-improve-productivity/",
    "https://www.artificialintelligence-news.com/2023/05/15/iurii-milovanov-softserve-how-ai-ml-is-helping-boost-innovation-and-personalisation/",
    "https://www.artificialintelligence-news.com/2023/05/11/ai-and-big-data-expo-north-america-begins-in-less-than-one-week/",
    "https://www.artificialintelligence-news.com/2023/05/02/ai-godfather-warns-dangers-and-quits-google/",
    "https://www.artificialintelligence-news.com/2023/04/28/palantir-demos-how-ai-can-used-military/"
]

session = requests.Session()
pages_content = [] # where we save the scraped articles

for url in article_urls:
    try:
        time.sleep(2) # sleep two seconds for gentle scraping
        response = session.get(url, headers=headers, timeout=10)

        if response.status_code == 200:
            article = Article(url)
            article.download() # download HTML of webpage
            article.parse() # parse HTML to extract the article text
            pages_content.append({ "url": url, "text": article.text })
        else:
            print(f"Failed to fetch article at {url}")
    except Exception as e:
        print(f"Error occurred while fetching article at {url}: {e}")

#If an error occurs while fetching an article, we catch the exception and print
#an error message. This ensures that even if one article fails to download,
#the rest of the articles can still be processed.

Next, we will compute the embeddings of the document using the embedding model and store them in Deep Lake, a multimodal vector database. OpenAIEmbeddings will be used to create vector representations of our documents. These embeddings are height vectors that capture the semantic content of the document. When we create an instance of the Deep Lake class, we provide a path starting with the center: hub://... Specifies the name of the database to be stored in the cloud. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_org_id = "<Your_Organization_Id>" # TODO: use your organization id here
my_activeloop_dataset_name = "langchain_course_qabot_with_source"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


-

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/ala/langchain_course_qabot_with_source


 

hub://ala/langchain_course_qabot_with_source loaded successfully.




This is an important part of the setup process as it prepares the system to store and retrieve documents based on their semantic content. This functionality is essential for the next steps where we will find the most relevant documents to answer the user's question.

Next, we'll split these articles into smaller sections, and for each section, we'll save its corresponding URL as the source. This division makes data processing efficient, makes retrieval tasks more manageable, and focuses on the most relevant passages of text when answering questions.

Recursive CharacterTextSplitter is created with block size of 1000 and 100 characters overlap between blocks. The chunk_size parameter specifies the length of each block of text, while chunk_overlap determines the number of characters shared by contiguous blocks. For each document in pages_content, the text will be split into sections using the .split_text() method. 

In [None]:
# We split the article texts into small chunks. While doing so, we keep track of each
# chunk metadata (i.e. the URL where it comes from). Each metadata is a dictionary and
# we need to use the "source" key for the document source so that we can then use the
# RetrievalQAWithSourcesChain class which will automatically retrieve the "source" item
# from the metadata dictionary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

all_texts, all_metadatas = [], []
for d in pages_content:
    chunks = text_splitter.split_text(d["text"])
    for chunk in chunks:
        all_texts.append(chunk)
        all_metadatas.append({ "source": d["url"] })

The "source" key is used in the metadata dictionary to match the expectations of the RetrievalQAWithSourcesChain class, which will automatically retrieve this "source" item from the metadata. We then add these blocks to the Deep Lake database along with their respective metadata. 

In [None]:
# we add all the chunks to the deep lake, along with their metadata
db.add_texts(all_texts, all_metadatas)

Evaluating ingest: 100%|██████████| 1/1 [00:21<00:00
/

Dataset(path='hub://ala/langchain_course_qabot_with_source', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (49, 1536)  float32   None   
    ids      text     (49, 1)      str     None   
 metadata    json     (49, 1)      str     None   
   text      text     (49, 1)      str     None   


 

['a9ac1d0c-ffe6-11ed-8434-0242ac1c000c',
 'a9ac1ece-ffe6-11ed-8434-0242ac1c000c',
 'a9ac1faa-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2054-ffe6-11ed-8434-0242ac1c000c',
 'a9ac20f4-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2180-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2216-ffe6-11ed-8434-0242ac1c000c',
 'a9ac22a2-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2338-ffe6-11ed-8434-0242ac1c000c',
 'a9ac23ce-ffe6-11ed-8434-0242ac1c000c',
 'a9ac245a-ffe6-11ed-8434-0242ac1c000c',
 'a9ac24e6-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2572-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2608-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2694-ffe6-11ed-8434-0242ac1c000c',
 'a9ac272a-ffe6-11ed-8434-0242ac1c000c',
 'a9ac27c0-ffe6-11ed-8434-0242ac1c000c',
 'a9ac284c-ffe6-11ed-8434-0242ac1c000c',
 'a9ac28ce-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2964-ffe6-11ed-8434-0242ac1c000c',
 'a9ac29e6-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2a72-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2b08-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2b94-ffe6-11ed-8434-0242ac1c000c',
 'a9ac2c20-ffe6-

Now comes the fun part - building the QA Chatbot. We'll create a RetrievalQAWithSourcesChain chain that not only retrieves relevant document snippets to answer the questions but also keeps track of the sources of these documents.

## Setting up the Chain 

We then create an instance of RetrievalQAWithSourcesChain using the from_chain_type method. This method takes the following parameters:

- **LLM:** This argument expects to receive an instance of a model (GPT-3, in this case) with a temperature of 0. The temperature controls the randomness of the model's outputs - a higher temperature results in more randomness, while a lower temperature makes the outputs more deterministic.
- **chain_type="stuff":** This defines the type of chain being used, which influences how the model processes the retrieved documents and generates responses. 
- **retriever=db.as_retriever():** This sets up the retriever that will fetch the relevant documents from the Deep Lake database. Here, the Deep Lake database instance db is converted into a retriever using its as_retriever method.

In [None]:
# we create a RetrievalQAWithSourcesChain chain, which is very similar to a
# standard retrieval QA chain but it also keeps track of the sources of the
# retrieved documents

from langchain.chains import RetrievalQAWithSourcesChain
from langchain import OpenAI

llm = OpenAI(model_name="text-davinci-003", temperature=0)

chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm,
                                                    chain_type="stuff",
                                                    retriever=db.as_retriever())

Lastly, we'll generate a response to a question using the chain. The response includes the answer and its corresponding sources.

In [None]:
# We generate a response to a query using the chain. The response object is a dictionary containing
# an "answer" field with the textual answer to the query, and a "sources" field containing a string made
# of the concatenation of the metadata["source"] strings of the retrieved documents.
d_response = chain({"question": "What does Geoffrey Hinton think about recent trends in AI?"})

print("Response:")
print(d_response["answer"])
print("Sources:")
for source in d_response["sources"].split(", "):
    print("- " + source)

Response:
 Geoffrey Hinton has expressed concerns about the potential dangers of AI, such as false text, images, and videos created by AI, and the impact of AI on the job market. He believes that AI has the potential to replace humans as the dominant species on Earth.

Sources:
- https://www.artificialintelligence-news.com/2023/05/02/ai-godfather-warns-dangers-and-quits-google/
- https://www.artificialintelligence-news.com/2023/05/15/iurii-milovanov-softserve-how-ai-ml-is-helping-boost-innovation-and-personalisation/


That's it! You have now built a Q&A chatbot that can provide answers from a collection of documents and where it gets information. 

## Conclusion

The chatbot was able to answer the question, giving a brief insight into Geoffrey Hinton's views on recent AI trends. Sources provided and feedback go back to original articles expressing these views. This process adds an extra layer of reliability and traceability to the chatbot response. The presence of multiple sources also shows that the chatbot can pull information from various documents to give a complete answer, proving the effectiveness of RetrievalQAWithSourcesChain in retrieving information. 

Further Reading:

[https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa)

[https://python.langchain.com/docs/integrations/vectorstores/activeloop_deeplake](https://python.langchain.com/docs/integrations/vectorstores/activeloop_deeplake)

[https://docs.activeloop.ai/quickstart](https://docs.activeloop.ai/quickstart)

## Acknowledgements

I'd like to express my thanks to the wonderful [LangChain & Vector Databases in Production Course](https://learn.activeloop.ai/courses/langchain) by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.