# QA Chatbot over Documents with Sources

* [1. Setup](#setup)
* [2. Scrapping for the News](#scrapping)
* [3. Saving Embeddings](#saving)
* [4. Setting up the Chain (RetrievalQAWithSourcesChain)](#chain)
* [5. Run QA](#run)
* [6. Additional Resources](#resources)

<hr>
<a class="anchor" id="setup">
    
## 1. Setup
    
</a>

In [1]:
!pip install -q newspaper3k==0.2.8 python-dotenv

In [2]:
import os
from keys import OPENAI_API_KEY, ACTIVELOOP_TOKEN

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

<hr>
<a class="anchor" id="scrapping">
    
## 2. Scrapping for the News
    
</a>

In [3]:
# Imports
import requests # to send HTTP requests
from newspaper import Article # https://github.com/codelucas/newspaper
import time # to introduce pauses during the web scraping 


# To avoid blocking (if any) of requests without a proper User-Agent header 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_urls = [
    "https://www.artificialintelligence-news.com/2023/05/16/openai-ceo-ai-regulation-is-essential/",
    "https://www.artificialintelligence-news.com/2023/05/15/jay-migliaccio-ibm-watson-on-leveraging-ai-to-improve-productivity/",
    "https://www.artificialintelligence-news.com/2023/05/15/iurii-milovanov-softserve-how-ai-ml-is-helping-boost-innovation-and-personalisation/",
    "https://www.artificialintelligence-news.com/2023/05/11/ai-and-big-data-expo-north-america-begins-in-less-than-one-week/",
    "https://www.artificialintelligence-news.com/2023/05/02/ai-godfather-warns-dangers-and-quits-google/",
    "https://www.artificialintelligence-news.com/2023/04/28/palantir-demos-how-ai-can-used-military/"
]

In [4]:
session = requests.Session() # to make multiple requests within the same session
pages_content = [] # to store the scraped articles

for url in article_urls:
    try:
        time.sleep(2) # sleep two seconds for gentle scraping
        response = session.get(url, headers=headers, timeout=10)

        if response.status_code == 200:
            article = Article(url)
            article.download() # download HTML of webpage
            article.parse() # parse HTML to extract the article text
            pages_content.append({ "url": url, "text": article.text })
        else:
            print(f"Failed to fetch article at {url}")
    except Exception as e:
        print(f"Error occurred while fetching article at {url}: {e}")

        
#If an error occurs while fetching an article, we catch the exception and print
#an error message. This ensures that even if one article fails to download,
#the rest of the articles can still be processed.

<hr>
<a class="anchor" id="saving">
    
## 3. Saving Embeddings
    
</a>

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "qa_with_source"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!




In [6]:
# Split the article texts into small chunks. While doing so, we keep track of each
# chunk metadata (i.e. the URL where it comes from). 

# Each metadata is a dictionary and we need to use the "source" key 
# for the document source so that we can then use the RetrievalQAWithSourcesChain 
# class which will automatically retrieve the "source" item from the metadata dictionary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

all_texts, all_metadatas = [], []
for d in pages_content:
    chunks = text_splitter.split_text(d["text"])
    for chunk in chunks:
        all_texts.append(chunk)
        all_metadatas.append({ "source": d["url"] })

**Note:** The `source` key is used in the metadata dictionary to align with the `RetrievalQAWithSourcesChain` class's expectations, which will automatically retrieve this "source" item from the metadata.

In [7]:
# Add the chunks to the DeepLake, along with their metadata
db.add_texts(all_texts, all_metadatas)

-

Dataset(path='hub://iryna/qa_with_source', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
 embedding  embedding  (49, 1536)  float32   None   
    id        text      (49, 1)      str     None   
 metadata     json      (49, 1)      str     None   
   text       text      (49, 1)      str     None   


 

['fb7645a4-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb7646c6-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb76472a-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764784-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb7647ca-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764810-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764860-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb7648a6-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb7648ec-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb76493c-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764982-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb7649c8-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764a0e-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764a5e-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764aa4-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764aea-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764b30-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764b76-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764bc6-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764c0c-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764c52-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764c98-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764cde-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764d2e-3fc9-11ee-aa24-12ee7aa5dbdc',
 'fb764d74-3fc9-

<hr>
<a class="anchor" id="chain">
    
## 4. Setting up the Chain (RetrievalQAWithSourcesChain)
    
</a>

In [8]:
# Create a "RetrievalQAWithSourcesChain" chain, which is very similar to a
# standard retrieval QA chain but it also keeps track of the sources of the retrieved documents

from langchain.chains import RetrievalQAWithSourcesChain
from langchain import OpenAI

llm = OpenAI(model_name="text-davinci-003", temperature=0)

chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm,
                                                    chain_type="stuff",
                                                    retriever=db.as_retriever())

<hr>
<a class="anchor" id="run">
    
## 5. Run QA
    
</a>

In [9]:
# Generate a response to a query using the chain. 
response_dict = chain({"question": "What does Geoffrey Hinton think about recent trends in AI?"})

In [10]:
# The response object is a dictionary containing:
# an "answer" field with the textual answer to the query, 
# a "sources" field containing a string made of the concatenation of the metadata["source"] strings
print("Response:")
print(response_dict["answer"])

print("Sources:")
for source in response_dict["sources"].split(", "):
    print("- " + source)

Response:
 Geoffrey Hinton believes that the rapid development of generative AI products is "racing towards danger" and that false text, images, and videos created by AI could lead to a situation where average people "would not be able to know what is true anymore." He also expressed concerns about the impact of AI on the job market, as machines could eventually replace roles such as paralegals, personal assistants, and translators.

Sources:
- https://www.artificialintelligence-news.com/2023/05/02/ai-godfather-warns-dangers-and-quits-google/


<hr>
<a class="anchor" id="resources">
    
## 6. Additional Resources
    
</a>

- [QA using a Langchain Retriever](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa)
- [Activeloop's Deep Lake](https://python.langchain.com/docs/integrations/vectorstores/activeloop_deeplake)
- [Vector Store Quickstart](https://docs.activeloop.ai/quickstart)