# Build an Article ChatBot

You can use any LLM model.

Build a bot that will take a query and answer from blogs on the internet.
1. Take a query as input.
2. Get relevant articles from the internet.
3. Find the correct answer from the article contents.
4. Reply with the answer and link of the article.

Please use Google Colab to complete this assignment. \
If you face any problems with colab, you can write a python script and share that.

# Solution




The solution makes use of the **Google Search API**, **LangChain** and uses **RAG** (Retrieval Augmented Generation)

- Relevant articles are gathered from the internet using the Google Search API - the users query is used along with " inurl:blog" appended on to the end of the string in order to ensure only blogs appear in the search results.

- Beautiful Soup was used to webscrape the content from the webpages at each of these links.

(Please note: The GPT-3.5 is utlitized because it has a knowledge cutoff in September 2021 ([Source](https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates)), therefore asking questions where the answer is not known until after this date will ensure that the answers are completely from the internet blogs and that the model does not "cheat".)

<center><div>
<img src="https://www.researchgate.net/publication/371582328/figure/fig4/AS:11431281168418756@1686961053226/Google-Cloud-Platform-logo.ppm" width="250" height="auto">
<img src="https://datascientest.com/en/wp-content/uploads/sites/9/2024/01/beautiful-soup.png" width="250" height="auto">
<img src="https://deepsense.ai/wp-content/uploads/2023/10/LangChain-announces-partnership-with-deepsense.jpeg" width="250" height="auto" style="margin-right: 20px;">
<img src="https://static.vecteezy.com/system/resources/previews/021/059/825/original/chatgpt-logo-chat-gpt-icon-on-green-background-free-vector.jpg" width="100" height="auto">
</div></center>



# Imports

In [1]:
%pip install openai
%pip install unstructured
%pip install chromadb
%pip install tiktoken
%pip install langchain

In [2]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from bs4 import BeautifulSoup
from difflib import SequenceMatcher

import numpy as np
import pandas as pd
import requests
import os
import openai
import shutil

# Get Webpage Content
In this section the relevant blog article links will be collected using the Google Search API.

In [3]:
# Example search_query
search_query = "Which team won the Premier Leauge in 2023?"

In [4]:
# Google Search API
API_KEY = "XXX"
SEARCH_ENGINE_ID = "XXX"

In [5]:
def gatherLinks(search_query,numLinks=5):

  # Limit the search results to blogs
  search_query += " inurl:blog"

  url = "https://www.googleapis.com/customsearch/v1"
  params = {
      "q": search_query,
      "key": API_KEY,
      "cx": SEARCH_ENGINE_ID,
      "num": numLinks,
  }

  response = requests.get(url, params=params)
  results = response.json()

  urlList = []

  if "items" in results:
      for i in range(min(numLinks, len(results["items"]))):
          urlList.append(results["items"][i]["link"])
  else:
      print("No search results found.")

  return urlList

In [6]:
def gatherContent(urlList):
  urlContentList = []

  for url in urlList:
    try:
        # Send a GET request to the URL
        response = requests.get(url)

        # Create a BeautifulSoup object to parse the HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # Find the <body> element
        body_content = soup.find("body")

        # Check if the <body> element was found
        if body_content:
            # Extract the text from the <body> element
            text = body_content.get_text()
            urlContentList.append(text[:10000]) # Limit number of words
        else:
            print(f"No <body> element found for URL: {url}")
            urlContentList.append("")

    except requests.exceptions.RequestException as e:
        print(f"Error occurred while retrieving URL: {url}")
        print(f"Error message: {str(e)}")
  return urlContentList

In [7]:
urlList = gatherLinks(search_query)
urlContentList = gatherContent(urlList)

In [8]:
# Save the content of each of the webpages into individual markdown files

# Create the output directory if it doesn't exist
output_dir = "data"
os.makedirs(output_dir, exist_ok=True)

# Save each URL content to a separate markdown file
for i, url_content in enumerate(urlContentList):
    file_name = f"{urlList[i].replace('https://', '').replace('/', '_')}.md"
    file_path = os.path.join(output_dir, file_name)
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(url_content)
    print(f"Saved content for {urlList[i]} to {file_path}")

Saved content for https://www.premierleague.com/matchweek/12284/blog to data/www.premierleague.com_matchweek_12284_blog.md
Saved content for https://weaintgotnohistory.sbnation.com/2023/11/25/23975423/newcastle-united-chelsea-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights to data/weaintgotnohistory.sbnation.com_2023_11_25_23975423_newcastle-united-chelsea-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights.md
Saved content for https://www.premierleague.com/matchweek/12296/blog to data/www.premierleague.com_matchweek_12296_blog.md
Saved content for https://weaintgotnohistory.sbnation.com/2023/10/28/23935994/chelsea-brentford-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights to data/weaintgotnohistory.sbnation.com_2023_10_28_23935994_chelsea-brentford-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights.md
Saved content for https://www.venasolutions.com/blog/richest-premier-league-clubs to data/www.

# Answer Question

To answer the question, the ChatGPT API will be utilised

## Approach 1

In this approach, a for loop is utilized, where in each iteration the content from one of the webpages is passed, until the chatbot can answer the query.

In [9]:
os.environ["OPENAI_API_KEY"] = "XXX"

In [10]:
# Return zzz, as this will never show up in the correct answer

for i in range(0, len(urlContentList)):
  completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": f"Answer the question: {search_query} \n---\n using the content passed on, if you cant answer simply return 'zzz' only."},
          {"role": "user", "content": urlContentList[i]},
      ],
  )
  result = completion.choices[0].message.content

  if "zzz" not in result.lower():
    print(completion.choices[0].message.content)
    print(f"Source: {urlList[i]}")
    break

Manchester City won the Premier League in 2023.
Source: https://www.premierleague.com/matchweek/12284/blog


## Approach 2

Shoutout to this video for helping:

[RAG + Langchain Python Project: Easy AI/Chat For Your Docs](https://www.youtube.com/watch?v=tcqEUSNCn8I)

In this approach, RAG is used. The documents are broken into chunks, and the most relevant chunks to the query are got and are used to answer the query.

In [11]:
# Load the markdown files from the data folder
loader = DirectoryLoader("data", glob="*.md")
documents = loader.load()

In [12]:
# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=500,
    length_function=len,
    add_start_index=True
)

chunks = text_splitter.split_documents(documents)

print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

Split 58 documents into 814 chunks.


In [13]:
# Chroma: This is a vector database library that is used for efficient storage and retrieval of high-dimensional data, such as text embeddings.
CHROMA_PATH = "chroma"

In [14]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [15]:
# Delete if already exists
if os.path.exists(CHROMA_PATH):
  shutil.rmtree(CHROMA_PATH)

In [16]:
db = Chroma.from_documents(chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH)
db.persist()
print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

  warn_deprecated(


Saved 814 chunks to chroma.


In [17]:
# Return the 5 results which are most relevant to the search_query
results = db.similarity_search_with_relevance_scores(search_query, k=5)

if len(results) == 0 or results[0][1] < 0.7:
  print(f"Unable to find matching results.")

In [18]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=search_query)
model = ChatOpenAI()
response_text = model.predict(prompt)

# Convert file names back to url link
def find_most_similar_link(source, html_list):
    max_similarity = 0
    most_similar_link = None

    for link in html_list:
        similarity = SequenceMatcher(None, source, link).ratio()
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar_link = link

    return most_similar_link

sources = []
for doc, score in results:
    source = doc.metadata.get("source", None)
    if source:
        most_similar_link = find_most_similar_link(source, urlList)
        if most_similar_link:
            sources.append(f"- {most_similar_link} (Score: {score:.2f})")
        else:
            sources.append(f"- {source} (Score: {score:.2f})")

sources_text = "\n".join(sources)
formatted_response = f"Response: {response_text}\n\nSources:\n{sources_text}"
print(formatted_response)

  warn_deprecated(
  warn_deprecated(


Response: Manchester City won the Premier League in 2023.

Sources:
- https://www.venasolutions.com/blog/richest-premier-league-clubs (Score: 0.77)
- https://www.venasolutions.com/blog/richest-premier-league-clubs (Score: 0.76)
- https://weaintgotnohistory.sbnation.com/2023/11/25/23975423/newcastle-united-chelsea-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights (Score: 0.75)
- https://weaintgotnohistory.sbnation.com/2023/10/28/23935994/chelsea-brentford-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights (Score: 0.75)
- https://weaintgotnohistory.sbnation.com/2023/10/28/23935994/chelsea-brentford-premier-league-live-stream-time-tv-how-watch-online-live-blog-highlights (Score: 0.75)
