#Task 2: Chat with Website Using RAG Pipeline<br>
**Overview**<BR>
The goal is to implement a Retrieval-Augmented Generation (RAG) pipeline that allows users to
interact with structured and unstructured data extracted from websites. The system will crawl,
scrape, and store website content, convert it into embeddings, and store it in a vector database.
Users can query the system for information and receive accurate, context-rich responses
generated by a selected LLM.<br><br>
**Functional Requirements**<br>
**1. Data Ingestion**
• Input: URLs or list of websites to crawl/scrape.
• Process:
o Crawl and scrape content from target websites.
o Extract key data fields, metadata, and textual content.
o Segment content into chunks for better granularity.
o Convert chunks into vector embeddings using a pre-trained embedding model.
o Store embeddings in a vector database with associated metadata for e icient
retrieval.<br>
**2. Query Handling**
• Input: User's natural language question.
• Process:
o Convert the user's query into vector embeddings using the same embedding
model.
o Perform a similarity search in the vector database to retrieve the most relevant
chunks.
o Pass the retrieved chunks to the LLM along with a prompt or agentic context to
generate a detailed response. <br>
**3. Response Generation**
• Input: Relevant information retrieved from the vector database and the user query.
• Process:
o Use the LLM with retrieval-augmented prompts to produce responses with exact
values and context.
o Ensure factuality by incorporating retrieved data directly into the response.

In [2]:
import requests
from bs4 import BeautifulSoup

def fetch_website_content(url):
  """Fetches and returns the text content of a website."""
  try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup (response.text, 'html.parser')
    return soup.get_text(separator='', strip=True)
  except requests.exceptions. SSLError as ssl_err:
    print(f"SSL error occurred: {ssl_err}")
  except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")
  except Exception as e:
    print(f"An error occurred: {e}")

def search_in_content(query, content_dict):
  """Searches for a query in the scraped content and returns matching results."""
  matches = []
  for url, content in content_dict.items():
    if query.lower() in content.lower():
      matches.append((url, content))
  return matches
websites = [
    "https://www.uchicago.edu/",
    "https://www.washington.edu/",
    "https://www.stanford.edu/",
    "https://und.edu/"
    ]

scraped_content = {}
for website in websites:
  content =fetch_website_content(website)
  if content:
    print(f"Successfully scraped content from (website)")
    scraped_content [website] = content

query_input = input("Enter your query: ")
search_results = search_in_content(query_input, scraped_content)

if search_results:
  print("\nResults found:")
  for url, content in search_results:
    print(f"\nFrom {url}:\n{content[:200]}...")
else:
    print("No results found for your query.")

Request error occurred: 403 Client Error: Forbidden for url: https://www.uchicago.edu/
Successfully scraped content from (website)
Successfully scraped content from (website)
Successfully scraped content from (website)
Enter your query: distictly uchicago
No results found for your query.


In [1]:
pip install requests beautifulsoup4 sentence-transformers faiss-cpu openai


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1
