In [None]:
Task 2: Chat with Website Using RAG Pipeline
Overview
The goal is to implement a Retrieval-Augmented Generation (RAG) pipeline that allows users to
interact with structured and unstructured data extracted from websites. The system will crawl,
scrape, and store website content, convert it into embeddings, and store it in a vector database.
Users can query the system for information and receive accurate, context-rich responses
generated by a selected LLM.
Functional Requirements
1. Data Ingestion
• Input: URLs or list of websites to crawl/scrape.
• Process:
o Crawl and scrape content from target websites.
o Extract key data fields, metadata, and textual content.
o Segment content into chunks for better granularity.
o Convert chunks into vector embeddings using a pre-trained embedding model.
o Store embeddings in a vector database with associated metadata for eFicient
retrieval.
2. Query Handling
• Input: User's natural language question.
• Process:
o Convert the user's query into vector embeddings using the same embedding
model.
o Perform a similarity search in the vector database to retrieve the most relevant
chunks.
o Pass the retrieved chunks to the LLM along with a prompt or agentic context to
generate a detailed response.
o

3. Response Generation
• Input: Relevant information retrieved from the vector database and the user query.
• Process:
o Use the LLM with retrieval-augmented prompts to produce responses with exact
values and context.
o Ensure factuality by incorporating retrieved data directly into the response.

Example website links :
https://www.uchicago.edu/
https://www.washington.edu/
https://www.stanford.edu/
https://und.edu/

In [10]:
pip install sentence-transformers


Note: you may need to restart the kernel to use updated packages.


In [12]:
pip install numpy PyPDF2 faiss-cpu transformers


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp312-cp312-win_amd64.whl.metadata (4.5 kB)
Downloading faiss_cpu-1.9.0.post1-cp312-cp312-win_amd64.whl (13.8 MB)
   ---------------------------------------- 0.0/13.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/13.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/13.8 MB 435.7 kB/s eta 0:00:32
   ---------------------------------------- 0.0/13.8 MB 393.8 kB/s eta 0:00:36
   ---------------------------------------- 0.1/13.8 MB 581.0 kB/s eta 0:00:24
    --------------------------------------- 0.2/13.8 MB 871.5 kB/s eta 0:00:16
    --------------------------------------- 0.2/13.8 MB 915.1 kB/s eta 0:00:15
    --------------------------------------- 0.3/13.8 MB 1.1 MB/s eta 0:00:13
   - -------------------------------------- 0.4/13.8 MB 1.1 MB/s eta 0:00:13
   - -------------------------------------- 0.4/13.8 MB 1.1 MB/s eta 0:00:12
   - -------------------------------------- 0.5/13.8 MB 1.2 MB/

In [31]:
!pip install faiss-cpu
from sentence_transformers import SentenceTransformer
import faiss
import requests
from bs4 import BeautifulSoup

def crawl_and_scrape(urls):

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    content = []
    for url in urls:
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            text = ' '.join([p.get_text() for p in soup.find_all('p')])
            content.append(text)
        except Exception as e:
            print(f"Error scraping {url}: {e}")
    return ' '.join(content)


def chunk_text(text, chunk_size=100):

    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

def store_embeddings(chunks, model):

    embeddings = model.encode(chunks)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index, chunks

def query_vector_database(query, index, model, chunks, top_k=3):

    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, top_k)
    return [chunks[i] for i in indices[0]]

def generate_response(results):

    return "\n".join(results)

def main():
    # Step 1: Define URLs
    urls = [
        "https://www.uchicago.edu/",
        "https://www.washington.edu/",
        "https://www.stanford.edu/",
        "https://und.edu/",
    ]

    # Step 2: Initialize embedding model
    print("Loading embedding model...")
    model = SentenceTransformer("all-MiniLM-L6-v2")  # Pre-trained model

    # Step 3: Crawl and scrape websites
    print("Crawling and scraping websites...")
    website_data = crawl_and_scrape(urls)

    # Step 4: Chunk and embed content
    print("Chunking and embedding content...")
    chunks = chunk_text(website_data)
    index, stored_chunks = store_embeddings(chunks, model)

    # Step 5: Handle user query
    query = input("Enter your query: ")
    print("Searching content...")
    answers = query_vector_database(query, index, model, stored_chunks)
    print("Generating response...")
    response = generate_response(answers)
    print(response)

if __name__ == "__main__":
    main()

Loading embedding model...
Crawling and scraping websites...
Chunking and embedding content...


Enter your query:  what is chicago university for


Searching content...
Generating response...
A diversity of people and ideas, coupled with free and open discourse, lays the foundation for students and scholars to bring forth original ideas that define fields and enrich human life. UChicago students develop the habits of mind and intellectual skills needed to confront complex challenges. UChicago researchers have contributed to some of the world’s greatest discoveries, advancements, and bodies of knowledge. Faculty have a free and challenging environment in which to pursue the most original research. As a community partner, we invest in Chicago’s South Side across such areas as health, education, economic growth, and the arts. We are
1912, totaling 335 medals from 196 medalists Medals Stanford student-athletes have achieved local, national, and global impact through community involvement and advocacy Athlete Stories Offering extraordinary freedom to explore, to collaborate, and to challenge yourself We look for distinctive students wh