# RAG Webscraping Q&A System

## 🎯 Project Overview

This notebook demonstrates building a **Retrieval-Augmented Generation (RAG)** system that:
- **Scrapes content** from any website
- **Creates embeddings** from the scraped text
- **Stores vectors** in a database for fast retrieval
- **Uses a local LLM** (Ollama) to answer questions based on the scraped content
- **Provides a web interface** (Streamlit) for user interaction

---

## 💻 Hardware Specifications & Technology Choices

**System Configuration:**
- **GPU**: RTX 4070 (8GB VRAM) - Perfect for running smaller LLMs locally
- **CPU**: Intel i7-13620H - Handles embedding generation efficiently
- **RAM**: 32GB - Allows smooth operation of multiple components

**Technology Choices:**

### 🧠 LLM: Ollama (llama3.2:1b)
- **Reasoning**: With 8GB VRAM, larger models like 7B+ would be too slow
- **Benefits**: 1B parameter model runs fast, uses ~2GB VRAM, good quality answers

### 🌐 Frontend: Streamlit
- **Reasoning**: Initially tried Voila but had JSON corruption issues
- **Benefits**: Fast development, great for ML/AI projects, easy deployment

### 📊 Vector Database: ChromaDB
- **Reasoning**: Lightweight, perfect for local development
- **Benefits**: No server setup required, persistent storage, good performance

### 🔢 Embeddings: SentenceTransformer (all-MiniLM-L6-v2)
- **Reasoning**: Balanced between speed and quality
- **Benefits**: Fast inference on CPU, good semantic understanding

---

## 📋 Step 1: Environment Setup & Dependencies

Install all required packages for our RAG system:

In [None]:
# Install required packages
import subprocess
import sys

packages = [
    'requests',           # For web scraping
    'beautifulsoup4',     # For HTML parsing
    'sentence-transformers', # For embeddings
    'chromadb',           # Vector database
    'langchain',          # RAG framework
    'langchain-community', # Community integrations
    'streamlit',          # Web interface
    'ollama'              # Local LLM
]

for package in packages:
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
        print(f"✅ {package} installed")
    except:
        print(f"❌ {package} failed")

## 📚 Step 2: Import Libraries

Import all necessary libraries for our RAG pipeline:

In [None]:
# Core libraries
import os
import time
import uuid
from typing import List

# Web scraping
import requests
from bs4 import BeautifulSoup

# ML & Embeddings
from sentence_transformers import SentenceTransformer
import chromadb

# LangChain RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings

print("✅ All libraries imported successfully!")

## 🌐 Step 3: Web Scraping Function

Create a robust web scraper with enhanced headers to avoid bot detection:

In [None]:
def scrape_website(url: str, max_length: int = 10000) -> str:
    """
    Scrape content from website with bot detection avoidance.
    """
    # Enhanced headers for Reuters, news sites, etc.
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Cache-Control': 'max-age=0'
    }
    
    try:
        print(f"🌐 Scraping: {url}")
        time.sleep(1)  # Be respectful
        
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        
        print(f"✅ Success (Status: {response.status_code})")
        
        # Parse HTML
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Remove scripts and styles
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Extract text from paragraphs
        paragraphs = soup.find_all('p')
        content = ' '.join([p.get_text().strip() for p in paragraphs if p.get_text().strip()])
        
        if not content:
            content = soup.get_text()
        
        # Clean and limit content
        content = ' '.join(content.split())
        if len(content) > max_length:
            content = content[:max_length] + "..."
        
        print(f"📝 Extracted {len(content)} characters")
        return content
        
    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return f"Error scraping {url}: {str(e)}"

# Test scraping
test_content = scrape_website("https://www.bbc.com/news/technology", 1000)
print(f"\n📋 Sample: {test_content[:300]}...")

## 🔢 Step 4: Text Processing & Embeddings

Process scraped text into chunks and create embeddings:

In [None]:
def process_text_for_embeddings(text: str, chunk_size: int = 500, chunk_overlap: int = 50):
    """
    Split text into chunks for better LLM processing.
    """
    print(f"📝 Processing {len(text)} characters")
    
    # Smart text splitting
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", "!", "?", ",", " "]
    )
    
    chunks = text_splitter.split_text(text)
    print(f"✂️ Split into {len(chunks)} chunks")
    
    # Create Document objects
    documents = [
        Document(
            page_content=chunk,
            metadata={"chunk_id": i, "total_chunks": len(chunks)}
        )
        for i, chunk in enumerate(chunks)
    ]
    
    return documents

# Test processing
if test_content and "Error" not in test_content:
    docs = process_text_for_embeddings(test_content)
    print(f"\n📄 First chunk: {docs[0].page_content[:200]}...")
else:
    print("⚠️ Skipping - no valid content")

## 🗄️ Step 5: Vector Database Setup

Create ChromaDB vector store for fast similarity search:

In [None]:
def setup_vector_database(documents: List[Document], collection_name: str = None):
    """
    Create vector database from documents.
    """
    if not collection_name:
        collection_name = f"rag_collection_{uuid.uuid4().hex[:8]}"
    
    print(f"🗄️ Setting up vector DB: {collection_name}")
    
    # Initialize embeddings
    embeddings = HuggingFaceEmbeddings(
        model_name='all-MiniLM-L6-v2',
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True}
    )
    
    # Create vector store
    vectorstore = Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        collection_name=collection_name,
        persist_directory="./chroma_db"
    )
    
    print(f"✅ Vector DB created with {len(documents)} documents")
    return vectorstore

# Test vector database
if 'docs' in locals() and docs:
    vectorstore = setup_vector_database(docs, "test_collection")
    
    # Test similarity search
    results = vectorstore.similarity_search("technology news", k=2)
    print(f"\n🔍 Search results: {len(results)} documents found")
    for i, doc in enumerate(results):
        print(f"Result {i+1}: {doc.page_content[:100]}...")

## 🤖 Step 6: Ollama LLM Setup

Configure local Ollama LLM optimized for RTX 4070:

In [None]:
def setup_ollama_llm(model_name: str = "llama3.2:1b"):
    """
    Initialize Ollama LLM with optimized settings.
    """
    print(f"🤖 Setting up Ollama: {model_name}")
    
    llm = Ollama(
        model=model_name,
        base_url="http://localhost:11434",
        temperature=0.1,    # Focused answers
        top_p=0.9,         # Nucleus sampling
        num_predict=512,   # Response length limit
        repeat_penalty=1.1 # Reduce repetition
    )
    
    print(f"✅ LLM ready (optimized for RTX 4070)")
    return llm

# Test Ollama
try:
    llm = setup_ollama_llm()
    test_response = llm("What is AI?")
    print(f"\n🧪 Test response: {test_response[:150]}...")
except Exception as e:
    print(f"⚠️ Ollama error: {e}")
    print("💡 Make sure Ollama is running:")
    print("   - ollama serve")
    print("   - ollama pull llama3.2:1b")

## 🔗 Step 7: Complete RAG Pipeline

Combine all components into end-to-end system:

In [None]:
def create_rag_chain(vectorstore, llm):
    """
    Create complete RAG chain.
    """
    print("🔗 Creating RAG chain...")
    
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}  # Top 4 similar chunks
    )
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    
    print("✅ RAG chain ready")
    return qa_chain

def webscrape_rag_qa(url: str, question: str, fallback_urls: List[str] = None):
    """
    Complete pipeline: scrape → embed → answer.
    """
    print(f"🚀 RAG Pipeline: {url}")
    print(f"❓ Question: {question}")
    
    # Scrape content
    content = scrape_website(url)
    
    # Try fallbacks if needed
    if "Error" in content and fallback_urls:
        for fallback in fallback_urls:
            content = scrape_website(fallback)
            if "Error" not in content:
                url = fallback
                break
    
    if "Error" in content or len(content.strip()) < 100:
        return {
            "answer": "❌ Unable to scrape content",
            "source_url": url,
            "error": content
        }
    
    try:
        # Process text
        documents = process_text_for_embeddings(content)
        
        # Create vector DB
        vectorstore = setup_vector_database(documents)
        
        # Setup LLM
        llm = setup_ollama_llm()
        
        # Create RAG chain
        qa_chain = create_rag_chain(vectorstore, llm)
        
        # Generate answer
        print("🤔 Generating answer...")
        result = qa_chain({"query": question})
        
        print("✅ Answer ready!")
        
        return {
            "answer": result["result"],
            "source_url": url,
            "source_documents": result["source_documents"],
            "num_chunks": len(documents)
        }
        
    except Exception as e:
        return {
            "answer": f"❌ Pipeline error: {str(e)}",
            "source_url": url,
            "error": str(e)
        }

print("🎯 RAG pipeline ready!")

## 🧪 Step 8: Test the Complete System

Try our RAG system with a real example:

In [None]:
# Test complete RAG system
test_url = "https://www.bbc.com/news/technology"
test_question = "What are the main technology stories mentioned?"

fallback_urls = [
    "https://techcrunch.com",
    "https://www.theverge.com"
]

print("🧪 Testing RAG system...")
print("="*50)

result = webscrape_rag_qa(
    url=test_url,
    question=test_question,
    fallback_urls=fallback_urls
)

print("\n📋 RESULTS:")
print("="*50)
print(f"🌐 Source: {result.get('source_url', 'N/A')}")
print(f"📄 Chunks: {result.get('num_chunks', 'N/A')}")
print("\n💬 ANSWER:")
print("-"*30)
print(result['answer'])

if 'source_documents' in result:
    print("\n📚 SOURCE CHUNKS:")
    for i, doc in enumerate(result['source_documents']):
        print(f"\nChunk {i+1}: {doc.page_content[:200]}...")

## 🌟 Step 9: Streamlit Web Interface

The complete production application is available in `streamlit_rag_app.py`:

### 🚀 To Run the Web Interface:
1. **Start Ollama**: `ollama serve`
2. **Install model**: `ollama pull llama3.2:1b`
3. **Run Streamlit**: `streamlit run streamlit_rag_app.py`
4. **Open browser**: http://localhost:8501

### ✅ Features:
- User-friendly web interface
- Real-time processing with progress bars
- Website compatibility guidance
- Error handling with helpful messages
- Source document display for transparency

---

## 🎉 Project Complete!

### 📋 What We Built:
1. **Web Scraping** - Robust with bot detection avoidance
2. **Text Processing** - Smart chunking with context preservation
3. **Vector Database** - Fast similarity search with ChromaDB
4. **Local LLM** - Ollama optimized for RTX 4070
5. **RAG Pipeline** - Complete retrieval-augmented generation
6. **Web Interface** - Production-ready Streamlit app

### 🌐 GitHub Repository:
**https://github.com/prakharrshukla/RAG-Webscraping-QA-System**

### 💡 Key Benefits:
- **Hardware Optimized** for RTX 4070 (8GB VRAM)
- **No API Costs** - Everything runs locally
- **Privacy Focused** - No data sent externally
- **Educational** - Complete step-by-step guide
- **Production Ready** - Professional web interface

**Happy RAG building! 🎯**