# Retrieval Augmented Generation Playground - Hugging Face Online Models

> This notebook acts as a tool for users to play around with vectorizing documents and using a RAG architecture to improve the responses and capabilities of an AI (LLM) for some unique purpose. This version uses **Hugging Face online models** instead of local models, making it accessible to anyone without requiring specific local model installations.
>
> **Key Features:**
> - Uses Hugging Face transformers pipeline for easy model access
> - Works with free Hugging Face models (no account required for many models)
> - Automatically handles model downloading and caching
> - Supports both CPU and GPU execution
> - Uses sentence-transformers for embeddings

## Install Required Dependencies

> First, let's make sure we have all the required packages installed.

In [1]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    try:
        __import__(package.split('==')[0])
        print(f"✓ {package} already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Essential packages for this notebook
packages = [
    "transformers>=4.30.0",
    "torch",
    "sentence-transformers",
    "faiss-cpu",  # Use faiss-gpu if you have GPU support
    "langchain",
    "langchain-community",
    "gradio",
    "pypdf",
    "datasets",
    "pandas"
]

for package in packages:
    install_package(package)

print("\n✅ All packages installed successfully!")

Installing transformers>=4.30.0...
✓ torch already installed
Installing sentence-transformers...
Installing faiss-cpu...
✓ langchain already installed
Installing langchain-community...


  from .autonotebook import tqdm as notebook_tqdm


✓ gradio already installed
✓ pypdf already installed
✓ datasets already installed
✓ pandas already installed

✅ All packages installed successfully!


## Import Libraries and Setup

> Import all necessary libraries and set up our environment for RAG operations.

In [2]:
import os
import torch
import pandas as pd
import gradio as gr
from typing import List, Tuple, Optional
from pathlib import Path

# Transformers and model loading
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    pipeline,
    TextStreamer
)

# Sentence transformers for embeddings
from sentence_transformers import SentenceTransformer

# LangChain for document processing and vector stores
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

# Other utilities
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

print("✅ All libraries imported successfully!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🚀 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🎮 GPU: {torch.cuda.get_device_name(0)}")

2025-07-07 18:41:59.678190: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-07 18:41:59.714998: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


✅ All libraries imported successfully!
🔥 PyTorch version: 2.2.1+cu121
🚀 CUDA available: True
🎮 GPU: NVIDIA L40S


## Models with no token

In [3]:
open_llms_no_token_required = [
    {
        "model": "mistralai/Mistral-7B-Instruct-v0.2",
        "description": "Instruction-tuned Mistral 7B model with strong reasoning and generation capabilities.",
        "pros": ["Excellent general-purpose LLM", "Performs well in RAG", "Fast inference"],
        "cons": ["May require GPU with ~12GB+ VRAM for best performance"]
    },
    {
        "model": "NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
        "description": "DPO fine-tuned Mistral variant for assistant-like dialog and reasoning.",
        "pros": ["Great for conversation and QA", "Performs well in long-context tasks", "No token needed"],
        "cons": ["Heavier than Phi models"]
    },
    {
        "model": "teknium/OpenHermes-2.5-Mistral-7B",
        "description": "Mistral-based chat model tuned for helpful assistant-style responses.",
        "pros": ["Efficient", "Instruction following", "Good with injected context (RAG)"],
        "cons": ["May hallucinate if not grounded"]
    },
    {
        "model": "microsoft/phi-2",
        "description": "Compact transformer trained with curriculum learning, ideal for reasoning and basic chat.",
        "pros": ["Lightweight", "Good accuracy per parameter", "No token required"],
        "cons": ["Limited context window (2k)"]
    },
    {
        "model": "microsoft/phi-3-mini-4k-instruct",
        "description": "Latest 4k context Phi-3 model, small but powerful for structured assistant tasks.",
        "pros": ["Efficient and very fast", "Strong coding and reasoning", "Great for edge devices"],
        "cons": ["Limited generative depth compared to larger models"]
    },
    {
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "description": "Very small LLaMA-style model trained for chat and instruction following.",
        "pros": ["Extremely lightweight", "Can run on CPU", "Good for constrained environments"],
        "cons": ["Not as strong in generalization or long reasoning"]
    },
    {
        "model": "OpenAccessAI/MythoMax-L2-13b",
        "description": "A powerful LLaMA-2-based model fine-tuned for rich, open-ended conversation.",
        "pros": ["Powerful and expressive", "Fine-tuned on a diverse instruction dataset", "No token required"],
        "cons": ["Large (13B) - needs ~24GB+ VRAM or GGUF format"]
    },
    {
        "model": "openchat/openchat-3.5-0106",
        "description": "Chat-focused LLaMA-based model for helpful assistant behaviors.",
        "pros": ["Chat-optimized", "Can be used in RAG", "Open access"],
        "cons": ["Some versions require more VRAM"]
    }
]


model_names = [d["model"] for d in open_llms_no_token_required]

for m in model_names:
    print(m)

mistralai/Mistral-7B-Instruct-v0.2
NousResearch/Nous-Hermes-2-Mistral-7B-DPO
teknium/OpenHermes-2.5-Mistral-7B
microsoft/phi-2
microsoft/phi-3-mini-4k-instruct
TinyLlama/TinyLlama-1.1B-Chat-v1.0
OpenAccessAI/MythoMax-L2-13b
openchat/openchat-3.5-0106


## Configure Hugging Face Online RAG Assistant

> This class mimics the functionality of the original AMAS_RAG_Assistant but uses Hugging Face online models.

In [4]:
class HuggingFaceRAGAssistant:
    """
    A RAG Assistant that uses Hugging Face online models instead of local ones.
    This makes it accessible to anyone without requiring specific local model installations.
    """
    
    def __init__(
        self,
        model_name: str = "microsoft/DialoGPT-medium",  # Default conversational model
        embedding_model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
        device: str = "auto",  # "auto", "cpu", or "cuda"
        max_new_tokens: int = 512,
        temperature: float = 0.7,
        top_k: int = 50,
        top_p: float = 0.9,
        do_sample: bool = True,
        use_auth_token: Optional[str] = None,  # Optional HF token for gated models
        verbose: bool = True
    ):
        self.model_name = model_name
        self.embedding_model_name = embedding_model_name
        self.device = self._setup_device(device)
        self.max_new_tokens = max_new_tokens
        self.temperature = temperature
        self.top_k = top_k
        self.top_p = top_p
        self.do_sample = do_sample
        self.use_auth_token = use_auth_token
        self.verbose = verbose
        
        # Initialize components
        self.model = None
        self.tokenizer = None
        self.pipeline = None
        self.embedding_model = None
        self.vector_store = None
        self.documents = []
        
        # RAG settings
        self.rag_mode = False
        self.k = 3  # Number of documents to retrieve
        self.min_score = 0.0  # Minimum similarity score
        
        # Load models
        self._load_language_model()
        self._load_embedding_model()
        
        if self.verbose:
            print(f"✅ HuggingFaceRAGAssistant initialized successfully!")
            print(f"📝 Language Model: {self.model_name}")
            print(f"🔍 Embedding Model: {self.embedding_model_name}")
            print(f"🖥️  Device: {self.device}")
    
    def _setup_device(self, device: str) -> str:
        """Setup the appropriate device for computation."""
        if device == "auto":
            if torch.cuda.is_available():
                return "cuda"
            else:
                return "cpu"
        return device
    
    def _load_language_model(self):
        """Load the language model and tokenizer from Hugging Face."""
        try:
            if self.verbose:
                print(f"🔄 Loading language model: {self.model_name}")
            
            # Create a text generation pipeline
            self.pipeline = pipeline(
                "text-generation",
                model=self.model_name,
                tokenizer=self.model_name,
                device=0 if self.device == "cuda" and torch.cuda.is_available() else -1,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                use_auth_token=self.use_auth_token,
                trust_remote_code=True
            )
            
            if self.verbose:
                print(f"✅ Language model loaded successfully!")
                
        except Exception as e:
            print(f"❌ Error loading language model: {e}")
            print("🔄 Falling back to a smaller model...")
            # Fallback to a smaller, more reliable model
            self.model_name = "gpt2"
            self.pipeline = pipeline(
                "text-generation",
                model=self.model_name,
                device=0 if self.device == "cuda" and torch.cuda.is_available() else -1
            )
    
    def _load_embedding_model(self):
        """Load the embedding model for vector search."""
        try:
            if self.verbose:
                print(f"🔄 Loading embedding model: {self.embedding_model_name}")
            
            # Use HuggingFaceEmbeddings for LangChain compatibility
            self.embedding_model = HuggingFaceEmbeddings(
                model_name=self.embedding_model_name,
                model_kwargs={'device': self.device},
                encode_kwargs={'normalize_embeddings': True}
            )
            
            if self.verbose:
                print(f"✅ Embedding model loaded successfully!")
                
        except Exception as e:
            print(f"❌ Error loading embedding model: {e}")
            # Fallback to a smaller embedding model
            self.embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
            self.embedding_model = HuggingFaceEmbeddings(
                model_name=self.embedding_model_name,
                model_kwargs={'device': self.device}
            )
    
    def process_pdf_to_vector_store(
        self, 
        pdf_path: str, 
        chunk_size: int = 1000, 
        chunk_overlap: int = 200
    ) -> Tuple[List[Document], any]:
        """
        Process a PDF file and create a vector store.
        
        Args:
            pdf_path: Path to the PDF file
            chunk_size: Size of text chunks
            chunk_overlap: Overlap between chunks
            
        Returns:
            Tuple of (documents, vector_store)
        """
        try:
            if self.verbose:
                print(f"🔄 Processing PDF: {pdf_path}")
            
            # Load PDF
            loader = PyPDFLoader(pdf_path)
            documents = loader.load()
            
            if self.verbose:
                print(f"📄 Loaded {len(documents)} pages from PDF")
            
            # Split documents into chunks
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap,
                length_function=len,
            )
            
            split_documents = text_splitter.split_documents(documents)
            self.documents = split_documents
            
            if self.verbose:
                print(f"✂️  Split into {len(split_documents)} chunks")
            
            # Create vector store
            self.vector_store = FAISS.from_documents(
                split_documents, 
                self.embedding_model
            )
            
            if self.verbose:
                print(f"✅ Vector store created successfully!")
            
            return split_documents, self.vector_store
            
        except Exception as e:
            print(f"❌ Error processing PDF: {e}")
            return [], None
    
    def query_vector_store(
        self, 
        query: str, 
        k: Optional[int] = None, 
        min_score: Optional[float] = None
    ) -> Tuple[List[Document], List[float]]:
        """
        Query the vector store for similar documents.
        
        Args:
            query: Search query
            k: Number of documents to retrieve
            min_score: Minimum similarity score
            
        Returns:
            Tuple of (documents, scores)
        """
        if self.vector_store is None:
            print("❌ No vector store available. Please process a document first.")
            return [], []
        
        k = k or self.k
        min_score = min_score or self.min_score
        
        try:
            # Search for similar documents
            docs_and_scores = self.vector_store.similarity_search_with_score(
                query, k=k
            )
            
            # Filter by minimum score if specified
            if min_score > 0:
                docs_and_scores = [
                    (doc, score) for doc, score in docs_and_scores 
                    if score >= min_score
                ]
            
            docs = [doc for doc, score in docs_and_scores]
            scores = [score for doc, score in docs_and_scores]
            
            if self.verbose:
                print(f"🔍 Found {len(docs)} relevant documents")
                for i, score in enumerate(scores):
                    print(f"   Document {i+1}: Similarity score = {score:.4f}")
            
            return docs, scores
            
        except Exception as e:
            print(f"❌ Error querying vector store: {e}")
            return [], []
    
    def generate_response(self, user_input: str, use_rag: Optional[bool] = None) -> str:
        """
        Generate a response to user input, optionally using RAG.
        
        Args:
            user_input: User's question or input
            use_rag: Whether to use RAG (if None, uses self.rag_mode)
            
        Returns:
            Generated response
        """
        use_rag = use_rag if use_rag is not None else self.rag_mode
        
        # Prepare the prompt
        if use_rag and self.vector_store is not None:
            # RAG mode: retrieve relevant documents
            docs, scores = self.query_vector_store(user_input)
            
            if docs:
                # Create context from retrieved documents
                context = "\n\n".join([doc.page_content for doc in docs])
                prompt = f"""Context from documents:
{context}

Question: {user_input}

Please answer the question based on the provided context. If the context doesn't contain relevant information, you may use your general knowledge but please indicate when you're doing so.

Answer:"""
            else:
                prompt = user_input
        else:
            # Non-RAG mode: use input directly
            prompt = user_input
        
        try:
            # Generate response
            response = self.pipeline(
                prompt,
                max_new_tokens=self.max_new_tokens,
                temperature=self.temperature,
                top_k=self.top_k,
                top_p=self.top_p,
                do_sample=self.do_sample,
                pad_token_id=self.pipeline.tokenizer.eos_token_id,
                return_full_text=False
            )
            
            # Extract the generated text
            generated_text = response[0]['generated_text']
            
            return generated_text.strip()
            
        except Exception as e:
            print(f"❌ Error generating response: {e}")
            return "Sorry, I encountered an error while generating a response."
    
    def toggle_rag_mode(self):
        """Toggle RAG mode on/off."""
        self.rag_mode = not self.rag_mode
        mode = "enabled" if self.rag_mode else "disabled"
        print(f"🔄 RAG mode {mode}")
        return self.rag_mode

# Initialize the assistant with a lightweight model suitable for most users
print("🚀 Initializing Hugging Face RAG Assistant...")
print("📝 Using a lightweight model suitable for online use...")


# You can change these models based on your needs and computational resources
rag_assistant = HuggingFaceRAGAssistant(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",  # Good balance of quality and speed
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",  # Fast and efficient
    device="auto",
    max_new_tokens=512,
    temperature=0.7,
    verbose=True
)

🚀 Initializing Hugging Face RAG Assistant...
📝 Using a lightweight model suitable for online use...
🔄 Loading language model: mistralai/Mistral-7B-Instruct-v0.2


Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.55it/s]
Device set to use cuda:0


✅ Language model loaded successfully!
🔄 Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
✅ Embedding model loaded successfully!
✅ HuggingFaceRAGAssistant initialized successfully!
📝 Language Model: mistralai/Mistral-7B-Instruct-v0.2
🔍 Embedding Model: sentence-transformers/all-MiniLM-L6-v2
🖥️  Device: cuda


## Alternative Model Options

> Here are some alternative models you can try. Simply change the model names in the cell above:

### Language Models (Text Generation):
- **Lightweight & Fast**: `"gpt2"`, `"microsoft/DialoGPT-medium"`
- **Better Quality**: `"microsoft/DialoGPT-large"`, `"facebook/blenderbot-400M-distill"`
- **Advanced (requires more resources)**: `"microsoft/DialoGPT-large"`, `"facebook/blenderbot-1B-distill"`

### Embedding Models:
- **Fast**: `"sentence-transformers/all-MiniLM-L6-v2"`
- **Better Quality**: `"sentence-transformers/all-mpnet-base-v2"`
- **Specialized**: `"sentence-transformers/multi-qa-mpnet-base-dot-v1"` (for Q&A)

### Using Hugging Face Account (Optional):
If you have a Hugging Face account and token, you can access more models:
```python
# Get your token from https://huggingface.co/settings/tokens
rag_assistant = HuggingFaceRAGAssistant(
    model_name="meta-llama/Llama-2-7b-chat-hf",  # Requires HF token
    use_auth_token="your_hf_token_here"
)
```

## Load and Process Documents

> Now let's load a PDF document and create a vector store for RAG operations.

In [5]:
# Configure document processing
pdf_file_path = "../data/CUI_SPEC.pdf"  # Adjust path as needed

# Check if file exists
if os.path.exists(pdf_file_path):
    print(f"📁 Found PDF file: {pdf_file_path}")
    
    # Process the PDF and create vector store
    documents, vector_store = rag_assistant.process_pdf_to_vector_store(
        pdf_path=pdf_file_path,
        chunk_size=1000,  # Adjust based on your needs
        chunk_overlap=200
    )
    
    print(f"\n📊 Processing Results:")
    print(f"   📄 Total documents/chunks: {len(documents)}")
    print(f"   🔍 Vector store created: {'✅ Yes' if vector_store else '❌ No'}")
    
else:
    print(f"❌ PDF file not found: {pdf_file_path}")
    print("📝 Please ensure the file exists or update the path.")
    print("🔄 You can also upload your own PDF file to the data folder.")

📁 Found PDF file: ../data/CUI_SPEC.pdf
🔄 Processing PDF: ../data/CUI_SPEC.pdf
📄 Loaded 29 pages from PDF
✂️  Split into 86 chunks
✅ Vector store created successfully!

📊 Processing Results:
   📄 Total documents/chunks: 86
   🔍 Vector store created: ✅ Yes


## Examine the Processed Documents

> Let's take a look at what documents were created from the PDF processing.

In [6]:
if documents:
    print(f"📚 Examining processed documents:")
    print(f"   Total chunks: {len(documents)}")
    
    # Show first few document chunks
    num_to_show = min(3, len(documents))
    
    for i, doc in enumerate(documents[:num_to_show]):
        print(f"\n📄 Document Chunk {i + 1}:")
        print(f"   Content length: {len(doc.page_content)} characters")
        print(f"   Preview: {doc.page_content[:200]}...")
        if hasattr(doc, 'metadata') and doc.metadata:
            print(f"   Metadata: {doc.metadata}")
        print("-" * 80)
else:
    print("❌ No documents to examine. Please process a PDF first.")

📚 Examining processed documents:
   Total chunks: 86

📄 Document Chunk 1:
   Content length: 956 characters
   Preview: AVAILABLE ONLINE AT: INITIATED BY: 
www.directives.doe.gov Office of the Chief Information Officer 
U.S. Department of Energy ORDE R 
Washington, DC 
 Approved: 2-3-2022 
SUBJECT: CONTROLLED UNCLASSIF...
   Metadata: {'source': '../data/CUI_SPEC.pdf', 'page': 0}
--------------------------------------------------------------------------------

📄 Document Chunk 2:
   Content length: 941 characters
   Preview: the Atomic Energy Act of 1954 (42 U.S.C. 2011, et seq.), as amended. This Directive
implements the requirements in EO 13556, Controlled Unclassified Information, and 32
CFR part 2002, Controlled Uncla...
   Metadata: {'source': '../data/CUI_SPEC.pdf', 'page': 0}
--------------------------------------------------------------------------------

📄 Document Chunk 3:
   Content length: 952 characters
   Preview: commitment is modified to either eliminate requirements th

## Test Vector Store Queries

> Let's test how well our vector store can find relevant documents for specific queries.

In [7]:
if rag_assistant.vector_store is not None:
    # Test queries related to the CUI_SPEC.pdf document
    test_queries = [
        "What is CUI Specified?",
        "Tell me about sending CUI via email to accounts outside of Federal IT",
        "What are the handling requirements for CUI?",
        "What is controlled unclassified information?"
    ]
    
    print("🔍 Testing vector store queries:")
    print("=" * 60)
    
    for i, query in enumerate(test_queries, 1):
        print(f"\n🔍 Query {i}: {query}")
        print("-" * 40)
        
        # Search for relevant documents
        docs, scores = rag_assistant.query_vector_store(
            query=query,
            k=2,  # Get top 2 most relevant documents
            min_score=0.0
        )
        
        if docs:
            for j, (doc, score) in enumerate(zip(docs, scores)):
                print(f"📄 Result {j + 1} (Score: {score:.4f}):")
                # Show first 300 characters of the document
                preview = doc.page_content[:300].replace('\n', ' ')
                print(f"   {preview}...")
                print()
        else:
            print("   ❌ No relevant documents found")
        
        print("-" * 60)
else:
    print("❌ Vector store not available. Please process a document first.")

🔍 Testing vector store queries:

🔍 Query 1: What is CUI Specified?
----------------------------------------
🔍 Found 2 relevant documents
   Document 1: Similarity score = 0.6325
   Document 2: Similarity score = 0.6962
📄 Result 1 (Score: 0.6325):
   Attachment 2  Page 2-2  DOE O 471.7  2-3-2022  9. CUI Specified. This is the subset of CUI in which the authorizing law, regulation, or Government-wide policy contains specific handling controls that it requires or permits agencies to use that differ from those for CUI Basic. 10. CUI Marking Handboo...

📄 Result 2 (Score: 0.6962):
   3 DOE O 471.7  2-3-2022  requirements. However, authorized holders may only handle CUI when  furthering a LGP. At DOE, an authorized holder’s LGP may be defined by  DOE policy, position descriptions, or contractual requirements.   (4) For information to be identified as CUI, it must be designated as...

------------------------------------------------------------

🔍 Query 2: Tell me about sending CUI via email 

## Create Interactive Gradio Interface

> Now let's create a user-friendly Gradio interface to interact with our RAG assistant.

In [8]:
def create_rag_interface():
    """Create a Gradio interface for the RAG assistant."""
    
    def chat_with_assistant(message, use_rag, temperature, top_k, top_p, max_tokens, k_docs):
        """Handle chat interactions."""
        # Update assistant parameters
        rag_assistant.temperature = temperature
        rag_assistant.top_k = int(top_k)
        rag_assistant.top_p = top_p
        rag_assistant.max_new_tokens = int(max_tokens)
        rag_assistant.k = int(k_docs)
        
        # Generate response
        response = rag_assistant.generate_response(message, use_rag=use_rag)
        
        # Add mode indicator
        mode = "🔍 RAG Mode" if use_rag else "🤖 Standard Mode"
        return f"{mode}\n\n{response}"
    
    def process_new_pdf(pdf_file, chunk_size, chunk_overlap):
        """Process a new PDF file."""
        if pdf_file is None:
            return "❌ Please upload a PDF file."
        
        try:
            # Save uploaded file temporarily
            temp_path = f"temp_{pdf_file.name}"
            with open(temp_path, "wb") as f:
                f.write(pdf_file.read())
            
            # Process the PDF
            documents, vector_store = rag_assistant.process_pdf_to_vector_store(
                pdf_path=temp_path,
                chunk_size=int(chunk_size),
                chunk_overlap=int(chunk_overlap)
            )
            
            # Clean up temp file
            os.remove(temp_path)
            
            if documents:
                return f"✅ Successfully processed PDF!\n📄 Created {len(documents)} document chunks.\n🔍 Vector store ready for RAG queries."
            else:
                return "❌ Failed to process PDF. Please try again."
                
        except Exception as e:
            return f"❌ Error processing PDF: {str(e)}"
    
    # Create the Gradio interface
    with gr.Blocks(title="RAG Assistant - Hugging Face Models", theme=gr.themes.Soft()) as interface:
        gr.Markdown("# 🤖 RAG Assistant with Hugging Face Models")
        gr.Markdown("This interface allows you to chat with an AI assistant that can use Retrieval Augmented Generation (RAG) to answer questions based on your documents.")
        
        with gr.Row():
            # Left column - Controls
            with gr.Column(scale=1):
                gr.Markdown("## ⚙️ Settings")
                
                # RAG toggle
                rag_enabled = gr.Checkbox(
                    label="🔍 Enable RAG Mode",
                    value=False,
                    info="Use document knowledge for responses"
                )
                
                # Generation parameters
                gr.Markdown("### 🎛️ Generation Parameters")
                temperature = gr.Slider(0.1, 2.0, value=0.7, label="Temperature", info="Creativity level")
                top_k = gr.Slider(1, 100, value=50, label="Top K", info="Token selection diversity")
                top_p = gr.Slider(0.1, 1.0, value=0.9, label="Top P", info="Nucleus sampling")
                max_tokens = gr.Slider(50, 1000, value=512, label="Max Tokens", info="Response length")
                
                # RAG parameters
                gr.Markdown("### 🔍 RAG Parameters")
                k_docs = gr.Slider(1, 10, value=3, label="K Documents", info="Number of docs to retrieve")
                
                # Document upload
                gr.Markdown("### 📄 Document Management")
                pdf_upload = gr.File(label="Upload PDF", file_types=[".pdf"])
                chunk_size = gr.Number(value=1000, label="Chunk Size")
                chunk_overlap = gr.Number(value=200, label="Chunk Overlap")
                process_btn = gr.Button("📤 Process PDF")
                process_status = gr.Textbox(label="Processing Status", interactive=False)
            
            # Right column - Chat
            with gr.Column(scale=2):
                gr.Markdown("## 💬 Chat Interface")
                
                # Chat interface
                chatbot = gr.Chatbot(height=400, label="Conversation", type='messages')
                user_input = gr.Textbox(
                    label="Your Message",
                    placeholder="Ask a question or chat with the assistant...",
                    lines=2
                )
                
                with gr.Row():
                    send_btn = gr.Button("📤 Send", variant="primary")
                    clear_btn = gr.Button("🗑️ Clear Chat")
                
                # Sample questions
                gr.Markdown("### 💡 Sample Questions (for CUI_SPEC.pdf)")
                sample_questions = [
                    "What is CUI?",
                    "How should CUI be handled when sending emails?",
                    "What are the marking requirements for CUI?",
                    "Explain the safeguarding requirements for CUI."
                ]
                
                for question in sample_questions:
                    gr.Button(question, size="sm").click(
                        lambda q=question: q, outputs=user_input
                    )
        
        # Event handlers
        def respond(message, history, use_rag, temp, top_k_val, top_p_val, max_tok, k_val):
            if not message:
                return history, ""
            
            # Get response from assistant
            response = chat_with_assistant(message, use_rag, temp, top_k_val, top_p_val, max_tok, k_val)
            
            # Update chat history
            history.append({"role": "user", "content": message})
            history.append({"role": "assistant", "content": response})
            return history, ""
        
        # Connect events
        send_btn.click(
            respond,
            inputs=[user_input, chatbot, rag_enabled, temperature, top_k, top_p, max_tokens, k_docs],
            outputs=[chatbot, user_input]
        )
        
        user_input.submit(
            respond,
            inputs=[user_input, chatbot, rag_enabled, temperature, top_k, top_p, max_tokens, k_docs],
            outputs=[chatbot, user_input]
        )
        
        clear_btn.click(lambda: ([], ""), outputs=[chatbot, user_input])
        
        process_btn.click(
            process_new_pdf,
            inputs=[pdf_upload, chunk_size, chunk_overlap],
            outputs=process_status
        )
    
    return interface

# Create and display the interface
print("🎨 Creating Gradio interface...")
rag_interface = create_rag_interface()
print("✅ Interface created successfully!")

🎨 Creating Gradio interface...
✅ Interface created successfully!


## Launch the Interactive Application

> **Note**: The app will launch on a unique port. You can use it to:
> 1. Toggle between RAG mode and standard mode
> 2. Adjust generation parameters (temperature, top-k, top-p)
> 3. Upload and process new PDF documents
> 4. Ask questions and compare responses with and without RAG

### How to use the interface:

1. **Standard Mode**: Ask general questions using the model's built-in knowledge
2. **RAG Mode**: Enable RAG to use document knowledge for responses
3. **Upload PDFs**: Use the upload section to process your own documents
4. **Adjust Parameters**: Fine-tune the model's behavior using the sliders

### Tips for best results:
- Start with RAG disabled to see baseline responses
- Enable RAG and ask the same questions to see the difference
- Try the sample questions provided for the CUI document
- Experiment with different parameter settings

In [11]:
# Launch the Gradio interface
if rag_interface is not None:
    print("🚀 Launching RAG Assistant interface...")
    print("📱 The interface will open in a new tab/window")
    print("🔗 You can also access it through the provided local URL")
    
    # Launch with sharing enabled for broader access
    rag_interface.launch(
        share=True,  # Creates a public link for 72 hours
        server_port=7860,  # Default Gradio port
        debug=False,
        show_error=True,
        quiet=False
    )
else:
    print("❌ Failed to create interface")

🚀 Launching RAG Assistant interface...
📱 The interface will open in a new tab/window
🔗 You can also access it through the provided local URL
* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://3c4fff20c84badceda.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


🔍 Found 3 relevant documents
   Document 1: Similarity score = 0.8217
   Document 2: Similarity score = 0.9151
   Document 3: Similarity score = 0.9331


In [10]:
# If you need to stop the interface, run this cell
if 'rag_interface' in locals():
    rag_interface.close()
    print("🛑 Interface stopped")

Closing server running on port: 7860
🛑 Interface stopped


## Test RAG vs Non-RAG Responses

> Let's directly compare responses with and without RAG to see the difference.

In [None]:
if rag_assistant.vector_store is not None:
    # Test questions
    test_questions = [
        "What is CUI?",
        "How should I handle CUI when sending emails outside the organization?",
        "What are the marking requirements for controlled unclassified information?"
    ]
    
    print("🔬 Comparing RAG vs Non-RAG responses:")
    print("=" * 80)
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n❓ Question {i}: {question}")
        print("-" * 60)
        
        # Get response without RAG
        print("🤖 Standard Response (No RAG):")
        standard_response = rag_assistant.generate_response(question, use_rag=False)
        print(f"   {standard_response}")
        
        print("\n🔍 RAG-Enhanced Response:")
        rag_response = rag_assistant.generate_response(question, use_rag=True)
        print(f"   {rag_response}")
        
        print("\n" + "=" * 80)
else:
    print("❌ Vector store not available. Please process a document first to compare RAG vs non-RAG responses.")

## System Information and Troubleshooting

> Check system resources and get troubleshooting information.

In [None]:
import psutil
import platform

def get_system_info():
    """Get system information for troubleshooting."""
    print("🖥️  System Information:")
    print(f"   Platform: {platform.platform()}")
    print(f"   Python version: {platform.python_version()}")
    print(f"   CPU cores: {psutil.cpu_count()}")
    print(f"   RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
    print(f"   Available RAM: {psutil.virtual_memory().available / (1024**3):.1f} GB")
    
    if torch.cuda.is_available():
        print(f"   GPU: {torch.cuda.get_device_name(0)}")
        print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.1f} GB")
    else:
        print("   GPU: Not available (using CPU)")
    
    print(f"\n📦 Key Package Versions:")
    print(f"   torch: {torch.__version__}")
    try:
        import transformers
        print(f"   transformers: {transformers.__version__}")
    except:
        print("   transformers: Not installed")
    
    try:
        import sentence_transformers
        print(f"   sentence-transformers: {sentence_transformers.__version__}")
    except:
        print("   sentence-transformers: Not installed")
    
    try:
        import langchain
        print(f"   langchain: {langchain.__version__}")
    except:
        print("   langchain: Not installed")

# Display system information
get_system_info()

# Show current model status
print(f"\n🤖 Current Model Status:")
print(f"   Language Model: {rag_assistant.model_name}")
print(f"   Embedding Model: {rag_assistant.embedding_model_name}")
print(f"   Device: {rag_assistant.device}")
print(f"   Vector Store: {'✅ Loaded' if rag_assistant.vector_store else '❌ Not loaded'}")
print(f"   Documents: {len(rag_assistant.documents)} chunks")
print(f"   RAG Mode: {'✅ Enabled' if rag_assistant.rag_mode else '❌ Disabled'}")

## Troubleshooting Guide

### Common Issues and Solutions:

1. **Out of Memory Errors**:
   - Use smaller models (e.g., "gpt2" instead of larger models)
   - Reduce `max_new_tokens` parameter
   - Use CPU instead of GPU if GPU memory is limited

2. **Slow Response Times**:
   - Use smaller, faster models
   - Reduce chunk sizes when processing documents
   - Lower the number of retrieved documents (k parameter)

3. **Model Not Found Errors**:
   - Check your internet connection
   - Verify model names are correct
   - Some models may require Hugging Face authentication

4. **Poor RAG Performance**:
   - Try different embedding models
   - Adjust chunk sizes (smaller for specific questions, larger for context)
   - Experiment with different similarity thresholds

5. **Authentication Issues**:
   - Get a free Hugging Face account and token
   - Set the `use_auth_token` parameter when initializing the assistant

### Performance Tips:
- Start with lightweight models and upgrade as needed
- Use GPU acceleration when available
- Process documents in smaller batches for large files
- Cache models locally for faster subsequent loads

In [19]:
# Clean up resources (optional)
def cleanup_resources():
    """Clean up memory and resources."""
    import gc
    
    # Clear CUDA cache if using GPU
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Force garbage collection
    gc.collect()
    
    print("🧹 Resources cleaned up")

# Uncomment the line below if you want to clean up resources
cleanup_resources()

🧹 Resources cleaned up
