# LLM Exercise with RAG

This notebook demonstrates how to use a language model with a retrieval-augmented generation (RAG) approach. The goal is to retrieve relevant documents from a knowledge base and use them to generate a response to a user query.

## Import Libraries

In [17]:
!pip install accelerate



In [18]:
# Import Libraries to pull a model from Hugging Face and set it up with RAG
import os
import sys
import json
import torch
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
from transformers import RagConfig, RagTokenForGeneration
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration

# Import libraries for efficient small LLMs
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# For quantization to reduce memory footprint
import bitsandbytes as bnb
import accelerate

# For RAG components
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader

# For file handling
import os
import glob
from tqdm.notebook import tqdm

## Small LLMs for Local Inference

For running inference locally with reasonable performance, there are several excellent small LLMs to consider:

1. **Phi-2** (2.7B parameters) - Microsoft's model with excellent reasoning for its size
2. **TinyLlama** (1.1B parameters) - Extremely compact with decent performance
3. **Mistral-7B-Instruct** - Great performance/size trade-off with 7B parameters
4. **Llama-2-7B** - Good general-purpose model that can be quantized further

These models can be loaded with quantization (4-bit or 8-bit) to reduce memory requirements while maintaining most of the performance.

## Load a Small LLM for Local Inference

Below are examples of how to load different small LLMs with quantization for efficient inference.

In [20]:
# Load the model and tokenizer
def load_phi2_model():
    """Load Microsoft's Phi-2 (2.7B parameters) - excellent small model"""
    model_id = "microsoft/phi-2"
    
    # Load with 4-bit quantization to reduce memory usage
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        load_in_4bit=True,  # Use 4-bit quantization
        trust_remote_code=True
    )
    return model, tokenizer

def load_tinyllama_model():
    """Load TinyLlama (1.1B parameters) - extremely compact"""
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        load_in_8bit=True,  # Use 8-bit quantization
    )
    return model, tokenizer

def load_mistral_model():
    """Load Mistral-7B-Instruct - very efficient for its size"""
    model_id = "mistralai/Mistral-7B-Instruct-v0.2"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        load_in_4bit=True,  # Use 4-bit quantization for larger models
    )
    return model, tokenizer


In [21]:
# Choose and load your preferred model
# Uncomment the model you want to use

# Option 1: Phi-2 (2.7B) - Best performance/size ratio
model, tokenizer = load_phi2_model()
print("Phi-2 model loaded successfully!")

# Option 2: TinyLlama (1.1B) - Smallest size
# model, tokenizer = load_tinyllama_model()
# print("TinyLlama model loaded successfully!")

# Option 3: Mistral (7B) - Best overall performance
# model, tokenizer = load_mistral_model()
# print("Mistral-7B model loaded successfully!")

# Create a text generation pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

ValueError: Using a `device_map` or `tp_plan` requires `accelerate`. You can install it with `pip install accelerate`

## Model Comparison and Requirements

| Model | Parameters | Quantized Size | Min RAM | VRAM (GPU) | Performance | Best For |
|-------|------------|----------------|---------|------------|------------|----------|
| Phi-2 | 2.7B | ~1.5-2GB | 4GB | 2GB | Very Good | Balanced use cases |
| TinyLlama | 1.1B | ~700MB | 2GB | 1GB | Moderate | Resource-constrained devices |
| Mistral-7B | 7B | ~3.5-4GB | 8GB | 4GB | Excellent | When performance matters more |

Notes:
- 4-bit quantization reduces memory by ~75% with minor quality loss
- 8-bit quantization reduces memory by ~50% with minimal quality loss
- CPU inference is possible but much slower; a GPU is recommended

## Set Up RAG with Local Documents

Now we'll set up a Retrieval-Augmented Generation (RAG) system using your local PDF text files.
This will allow the model to generate answers based on your document collection.

In [None]:
# Load text files from the data directory
def load_documents(directory="data"):
    """Load all text files from the specified directory"""
    txt_files = glob.glob(os.path.join(directory, "*.txt"))
    print(f"Found {len(txt_files)} text files in {directory}")
    
    documents = []
    for file_path in tqdm(txt_files, desc="Loading documents"):
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()
            documents.append({"content": content, "source": os.path.basename(file_path)})
    
    return documents

# Load and preprocess documents
documents = load_documents()
print(f"Loaded {len(documents)} documents")

In [None]:
# Split documents into chunks for processing
def split_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents into manageable chunks"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    chunks = []
    for doc in tqdm(documents, desc="Splitting documents"):
        doc_chunks = text_splitter.split_text(doc["content"])
        for chunk in doc_chunks:
            chunks.append({"content": chunk, "source": doc["source"]})
    
    return chunks

# Split documents into manageable chunks
chunks = split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

In [None]:
# Load a small embedding model
def setup_embeddings_and_vectorstore(chunks):
    """Set up embeddings and vector storage for document retrieval"""
    # Use a small, efficient embedding model
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    
    # Extract chunk texts and their metadata
    texts = [chunk["content"] for chunk in chunks]
    metadatas = [{"source": chunk["source"]} for chunk in chunks]
    
    # Create FAISS vector store
    vectorstore = FAISS.from_texts(
        texts=texts,
        embedding=embeddings,
        metadatas=metadatas
    )
    
    return vectorstore

# Set up vector store for retrieval
vectorstore = setup_embeddings_and_vectorstore(chunks)
print("Vector store created successfully!")

## RAG Inference with Local LLM

Now we can use our local LLM with the RAG setup to answer questions based on the document collection.

In [None]:
def rag_query(query, num_chunks=3, generator=generator):
    """Query the RAG system with a user question"""
    # Retrieve relevant chunks
    retrieved_docs = vectorstore.similarity_search(query, k=num_chunks)
    retrieved_text = "\n\n".join([doc.page_content for doc in retrieved_docs])
    sources = list(set([doc.metadata["source"] for doc in retrieved_docs]))
    
    # Create context-enriched prompt
    prompt = f"""Context information is below.
---------------------
{retrieved_text}
---------------------
Given the context information and no prior knowledge, answer the following question: {query}
"""
    
    # Generate response using the local LLM
    response = generator(prompt, max_new_tokens=512)[0]["generated_text"]
    
    # Remove the prompt from the response
    response = response.replace(prompt, "")
    
    return {
        "response": response,
        "sources": sources
    }

In [None]:
# Example query
query = "How do I set up a Cisco ASA 5505 firewall?"
result = rag_query(query)

print("Question:", query)
print("\nAnswer:")
print(result["response"])
print("\nSources:")
for source in result["sources"]:
    print(f"- {source}")

## Interactive RAG Query Interface

Use the cell below to query your document collection interactively.

In [None]:
from ipywidgets import widgets
from IPython.display import display, HTML

# Create input widget
query_input = widgets.Text(
    value='',
    placeholder='Type your question here...',
    description='Query:',
    layout=widgets.Layout(width='80%')
)

# Create output widget
output = widgets.Output()

# Define submit button callback
def on_submit_clicked(b):
    with output:
        output.clear_output()
        print("Processing query...")
        result = rag_query(query_input.value)
        print("\n\nAnswer:")
        print(result["response"])
        print("\nSources:")
        for source in result["sources"]:
            print(f"- {source}")

# Create and configure submit button
submit_button = widgets.Button(
    description='Submit',
    button_style='primary'
)
submit_button.on_click(on_submit_clicked)

# Display the interface
display(query_input, submit_button, output)