# 01 ‚Äì Naive RAG from Scratch

**Learning Goals:**
- Build a minimal RAG pipeline without frameworks
- Understand TF-IDF vectorization and cosine similarity
- See the core retrieve-then-generate pattern

**What we'll build:**
1. Load text documents from `./data/raw_text/`
2. Chunk them (fixed length with overlap)
3. Vectorize using scikit-learn TF-IDF
4. Retrieve top-k chunks by cosine similarity
5. Generate answers using an LLM with retrieved context

**Persistence:**
- Vectorizer, matrix, and chunk metadata ‚Üí `./artifacts/naive_tfidf/`


In [None]:
#  Global Config & Services (using centralized modules)

import sys
from pathlib import Path
from dotenv import load_dotenv

# Add parent directory to path and change to project root
import os

# Get the notebook's current directory and find project root
notebook_dir = Path.cwd()
if notebook_dir.name == "notebooks":
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

# Change to project root and add to path
os.chdir(project_root)
sys.path.insert(0, str(project_root))

print(f" Working directory: {os.getcwd()}")

from src.services.llm_services import (
    load_config,
    get_llm,
    validate_api_keys,
    print_config_summary
)

# Load environment variables
load_dotenv()

# Load configuration from config.yaml (now we're in project root)
config = load_config("src/config/config.yaml")

# Validate API keys
validate_api_keys(config, verbose=True)

# Print summary
print_config_summary(config)


üìÇ Working directory: /Users/machinelearningzuu/Dropbox/Zuu Crew/Courses/üöß AI Engineer Essentials/Live Classes/Week 03
‚úÖ Config loaded:
  LLM: openrouter (openai/gpt-4o-mini)
  Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2
  Temperature: 0.2
  Artifacts: ./artifacts




In [None]:
# Initialize LLM using factory from llm_services
llm = get_llm(config)
print(f" LLM initialized: {config['llm_provider']} / {config['llm_model']}")

# Verify API key with test completion
print("\n Testing API connection...")
try:
    test_response = llm.invoke("Say 'API working!' if you can read this.")
    test_msg = test_response.content if hasattr(test_response, 'content') else str(test_response)
    print(f" API key verified: {test_msg[:50]}")
except Exception as e:
    print(f" API key test failed: {e}")
    print("  Please check your .env file and API key configuration.")


‚úÖ LLM initialized: openrouter / gpt-4o-mini

üîç Testing API connection...
‚úÖ API key verified: API working!


---

## Step 1: Load Text Documents

We'll load all `.txt` files from `./data/raw_text/`. If the folder is empty, we'll create sample documents.


In [None]:
import glob

text_dir = Path(config["data_root"]) / "raw_text"
text_dir.mkdir(parents=True, exist_ok=True)

# Load documents
doc_files = list(text_dir.glob("*.txt"))

if len(doc_files) == 0:
    print("  No .txt files found. Creating sample documents...")
    
    samples = [
        ("doc1.txt", """Eczema, also called atopic dermatitis, is a chronic inflammatory skin condition marked by itchy, dry, and red skin. 
It affects 10-20% of children and often has a genetic component involving skin barrier dysfunction. Common triggers include allergens, 
irritants, stress, and environmental factors. Treatment focuses on moisturizing, avoiding triggers, and using topical anti-inflammatory 
medications when needed."""),
        
        ("doc2.txt", """Psoriasis is an autoimmune skin condition causing thick, silvery scales and red plaques due to rapid skin cell turnover. 
It can affect the scalp, elbows, knees, and other areas. Treatment options include topical corticosteroids, calcipotriol ointment, 
phototherapy, and systemic medications for moderate to severe cases. Triggers may include stress, infections, and certain medications."""),
        
        ("doc3.txt", """Fungal skin infections like ringworm (tinea) cause circular, red, scaly patches on the skin. They are treated with 
topical antifungal creams such as terbinafine, or oral antifungals for severe cases. It's important to keep affected areas dry and clean. 
Athlete's foot and jock itch are common forms of tinea infection affecting feet and groin respectively."""),
    ]
    
    for fname, content in samples:
        (text_dir / fname).write_text(content)
    
    doc_files = list(text_dir.glob("*.txt"))

documents = []
for fpath in doc_files:
    documents.append({
        "source": fpath.name,
        "content": fpath.read_text(encoding="utf-8")
    })

print(f" Loaded {len(documents)} documents")
for doc in documents:
    print(f"  - {doc['source']}: {len(doc['content'])} chars")


‚úÖ Loaded 2 documents
  - Understanding Skin Diseases.txt: 6019 chars
  - skin-care habits.txt: 6585 chars


---

## Step 2: Chunk Documents

We'll split documents into fixed-size chunks (~800 chars) with overlap (~150 chars) for context continuity.


In [None]:
def chunk_text(text: str, chunk_size: int = 800, overlap: int = 150) -> list:
    """
    Split text into overlapping chunks.
    
    Args:
        text: Input text to chunk
        chunk_size: Maximum size of each chunk in characters
        overlap: Number of characters to overlap between chunks
        
    Returns:
        List of text chunks
    """
    chunks = [] 
    start = 0 

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        if chunk.strip():
            chunks.append(chunk)
        start += (chunk_size - overlap)
    return chunks
        
# Create chunks with metadata
all_chunks = []
for doc in documents:
    text_chunks = chunk_text(doc["content"])
    for i, chunk in enumerate(text_chunks):
        all_chunks.append({
            "chunk_id": len(all_chunks),
            "source": doc["source"],
            "chunk_idx": i,
            "text": chunk.strip()
        })

print(f" Created {len(all_chunks)} chunks")
print(f"\nExample chunk:")
print(f"  ID: {all_chunks[0]['chunk_id']}")
print(f"  Source: {all_chunks[0]['source']}")
print(f"  Text: {all_chunks[0]['text'][:100]}...")


‚úÖ Created 21 chunks

Example chunk:
  ID: 0
  Source: Understanding Skin Diseases.txt
  Text: Sure ‚Äî here‚Äôs a detailed and comprehensive overview of skin diseases, written in an informative, med...


---

## Step 3: TF-IDF Vectorization

We'll use scikit-learn's `TfidfVectorizer` with 1-2 n-grams to capture both individual terms and bigrams.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import save_npz, load_npz
import pandas as pd
import pickle

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(
                            ngram_range=(1, 2),  # Unigrams and bigrams
                            max_df=0.85,         # Ignore terms in >85% of documents
                            min_df=1,            # Must appear in at least 1 document
                            lowercase=True,
                            stop_words='english'
                            )

# Fit and transform
chunk_texts = [c["text"] for c in all_chunks]
tfidf_matrix = vectorizer.fit_transform(chunk_texts)

print(f" TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"  Chunks: {tfidf_matrix.shape[0]}")
print(f"  Vocabulary size: {tfidf_matrix.shape[1]}")


‚úÖ TF-IDF matrix shape: (21, 1924)
  Chunks: 21
  Vocabulary size: 1924


### Persist Artifacts


In [None]:
naive_dir = Path(config["artifacts_root"]) / "naive_tfidf"
naive_dir.mkdir(parents=True, exist_ok=True)

# Save vectorizer
with open(naive_dir / "vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# Save TF-IDF matrix
save_npz(naive_dir / "matrix.npz", tfidf_matrix)

# Save chunk metadata
chunks_df = pd.DataFrame(all_chunks)
chunks_df.to_parquet(naive_dir / "chunks.parquet", index=False)

print(f" Artifacts saved to {naive_dir}")
print(f"  - vectorizer.pkl")
print(f"  - matrix.npz")
print(f"  - chunks.parquet")


‚úÖ Artifacts saved to artifacts/naive_tfidf
  - vectorizer.pkl
  - matrix.npz
  - chunks.parquet


---

## Step 4: Retrieval (Cosine Similarity)

Given a query, we:
1. Transform it using the same vectorizer
2. Compute cosine similarity with all chunks
3. Return top-k most similar chunks


In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def retrieve_chunks(query: str, top_k: int = 3):
    """
    Retrieve top-k most relevant chunks for a query using TF-IDF and cosine similarity.
    
    Args:
        query: Search query string
        top_k: Number of top results to return
        
    Returns:
        List of dictionaries containing chunk data and similarity scores
    """
    query_vec = vectorizer.transform([query])
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    top_indices = np.argsort(similarities)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
                        **all_chunks[idx],
                        "score": float(similarities[idx])
                        })
    return results

# Test retrieval
test_query = "What is eczema and how is it treated?"
retrieved = retrieve_chunks(test_query, top_k=2)

print(f"Query: {test_query}\n")
for i, chunk in enumerate(retrieved, 1):
    print(f"Result {i} (score: {chunk['score']:.3f}):")
    print(f"  Source: {chunk['source']}")
    print(f"  Text: {chunk['text'][:120]}...\n")


Query: What is eczema and how is it treated?

Result 1 (score: 0.068):
  Source: skin-care habits.txt
  Text: ctive dressings and avoiding skin-to-skin spread; wart options include salicylic acid preparations or cryotherapy admini...

Result 2 (score: 0.068):
  Source: Understanding Skin Diseases.txt
  Text: PV).
	‚Ä¢	Fungal infections
	‚Ä¢	Ringworm (Tinea): Circular, red, scaly patches.
	‚Ä¢	Candidiasis: Yeast infection affecting m...



---

## Step 5: RAG Pipeline (Retrieve + Generate)

Now we'll combine retrieval with LLM generation:
1. Retrieve top-k chunks
2. Build a prompt with the retrieved context
3. Generate an answer using the LLM


In [None]:
def build_rag_prompt(query: str, retrieved_chunks: list) -> str:
    """
    Build a RAG prompt with retrieved context.
    
    Args:
        query: User's question
        retrieved_chunks: List of retrieved chunk dictionaries
        
    Returns:
        Complete prompt string with context and question
    """
    context_parts = []
    for i, chunk in enumerate(retrieved_chunks, 1):
        context_parts.append(f"[{i}] {chunk['text']}")

    context = "\n\n".join(context_parts)

    prompt = f"""You are a concise assitant for healthcare. Use only the provided context to answer the question.
    Keep answers under 5 sentences. Based on below conditions adapt your answer.

    1. If question is about Skin diseases and the infomation is insufficent, say "I do not have info, please reach to our hospital".
    2. If question is iirelevant to Skin diseases, say "I can't provide info"
    
    Context:
    {context}

    Question: {query}
    
    Answer:
    """

    return prompt

def naive_rag(query: str, top_k: int = 3) -> dict:
    """Complete RAG pipeline: retrieve + generate."""
    # Retrieve
    retrieved = retrieve_chunks(query, top_k=top_k)
    
    # Build prompt
    prompt = build_rag_prompt(query, retrieved)

    # Generate
    response = llm.invoke(prompt)
    answer = response.content if hasattr(response, 'content') else str(response)
    
    return {
        "query": query,
        "answer": answer,
        "retrieved_chunks": retrieved,
        "num_chunks": len(retrieved)
    }

print(" RAG pipeline ready")


‚úÖ RAG pipeline ready


---

## Interactive RAG Demo

Ask questions and see the retrieved chunks + generated answers.


In [None]:
# Example queries about dermatology
queries = [
    "What is eczema and how is it treated?",
    "How do you treat fungal skin infections?",
    "What are the symptoms of psoriasis?",
    "Who won ashes 2025 ?"
]

for query in queries:
    print("=" * 80)
    print(f" Query: {query}\n")
    
    result = naive_rag(query, top_k=2)
    
    print(" Retrieved chunks:")
    for i, chunk in enumerate(result["retrieved_chunks"], 1):
        print(f"  [{i}] {chunk['source']} (score: {chunk['score']:.3f})")
        print(f"      {chunk['text'][:100]}...\n")
    
    print(f" Answer:\n{result['answer']}\n")


‚ùì Query: What is eczema and how is it treated?

üìö Retrieved chunks:
  [1] skin-care habits.txt (score: 0.068)
      ctive dressings and avoiding skin-to-skin spread; wart options include salicylic acid preparations o...

  [2] Understanding Skin Diseases.txt (score: 0.068)
      PV).
	‚Ä¢	Fungal infections
	‚Ä¢	Ringworm (Tinea): Circular, red, scaly patches.
	‚Ä¢	Candidiasis: Yeast i...

üí° Answer:
Eczema, or atopic dermatitis, is a chronic inflammatory disease characterized by itchy, dry, red skin. Treatment includes daily emollients, short lukewarm baths, fragrance-free products, and avoiding triggers like wool and harsh detergents. During flare-ups, clinicians may prescribe topical anti-inflammatories, wet-wraps, and selected phototherapy for moderate to severe cases. Additionally, dilute bleach baths may be recommended by a dermatologist to reduce Staph burden and itch.

‚ùì Query: How do you treat fungal skin infections?

üìö Retrieved chunks:
  [1] Understanding Skin Dise

---

## Summary

**What we learned:**
- ‚úÖ Basic RAG workflow: chunk ‚Üí vectorize ‚Üí retrieve ‚Üí generate
- ‚úÖ TF-IDF captures term importance within and across documents
- ‚úÖ Cosine similarity measures semantic relevance
- ‚úÖ LLM uses retrieved context to ground its answers

**Limitations of this approach:**
- TF-IDF is sparse and keyword-based (misses semantic similarity)
- No understanding of context or synonyms
- Fixed chunk size doesn't respect document structure

**Artifacts persisted:**
- `./artifacts/naive_tfidf/vectorizer.pkl`
- `./artifacts/naive_tfidf/matrix.npz`
- `./artifacts/naive_tfidf/chunks.parquet`
