# RAG System untuk Teacher Agent - Automated Feedback System

Sistem Retrieval-Augmented Generation (RAG) untuk mendukung Teacher Agent dalam memberikan feedback otomatis kepada siswa.

## Pipeline Overview:
1. **Data Loading**: Membaca file .txt dari folder raw_data
2. **Preprocessing**: Cleaning dan normalisasi teks
3. **Chunking**: Strategi chunking dengan overlap
4. **Embedding**: Menggunakan model Qwen3-Embedding
5. **Vector Store**: FAISS untuk penyimpanan dan pencarian efisien
6. **Retrieval**: Similarity search dengan top-k
7. **Context Injection**: Format output untuk agent downstream

## 1. Import Libraries dan Setup

In [3]:
import os
import re
import time
import glob
from pathlib import Path
from typing import List, Dict, Any, Tuple
import warnings
warnings.filterwarnings('ignore')

# LangChain imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
# Utility imports
import numpy as np
import torch

print("‚úÖ Libraries imported successfully!")
print(f"üîß PyTorch version: {torch.__version__}")
print(f"üîß CUDA available: {torch.cuda.is_available()}")

‚úÖ Libraries imported successfully!
üîß PyTorch version: 2.9.1
üîß CUDA available: False


## 2. Data Loading - Membaca File .txt dari raw_data

In [4]:
class DataLoader:
    """
    Custom Data Loader untuk membaca file .txt dari folder raw_data
    """
    def __init__(self, data_dir: str = "raw_data"):
        self.data_dir = data_dir
        self.documents = []
        
    def load_documents(self) -> List[Document]:
        """
        Load semua file .txt dari direktori raw_data
        """
        start_time = time.time()
        
        # Cari semua file .txt secara rekursif
        txt_files = glob.glob(os.path.join(self.data_dir, "**/*.txt"), recursive=True)
        
        print(f"üìÇ Found {len(txt_files)} .txt files in {self.data_dir}")
        
        for file_path in txt_files:
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                    
                # Create Document object dengan metadata
                doc = Document(
                    page_content=content,
                    metadata={
                        "source": file_path,
                        "filename": os.path.basename(file_path),
                        "category": os.path.dirname(file_path).split('/')[-1]
                    }
                )
                self.documents.append(doc)
                print(f"  ‚úì Loaded: {file_path}")
                
            except Exception as e:
                print(f"  ‚úó Error loading {file_path}: {str(e)}")
        
        load_time = time.time() - start_time
        print(f"\n‚è±Ô∏è Loading time: {load_time:.2f} seconds")
        print(f"üìÑ Total documents loaded: {len(self.documents)}")
        
        return self.documents

# Initialize dan load documents
loader = DataLoader()
documents = loader.load_documents()

üìÇ Found 2 .txt files in raw_data
  ‚úì Loaded: raw_data/question_bank/uts2.txt
  ‚úì Loaded: raw_data/question_bank/uts1.txt

‚è±Ô∏è Loading time: 0.01 seconds
üìÑ Total documents loaded: 2


## 3. Preprocessing - Cleaning dan Normalisasi Teks

In [5]:
class TextPreprocessor:
    """
    Text Preprocessing untuk cleaning dan normalisasi
    """
    
    @staticmethod
    def clean_text(text: str) -> str:
        """
        Membersihkan teks dari noise
        """
        # Hapus multiple whitespaces
        text = re.sub(r'\s+', ' ', text)
        
        # Hapus special characters yang tidak perlu (keep alphanumeric, punctuation)
        text = re.sub(r'[^\w\s\.,!?;:()\-\'\"]', '', text)
        
        # Normalize line breaks
        text = text.replace('\n\n\n', '\n\n')
        
        # Strip leading/trailing whitespace
        text = text.strip()
        
        return text
    
    @staticmethod
    def normalize_text(text: str) -> str:
        """
        Normalisasi teks (lowercase, dll)
        """
        # Lowercase (optional - tergantung kebutuhan)
        # Untuk educational content, kita pertahankan case untuk proper nouns
        
        # Remove extra spaces
        text = ' '.join(text.split())
        
        return text
    
    @staticmethod
    def preprocess_documents(documents: List[Document]) -> List[Document]:
        """
        Preprocess list of documents
        """
        start_time = time.time()
        processed_docs = []
        
        for doc in documents:
            # Clean and normalize
            cleaned_text = TextPreprocessor.clean_text(doc.page_content)
            normalized_text = TextPreprocessor.normalize_text(cleaned_text)
            
            # Create new document with processed text
            processed_doc = Document(
                page_content=normalized_text,
                metadata=doc.metadata
            )
            processed_docs.append(processed_doc)
        
        preprocess_time = time.time() - start_time
        print(f"üßπ Preprocessing completed!")
        print(f"‚è±Ô∏è Preprocessing time: {preprocess_time:.2f} seconds")
        print(f"üìÑ Processed {len(processed_docs)} documents")
        
        return processed_docs

# Preprocess documents
preprocessor = TextPreprocessor()
processed_documents = preprocessor.preprocess_documents(documents)

# Show sample
if processed_documents:
    print(f"\nüìù Sample processed text (first 300 chars):")
    print(processed_documents[0].page_content[:300])

üßπ Preprocessing completed!
‚è±Ô∏è Preprocessing time: 0.00 seconds
üìÑ Processed 2 documents

üìù Sample processed text (first 300 chars):
--- PAGE 1 --- UNIVERSITAS Telkom Ujian Akhir Semester (Final Exam) Ganjil TA. 20252026 (1st Semester, Academic Year 20252026) CAK1BAB3-Algoritma dan Pemrograman 1 (Algorithm and Programming 1) Senin, 5 Januari 2026, Jam 14:00-16:00 (120 menit) (Monday, January 5, 2026, 14:00-16:00 120 minutes) Tim 


## 4. Chunking Strategy - Memecah Dokumen dengan Overlap

In [6]:
class DocumentChunker:
    """
    Chunking strategy dengan overlap untuk konteks yang lebih baik
    """
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        """
        Args:
            chunk_size: Ukuran maksimal setiap chunk (dalam karakter)
            chunk_overlap: Overlap antar chunk untuk kontinuitas konteks
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        # RecursiveCharacterTextSplitter dari LangChain
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
        )
    
    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """
        Split documents into chunks
        """
        start_time = time.time()
        
        chunks = self.text_splitter.split_documents(documents)
        
        chunk_time = time.time() - start_time
        print(f"‚úÇÔ∏è Chunking completed!")
        print(f"‚è±Ô∏è Chunking time: {chunk_time:.2f} seconds")
        print(f"üìÑ Original documents: {len(documents)}")
        print(f"üì¶ Total chunks created: {len(chunks)}")
        print(f"üìä Average chunks per document: {len(chunks)/len(documents):.1f}")
        
        return chunks

# Chunk documents
chunker = DocumentChunker(chunk_size=500, chunk_overlap=50)
chunks = chunker.chunk_documents(processed_documents)

# Show sample chunks
if chunks:
    print(f"\nüìù Sample chunks:")
    for i, chunk in enumerate(chunks[:3]):
        print(f"\n--- Chunk {i+1} ---")
        print(f"Source: {chunk.metadata.get('source', 'N/A')}")
        print(f"Content: {chunk.page_content[:200]}...")
        print(f"Length: {len(chunk.page_content)} chars")

‚úÇÔ∏è Chunking completed!
‚è±Ô∏è Chunking time: 0.00 seconds
üìÑ Original documents: 2
üì¶ Total chunks created: 55
üìä Average chunks per document: 27.5

üìù Sample chunks:

--- Chunk 1 ---
Source: raw_data/question_bank/uts2.txt
Content: --- PAGE 1 --- UNIVERSITAS Telkom Ujian Akhir Semester (Final Exam) Ganjil TA...
Length: 77 chars

--- Chunk 2 ---
Source: raw_data/question_bank/uts2.txt
Content: . 20252026 (1st Semester, Academic Year 20252026) CAK1BAB3-Algoritma dan Pemrograman 1 (Algorithm and Programming 1) Senin, 5 Januari 2026, Jam 14:00-16:00 (120 menit) (Monday, January 5, 2026, 14:00-...
Length: 451 chars

--- Chunk 3 ---
Source: raw_data/question_bank/uts2.txt
Content: . Jika dilakukan, maka dianggap pelanggaran (This is a close book exam, no electronic device is allowed. Put your phone and laptop at the front of class Do not cooperate with each other or violate aca...
Length: 466 chars


## 5. Embedding Generation - Menggunakan Qwen3-Embedding (atau alternatif open-source)

**Note**: Qwen3-Embedding mungkin memerlukan akses khusus. Sebagai alternatif, kita akan menggunakan model open-source yang powerful seperti `sentence-transformers/all-MiniLM-L6-v2` atau `BAAI/bge-small-en-v1.5` yang compatible dengan sistem gratis.

In [7]:
class EmbeddingGenerator:
    """
    Embedding generator menggunakan model open-source
    """
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        """
        Initialize embedding model
        
        Available models:
        - sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
        - BAAI/bge-small-en-v1.5 (better quality)
        - Qwen/Qwen-VL (jika tersedia)
        """
        print(f"üîÑ Loading embedding model: {model_name}")
        start_time = time.time()
        
        # Initialize HuggingFace Embeddings
        self.embeddings = HuggingFaceEmbeddings(
            model_name=model_name,
            model_kwargs={'device': 'cpu'},  # Use 'cuda' if GPU available
            encode_kwargs={'normalize_embeddings': True}  # Normalize for better similarity
        )
        
        load_time = time.time() - start_time
        print(f"‚úÖ Model loaded successfully in {load_time:.2f} seconds")
        
    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings untuk list of texts
        """
        start_time = time.time()
        
        embeddings = self.embeddings.embed_documents(texts)
        
        embed_time = time.time() - start_time
        
        return np.array(embeddings), embed_time
    
    def test_embedding(self, text: str = "Test embedding generation"):
        """
        Test embedding generation dengan sample text
        """
        print(f"\nüß™ Testing embedding generation...")
        start_time = time.time()
        
        embedding = self.embeddings.embed_query(text)
        
        test_time = time.time() - start_time
        
        print(f"‚úÖ Test successful!")
        print(f"üìä Embedding dimension: {len(embedding)}")
        print(f"‚è±Ô∏è Inference time: {test_time:.4f} seconds")
        print(f"üìà Sample values: {embedding[:5]}")
        
        return embedding

# Initialize embedding generator
embedding_generator = EmbeddingGenerator(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Test embedding
test_embedding = embedding_generator.test_embedding("This is a test sentence for embedding.")

üîÑ Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
‚úÖ Model loaded successfully in 31.98 seconds

üß™ Testing embedding generation...
‚úÖ Test successful!
üìä Embedding dimension: 384
‚è±Ô∏è Inference time: 0.1422 seconds
üìà Sample values: [0.02782415971159935, 0.001702569075860083, 0.08005549758672714, 0.04666281118988991, 0.03852206468582153]


## 6. FAISS Vector Store - Indexing dan Penyimpanan Efisien

In [8]:
class FAISSVectorStore:
    """
    FAISS Vector Store untuk penyimpanan dan pencarian efisien
    """
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.vector_store = None
        
    def create_index(self, chunks: List[Document]) -> FAISS:
        """
        Create FAISS index dari chunks
        """
        print(f"üî® Creating FAISS index...")
        start_time = time.time()
        
        # Create FAISS vector store from documents
        self.vector_store = FAISS.from_documents(
            documents=chunks,
            embedding=self.embeddings
        )
        
        index_time = time.time() - start_time
        
        print(f"‚úÖ FAISS index created successfully!")
        print(f"‚è±Ô∏è Indexing time: {index_time:.2f} seconds")
        print(f"üì¶ Total vectors in index: {len(chunks)}")
        print(f"üíæ Estimated memory usage: {len(chunks) * 384 / 1024:.2f} KB")  # 384 = embedding dimension
        
        return self.vector_store
    
    def save_index(self, path: str = "faiss_index"):
        """
        Save FAISS index ke disk untuk reuse
        """
        if self.vector_store is None:
            print("‚ùå No vector store to save!")
            return
        
        print(f"üíæ Saving FAISS index to {path}...")
        self.vector_store.save_local(path)
        print(f"‚úÖ Index saved successfully!")
    
    def load_index(self, path: str = "faiss_index"):
        """
        Load FAISS index dari disk
        """
        print(f"üìÇ Loading FAISS index from {path}...")
        start_time = time.time()
        
        self.vector_store = FAISS.load_local(
            path, 
            self.embeddings,
            allow_dangerous_deserialization=True  # Required for loading
        )
        
        load_time = time.time() - start_time
        print(f"‚úÖ Index loaded successfully in {load_time:.2f} seconds!")
        
        return self.vector_store

# Create FAISS vector store
faiss_store = FAISSVectorStore(embedding_generator.embeddings)
vector_store = faiss_store.create_index(chunks)

# Save index untuk reuse nanti
faiss_store.save_index("faiss_index")

üî® Creating FAISS index...
‚úÖ FAISS index created successfully!
‚è±Ô∏è Indexing time: 2.05 seconds
üì¶ Total vectors in index: 55
üíæ Estimated memory usage: 20.62 KB
üíæ Saving FAISS index to faiss_index...
‚úÖ Index saved successfully!


## 7. Retrieval System - Similarity Search dengan Top-K

In [9]:
class RAGRetriever:
    """
    Retrieval system dengan similarity search
    """
    def __init__(self, vector_store: FAISS, k: int = 3):
        """
        Args:
            vector_store: FAISS vector store
            k: Number of top results to retrieve
        """
        self.vector_store = vector_store
        self.k = k
    
    def retrieve(self, query: str, k: int = None) -> Tuple[List[Document], float]:
        """
        Retrieve top-k most relevant documents
        
        Returns:
            Tuple of (documents, retrieval_time)
        """
        if k is None:
            k = self.k
            
        print(f"\nüîç Searching for: '{query}'")
        start_time = time.time()
        
        # Similarity search
        results = self.vector_store.similarity_search(query, k=k)
        
        retrieval_time = time.time() - start_time
        
        print(f"‚úÖ Retrieved {len(results)} documents")
        print(f"‚è±Ô∏è Retrieval time: {retrieval_time:.4f} seconds")
        
        return results, retrieval_time
    
    def retrieve_with_scores(self, query: str, k: int = None) -> Tuple[List[Tuple[Document, float]], float]:
        """
        Retrieve dengan similarity scores
        """
        if k is None:
            k = self.k
            
        print(f"\nüîç Searching for: '{query}'")
        start_time = time.time()
        
        # Similarity search with scores
        results = self.vector_store.similarity_search_with_score(query, k=k)
        
        retrieval_time = time.time() - start_time
        
        print(f"‚úÖ Retrieved {len(results)} documents with scores")
        print(f"‚è±Ô∏è Retrieval time: {retrieval_time:.4f} seconds")
        
        # Display scores
        for i, (doc, score) in enumerate(results):
            print(f"  {i+1}. Score: {score:.4f} | Source: {doc.metadata.get('filename', 'N/A')}")
        
        return results, retrieval_time

# Initialize retriever
retriever = RAGRetriever(vector_store, k=3)

# Test retrieval
test_query = "algorithm and programming exam questions"
test_results, test_time = retriever.retrieve_with_scores(test_query)

# Show retrieved documents
print(f"\nüìÑ Retrieved Documents:")
for i, (doc, score) in enumerate(test_results):
    print(f"\n--- Document {i+1} (Score: {score:.4f}) ---")
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content preview: {doc.page_content[:200]}...")


üîç Searching for: 'algorithm and programming exam questions'
‚úÖ Retrieved 3 documents with scores
‚è±Ô∏è Retrieval time: 0.0255 seconds
  1. Score: 0.8872 | Source: uts2.txt
  2. Score: 1.0135 | Source: uts2.txt
  3. Score: 1.0949 | Source: uts2.txt

üìÑ Retrieved Documents:

--- Document 1 (Score: 0.8872) ---
Source: raw_data/question_bank/uts2.txt
Content preview: . 20252026 (1st Semester, Academic Year 20252026) CAK1BAB3-Algoritma dan Pemrograman 1 (Algorithm and Programming 1) Senin, 5 Januari 2026, Jam 14:00-16:00 (120 menit) (Monday, January 5, 2026, 14:00-...

--- Document 2 (Score: 1.0135) ---
Source: raw_data/question_bank/uts2.txt
Content preview: . Tuliskan jawaban di sini (Write your answer at the given box)! Plaintext program Nature dictionary p, kerdil : real ada : integer algorithm ada 1 input(kerdil) input(p) while p ! -1 do if p kerdil t...

--- Document 3 (Score: 1.0949) ---
Source: raw_data/question_bank/uts2.txt
Content preview: . Students are able to use basic

## 8. Teacher Agent Input Handler - Student Profile Integration

In [10]:
class TeacherAgentInput:
    """
    Handler untuk input dari Teacher Agent
    """
    def __init__(self, retriever: RAGRetriever):
        self.retriever = retriever
    
    def process_student_input(
        self, 
        student_input: str, 
        summary: str = "", 
        student_profile: Dict[str, Any] = None
    ) -> Dict[str, Any]:
        """
        Process student input dan retrieve relevant context
        
        Args:
            student_input: Input jawaban dari siswa
            summary: Summary dari jawaban siswa
            student_profile: Profile siswa (level, history, dll)
            
        Returns:
            Dict dengan context dan metadata
        """
        print("="*60)
        print("üéì TEACHER AGENT - RAG SYSTEM")
        print("="*60)
        
        # Build enhanced query
        query_parts = []
        
        if summary:
            query_parts.append(f"Summary: {summary}")
        
        query_parts.append(f"Student Input: {student_input}")
        
        if student_profile:
            level = student_profile.get('level', 'N/A')
            query_parts.append(f"Level: {level}")
        
        enhanced_query = " | ".join(query_parts)
        
        print(f"\nüìù Student Input: {student_input[:100]}...")
        if summary:
            print(f"üìã Summary: {summary[:100]}...")
        if student_profile:
            print(f"üë§ Student Profile: {student_profile}")
        
        # Retrieve relevant context
        results, retrieval_time = self.retriever.retrieve_with_scores(
            enhanced_query, 
            k=3
        )
        
        # Format context
        context_text = self._format_context(results)
        
        # Build output
        output = {
            "context": context_text,
            "student_input": student_input,
            "summary": summary,
            "student_profile": student_profile,
            "retrieval_time": retrieval_time,
            "num_sources": len(results),
            "sources": [
                {
                    "filename": doc.metadata.get('filename', 'N/A'),
                    "category": doc.metadata.get('category', 'N/A'),
                    "score": float(score),
                    "preview": doc.page_content[:150]
                }
                for doc, score in results
            ]
        }
        
        return output
    
    def _format_context(self, results: List[Tuple[Document, float]]) -> str:
        """
        Format retrieved documents menjadi context string
        """
        context_parts = []
        
        for i, (doc, score) in enumerate(results):
            context_parts.append(f"[Source {i+1} - Score: {score:.4f}]")
            context_parts.append(doc.page_content)
            context_parts.append("")  # Empty line
        
        return "\n".join(context_parts)
    
    def format_output_for_agents(self, output: Dict[str, Any]) -> str:
        """
        Format output dalam bentuk yang siap digunakan oleh downstream agents
        (Style Checker, Logic Checker, dll)
        """
        formatted = f"""
{{
    "context": "{output['context'][:500]}...",
    "student_input": "{output['student_input']}",
    "summary": "{output['summary']}",
    "metadata": {{
        "num_sources": {output['num_sources']},
        "retrieval_time": {output['retrieval_time']:.4f},
        "student_profile": {output['student_profile']}
    }}
}}
"""
        return formatted

# Initialize Teacher Agent Input Handler
teacher_agent = TeacherAgentInput(retriever)

## 9. Complete RAG Pipeline - End-to-End System

In [11]:
class CompleteRAGPipeline:
    """
    Complete RAG Pipeline yang mengintegrasikan semua komponen
    """
    def __init__(
        self, 
        data_dir: str = "raw_data",
        embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
        chunk_size: int = 500,
        chunk_overlap: int = 50,
        top_k: int = 3
    ):
        """
        Initialize complete RAG pipeline
        """
        self.data_dir = data_dir
        self.embedding_model = embedding_model
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.top_k = top_k
        
        # Components
        self.loader = None
        self.preprocessor = None
        self.chunker = None
        self.embedding_generator = None
        self.faiss_store = None
        self.retriever = None
        self.teacher_agent = None
        
        # Metrics
        self.metrics = {
            "load_time": 0,
            "preprocess_time": 0,
            "chunk_time": 0,
            "embedding_time": 0,
            "index_time": 0,
            "total_documents": 0,
            "total_chunks": 0
        }
    
    def build_pipeline(self, force_rebuild: bool = False):
        """
        Build complete pipeline
        """
        print("="*60)
        print("üöÄ BUILDING COMPLETE RAG PIPELINE")
        print("="*60)
        
        total_start = time.time()
        
        # Check if index exists
        if os.path.exists("faiss_index") and not force_rebuild:
            print("\nüìÇ Existing FAISS index found. Loading...")
            self._load_existing_pipeline()
        else:
            print("\nüî® Building new pipeline from scratch...")
            self._build_from_scratch()
        
        total_time = time.time() - total_start
        self.metrics["total_pipeline_time"] = total_time
        
        print(f"\n{'='*60}")
        print(f"‚úÖ PIPELINE BUILD COMPLETE!")
        print(f"‚è±Ô∏è Total time: {total_time:.2f} seconds")
        print(f"{'='*60}")
        
        self._print_metrics()
    
    def _build_from_scratch(self):
        """
        Build pipeline from scratch
        """
        # 1. Load documents
        print("\nüìö Step 1: Loading documents...")
        self.loader = DataLoader(self.data_dir)
        documents = self.loader.load_documents()
        self.metrics["total_documents"] = len(documents)
        
        # 2. Preprocess
        print("\nüßπ Step 2: Preprocessing...")
        self.preprocessor = TextPreprocessor()
        processed_docs = self.preprocessor.preprocess_documents(documents)
        
        # 3. Chunk
        print("\n‚úÇÔ∏è Step 3: Chunking...")
        self.chunker = DocumentChunker(self.chunk_size, self.chunk_overlap)
        chunks = self.chunker.chunk_documents(processed_docs)
        self.metrics["total_chunks"] = len(chunks)
        
        # 4. Generate embeddings and create index
        print("\nüî¢ Step 4: Generating embeddings...")
        self.embedding_generator = EmbeddingGenerator(self.embedding_model)
        
        print("\nüóÑÔ∏è Step 5: Creating FAISS index...")
        self.faiss_store = FAISSVectorStore(self.embedding_generator.embeddings)
        vector_store = self.faiss_store.create_index(chunks)
        
        # 6. Save index
        print("\nüíæ Step 6: Saving index...")
        self.faiss_store.save_index("faiss_index")
        
        # 7. Initialize retriever and teacher agent
        self._initialize_agents(vector_store)
    
    def _load_existing_pipeline(self):
        """
        Load existing pipeline from saved index
        """
        # Initialize embedding generator
        self.embedding_generator = EmbeddingGenerator(self.embedding_model)
        
        # Load FAISS index
        self.faiss_store = FAISSVectorStore(self.embedding_generator.embeddings)
        vector_store = self.faiss_store.load_index("faiss_index")
        
        # Initialize agents
        self._initialize_agents(vector_store)
    
    def _initialize_agents(self, vector_store):
        """
        Initialize retriever and teacher agent
        """
        print("\nüéØ Initializing retriever and teacher agent...")
        self.retriever = RAGRetriever(vector_store, k=self.top_k)
        self.teacher_agent = TeacherAgentInput(self.retriever)
        print("‚úÖ Agents initialized!")
    
    def query(
        self, 
        student_input: str, 
        summary: str = "", 
        student_profile: Dict[str, Any] = None
    ) -> Dict[str, Any]:
        """
        Main query method untuk Teacher Agent
        """
        if self.teacher_agent is None:
            raise ValueError("Pipeline not built! Call build_pipeline() first.")
        
        return self.teacher_agent.process_student_input(
            student_input, 
            summary, 
            student_profile
        )
    
    def _print_metrics(self):
        """
        Print pipeline metrics
        """
        print(f"\nüìä PIPELINE METRICS:")
        print(f"  ‚Ä¢ Total Documents: {self.metrics.get('total_documents', 'N/A')}")
        print(f"  ‚Ä¢ Total Chunks: {self.metrics.get('total_chunks', 'N/A')}")
        print(f"  ‚Ä¢ Embedding Model: {self.embedding_model}")
        print(f"  ‚Ä¢ Chunk Size: {self.chunk_size}")
        print(f"  ‚Ä¢ Top-K: {self.top_k}")

# Initialize Complete Pipeline
pipeline = CompleteRAGPipeline(
    data_dir="raw_data",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    chunk_size=500,
    chunk_overlap=50,
    top_k=3
)

# Build pipeline
pipeline.build_pipeline(force_rebuild=False)

üöÄ BUILDING COMPLETE RAG PIPELINE

üìÇ Existing FAISS index found. Loading...
üîÑ Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
‚úÖ Model loaded successfully in 4.67 seconds
üìÇ Loading FAISS index from faiss_index...
‚úÖ Index loaded successfully in 0.00 seconds!

üéØ Initializing retriever and teacher agent...
‚úÖ Agents initialized!

‚úÖ PIPELINE BUILD COMPLETE!
‚è±Ô∏è Total time: 4.67 seconds

üìä PIPELINE METRICS:
  ‚Ä¢ Total Documents: 0
  ‚Ä¢ Total Chunks: 0
  ‚Ä¢ Embedding Model: sentence-transformers/all-MiniLM-L6-v2
  ‚Ä¢ Chunk Size: 500
  ‚Ä¢ Top-K: 3


## 10. Demo & Testing - Contoh Penggunaan Sistem

In [12]:
# Test Case 1: Student jawaban tentang algoritma
print("="*80)
print("TEST CASE 1: Student Answer about Algorithms")
print("="*80)

student_input_1 = """
The algorithm I wrote uses a loop to find the maximum value in an array. 
I iterate through each element and compare it with the current maximum.
"""

summary_1 = "Student implements a linear search algorithm for finding maximum value"

student_profile_1 = {
    "name": "John Doe",
    "level": "beginner",
    "course": "Algorithm and Programming 1",
    "previous_score": 75
}

result_1 = pipeline.query(
    student_input=student_input_1,
    summary=summary_1,
    student_profile=student_profile_1
)

print(f"\nüì§ OUTPUT FOR DOWNSTREAM AGENTS:")
print(f"{'='*80}")
print(f"Context Length: {len(result_1['context'])} characters")
print(f"Number of Sources: {result_1['num_sources']}")
print(f"Retrieval Time: {result_1['retrieval_time']:.4f} seconds")
print(f"\nüí° Context Preview (first 500 chars):")
print(result_1['context'][:500])
print(f"...")

TEST CASE 1: Student Answer about Algorithms
üéì TEACHER AGENT - RAG SYSTEM

üìù Student Input: 
The algorithm I wrote uses a loop to find the maximum value in an array. 
I iterate through each el...
üìã Summary: Student implements a linear search algorithm for finding maximum value...
üë§ Student Profile: {'name': 'John Doe', 'level': 'beginner', 'course': 'Algorithm and Programming 1', 'previous_score': 75}

üîç Searching for: 'Summary: Student implements a linear search algorithm for finding maximum value | Student Input: 
The algorithm I wrote uses a loop to find the maximum value in an array. 
I iterate through each element and compare it with the current maximum.
 | Level: beginner'
‚úÖ Retrieved 3 documents with scores
‚è±Ô∏è Retrieval time: 0.0328 seconds
  1. Score: 1.1493 | Source: uts2.txt
  2. Score: 1.2057 | Source: uts1.txt
  3. Score: 1.2133 | Source: uts2.txt

üì§ OUTPUT FOR DOWNSTREAM AGENTS:
Context Length: 1440 characters
Number of Sources: 3
Retrieval Time: 0.

In [13]:
# Test Case 2: Student jawaban tentang ujian
print("\n" + "="*80)
print("TEST CASE 2: Question about Exam Format")
print("="*80)

student_input_2 = """
I'm confused about the exam format. Is it open book or closed book? 
Can we use calculators?
"""

summary_2 = "Student asking about exam rules and regulations"

student_profile_2 = {
    "name": "Jane Smith",
    "level": "intermediate",
    "course": "Algorithm and Programming 1",
    "previous_score": 85
}

result_2 = pipeline.query(
    student_input=student_input_2,
    summary=summary_2,
    student_profile=student_profile_2
)

print(f"\nüì§ OUTPUT FOR DOWNSTREAM AGENTS:")
print(f"{'='*80}")
print(f"Context Length: {len(result_2['context'])} characters")
print(f"Number of Sources: {result_2['num_sources']}")
print(f"Retrieval Time: {result_2['retrieval_time']:.4f} seconds")

print(f"\nüìö Retrieved Sources:")
for i, source in enumerate(result_2['sources']):
    print(f"\n  Source {i+1}:")
    print(f"    ‚Ä¢ File: {source['filename']}")
    print(f"    ‚Ä¢ Category: {source['category']}")
    print(f"    ‚Ä¢ Similarity Score: {source['score']:.4f}")
    print(f"    ‚Ä¢ Preview: {source['preview']}...")


TEST CASE 2: Question about Exam Format
üéì TEACHER AGENT - RAG SYSTEM

üìù Student Input: 
I'm confused about the exam format. Is it open book or closed book? 
Can we use calculators?
...
üìã Summary: Student asking about exam rules and regulations...
üë§ Student Profile: {'name': 'Jane Smith', 'level': 'intermediate', 'course': 'Algorithm and Programming 1', 'previous_score': 85}

üîç Searching for: 'Summary: Student asking about exam rules and regulations | Student Input: 
I'm confused about the exam format. Is it open book or closed book? 
Can we use calculators?
 | Level: intermediate'
‚úÖ Retrieved 3 documents with scores
‚è±Ô∏è Retrieval time: 0.0232 seconds
  1. Score: 1.0638 | Source: uts1.txt
  2. Score: 1.0816 | Source: uts1.txt
  3. Score: 1.0859 | Source: uts2.txt

üì§ OUTPUT FOR DOWNSTREAM AGENTS:
Context Length: 1450 characters
Number of Sources: 3
Retrieval Time: 0.0232 seconds

üìö Retrieved Sources:

  Source 1:
    ‚Ä¢ File: uts1.txt
    ‚Ä¢ Category: questio

## 11. Output Format untuk Downstream Agents (Style Checker, Logic Checker)

In [14]:
def format_for_style_checker(result: Dict[str, Any]) -> str:
    """
    Format output khusus untuk Style Checker Agent
    """
    return f"""
{{
    "context": {{
        "retrieved_knowledge": "{result['context'][:300]}...",
        "num_sources": {result['num_sources']},
        "retrieval_confidence": "high"
    }},
    "student_submission": {{
        "input": "{result['student_input']}",
        "summary": "{result['summary']}"
    }},
    "metadata": {{
        "student_profile": {result['student_profile']},
        "retrieval_time_ms": {result['retrieval_time'] * 1000:.2f}
    }}
}}
"""

def format_for_logic_checker(result: Dict[str, Any]) -> str:
    """
    Format output khusus untuk Logic Checker Agent
    """
    return f"""
{{
    "context": {{
        "relevant_concepts": "{result['context'][:300]}...",
        "source_files": {[s['filename'] for s in result['sources']]},
        "confidence_scores": {[s['score'] for s in result['sources']]}
    }},
    "student_answer": "{result['student_input']}",
    "summary": "{result['summary']}",
    "student_info": {result['student_profile']}
}}
"""

def format_for_llm_prompt(result: Dict[str, Any]) -> str:
    """
    Format output untuk LLM Prompt injection
    """
    prompt = f"""
You are a Teacher Agent providing feedback to a student.

CONTEXT (Retrieved from Knowledge Base):
{result['context']}

STUDENT SUBMISSION:
Input: {result['student_input']}
Summary: {result['summary']}

STUDENT PROFILE:
{result['student_profile']}

TASK:
Based on the context retrieved from the knowledge base and the student's submission, provide:
1. Assessment of correctness
2. Areas of improvement
3. Constructive feedback
4. Suggestions for further learning

Your feedback should be:
- Clear and concise
- Encouraging and constructive
- Aligned with the course materials (shown in context)
- Tailored to the student's level
"""
    return prompt

# Demo: Format untuk berbagai agents
print("="*80)
print("OUTPUT FORMATS FOR DOWNSTREAM AGENTS")
print("="*80)

print("\n1Ô∏è‚É£ FORMAT FOR STYLE CHECKER:")
print("-"*80)
print(format_for_style_checker(result_1))

print("\n2Ô∏è‚É£ FORMAT FOR LOGIC CHECKER:")
print("-"*80)
print(format_for_logic_checker(result_1))

print("\n3Ô∏è‚É£ FORMAT FOR LLM PROMPT INJECTION:")
print("-"*80)
print(format_for_llm_prompt(result_1)[:500] + "...")

OUTPUT FORMATS FOR DOWNSTREAM AGENTS

1Ô∏è‚É£ FORMAT FOR STYLE CHECKER:
--------------------------------------------------------------------------------

{
    "context": {
        "retrieved_knowledge": "[Source 1 - Score: 1.1493]
. 20252026 (1st Semester, Academic Year 20252026) CAK1BAB3-Algoritma dan Pemrograman 1 (Algorithm and Programming 1) Senin, 5 Januari 2026, Jam 14:00-16:00 (120 menit) (Monday, January 5, 2026, 14:00-16:00 120 minutes) Tim Dosen (Lecturer Team): BMG, FFS, FZD, HMT, IGR, JM...",
        "num_sources": 3,
        "retrieval_confidence": "high"
    },
    "student_submission": {
        "input": "
The algorithm I wrote uses a loop to find the maximum value in an array. 
I iterate through each element and compare it with the current maximum.
",
        "summary": "Student implements a linear search algorithm for finding maximum value"
    },
    "metadata": {
        "student_profile": {'name': 'John Doe', 'level': 'beginner', 'course': 'Algorithm and Programmin

## 12. Performance Metrics & Benchmarking

In [15]:
import statistics

class PerformanceBenchmark:
    """
    Benchmark untuk mengukur performa sistem RAG
    """
    def __init__(self, pipeline: CompleteRAGPipeline):
        self.pipeline = pipeline
        self.results = []
    
    def run_benchmark(self, test_queries: List[str], num_runs: int = 5):
        """
        Run benchmark dengan multiple queries
        """
        print("="*80)
        print("üî¨ PERFORMANCE BENCHMARK")
        print("="*80)
        
        all_embedding_times = []
        all_retrieval_times = []
        
        for query in test_queries:
            print(f"\nüîç Testing query: '{query[:50]}...'")
            
            query_times = []
            
            for i in range(num_runs):
                start_time = time.time()
                
                # Measure embedding time
                embed_start = time.time()
                _ = self.pipeline.embedding_generator.embeddings.embed_query(query)
                embed_time = time.time() - embed_start
                
                # Measure retrieval time
                retrieve_start = time.time()
                _ = self.pipeline.retriever.vector_store.similarity_search(query, k=3)
                retrieve_time = time.time() - retrieve_start
                
                total_time = time.time() - start_time
                
                query_times.append({
                    'embed_time': embed_time,
                    'retrieve_time': retrieve_time,
                    'total_time': total_time
                })
                
                all_embedding_times.append(embed_time)
                all_retrieval_times.append(retrieve_time)
            
            # Calculate stats for this query
            avg_embed = statistics.mean([t['embed_time'] for t in query_times])
            avg_retrieve = statistics.mean([t['retrieve_time'] for t in query_times])
            avg_total = statistics.mean([t['total_time'] for t in query_times])
            
            print(f"  ‚è±Ô∏è Average Embedding Time: {avg_embed*1000:.2f} ms")
            print(f"  ‚è±Ô∏è Average Retrieval Time: {avg_retrieve*1000:.2f} ms")
            print(f"  ‚è±Ô∏è Average Total Time: {avg_total*1000:.2f} ms")
        
        # Overall statistics
        print(f"\n{'='*80}")
        print("üìä OVERALL STATISTICS")
        print(f"{'='*80}")
        print(f"Total Queries: {len(test_queries)}")
        print(f"Runs per Query: {num_runs}")
        print(f"Total Measurements: {len(all_embedding_times)}")
        print(f"\n‚è±Ô∏è Embedding Time:")
        print(f"  ‚Ä¢ Mean: {statistics.mean(all_embedding_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Median: {statistics.median(all_embedding_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Std Dev: {statistics.stdev(all_embedding_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Min: {min(all_embedding_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Max: {max(all_embedding_times)*1000:.2f} ms")
        
        print(f"\n‚è±Ô∏è Retrieval Time:")
        print(f"  ‚Ä¢ Mean: {statistics.mean(all_retrieval_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Median: {statistics.median(all_retrieval_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Std Dev: {statistics.stdev(all_retrieval_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Min: {min(all_retrieval_times)*1000:.2f} ms")
        print(f"  ‚Ä¢ Max: {max(all_retrieval_times)*1000:.2f} ms")
        
        print(f"\nüí° Performance Grade:")
        avg_total = statistics.mean(all_embedding_times) + statistics.mean(all_retrieval_times)
        if avg_total < 0.05:
            grade = "üü¢ EXCELLENT (< 50ms)"
        elif avg_total < 0.1:
            grade = "üü° GOOD (< 100ms)"
        elif avg_total < 0.2:
            grade = "üü† ACCEPTABLE (< 200ms)"
        else:
            grade = "üî¥ NEEDS OPTIMIZATION (> 200ms)"
        print(f"  {grade}")

# Run benchmark
benchmark = PerformanceBenchmark(pipeline)

test_queries = [
    "algorithm and programming exam questions",
    "student cheating policy and violations",
    "how to calculate array maximum value",
    "exam rules and regulations",
    "programming assignment grading criteria"
]

benchmark.run_benchmark(test_queries, num_runs=5)

üî¨ PERFORMANCE BENCHMARK

üîç Testing query: 'algorithm and programming exam questions...'
  ‚è±Ô∏è Average Embedding Time: 9.52 ms
  ‚è±Ô∏è Average Retrieval Time: 6.74 ms
  ‚è±Ô∏è Average Total Time: 16.26 ms

üîç Testing query: 'student cheating policy and violations...'
  ‚è±Ô∏è Average Embedding Time: 6.83 ms
  ‚è±Ô∏è Average Retrieval Time: 7.02 ms
  ‚è±Ô∏è Average Total Time: 13.85 ms

üîç Testing query: 'how to calculate array maximum value...'
  ‚è±Ô∏è Average Embedding Time: 7.15 ms
  ‚è±Ô∏è Average Retrieval Time: 6.67 ms
  ‚è±Ô∏è Average Total Time: 13.81 ms

üîç Testing query: 'exam rules and regulations...'
  ‚è±Ô∏è Average Embedding Time: 6.78 ms
  ‚è±Ô∏è Average Retrieval Time: 6.61 ms
  ‚è±Ô∏è Average Total Time: 13.38 ms

üîç Testing query: 'programming assignment grading criteria...'
  ‚è±Ô∏è Average Embedding Time: 6.43 ms
  ‚è±Ô∏è Average Retrieval Time: 6.21 ms
  ‚è±Ô∏è Average Total Time: 12.64 ms

üìä OVERALL STATISTICS
Total Queries: 5
Runs per Query: 5

## 13. Utility Functions - Save & Load Pipeline

In [17]:
# Quick load function untuk reuse pipeline
def quick_load_pipeline(
    embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
    top_k: int = 3
) -> CompleteRAGPipeline:
    """
    Quick load existing pipeline from saved index
    """
    print("üöÄ Quick loading RAG pipeline...")
    
    pipeline = CompleteRAGPipeline(
        embedding_model=embedding_model,
        top_k=top_k
    )
    
    pipeline.build_pipeline(force_rebuild=False)
    
    return pipeline

# Example usage
print("="*80)
print("üíæ QUICK LOAD EXAMPLE")
print("="*80)

# Simulate reloading pipeline (it will use existing index)
reloaded_pipeline = quick_load_pipeline()

# Test the reloaded pipeline
test_result = reloaded_pipeline.query(
    student_input="What is the exam policy?",
    summary="Student asking about exam rules"
)

print(f"\n‚úÖ Pipeline reloaded and working!")
print(f"Context retrieved: {len(test_result['context'])} characters")
print(f"Sources: {test_result['num_sources']}")

üíæ QUICK LOAD EXAMPLE
üöÄ Quick loading RAG pipeline...
üöÄ BUILDING COMPLETE RAG PIPELINE

üìÇ Existing FAISS index found. Loading...
üîÑ Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
‚úÖ Model loaded successfully in 4.37 seconds
üìÇ Loading FAISS index from faiss_index...
‚úÖ Index loaded successfully in 0.00 seconds!

üéØ Initializing retriever and teacher agent...
‚úÖ Agents initialized!

‚úÖ PIPELINE BUILD COMPLETE!
‚è±Ô∏è Total time: 4.37 seconds

üìä PIPELINE METRICS:
  ‚Ä¢ Total Documents: 0
  ‚Ä¢ Total Chunks: 0
  ‚Ä¢ Embedding Model: sentence-transformers/all-MiniLM-L6-v2
  ‚Ä¢ Chunk Size: 500
  ‚Ä¢ Top-K: 3
üéì TEACHER AGENT - RAG SYSTEM

üìù Student Input: What is the exam policy?...
üìã Summary: Student asking about exam rules...

üîç Searching for: 'Summary: Student asking about exam rules | Student Input: What is the exam policy?'
‚úÖ Retrieved 3 documents with scores
‚è±Ô∏è Retrieval time: 0.0260 seconds
  1. Score: 0.9343 | Source: uts1.t

## 15. Summary & Documentation

### üìö System Overview

Sistem RAG (Retrieval-Augmented Generation) ini dirancang khusus untuk mendukung **Teacher Agent** dalam memberikan feedback otomatis kepada siswa.

### üîß Key Components:

1. **DataLoader**: Membaca file .txt dari folder `raw_data`
2. **TextPreprocessor**: Cleaning dan normalisasi teks
3. **DocumentChunker**: Chunking dengan overlap untuk konteks yang lebih baik
4. **EmbeddingGenerator**: Generate embeddings menggunakan model open-source
5. **FAISSVectorStore**: Penyimpanan dan pencarian vektor efisien
6. **RAGRetriever**: Similarity search dengan top-k retrieval
7. **TeacherAgentInput**: Handler untuk input student dengan profile integration
8. **CompleteRAGPipeline**: End-to-end pipeline integration
9. **RAGAPIInterface**: API untuk integrasi dengan agent system

### ‚ö° Performance:
- **Embedding Time**: < 50ms (rata-rata)
- **Retrieval Time**: < 10ms (rata-rata)
- **Total Query Time**: < 100ms (rata-rata)
- **Memory Efficient**: FAISS optimized indexing
- **Scalable**: Dapat handle ratusan dokumen

### üì§ Output Format:
```python
{
    "context": "Retrieved relevant context from knowledge base...",
    "student_input": "Student's answer or question...",
    "summary": "Summary of student's work...",
    "student_profile": {"level": "beginner", ...},
    "retrieval_time": 0.0234,
    "num_sources": 3,
    "sources": [...]
}
```

### üîÑ Integration Flow:
```
Student Input ‚Üí RAG System ‚Üí Context Retrieval ‚Üí Style Checker ‚Üí Logic Checker ‚Üí Feedback Generation
```

### üí° Usage:
```python
# Initialize pipeline
pipeline = CompleteRAGPipeline()
pipeline.build_pipeline()

# Query
result = pipeline.query(
    student_input="...",
    summary="...",
    student_profile={...}
)

# Use context for downstream agents
context = result['context']
```

### üéØ Use Cases:
1. ‚úÖ Automated grading assistance
2. ‚úÖ Personalized feedback generation
3. ‚úÖ Context-aware tutoring
4. ‚úÖ Style and logic checking
5. ‚úÖ Student performance analysis

## üéØ Ready to Use!

Sistem RAG sudah siap digunakan! Anda dapat:

1. **Run semua cell** untuk build pipeline pertama kali
2. **Gunakan `pipeline.query()`** untuk mendapatkan context
3. **Integrasikan dengan agent system** menggunakan `RAGAPIInterface`
4. **Load pipeline** dengan cepat menggunakan `quick_load_pipeline()`

### Next Steps:
- Tambahkan lebih banyak dokumen ke folder `raw_data/`
- Sesuaikan parameter chunking dan top-k sesuai kebutuhan
- Integrasikan dengan Style Checker dan Logic Checker agents
- Implementasi LLM untuk feedback generation menggunakan context yang di-retrieve