# TextReadingRAG System Testing Notebook

This notebook demonstrates how to:
1. Upload and ingest documents (PDF, DOCX, TXT)
2. Query the system with both English and Chinese queries
3. Test hybrid retrieval modes
4. Analyze retrieval results

## Setup

In [None]:
import asyncio
import os
import sys
from pathlib import Path
from pprint import pprint

# Add project root to path
project_root = Path(os.getcwd())
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")
print(f"Python version: {sys.version}")

In [None]:
# Import required modules
from src.core.config import Settings
from src.rag.ingestion import DocumentIngestionService, DocumentLoader
from src.rag.retrieval import HybridRetrievalService, RetrievalMode, FusionMethod
from src.rag.vector_store import ChromaVectorStore
from src.rag.language_utils import detect_language

print("✓ All modules imported successfully")

In [None]:
# Initialize settings
settings = Settings()

print("Configuration:")
print(f"  App Name: {settings.app.app_name}")
print(f"  OpenAI Model: {settings.llm.openai_model}")
print(f"  Embedding Model: {settings.llm.openai_embedding_model}")
print(f"  Chroma Collection: {settings.rag.chroma_collection_name}")
print(f"  Supported Languages: {settings.rag.supported_languages}")
print(f"  Language Detection: {settings.rag.enable_language_detection}")
print(f"  Chunk Size (EN): {settings.rag.chunk_size}")
print(f"  Chunk Size (ZH): {settings.rag.chinese_chunk_size}")

## Initialize Services

In [None]:
# Initialize vector store
vector_store = ChromaVectorStore(
    host=settings.rag.chroma_host,
    port=settings.rag.chroma_port,
    persist_directory=settings.rag.chroma_persist_directory,
    settings=settings,
)

print("✓ Vector store initialized")

In [None]:
# Initialize ingestion service
ingestion_service = DocumentIngestionService(
    settings=settings,
    vector_store=vector_store,
)

print("✓ Ingestion service initialized")

In [None]:
# Initialize retrieval service
retrieval_service = HybridRetrievalService(
    vector_store=vector_store,
    settings=settings,
)

print("✓ Retrieval service initialized")

## 1. Document Upload and Ingestion

### Create Sample Documents

In [None]:
# Create test documents directory
test_docs_dir = project_root / "test_documents"
test_docs_dir.mkdir(exist_ok=True)

print(f"Test documents directory: {test_docs_dir}")

In [None]:
# Create sample English document
english_doc = test_docs_dir / "ai_introduction_en.txt"
english_doc.write_text("""
Introduction to Artificial Intelligence

Artificial Intelligence (AI) is a branch of computer science that aims to create 
intelligent machines that can perform tasks that typically require human intelligence. 
These tasks include visual perception, speech recognition, decision-making, and 
language translation.

Machine Learning

Machine learning is a subset of AI that enables computers to learn from data without 
being explicitly programmed. It uses algorithms to identify patterns in data and make 
predictions or decisions based on those patterns.

Deep Learning

Deep learning is a specialized form of machine learning that uses artificial neural 
networks with multiple layers. These deep neural networks can process complex data 
such as images, audio, and text with remarkable accuracy.

Natural Language Processing

Natural Language Processing (NLP) is a field of AI that focuses on enabling computers 
to understand, interpret, and generate human language. NLP powers applications like 
chatbots, translation services, and sentiment analysis.

Computer Vision

Computer vision is another important area of AI that enables machines to interpret 
and understand visual information from the world. It's used in applications like 
facial recognition, autonomous vehicles, and medical image analysis.
""", encoding='utf-8')

print(f"✓ Created: {english_doc.name}")

In [None]:
# Create sample Chinese document (Traditional)
chinese_doc = test_docs_dir / "ai_introduction_zh.txt"
chinese_doc.write_text("""
人工智能簡介

人工智能（AI）是計算機科學的一個分支，旨在創建能夠執行通常需要人類智能的任務的智能機器。
這些任務包括視覺感知、語音識別、決策制定和語言翻譯。

機器學習

機器學習是人工智能的一個子集，使計算機能夠從數據中學習而無需明確編程。它使用算法來識別
數據中的模式，並根據這些模式進行預測或決策。

深度學習

深度學習是機器學習的一種特殊形式，使用具有多個層次的人工神經網絡。這些深度神經網絡可以
以驚人的準確性處理複雜的數據，如圖像、音頻和文本。

自然語言處理

自然語言處理（NLP）是人工智能的一個領域，專注於使計算機能夠理解、解釋和生成人類語言。
NLP 為聊天機器人、翻譯服務和情感分析等應用提供動力。

計算機視覺

計算機視覺是人工智能的另一個重要領域，使機器能夠解釋和理解來自世界的視覺信息。它用於
面部識別、自動駕駛汽車和醫療影像分析等應用。
""", encoding='utf-8')

print(f"✓ Created: {chinese_doc.name}")

In [None]:
# Create a mixed language document
mixed_doc = test_docs_dir / "rag_systems.txt"
mixed_doc.write_text("""
Retrieval-Augmented Generation (RAG) Systems

RAG systems combine the power of large language models with external knowledge bases 
to provide more accurate and contextually relevant responses. The system retrieves 
relevant documents from a vector database and uses them to augment the generation process.

檢索增強生成系統

RAG 系統結合了大型語言模型的強大功能和外部知識庫，以提供更準確和上下文相關的響應。
該系統從向量數據庫中檢索相關文檔，並使用它們來增強生成過程。

Key Components:
1. Document Ingestion - Processing and chunking documents
2. Vector Embeddings - Converting text to numerical vectors
3. Retrieval - Finding relevant documents using similarity search
4. Generation - Creating responses using retrieved context

主要組成部分：
1. 文檔攝取 - 處理和分塊文檔
2. 向量嵌入 - 將文本轉換為數值向量
3. 檢索 - 使用相似性搜索找到相關文檔
4. 生成 - 使用檢索到的上下文創建響應
""", encoding='utf-8')

print(f"✓ Created: {mixed_doc.name}")

### Ingest Documents

In [None]:
# Ingest English document
print("Ingesting English document...")
result_en = await ingestion_service.ingest_file(
    file_path=str(english_doc),
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nEnglish Document Ingestion Results:")
print(f"  File: {result_en['filename']}")
print(f"  Nodes created: {result_en['nodes_created']}")
print(f"  Processing time: {result_en['processing_time_seconds']:.2f}s")
print(f"  Collection: {result_en['collection_name']}")

In [None]:
# Ingest Chinese document
print("Ingesting Chinese document...")
result_zh = await ingestion_service.ingest_file(
    file_path=str(chinese_doc),
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nChinese Document Ingestion Results:")
print(f"  File: {result_zh['filename']}")
print(f"  Nodes created: {result_zh['nodes_created']}")
print(f"  Processing time: {result_zh['processing_time_seconds']:.2f}s")
print(f"  Collection: {result_zh['collection_name']}")

In [None]:
# Ingest mixed document
print("Ingesting mixed language document...")
result_mixed = await ingestion_service.ingest_file(
    file_path=str(mixed_doc),
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nMixed Document Ingestion Results:")
print(f"  File: {result_mixed['filename']}")
print(f"  Nodes created: {result_mixed['nodes_created']}")
print(f"  Processing time: {result_mixed['processing_time_seconds']:.2f}s")
print(f"  Collection: {result_mixed['collection_name']}")

In [None]:
# Get ingestion statistics
stats = ingestion_service.get_ingestion_stats()

print("\nCollection Statistics:")
pprint(stats)

## 2. Query Testing

### English Queries

In [None]:
# Test English query - Machine Learning
query_en_1 = "What is machine learning?"

print(f"Query: {query_en_1}")
print(f"Detected language: {detect_language(query_en_1)}")
print("\nRetrieving...")

results_en_1 = await retrieval_service.retrieve(
    query=query_en_1,
    mode=RetrievalMode.HYBRID,
    top_k=3,
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nRetrieved {len(results_en_1)} results:\n")
for i, result in enumerate(results_en_1, 1):
    print(f"Result {i} (Score: {result.score:.4f}):")
    print(f"  Text: {result.node.text[:200]}...")
    print(f"  Language: {result.node.metadata.get('language', 'N/A')}")
    print(f"  Method: {result.node.metadata.get('retrieval_method', 'N/A')}")
    print()

In [None]:
# Test English query - Deep Learning
query_en_2 = "Explain deep learning and neural networks"

print(f"Query: {query_en_2}")
print(f"Detected language: {detect_language(query_en_2)}")
print("\nRetrieving...")

results_en_2 = await retrieval_service.retrieve(
    query=query_en_2,
    mode=RetrievalMode.HYBRID,
    top_k=3,
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nRetrieved {len(results_en_2)} results:\n")
for i, result in enumerate(results_en_2, 1):
    print(f"Result {i} (Score: {result.score:.4f}):")
    print(f"  Text: {result.node.text[:200]}...")
    print(f"  Language: {result.node.metadata.get('language', 'N/A')}")
    print()

### Chinese Queries (繁體中文)

In [None]:
# Test Chinese query - Machine Learning
query_zh_1 = "什麼是機器學習？"

print(f"Query: {query_zh_1}")
print(f"Detected language: {detect_language(query_zh_1)}")
print("\nRetrieving...")

results_zh_1 = await retrieval_service.retrieve(
    query=query_zh_1,
    mode=RetrievalMode.HYBRID,
    top_k=3,
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nRetrieved {len(results_zh_1)} results:\n")
for i, result in enumerate(results_zh_1, 1):
    print(f"Result {i} (Score: {result.score:.4f}):")
    print(f"  Text: {result.node.text[:200]}...")
    print(f"  Language: {result.node.metadata.get('language', 'N/A')}")
    print(f"  Method: {result.node.metadata.get('retrieval_method', 'N/A')}")
    print()

In [None]:
# Test Chinese query - Natural Language Processing
query_zh_2 = "自然語言處理有什麼應用？"

print(f"Query: {query_zh_2}")
print(f"Detected language: {detect_language(query_zh_2)}")
print("\nRetrieving...")

results_zh_2 = await retrieval_service.retrieve(
    query=query_zh_2,
    mode=RetrievalMode.HYBRID,
    top_k=3,
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nRetrieved {len(results_zh_2)} results:\n")
for i, result in enumerate(results_zh_2, 1):
    print(f"Result {i} (Score: {result.score:.4f}):")
    print(f"  Text: {result.node.text[:200]}...")
    print(f"  Language: {result.node.metadata.get('language', 'N/A')}")
    print()

In [None]:
# Test Chinese query - RAG systems
query_zh_3 = "RAG 系統的主要組成部分是什麼？"

print(f"Query: {query_zh_3}")
print(f"Detected language: {detect_language(query_zh_3)}")
print("\nRetrieving...")

results_zh_3 = await retrieval_service.retrieve(
    query=query_zh_3,
    mode=RetrievalMode.HYBRID,
    top_k=3,
    collection_name=settings.rag.chroma_collection_name,
)

print(f"\nRetrieved {len(results_zh_3)} results:\n")
for i, result in enumerate(results_zh_3, 1):
    print(f"Result {i} (Score: {result.score:.4f}):")
    print(f"  Text: {result.node.text[:200]}...")
    print(f"  Language: {result.node.metadata.get('language', 'N/A')}")
    print()

## 3. Compare Retrieval Modes

In [None]:
# Test different retrieval modes with the same query
test_query = "深度學習如何處理複雜數據？"

print(f"Query: {test_query}\n")

modes = [
    RetrievalMode.VECTOR_ONLY,
    RetrievalMode.BM25_ONLY,
    RetrievalMode.HYBRID,
]

for mode in modes:
    print(f"\n{'='*60}")
    print(f"Retrieval Mode: {mode.value.upper()}")
    print('='*60)
    
    results = await retrieval_service.retrieve(
        query=test_query,
        mode=mode,
        top_k=2,
        collection_name=settings.rag.chroma_collection_name,
    )
    
    for i, result in enumerate(results, 1):
        print(f"\nResult {i} (Score: {result.score:.4f}):")
        print(f"  {result.node.text[:150]}...")
        print(f"  Retrieval time: {result.node.metadata.get('retrieval_time_ms', 'N/A')}ms")

## 4. Advanced: Custom Retrieval Parameters

In [None]:
# Test with different alpha values (dense vs sparse weight)
test_query = "computer vision applications"

print(f"Query: {test_query}\n")

alphas = [0.0, 0.5, 1.0]  # 0.0 = sparse only, 0.5 = balanced, 1.0 = dense only

for alpha in alphas:
    print(f"\nAlpha = {alpha} ({'Sparse only' if alpha == 0.0 else 'Dense only' if alpha == 1.0 else 'Balanced'}):")
    
    results = await retrieval_service.retrieve(
        query=test_query,
        mode=RetrievalMode.HYBRID,
        alpha=alpha,
        top_k=2,
        collection_name=settings.rag.chroma_collection_name,
    )
    
    for i, result in enumerate(results, 1):
        print(f"  {i}. Score: {result.score:.4f} | {result.node.text[:100]}...")

## 5. Batch Testing with Multiple Queries

In [None]:
# Define test queries in both languages
test_queries = [
    ("What is artificial intelligence?", "en"),
    ("什麼是人工智能？", "zh"),
    ("How does machine learning work?", "en"),
    ("機器學習如何運作？", "zh"),
    ("What are the applications of NLP?", "en"),
    ("自然語言處理有哪些應用？", "zh"),
]

print("Batch Query Testing\n")
print("="*70)

for query, expected_lang in test_queries:
    detected_lang = detect_language(query)
    
    print(f"\nQuery: {query}")
    print(f"Expected: {expected_lang} | Detected: {detected_lang}")
    
    results = await retrieval_service.retrieve(
        query=query,
        mode=RetrievalMode.HYBRID,
        top_k=1,
        collection_name=settings.rag.chroma_collection_name,
    )
    
    if results:
        top_result = results[0]
        print(f"Top result (Score: {top_result.score:.4f}):")
        print(f"  {top_result.node.text[:120]}...")
    else:
        print("  No results found")
    
    print("-" * 70)

## 6. Retrieval Statistics and Performance

In [None]:
# Get retrieval statistics
retrieval_stats = retrieval_service.get_retrieval_stats(
    collection_name=settings.rag.chroma_collection_name
)

print("Retrieval Service Statistics:\n")
pprint(retrieval_stats)

## 7. Cleanup (Optional)

In [None]:
# Uncomment to delete the test collection
# vector_store.delete_collection(settings.rag.chroma_collection_name)
# print(f"✓ Deleted collection: {settings.rag.chroma_collection_name}")

In [None]:
# Uncomment to delete test documents
# import shutil
# if test_docs_dir.exists():
#     shutil.rmtree(test_docs_dir)
#     print(f"✓ Deleted test documents directory: {test_docs_dir}")

## Summary

This notebook demonstrated:

✅ Document ingestion with language detection  
✅ English and Chinese query support  
✅ Multiple retrieval modes (vector, BM25, hybrid)  
✅ Retrieval parameter tuning (alpha, top_k)  
✅ Batch query testing  
✅ Performance statistics  

### Next Steps

1. Upload your own documents (PDF, DOCX, TXT)
2. Experiment with different chunk sizes for different languages
3. Try query expansion for better retrieval results
4. Implement reranking for improved accuracy
5. Test with larger document collections

### Tips

- For Chinese documents, use smaller chunk sizes (256 chars vs 512 for English)
- Hybrid retrieval often provides the best results
- Adjust alpha based on your use case (semantic vs keyword matching)
- Monitor retrieval times and adjust top_k accordingly