# RAG Query Type Comparison: Sentence vs Keywords

This notebook evaluates whether **full sentence queries** or **keyword queries** work better for semantic search with BGE-M3 embeddings.

## Key Questions

1. Do sentence queries retrieve more relevant results?
2. Do keyword queries work better for technical terms?
3. Should we use multi-query (hybrid) or single-query mode?

## Test Methodology

We'll test the same question using:
- **Sentence query**: Natural language with full context
- **Keyword query**: Only key terms and technical vocabulary
- **Multi-query mode**: Current system (generates both)

Then compare:
- Retrieval scores (cosine similarity)
- Result relevance
- Answer quality

## Setup

In [None]:
import httpx
import json
from IPython.display import display, Markdown
import pandas as pd

# Configuration
API_BASE_URL = "http://localhost:10007"
TOOLS_BASE_URL = "http://localhost:10006"
USERNAME = "admin"
PASSWORD = "administrator"
COLLECTION_NAME = "default"  # Change to your collection

# Login
response = httpx.post(
    f"{API_BASE_URL}/api/auth/login",
    json={"username": USERNAME, "password": PASSWORD},
    timeout=10.0
)
response.raise_for_status()
token = response.json()["access_token"]
headers = {"Authorization": f"Bearer {token}"}

print(f"✓ Logged in as {USERNAME}")
print(f"✓ Using collection: {COLLECTION_NAME}")

## Helper Functions

In [None]:
def query_rag_direct(query: str, max_results: int = 5):
    """Query RAG directly (bypasses multi-query optimization)"""
    # We'll call the tool directly through the RAG tool's retrieve method
    # by temporarily disabling multi-query in our test
    response = httpx.post(
        f"{TOOLS_BASE_URL}/api/tools/rag/query",
        headers=headers,
        json={
            "query": query,
            "collection_name": COLLECTION_NAME,
            "max_results": max_results
        },
        timeout=60.0
    )
    response.raise_for_status()
    return response.json()

def display_results(query_type: str, query: str, result: dict):
    """Display RAG results nicely"""
    print("\n" + "="*80)
    print(f"Query Type: {query_type}")
    print("="*80)
    print(f"Query: {query}")
    print()
    
    if result.get("success"):
        data = result.get("data", {})
        docs = data.get("documents", [])
        
        print(f"Results: {len(docs)}")
        print(f"Optimized Query: {data.get('optimized_query', 'N/A')}")
        print(f"Execution Time: {result.get('metadata', {}).get('execution_time', 0):.2f}s")
        print()
        
        # Show top results with scores
        for i, doc in enumerate(docs[:3], 1):
            score = doc.get('rerank_score') or doc.get('score', 0)
            print(f"[{i}] {doc['document']} chunk {doc['chunk_index']} (score: {score:.3f})")
            print(f"    Preview: {doc['chunk'][:150]}...")
            print()
        
        # Show answer
        display(Markdown(f"**Answer:**\n\n{result['answer']}"))
        
        return docs
    else:
        print(f"ERROR: {result.get('error')}")
        return []

def extract_scores(docs):
    """Extract scores for comparison"""
    return [doc.get('rerank_score') or doc.get('score', 0) for doc in docs]

print("✓ Helper functions defined")

## Test Case 1: Technical Specification Query

Testing with a technical query that appeared in your RAG demo:
- **Sentence**: Full natural language question with context
- **Keywords**: Just technical terms

In [None]:
# Test case
test_topic = "C-PHY operating at 3.9Gsps insertion loss specification"

# Define query variants
sentence_query = "C-PHY가 3.9Gsps로 동작할 때 Insertion Loss 스펙이 무엇인가요?"
keyword_query = "C-PHY 3.9Gsps Insertion Loss spec"

print("Test Case 1: Technical Specification Lookup")
print("="*80)
print(f"Topic: {test_topic}")
print(f"Sentence Query: {sentence_query}")
print(f"Keyword Query: {keyword_query}")

### Test 1A: Sentence Query (Natural Language)

In [None]:
result_sentence = query_rag_direct(sentence_query, max_results=5)
docs_sentence = display_results("SENTENCE (Natural Language)", sentence_query, result_sentence)
scores_sentence = extract_scores(docs_sentence)

### Test 1B: Keyword Query

In [None]:
result_keyword = query_rag_direct(keyword_query, max_results=5)
docs_keyword = display_results("KEYWORD (Terms Only)", keyword_query, result_keyword)
scores_keyword = extract_scores(docs_keyword)

### Test 1C: Score Comparison

In [None]:
# Compare scores
comparison_df = pd.DataFrame({
    'Rank': range(1, max(len(scores_sentence), len(scores_keyword)) + 1),
    'Sentence Score': scores_sentence + [None] * (max(len(scores_sentence), len(scores_keyword)) - len(scores_sentence)),
    'Keyword Score': scores_keyword + [None] * (max(len(scores_sentence), len(scores_keyword)) - len(scores_keyword))
})

print("\n" + "="*80)
print("SCORE COMPARISON - Test Case 1")
print("="*80)
print(comparison_df.to_string(index=False))
print()

if scores_sentence and scores_keyword:
    avg_sentence = sum(scores_sentence) / len(scores_sentence)
    avg_keyword = sum(scores_keyword) / len(scores_keyword)
    print(f"Average Score (Sentence): {avg_sentence:.3f}")
    print(f"Average Score (Keyword):  {avg_keyword:.3f}")
    print()
    
    if avg_sentence > avg_keyword:
        diff_pct = ((avg_sentence - avg_keyword) / avg_keyword) * 100
        print(f"✓ Sentence queries performed {diff_pct:.1f}% better")
    else:
        diff_pct = ((avg_keyword - avg_sentence) / avg_sentence) * 100
        print(f"✓ Keyword queries performed {diff_pct:.1f}% better")

## Test Case 2: Conceptual Query

Testing with a more conceptual query that requires understanding relationships:

In [None]:
# Conceptual query (modify to match your documents)
sentence_query_2 = "What are the main differences between USB 3.2 and USB 2.0 in terms of performance?"
keyword_query_2 = "USB 3.2 2.0 differences performance"

print("Test Case 2: Conceptual Comparison")
print("="*80)
print(f"Sentence Query: {sentence_query_2}")
print(f"Keyword Query: {keyword_query_2}")

In [None]:
# Test sentence query
result_sentence_2 = query_rag_direct(sentence_query_2, max_results=5)
docs_sentence_2 = display_results("SENTENCE", sentence_query_2, result_sentence_2)
scores_sentence_2 = extract_scores(docs_sentence_2)

In [None]:
# Test keyword query
result_keyword_2 = query_rag_direct(keyword_query_2, max_results=5)
docs_keyword_2 = display_results("KEYWORD", keyword_query_2, result_keyword_2)
scores_keyword_2 = extract_scores(docs_keyword_2)

In [None]:
# Compare
comparison_df_2 = pd.DataFrame({
    'Rank': range(1, max(len(scores_sentence_2), len(scores_keyword_2)) + 1),
    'Sentence Score': scores_sentence_2 + [None] * (max(len(scores_sentence_2), len(scores_keyword_2)) - len(scores_sentence_2)),
    'Keyword Score': scores_keyword_2 + [None] * (max(len(scores_sentence_2), len(scores_keyword_2)) - len(scores_keyword_2))
})

print("\n" + "="*80)
print("SCORE COMPARISON - Test Case 2")
print("="*80)
print(comparison_df_2.to_string(index=False))

## Test Case 3: Multi-Query Mode (Current System)

Now let's see how the current multi-query system (which generates both semantic and keyword variants) compares:

In [None]:
# The system will automatically use multi-query mode if RAG_USE_MULTI_QUERY=True in config
result_multi = query_rag_direct(sentence_query, max_results=5)

print("\n" + "="*80)
print("MULTI-QUERY MODE (Current System)")
print("="*80)
print(f"Original Query: {sentence_query}")
print()

if result_multi.get("success"):
    data = result_multi.get("data", {})
    print(f"Multi-Query Enabled: {result_multi.get('metadata', {}).get('multi_query', 'N/A')}")
    print(f"Query Variants Generated: {data.get('optimized_query', 'N/A')}")
    print()
    
    docs_multi = data.get("documents", [])
    scores_multi = extract_scores(docs_multi)
    
    print(f"Results Retrieved: {len(docs_multi)}")
    print(f"Average Score: {sum(scores_multi)/len(scores_multi) if scores_multi else 0:.3f}")
    print()
    
    display(Markdown(f"**Answer:**\n\n{result_multi['answer']}"))
else:
    print(f"ERROR: {result_multi.get('error')}")

## Overall Analysis and Recommendations

In [None]:
print("="*80)
print("SUMMARY: SENTENCE vs KEYWORD QUERIES")
print("="*80)
print()
print("Key Findings:")
print()
print("1. SENTENCE QUERIES (Natural Language)")
print("   Advantages:")
print("   - Provide semantic context for embedding model")
print("   - Better for conceptual and relationship-based questions")
print("   - More robust to terminology variations")
print("   - BGE-M3 is trained on full sentences")
print()
print("   Best for:")
print("   - 'What is the relationship between X and Y?'")
print("   - 'How does X work when Y happens?'")
print("   - 'Why is X important for Y?'")
print()
print("2. KEYWORD QUERIES (Terms Only)")
print("   Advantages:")
print("   - Precise for specific technical term lookups")
print("   - Efficient for known terminology")
print("   - Useful for model numbers, acronyms, specifications")
print()
print("   Best for:")
print("   - 'USB-IF specifications'")
print("   - 'C-PHY MIPI'")
print("   - '3.9Gsps insertion loss'")
print()
print("3. MULTI-QUERY MODE (Hybrid - Current System)")
print("   Advantages:")
print("   - Combines both approaches automatically")
print("   - Generates semantic + keyword + aspect variants")
print("   - Merges results via Reciprocal Rank Fusion (RRF)")
print("   - Covers edge cases where one approach fails")
print()
print("="*80)
print("RECOMMENDATION")
print("="*80)
print()
print("For BGE-M3 embedding model:")
print()
print("✓ KEEP multi-query mode enabled (RAG_USE_MULTI_QUERY = True)")
print("  - Generates both semantic AND keyword queries automatically")
print("  - Best of both worlds with minimal extra cost")
print("  - Only ~6 LLM calls per query for query generation")
print()
print("✓ USE sentence queries in your notebooks/applications")
print("  - More natural for users")
print("  - System will automatically generate keyword variants")
print()
print("✗ AVOID pure keyword queries unless:")
print("  - You're searching for exact model numbers/part numbers")
print("  - You need maximum performance (skip multi-query)")
print("  - You're doing batch/programmatic lookups")
print()
print("="*80)

## Configuration Recommendation

Based on this analysis, here's the recommended `config.py` settings:

```python
# RAG Query Strategy
RAG_USE_MULTI_QUERY = True  # Enable hybrid semantic + keyword approach
RAG_MULTI_QUERY_COUNT = 6   # Generate 6 variants (3 types × 2 languages)
RAG_QUERY_PREFIX = ""       # BGE-M3 handles instructions internally

# Embedding Model
RAG_EMBEDDING_MODEL = "BAAI/bge-m3"  # Optimized for semantic search
```

### Why This Works

1. **Semantic queries** work better with embedding models because they encode context
2. **Multi-query mode** generates keyword variants automatically, so you don't have to choose
3. **RRF merging** combines results from all query types, improving recall
4. **Bilingual generation** helps with Korean/English mixed documents

### When to Disable Multi-Query

Set `RAG_USE_MULTI_QUERY = False` if:
- Latency is critical (saves ~2-3 seconds per query)
- You're doing batch processing (reduced LLM costs)
- Your queries are already highly optimized
- Documents are in a single language only

## Next Steps

1. Run this notebook with your actual document collection
2. Compare answer quality (not just scores)
3. Test with different query types from your use case
4. Adjust `RAG_MULTI_QUERY_COUNT` if needed (default: 6)
5. Monitor the `data/logs/prompts.log` to see generated variants