# 07: Retrieval-Based Chatbot

**Duration:** 3-4 hours | **Difficulty:** Intermediate

## Learning Objectives
- Information retrieval fundamentals
- Embedding spaces and similarity metrics
- Response ranking systems
- FAQ chatbot implementation

## Table of Contents
1. [Retrieval-Based Systems](#1-introduction)
2. [Text Embeddings](#2-embeddings)
3. [Similarity Computation](#3-similarity)
4. [Response Ranking](#4-ranking)
5. [FAQ Chatbot](#5-chatbot)
6. [Practical Exercise](#6-exercise)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
import re

# Import utilities
import sys
sys.path.append('../')
from utils.text_utils import TextPreprocessor
from utils.model_helpers import get_device

device = get_device("auto")
print(f"Using device: {device}")

## 1. Retrieval-Based Systems {#1-introduction}

**Retrieval-based chatbots** select responses from a predefined set rather than generating new text. They excel at:
- Consistent, reliable responses
- Domain-specific knowledge (FAQ, customer service)
- Lower computational requirements

### Architecture:
1. **Knowledge Base**: Collection of question-answer pairs
2. **Retriever**: Finds relevant responses
3. **Ranker**: Scores and ranks candidates
4. **Selector**: Chooses final response

In [None]:
# Load FAQ knowledge base
with open('../data/conversations/faq_knowledge.json', 'r') as f:
    faq_data = json.load(f)

print(f"Loaded {len(faq_data)} FAQ entries")
print("Sample entry:")
print(f"Q: {faq_data[0]['question']}")
print(f"A: {faq_data[0]['answer'][:100]}...")
print(f"Category: {faq_data[0]['category']}")
print(f"Keywords: {faq_data[0]['keywords'][:3]}")

## 2. Text Embeddings {#2-embeddings}

Convert text to numerical vectors that capture semantic meaning.

In [None]:
class TextEmbedder:
    """Text embedding using TF-IDF and simple neural embeddings."""
    
    def __init__(self, method='tfidf'):
        self.method = method
        self.preprocessor = TextPreprocessor()
        
        if method == 'tfidf':
            self.vectorizer = TfidfVectorizer(
                max_features=5000,
                stop_words='english',
                ngram_range=(1, 2),
                lowercase=True
            )
    
    def fit(self, texts: List[str]):
        """Fit embedder on text corpus."""
        if self.method == 'tfidf':
            processed_texts = [self.preprocessor.preprocess(text) for text in texts]
            self.vectorizer.fit(processed_texts)
        
        print(f"Fitted {self.method} embedder on {len(texts)} texts")
    
    def embed(self, texts: List[str]) -> np.ndarray:
        """Convert texts to embeddings."""
        if self.method == 'tfidf':
            processed_texts = [self.preprocessor.preprocess(text) for text in texts]
            return self.vectorizer.transform(processed_texts).toarray()
    
    def embed_single(self, text: str) -> np.ndarray:
        """Embed single text."""
        return self.embed([text])[0]

# Create embedder and fit on FAQ data
embedder = TextEmbedder('tfidf')

# Prepare training texts (questions + answers)
training_texts = []
for entry in faq_data:
    training_texts.append(entry['question'])
    training_texts.append(entry['answer'])

embedder.fit(training_texts)

# Embed all questions
questions = [entry['question'] for entry in faq_data]
question_embeddings = embedder.embed(questions)

print(f"Question embeddings shape: {question_embeddings.shape}")
print(f"Sample embedding (first 10 dims): {question_embeddings[0][:10]}")

## 3. Similarity Computation {#3-similarity}

Measure semantic similarity between query and knowledge base entries.

In [None]:
class SimilarityMatcher:
    """Compute similarity between queries and knowledge base."""
    
    def __init__(self, embedder: TextEmbedder, kb_embeddings: np.ndarray):
        self.embedder = embedder
        self.kb_embeddings = kb_embeddings
    
    def find_similar(self, query: str, top_k: int = 5) -> List[Tuple[int, float]]:
        """Find most similar entries to query."""
        # Embed query
        query_embedding = self.embedder.embed_single(query).reshape(1, -1)
        
        # Compute similarities
        similarities = cosine_similarity(query_embedding, self.kb_embeddings)[0]
        
        # Get top-k indices and scores
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        top_scores = similarities[top_indices]
        
        return list(zip(top_indices, top_scores))
    
    def compute_similarity_matrix(self, queries: List[str]) -> np.ndarray:
        """Compute similarity matrix for multiple queries."""
        query_embeddings = self.embedder.embed(queries)
        return cosine_similarity(query_embeddings, self.kb_embeddings)

# Create similarity matcher
matcher = SimilarityMatcher(embedder, question_embeddings)

# Test similarity matching
test_queries = [
    "What is machine learning?",
    "How do neural networks work?",
    "What is deep learning?",
    "How to start programming?"
]

print("Similarity matching results:")
for query in test_queries:
    similar_entries = matcher.find_similar(query, top_k=3)
    print(f"\nQuery: '{query}'")
    
    for idx, score in similar_entries:
        question = faq_data[idx]['question']
        print(f"  Score: {score:.3f} - {question}")

## 4. Response Ranking {#4-ranking}

Rank candidate responses using multiple features and scoring.

In [None]:
class ResponseRanker:
    """Rank responses using multiple features."""
    
    def __init__(self, faq_data: List[Dict], matcher: SimilarityMatcher):
        self.faq_data = faq_data
        self.matcher = matcher
    
    def keyword_match_score(self, query: str, entry: Dict) -> float:
        """Score based on keyword matching."""
        query_words = set(query.lower().split())
        entry_keywords = set([kw.lower() for kw in entry['keywords']])
        
        # Jaccard similarity
        intersection = len(query_words & entry_keywords)
        union = len(query_words | entry_keywords)
        
        return intersection / union if union > 0 else 0.0
    
    def confidence_score(self, entry: Dict) -> float:
        """Use entry's confidence score."""
        return entry.get('confidence', 0.5)
    
    def category_relevance(self, query: str, entry: Dict) -> float:
        """Score based on category relevance."""
        # Simple category matching
        category_keywords = {
            'AI/ML Basics': ['ai', 'ml', 'machine', 'learning', 'artificial', 'intelligence'],
            'Deep Learning': ['deep', 'neural', 'network', 'layers'],
            'Programming': ['code', 'programming', 'python', 'language'],
            'NLP': ['nlp', 'language', 'text', 'processing'],
            'Chatbots': ['chatbot', 'conversation', 'dialogue']
        }
        
        query_lower = query.lower()
        category = entry['category']
        
        if category in category_keywords:
            keywords = category_keywords[category]
            matches = sum(1 for kw in keywords if kw in query_lower)
            return matches / len(keywords)
        
        return 0.0
    
    def rank_responses(self, query: str, top_k: int = 5) -> List[Tuple[Dict, float]]:
        """Rank responses using combined scoring."""
        # Get similar entries
        similar_entries = self.matcher.find_similar(query, top_k=min(top_k * 2, len(self.faq_data)))
        
        ranked_responses = []
        
        for idx, similarity_score in similar_entries:
            entry = self.faq_data[idx]
            
            # Compute individual scores
            keyword_score = self.keyword_match_score(query, entry)
            confidence = self.confidence_score(entry)
            category_score = self.category_relevance(query, entry)
            
            # Combined score (weighted)
            final_score = (
                0.5 * similarity_score +
                0.2 * keyword_score +
                0.2 * confidence +
                0.1 * category_score
            )
            
            ranked_responses.append((entry, final_score))
        
        # Sort by final score
        ranked_responses.sort(key=lambda x: x[1], reverse=True)
        
        return ranked_responses[:top_k]

# Create ranker
ranker = ResponseRanker(faq_data, matcher)

# Test ranking
print("Response ranking results:")
test_query = "How do I learn machine learning?"
ranked_responses = ranker.rank_responses(test_query, top_k=3)

print(f"\nQuery: '{test_query}'")
for i, (entry, score) in enumerate(ranked_responses, 1):
    print(f"\n{i}. Score: {score:.3f}")
    print(f"   Q: {entry['question']}")
    print(f"   A: {entry['answer'][:80]}...")
    print(f"   Category: {entry['category']}")

## 5. FAQ Chatbot Implementation {#5-chatbot}

Complete retrieval-based chatbot for FAQ responses.

In [None]:
class RetrievalChatbot:
    """Complete retrieval-based FAQ chatbot."""
    
    def __init__(self, faq_data: List[Dict], confidence_threshold: float = 0.3):
        self.faq_data = faq_data
        self.confidence_threshold = confidence_threshold
        
        # Initialize components
        self.embedder = TextEmbedder('tfidf')
        self._build_knowledge_base()
        
        self.matcher = SimilarityMatcher(self.embedder, self.question_embeddings)
        self.ranker = ResponseRanker(faq_data, self.matcher)
        
        # Conversation context
        self.conversation_history = []
        
    def _build_knowledge_base(self):
        """Build and index knowledge base."""
        # Prepare training texts
        training_texts = []
        for entry in self.faq_data:
            training_texts.append(entry['question'])
            training_texts.append(entry['answer'])
        
        # Fit embedder
        self.embedder.fit(training_texts)
        
        # Embed questions
        questions = [entry['question'] for entry in self.faq_data]
        self.question_embeddings = self.embedder.embed(questions)
        
        print(f"Knowledge base built with {len(self.faq_data)} entries")
    
    def respond(self, user_input: str) -> Dict:
        """Generate response to user input."""
        # Add to conversation history
        self.conversation_history.append({'role': 'user', 'content': user_input})
        
        # Rank responses
        ranked_responses = self.ranker.rank_responses(user_input, top_k=1)
        
        if not ranked_responses:
            response = {
                'answer': "I'm sorry, I don't have information about that topic.",
                'confidence': 0.0,
                'source': None
            }
        else:
            best_entry, score = ranked_responses[0]
            
            if score >= self.confidence_threshold:
                response = {
                    'answer': best_entry['answer'],
                    'confidence': score,
                    'source': best_entry,
                    'category': best_entry['category']
                }
            else:
                response = {
                    'answer': f"I found some information, but I'm not very confident. {best_entry['answer']}",
                    'confidence': score,
                    'source': best_entry,
                    'category': best_entry['category']
                }
        
        # Add to conversation history
        self.conversation_history.append({'role': 'assistant', 'content': response['answer']})
        
        return response
    
    def get_conversation_history(self) -> List[Dict]:
        """Get conversation history."""
        return self.conversation_history
    
    def reset_conversation(self):
        """Reset conversation history."""
        self.conversation_history = []

# Create chatbot
chatbot = RetrievalChatbot(faq_data, confidence_threshold=0.25)

# Interactive demo
print("=== FAQ Chatbot Demo ===")
print("Ask questions about AI, ML, programming, etc.")
print("Type 'quit' to exit\n")

# Demo queries
demo_queries = [
    "What is machine learning?",
    "How do neural networks work?",
    "What programming language should I learn?",
    "How do I start with deep learning?",
    "What is the difference between AI and ML?"
]

for query in demo_queries:
    print(f"User: {query}")
    response = chatbot.respond(query)
    print(f"Bot: {response['answer']}")
    print(f"Confidence: {response['confidence']:.3f}")
    if response.get('category'):
        print(f"Category: {response['category']}")
    print("-" * 50)

print("\nConversation completed!")

## 6. Practical Exercise {#6-exercise}

**Exercise**: Enhance the retrieval-based chatbot

### Tasks:
1. Add semantic embeddings (Word2Vec, BERT)
2. Implement context-aware retrieval
3. Add response filtering and safety checks
4. Create evaluation metrics (precision@k, MRR)

### Extensions:
- Multi-turn conversation support
- Hybrid retrieval + generation
- Domain adaptation techniques
- Real-time learning from user feedback

In [None]:
# Exercise: Evaluation metrics
def evaluate_retrieval(chatbot, test_queries, expected_categories):
    """Evaluate retrieval performance."""
    correct_category = 0
    high_confidence = 0
    
    for query, expected_cat in zip(test_queries, expected_categories):
        response = chatbot.respond(query)
        
        if response.get('category') == expected_cat:
            correct_category += 1
        
        if response['confidence'] >= 0.5:
            high_confidence += 1
    
    accuracy = correct_category / len(test_queries)
    confidence_rate = high_confidence / len(test_queries)
    
    return {
        'category_accuracy': accuracy,
        'high_confidence_rate': confidence_rate,
        'total_queries': len(test_queries)
    }

# Test evaluation
test_queries = [
    "What is artificial intelligence?",
    "How do I code in Python?",
    "What are transformers?"
]
expected_cats = ['AI/ML Basics', 'Programming', 'Deep Learning']

chatbot.reset_conversation()
results = evaluate_retrieval(chatbot, test_queries, expected_cats)
print(f"\nEvaluation Results:")
print(f"Category Accuracy: {results['category_accuracy']:.2f}")
print(f"High Confidence Rate: {results['high_confidence_rate']:.2f}")

print("\n=== Retrieval-Based Chatbot Complete ===")
print("Key Concepts Learned:")
print("• Text embeddings and similarity matching")
print("• Multi-feature response ranking")
print("• Knowledge base indexing")
print("• Confidence-based response selection")
print("• Retrieval system evaluation")