Skip to content

marcelonieva7/AA_Bot

Repository files navigation

πŸƒ AA Assistant β€” LLM Zoomcamp Final Project

A Retrieval-Augmented Generation (RAG) chatbot for Alcoholics Anonymous information
Created by Marcelo Nieva for the DataTalksClub LLM Zoomcamp


🧠 Project Overview

AA Assistant is an intelligent conversational agent designed to provide accurate, empathetic, and trustworthy information about Alcoholics Anonymous (AA) to individuals seeking help or guidance with alcohol-related concerns.

The chatbot leverages a Retrieval-Augmented Generation (RAG) pipeline built entirely from official AA sources, ensuring that all responses are grounded in verified documentation. This project demonstrates a complete end-to-end LLM applicationβ€”from data collection and indexing to semantic retrieval, generation, and comprehensive evaluation.

🌐 Multilingual Support: The assistant operates in both Spanish and English, making AA information accessible to a broader audience across different regions and language preferences.

AA Assistant Chatbot

AA Assistant

Cloud deployment on Koyeb

https://pale-sisely-arac-347f0d7a.koyeb.app/


🎯 Project Goals & LLM Zoomcamp Evaluation Criteria

This project addresses all key evaluation criteria outlined in the LLM Zoomcamp:

Criterion Implementation Location in Code
Problem Definition Clear use case: providing reliable AA information to people struggling with alcohol This README, src/RAG/prompts.py
Data Collection Web scraping from official AA websites (global & Argentina) data/final/*.json
Data Indexing Vector embeddings using Jina AI v2 (Spanish/English) + sparse BM25 indexing src/ingest.py, src/config/db.py
Retrieval Strategy Hybrid search (semantic + lexical) with DBSF and RRF reranking src/RAG/main.py, src/eval/retrival/
LLM Integration NVIDIA NIM endpoints with three models evaluated src/LLM/main.py
Application Interface FastAPI serving custom HTML/CSS/JS interface src/server/app.py, src/server/templates/
Evaluation Comprehensive evaluation framework: retrieval metrics (Hit Rate, MRR) and generation quality (cosine similarity, LLM-as-judge) across 580 test cases src/eval/retrival/, src/eval/llm/
Monitoring OpenTelemetry instrumentation with Arize Phoenix for real-time observability, trace tracking, and user feedback collection src/monitoring/tracing.py, src/server/app.py
Documentation Detailed README with setup instructions, architecture, evaluation results, and monitoring guidelines This file

πŸ“š Data Sources

All knowledge in the chatbot comes from official Alcoholics Anonymous sources:

🌍 Global Resources

  • AA Official Website (Spanish): https://www.aa.org/es
  • Core AA literature: The Twelve Steps, The Big Book, FAQs

πŸ‡¦πŸ‡· Local Resources

  • AA Argentina Official Website: https://aa.org.ar/
  • Regional meeting information, local resources, and Argentina-specific guidance

πŸ“ Data Structure

data/final/
β”œβ”€β”€ FAQS.json                      # Frequently Asked Questions
β”œβ”€β”€ FAQS_IDX.json                  # Indexed FAQs with IDs
β”œβ”€β”€ Ground_Truth.json              # Evaluation ground truth dataset
β”œβ”€β”€ Ground_Truth_IDX.json          # Indexed ground truth
β”œβ”€β”€ answers_gpt_20b.json           # Generated answers from GPT-20B
β”œβ”€β”€ answers_kimi_k2.json           # Generated answers from Kimi K2
└── answers_llama4_scout.json     # Generated answers from Llama 4 Scout

πŸ—οΈ Technical Architecture

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      User Interface                     β”‚
β”‚              (FastAPI + HTML/CSS/JS)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   RAG Pipeline                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   Semantic   β”‚  β”‚   Lexical    β”‚  β”‚   Hybrid     β”‚   β”‚
β”‚  β”‚   Search     β”‚  β”‚   Search     β”‚  β”‚   (DBSF/RRF) β”‚   β”‚
β”‚  β”‚  (Jina v2)   β”‚  β”‚   (BM25)     β”‚  β”‚  Reranking   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Vector Database                         β”‚
β”‚                    (Qdrant)                             β”‚
β”‚  β€’ Dense vectors: Jina Embeddings v2 (768 dims)         β”‚
β”‚  β€’ Sparse vectors: BM25 tokenization                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               LLM Generation Layer                      β”‚
β”‚                  (NVIDIA NIM)                           β”‚
β”‚  β€’ openai/gpt-oss-20b                                   β”‚
β”‚  β€’ moonshotai/kimi-k2-instruct                          β”‚
β”‚  β€’ meta/llama-4-scout-17b-16e-instruct                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

src/
β”œβ”€β”€ RAG/                          # Retrieval-Augmented Generation
β”‚   β”œβ”€β”€ main.py                   # Core RAG orchestration with tracing
β”‚   └── prompts.py                # System and user prompts
β”œβ”€β”€ LLM/                          # Language Model interface
β”‚   └── main.py                   # NVIDIA NIM client wrapper with token tracking
β”œβ”€β”€ eval/                         # Evaluation
β”‚   β”œβ”€β”€ retrival/                 # Retrieval evaluation
β”‚   β”‚   β”œβ”€β”€ evaluate.py           # Hit Rate and MRR metrics
β”‚   β”‚   └── metrics.py            # Metric calculation utilities
β”‚   β”œβ”€β”€ llm/                      # Generation evaluation
β”‚   β”‚   β”œβ”€β”€ generate_answers.py  # Answer generation for evaluation
β”‚   β”‚   β”œβ”€β”€ cosine_similarity/   # Semantic similarity evaluation
β”‚   β”‚   β”‚   └── evaluate.py
β”‚   β”‚   └── llm_judge/           # LLM-as-judge evaluation
β”‚   β”‚       └── evaluate.py
β”‚   β”œβ”€β”€ generate_ground_truth.py # Ground truth dataset creation
β”‚   └── prompts.py               # Evaluation prompts
β”œβ”€β”€ monitoring/                   # Observability & Tracing
β”‚   └── tracing.py               # OpenTelemetry + Phoenix setup
β”œβ”€β”€ server/                       # Web application
β”‚   β”œβ”€β”€ app.py                    # FastAPI server with traced endpoints
β”‚   └── templates/
β”‚       └── index.html            # Frontend interface
β”œβ”€β”€ config/                       # Configuration management
β”‚   β”œβ”€β”€ db.py                     # Qdrant database setup
β”‚   β”œβ”€β”€ envs.py                   # Environment variables
β”‚   β”œβ”€β”€ paths.py                  # Path constants
β”‚   └── utils.py                  # Utility functions
β”œβ”€β”€ ingest.py                     # Data ingestion pipeline
└── main.py                       # Entry point

βš™οΈ Setup Instructions

Prerequisites

  • Python 3.12+
  • uv package manager
  • Docker
  • NVIDIA NIM API key

πŸš€ Quick Start

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Verify installation:

uv --version

2. Clone Repository

git clone https://github.com/marcelonieva7/AA_Bot.git
cd AA_BOT

3. Create Virtual Environment

uv venv

4. Install Dependencies

uv sync

This installs all dependencies from pyproject.toml using locked versions from uv.lock.

5. Configure Environment Variables

Create a .env file in the project root by copying the provided template:

cp .env.template .env

Then edit .env with your configuration:

# Qdrant Vector Database
QDRANT_URL=http://localhost:6333

# NVIDIA NIM API Configuration
NVIDIA_API_KEY=
NVIDIA_URL=

Configuration Details:

Variable Description Default/Example Required
QDRANT_URL Qdrant vector database endpoint http://localhost:6333 βœ… Yes
NVIDIA_API_KEY Your NVIDIA NIM API key from build.nvidia.com nvapi-xxx... βœ… Yes
NVIDIA_URL NVIDIA NIM API base URL https://integrate.api.nvidia.com/v1 βœ… Yes

Getting Your NVIDIA API Key:

  1. Visit https://build.nvidia.com/
  2. Sign in or create a free account
  3. Navigate to your API keys section
  4. Generate a new API key
  5. Copy the key to your .env file

Note for Docker Users:

If deploying with Docker Compose, you can also configure .env.docker with the same variables. The Docker setup uses:

QDRANT_URL=http://qdrant:6333  # Note: uses service name instead of localhost

6. Initialize Vector Database

Step 1: Start Qdrant Database

First, launch the Qdrant vector database using Docker:

docker run --rm -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/docker_volumes/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrant

This command:

  • Exposes port 6333 for the REST API
  • Exposes port 6334 for the gRPC API
  • Persists data in ./docker_volumes/qdrant_storage/
  • Runs in the foreground (use Ctrl+C to stop)

Verify Qdrant is running:

curl http://localhost:6333/

You should see a JSON response with version information.


Step 2: Ingest and Index Documents

Once Qdrant is running, populate the vector database:

uv run -m src.ingest

This ingestion pipeline will:

  • πŸ“‚ Load documents from data/final/FAQS.json and related files
  • 🧠 Generate dense embeddings using Jina Embeddings v2 (768 dimensions)
  • πŸ”€ Create sparse BM25 vectors for lexical search
  • πŸ’Ύ Index all vectors in Qdrant with hybrid search capabilities
  • βœ… Create collection with optimized indexing parameters

7. Run the Application

uv run uvicorn src.server.app:app --host 0.0.0.0 --port 8000 --reload

Open your browser at http://localhost:8000 to access the chatbot interface.


🐳 Docker Deployment

Environment Configuration

Use .env.docker for Docker-specific settings:

QDRANT_URL=http://qdrant:6333
NVIDIA_API_KEY=
NVIDIA_URL=https://integrate.api.nvidia.com/v1

Using Docker Compose

docker compose up --build

This starts:

  • FastAPI server on port 8000
  • Qdrant vector database on port 6333
  • Phoenix Observability and monitoring on port 6006

πŸ” Retrieval Strategy

The chatbot implements a sophisticated hybrid retrieval system combining semantic and lexical search:

Search Types Evaluated

search_types = [
    'semantic',              # Dense vector similarity (Jina v2)
    'lexical',               # Sparse BM25 keyword matching
    ['hybrid', 'DBSF'],      # Hybrid with Distribution-Based Score Fusion
    ['hybrid', 'RRF']        # Hybrid with Reciprocal Rank Fusion
]

Embedding Models

Model Type Model Name Purpose Dimensions
Dense jinaai/jina-embeddings-v2-base-es Semantic similarity in Spanish and English 768
Sparse Qdrant/bm25 Lexical keyword matching Variable

Why Hybrid Search?

  • Semantic search captures conceptual similarity and handles paraphrasing
  • Lexical search ensures exact term matches (important for AA-specific terminology)
  • Reranking algorithms (DBSF/RRF) combine both strengths for optimal retrieval

πŸ€– LLM Models Evaluated

The project compares three state-of-the-art language models via NVIDIA NIM endpoints:

Model Identifier Strengths Use Case
GPT-20B openai/gpt-oss-20b Balanced performance, good Spanish support General purpose answering
Kimi K2 moonshotai/kimi-k2-instruct Long context, instruction following Complex multi-step reasoning
Llama 4 Scout meta/llama-4-scout-17b-16e-instruct Efficient, fast inference Quick responses

All models are accessed through the NVIDIA NIM API, enabling consistent evaluation and deployment.


πŸ“Š Evaluation Framework

1. Retrieval Evaluation

Located in src/eval/retrival/, this module measures retrieval quality using two key metrics:

Evaluation Metrics

Hit Rate (Recall@k)

  • Measures the proportion of queries where at least one relevant document appears in the top-k results
  • Formula: (Number of queries with relevant docs in top-k) / (Total queries)
  • Range: 0.0 to 1.0 (higher is better)

Mean Reciprocal Rank (MRR)

  • Evaluates how high the first relevant document ranks in the results
  • Formula: Average(1 / rank of first relevant document)
  • Range: 0.0 to 1.0 (higher is better)
  • Emphasizes ranking qualityβ€”finding relevant docs early is rewarded

Running the Evaluation

uv run -m src.eval.retrival.evaluate

This script:

  1. Loads ground truth questions with known relevant document IDs
  2. Executes each search type (semantic, lexical, hybrid)
  3. Compares retrieved document IDs against ground truth
  4. Calculates Hit Rate and MRR for each configuration
  5. Saves results to src/eval/retrival/Eval_results.csv

Results

Based on evaluation across 580 test queries from the ground truth dataset:

Search Type Hit Rate MRR Interpretation
Semantic 0.9276 0.6741 Finds relevant docs 92.8% of the time, typically ranked 3rd-4th
Lexical 0.7845 0.5505 Baseline BM25 performanceβ€”finds docs 78.5% of time
Hybrid (DBSF) 0.9224 0.7090 Best ranking qualityβ€”relevant docs appear higher (avg rank ~2)
Hybrid (RRF) 0.9379 0.6985 Best hit rateβ€”finds relevant docs 93.8% of time

Key Findings

βœ… Winner: Hybrid RRF for production deployment

  • Highest hit rate (93.79%) ensures users almost always get relevant information
  • Strong MRR (0.6985) means relevant docs typically appear in top 2-3 positions
  • Combines strengths of both semantic understanding and exact term matching

πŸ“Š Best Ranking: Hybrid DBSF

  • Highest MRR (0.7090) means relevant documents rank slightly higher on average
  • Nearly as good hit rate (92.24%)
  • Excellent choice if ranking position is critical

🎯 Semantic-Only Performance

  • Strong hit rate (92.76%) shows Jina embeddings work well for Spanish AA content
  • Lower MRR suggests relevant docs sometimes appear lower in results
  • Good fallback if computational resources are limited

⚠️ Lexical-Only Limitations

  • Lowest performance (78.45% hit rate) confirms pure keyword matching isn't sufficient
  • Struggles with paraphrasing and conceptual questions
  • Important as a component but insufficient alone

Visualization

Hit Rate Comparison (Higher is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Hybrid (RRF)    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 93.79%
Semantic        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 92.76%
Hybrid (DBSF)   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 92.24%
Lexical         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      78.45%

MRR Comparison (Higher is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Hybrid (DBSF)   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.7090
Hybrid (RRF)    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  0.6985
Semantic        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   0.6741
Lexical         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      0.5505

2. Generation Evaluation

Located in src/eval/llm/, this module includes:

a) Cosine Similarity Analysis

Measures semantic overlap between generated answers and ground truth using vector embeddings. This quantitative metric evaluates how closely each model's responses align with reference answers.

Running the Evaluation:

uv run -m src.eval.llm.cosine_similarity.evaluate

Methodology:

  • Embeds both generated answers and ground truth using the same embedding model
  • Computes cosine similarity scores (range: -1 to 1, where 1 = identical meaning)
  • Analyzes distribution across 580 test question-answer pairs

Results Summary:

Model Mean Similarity Median Std Dev Min Max 25th %ile 75th %ile
Llama 4 Scout 0.7756 0.7786 0.1370 -0.0223 1.0000 0.7024 0.8626
Kimi K2 0.7425 0.7617 0.1509 -0.0435 1.0000 0.6675 0.8411
GPT-20B 0.6801 0.7162 0.2075 -0.1596 1.0000 0.6182 0.8002

Detailed Statistics:

GPT-20B (openai/gpt-oss-20b)
count  580.000000
mean     0.680088
std      0.207454
min     -0.159631
25%      0.618194
50%      0.716159
75%      0.800209
max      1.000000

Analysis:

  • Lowest mean similarity (0.68) suggests more creative/varied responses
  • Highest standard deviation (0.21) indicates inconsistent alignment with ground truth
  • Some negative scores show occasional semantic drift
  • 75% of responses still achieve >0.62 similarity
Llama 4 Scout (meta/llama-4-scout-17b-16e-instruct)
count  580.000000
mean     0.775588
std      0.136974
min     -0.022276
25%      0.702364
50%      0.778617
75%      0.862613
max      1.000000

Analysis:

  • Highest mean similarity (0.78) β€” best alignment with reference answers
  • Lowest standard deviation (0.14) shows most consistent performance
  • Minimum score near 0 (vs. negative for others) indicates fewer outliers
  • 75% of responses achieve >0.86 similarity (excellent)
  • Winner for semantic accuracy
Kimi K2 (moonshotai/kimi-k2-instruct)
count  580.000000
mean     0.742493
std      0.150890
min     -0.043473
25%      0.667533
50%      0.761724
75%      0.841071
max      1.000000

Analysis:

  • Strong performance (0.74 mean) β€” second place overall
  • Moderate standard deviation (0.15) shows good consistency
  • Balanced distribution with median close to mean
  • 75% of responses achieve >0.84 similarity
  • Good middle ground between accuracy and creativity

Key Insights:

πŸ₯‡ Llama 4 Scout emerges as the winner for semantic accuracy:

  • Highest average similarity (77.6%) to ground truth
  • Most consistent performance across all queries
  • Fewest outliers and semantic drift cases

πŸ₯ˆ Kimi K2 provides strong balanced performance:

  • Second-best similarity (74.2%)
  • Good for complex queries requiring nuanced understanding

⚠️ GPT-20B shows higher variance:

  • More creative/diverse responses (can be positive or negative depending on use case)
  • Less predictable alignment with reference answers
  • May provide alternative valid perspectives not captured in ground truth

Distribution Visualization:

Similarity Score Distribution
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Llama 4 Scout:  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 77.6% avg
                     ↑ Most consistent

Kimi K2:        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  ] 74.2% avg
                     ↑ Balanced performance

GPT-20B:        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     ] 68.0% avg
                     ↑ More variance

Recommendation: For production deployment, Llama 4 Scout is recommended when semantic accuracy to established AA content is prioritized. However, all three models perform above the 0.68 threshold, indicating they all generate semantically relevant responses.

b) LLM-as-a-Judge Evaluation

Uses an advanced language model to qualitatively assess answer quality through structured evaluation prompts. This approach provides nuanced assessment beyond pure numerical metrics.

Running the Evaluation:

uv run -m src.eval.llm.llm_judge.evaluate

Methodology:

Two complementary evaluation perspectives are used to assess each model's responses across 580 test cases:

Prompt 1: Answer-to-Answer Comparison

  • Compares generated answer against a reference ground truth answer
  • Evaluates semantic preservation and information completeness
  • Stricter evaluationβ€”checks if the model maintains factual accuracy
"Compare the generated answer to the original reference answer 
and classify relevance as: NON_RELEVANT | PARTLY_RELEVANT | RELEVANT"

Prompt 2: Question-to-Answer Alignment

  • Evaluates how well the generated answer addresses the original question
  • Focuses on user satisfaction and practical utility
  • More lenientβ€”allows for valid alternative phrasings and approaches
"Evaluate how well the generated answer responds to the question
and classify relevance as: NON_RELEVANT | PARTLY_RELEVANT | RELEVANT"

Results Summary:

Model Prompt Type Relevant Partly Relevant Non-Relevant Success Rate*
Llama 4 Scout Answer Comparison (P1) 274 (47.2%) 305 (52.6%) 1 (0.2%) 99.8%
Llama 4 Scout Question Alignment (P2) 501 (86.4%) 77 (13.3%) 2 (0.3%) 99.7%
Kimi K2 Answer Comparison (P1) 259 (44.7%) 313 (54.0%) 8 (1.4%) 98.6%
Kimi K2 Question Alignment (P2) 488 (84.1%) 86 (14.8%) 6 (1.0%) 99.0%
GPT-20B Answer Comparison (P1) 242 (41.7%) 308 (53.1%) 30 (5.2%) 94.8%
GPT-20B Question Alignment (P2) 488 (84.1%) 63 (10.9%) 29 (5.0%) 95.0%

*Success Rate = (Relevant + Partly Relevant) / Total


Detailed Results:

GPT-20B (openai/gpt-oss-20b)

Prompt 1 - Answer Comparison:

PARTLY_RELEVANT    308 (53.1%)
RELEVANT           242 (41.7%)
NON_RELEVANT        30 (5.2%)

Prompt 2 - Question Alignment:

RELEVANT           488 (84.1%)
PARTLY_RELEVANT     63 (10.9%)
NON_RELEVANT        29 (5.0%)

Analysis:

  • Shows largest gap between strict (P1) and lenient (P2) evaluation
  • 5% non-relevant rate highest among all modelsβ€”indicates more factual drift
  • Strong question-answering capability (84% fully relevant)
  • When aligned, provides comprehensive answers
  • ⚠️ Higher risk of deviating from reference material
Llama 4 Scout (meta/llama-4-scout-17b-16e-instruct)

Prompt 1 - Answer Comparison:

PARTLY_RELEVANT    305 (52.6%)
RELEVANT           274 (47.2%)
NON_RELEVANT         1 (0.2%)  ← Best

Prompt 2 - Question Alignment:

RELEVANT           501 (86.4%)  ← Best
PARTLY_RELEVANT     77 (13.3%)
NON_RELEVANT         2 (0.3%)

Analysis:

  • πŸ† Winner: Best overall performance
  • Virtually no non-relevant answers (0.2-0.3%)
  • Highest "Relevant" score on question alignment (86.4%)
  • Excellent balance between accuracy and utility
  • Most reliable for production deployment
Kimi K2 (moonshotai/kimi-k2-instruct)

Prompt 1 - Answer Comparison:

PARTLY_RELEVANT    313 (54.0%)
RELEVANT           259 (44.7%)
NON_RELEVANT         8 (1.4%)

Prompt 2 - Question Alignment:

RELEVANT           488 (84.1%)
PARTLY_RELEVANT     86 (14.8%)
NON_RELEVANT         6 (1.0%)

Analysis:

  • Strong performance, second place overall
  • Low non-relevant rate (1.0-1.4%)
  • Tied with GPT-20B on question alignment (84.1%)
  • Slightly more conservative than Llama (more "partly relevant" classifications)
  • Good choice for complex, nuanced queries

Key Insights:

πŸ“Š Evaluation Method Comparison:

The two prompts reveal different aspects of model performance:

Metric Prompt 1 (Strict) Prompt 2 (Lenient) Insight
Average "Relevant" 45.5% 85.5% Models better at answering questions than matching reference style
Average "Non-Relevant" 2.3% 2.1% Consistent failure rate across evaluation methods

🎯 Performance Ranking:

By Accuracy (Prompt 1 - Answer Fidelity):

  1. πŸ₯‡ Llama 4 Scout: 99.8% success rate, 0.2% failures
  2. πŸ₯ˆ Kimi K2: 98.6% success rate, 1.4% failures
  3. πŸ₯‰ GPT-20B: 94.8% success rate, 5.2% failures

By Utility (Prompt 2 - Question Answering):

  1. πŸ₯‡ Llama 4 Scout: 86.4% fully relevant, 0.3% failures
  2. πŸ₯ˆ Kimi K2 / GPT-20B: 84.1% fully relevant (tied)

πŸ’‘ Practical Implications:

For Production Use:

  • Llama 4 Scout recommended: highest reliability + best question-answering
  • Maintains factual accuracy while providing helpful responses
  • Lowest risk of hallucination or irrelevant content

For Specialized Cases:

  • Kimi K2: Excellent for long-context queries requiring deep understanding
  • GPT-20B: Consider when creative rephrasing is valued over strict accuracy

Visual Comparison:

Success Rate (Relevant + Partly Relevant)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Answer Comparison (Strict):
Llama 4 Scout  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 99.8%
Kimi K2        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 98.6%
GPT-20B        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ] 94.8%

Question Alignment (Lenient):
Llama 4 Scout  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 99.7%
Kimi K2        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 99.0%
GPT-20B        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ] 95.0%

Recommendation:

Based on combined evaluation (cosine similarity + LLM-as-a-judge), Llama 4 Scout emerges as the optimal model for the AA Assistant chatbot:

βœ… Highest semantic similarity (77.6%)
βœ… Best LLM-judge scores (86.4% fully relevant)
βœ… Lowest failure rate (0.2-0.3% non-relevant)
βœ… Consistent performance across evaluation methods
βœ… Balanced accuracy and user satisfaction

This model is currently deployed in production.


πŸ“Š Monitoring & Observability

The AA Assistant implements comprehensive observability using OpenTelemetry and Arize Phoenix to track performance, debug issues, and ensure system reliability in production.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   User Request                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FastAPI Endpoint (/chat)                   β”‚
β”‚           [Traced: api.chat span]                       β”‚
β”‚  β€’ Captures: client IP, model, query length             β”‚
β”‚  β€’ Returns: response + trace_id                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                RAG Pipeline                             β”‚
β”‚           [Traced: rag.pipeline span]                   β”‚
β”‚  β€’ Query preprocessing and validation                   β”‚
β”‚  β€’ Coordinates retrieval + generation                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β–Ό                              β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Database Search     β”‚    β”‚   LLM Generation      β”‚
         β”‚  [rag_search span]    β”‚    β”‚   [llm.chat span]     β”‚
         β”‚  β€’ Search type        β”‚    β”‚   β€’ Model name        β”‚
         β”‚  β€’ Fusion algorithm   β”‚    β”‚   β€’ Token usage       β”‚
         β”‚  β€’ Documents count    β”‚    β”‚   β€’ Response preview  β”‚
         β”‚  β€’ Results preview    β”‚    β”‚   β€’ Finish reason     β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Feedback Collection  β”‚
         β”‚ [api.feedback span]   β”‚
         β”‚  β€’ User satisfaction  β”‚
         β”‚  β€’ Links to trace_id  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” What We Monitor

The system captures detailed telemetry across four key areas:

1. API Layer Monitoring (api.chat span)

Tracks all incoming requests to the chatbot:

Attribute Description Example
request.client_ip User's IP address 192.168.1.100
chat.model Selected LLM model meta/llama-4-scout-17b-16e-instruct
chat.query_preview First 200 chars of query "ΒΏQuΓ© son los 12 pasos de AA?"
chat.query_length Total query length 45
chat.response_preview First 200 chars of response "Los 12 pasos son..."
trace_id Unique identifier for request 7f3b2a1c...

2. RAG Pipeline Monitoring (rag.pipeline span)

Oversees the entire RAG process:

Attribute Description Example
rag.query_length Query character count 150
rag.query_preview First 500 chars of query Full query text
span.status Operation success/failure OK / ERROR

3. Retrieval Monitoring (rag_search span)

Captures vector database search operations:

Attribute Description Example
search_limit Max documents to retrieve 10
search_type Search algorithm used hybrid
search_fusion_alg Reranking method RRF
retrieved_documents_count Actual docs found 8
documents Detailed results preview JSON array with doc metadata

Document Preview Structure:

[
  {
    "ranking": 1,
    "answer": "Los 12 pasos son principios guΓ­a...",
    "question": "ΒΏQuΓ© son los 12 pasos?",
    "id": "faq_123",
    "topic": "Programa de AA",
    "source": "aa.org/es/12-pasos"
  }
]

4. LLM Generation Monitoring (llm.chat span)

Tracks language model invocations:

Attribute Description Example
llm.model Model identifier meta/llama-4-scout-17b-16e-instruct
input.user_preview First 200 chars of prompt User query
llm.prompt_tokens Tokens in input 450
llm.completion_tokens Tokens in output 280
llm.total_tokens Total token usage 730
llm.finish_reason Completion status stop / length
llm.output_preview First 200 chars of response Generated answer

5. User Feedback Monitoring (api.feedback span)

Collects user satisfaction data:

Attribute Description Example
feedback.value User rating positive / negative
feedback.question_preview Original query First 200 chars
feedback.answer_preview Bot response First 200 chars
related_trace_id Links to original chat trace 7f3b2a1c...

πŸš€ Setting Up Monitoring

Start Phoenix Server

Phoenix runs as a Docker container alongside the application. It's already configured in docker-compose.yml:

Option A: Start Everything with Docker Compose (Recommended)

docker compose up -d

This starts all services:

  • FastAPI server on port 8000
  • Qdrant vector database on port 6333
  • Phoenix monitoring on port 6006

Option B: Run Phoenix Separately (Development)

If you're running the FastAPI app locally but want Phoenix in Docker:

# Start only Phoenix
docker compose up phoenix -d

# Then run your app locally
uv run uvicorn src.server.app:app --host 0.0.0.0 --port 8000 --reload

Verify Phoenix is Running:

# Check container status
docker compose ps phoenix

# Or visit the dashboard
http://localhost:6006

Phoenix dashboard will be available at http://localhost:6006

Stopping Phoenix:

# Stop all services
docker compose down

# Stop only Phoenix
docker compose stop phoenix

πŸ“ˆ Monitoring Dashboard

The Arize Phoenix dashboard provides:

Traces View

Monitoring dashboard, traces view
  • Real-time trace visualization
  • Waterfall charts showing span hierarchies
  • Performance bottleneck identification
  • Error tracking and debugging

Trace Details View

Phoenix Monitoring dashboard, trace details Phoenix Monitoring dashboard, trace details Phoenix Monitoring dashboard, trace details

Click any trace to see:

  • Timeline: Visual span waterfall
  • Attributes: All captured metadata
  • Events: Exceptions and log messages
  • Relationships: Parent-child span connections

πŸ” Privacy Considerations

The monitoring system respects user privacy:

  • No full message storage: Only previews (first 200-500 chars) are logged
  • No PII collection: No names, emails, or personal identifiers
  • Feedback linkage: Trace IDs connect feedback without storing personal data

πŸ“ Monitoring Code Structure

Traced files:

src/
β”œβ”€β”€ server/app.py           # API endpoints with tracing
β”œβ”€β”€ RAG/main.py            # RAG pipeline with nested spans
└── LLM/main.py            # LLM generation with token tracking

🎯 Monitoring Objectives Checklist

Based on LLM Zoomcamp evaluation criteria:

Objective Implementation Location
Trace API requests βœ… OpenTelemetry spans on /chat and /feedback src/server/app.py
Monitor RAG pipeline βœ… Nested spans for retrieval + generation src/RAG/main.py
Track LLM usage βœ… Token counts, model selection, latency src/LLM/main.py
Capture user feedback βœ… Feedback endpoint linked to trace_id src/server/app.py
Visualize traces βœ… Arize Phoenix dashboard http://localhost:6006
Debug issues βœ… Searchable traces with full context Phoenix UI
Performance optimization βœ… Latency tracking per component All traced spans

πŸ“Έ Screenshots

Main Interface

The chatbot provides a clean, accessible interface for users seeking AA information: chat bot web interface

Sample Conversation

Example interaction showing empathetic and informative responses: chat bot web interface, sample conversation


πŸŽ₯ Demo Video

https://www.loom.com/share/6d4af45916234437940b104dc607d170


πŸ“ˆ Key Results

Retrieval Performance

Evaluated across 58 ground truth queries using Hit Rate and Mean Reciprocal Rank (MRR):

Search Type Hit Rate MRR Performance Summary
Semantic 0.9276 0.6741 Strong recall, moderate ranking
Lexical 0.7845 0.5505 Baseline BM25 performance
Hybrid (DBSF) 0.9224 0.7090 Best ranking quality
Hybrid (RRF) 0.9379 0.6985 Best overall - highest hit rate

Winner: Hybrid search with Reciprocal Rank Fusion (RRF) achieves the best retrieval performance with 93.79% hit rate, ensuring users almost always receive relevant AA information.


LLM Comparison

Comprehensive evaluation across 580 test cases using multiple metrics:

Semantic Similarity (Cosine)

Model Mean Similarity Median Consistency (Std Dev) Ranking
Llama 4 Scout 0.7756 0.7786 0.1370 (Best) πŸ₯‡
Kimi K2 0.7425 0.7617 0.1509 πŸ₯ˆ
GPT-20B 0.6801 0.7162 0.2075 πŸ₯‰

LLM-as-a-Judge Quality Assessment

Answer Fidelity (vs. Ground Truth):

Model Relevant Partly Relevant Non-Relevant Success Rate
Llama 4 Scout 47.2% 52.6% 0.2% ✨ 99.8%
Kimi K2 44.7% 54.0% 1.4% 98.6%
GPT-20B 41.7% 53.1% 5.2% 94.8%

Question-Answering Quality:

Model Relevant Partly Relevant Non-Relevant User Satisfaction
Llama 4 Scout 86.4% ✨ 13.3% 0.3% 99.7%
Kimi K2 84.1% 14.8% 1.0% 99.0%
GPT-20B 84.1% 10.9% 5.0% 95.0%

Overall Winner: Llama 4 Scout πŸ†

meta/llama-4-scout-17b-16e-instruct is deployed in production based on:

βœ… Highest semantic accuracy (77.6% cosine similarity)
βœ… Best consistency (lowest standard deviation)
βœ… Exceptional reliability (99.8% success rate)
βœ… Top question-answering (86.4% fully relevant responses)
βœ… Lowest failure rate (only 0.2-0.3% non-relevant answers)

Model Characteristics:

Model Strengths Best Use Case Production Status
Llama 4 Scout Accuracy, consistency, reliability General AA information queries βœ… Deployed
Kimi K2 Long context, nuanced understanding Complex multi-part questions Available
GPT-20B Creative rephrasing, diversity Alternative perspectives Available

Configuration:

The production deployment uses:

  • Retrieval: Hybrid RRF (93.79% hit rate)
  • Generation: Llama 4 Scout (86.4% relevance)
  • Combined: Provides accurate, empathetic, and trustworthy AA information

πŸ’‘ Future Improvements

  • Conversation Memory: Implement persistent chat history
  • Regional Meeting Finder: Integrate location-based AA meeting search
  • Human Feedback Loop: Collect user ratings to improve responses
  • Fine-tuned Model: Train a specialized model on AA literature

🀝 Contributing

While this is a personal project for the LLM Zoomcamp, feedback and suggestions are welcome! Please open an issue or reach out directly.


πŸ‘€ Author

Marcelo Nieva
Final Project for DataTalksClub LLM Zoomcamp


πŸ“„ License

This project is released under the MIT License


⚠️ Important Disclaimer

This chatbot is not affiliated with or endorsed by Alcoholics Anonymous.

This is an educational project built to demonstrate RAG systems and improve access to publicly available AA information. It should never replace professional medical advice, therapy, or in-person AA meetings.

If you or someone you know is struggling with alcohol use:

  • 🌐 Visit the official AA website: aa.org
  • πŸ“ž Contact a local AA chapter
  • πŸ₯ Seek professional medical help

In case of emergency, call your local emergency services immediately.


πŸ™ Acknowledgments

  • DataTalksClub for the excellent LLM Zoomcamp course
  • Alcoholics Anonymous for their invaluable resources and decades of helping people
  • NVIDIA for providing accessible LLM inference via NIM
  • Qdrant for their powerful vector search engine
  • The open-source community for tools like FastAPI, FastEmbed, and uv

Built with ❀️ to help people find information about recovery

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •