A Retrieval-Augmented Generation (RAG) chatbot for Alcoholics Anonymous information
Created by Marcelo Nieva for the DataTalksClub LLM Zoomcamp
AA Assistant is an intelligent conversational agent designed to provide accurate, empathetic, and trustworthy information about Alcoholics Anonymous (AA) to individuals seeking help or guidance with alcohol-related concerns.
The chatbot leverages a Retrieval-Augmented Generation (RAG) pipeline built entirely from official AA sources, ensuring that all responses are grounded in verified documentation. This project demonstrates a complete end-to-end LLM applicationβfrom data collection and indexing to semantic retrieval, generation, and comprehensive evaluation.
π Multilingual Support: The assistant operates in both Spanish and English, making AA information accessible to a broader audience across different regions and language preferences.
AA Assistant
Cloud deployment on Koyeb
https://pale-sisely-arac-347f0d7a.koyeb.app/
This project addresses all key evaluation criteria outlined in the LLM Zoomcamp:
| Criterion | Implementation | Location in Code | 
|---|---|---|
| Problem Definition | Clear use case: providing reliable AA information to people struggling with alcohol | This README, src/RAG/prompts.py | 
| Data Collection | Web scraping from official AA websites (global & Argentina) | data/final/*.json | 
| Data Indexing | Vector embeddings using Jina AI v2 (Spanish/English) + sparse BM25 indexing | src/ingest.py, src/config/db.py | 
| Retrieval Strategy | Hybrid search (semantic + lexical) with DBSF and RRF reranking | src/RAG/main.py, src/eval/retrival/ | 
| LLM Integration | NVIDIA NIM endpoints with three models evaluated | src/LLM/main.py | 
| Application Interface | FastAPI serving custom HTML/CSS/JS interface | src/server/app.py, src/server/templates/ | 
| Evaluation | Comprehensive evaluation framework: retrieval metrics (Hit Rate, MRR) and generation quality (cosine similarity, LLM-as-judge) across 580 test cases | src/eval/retrival/, src/eval/llm/ | 
| Monitoring | OpenTelemetry instrumentation with Arize Phoenix for real-time observability, trace tracking, and user feedback collection | src/monitoring/tracing.py, src/server/app.py | 
| Documentation | Detailed README with setup instructions, architecture, evaluation results, and monitoring guidelines | This file | 
All knowledge in the chatbot comes from official Alcoholics Anonymous sources:
- AA Official Website (Spanish): https://www.aa.org/es
 - Core AA literature: The Twelve Steps, The Big Book, FAQs
 
- AA Argentina Official Website: https://aa.org.ar/
 - Regional meeting information, local resources, and Argentina-specific guidance
 
data/final/
βββ FAQS.json                      # Frequently Asked Questions
βββ FAQS_IDX.json                  # Indexed FAQs with IDs
βββ Ground_Truth.json              # Evaluation ground truth dataset
βββ Ground_Truth_IDX.json          # Indexed ground truth
βββ answers_gpt_20b.json           # Generated answers from GPT-20B
βββ answers_kimi_k2.json           # Generated answers from Kimi K2
βββ answers_llama4_scout.json     # Generated answers from Llama 4 Scout
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                      User Interface                     β
β              (FastAPI + HTML/CSS/JS)                    β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                   RAG Pipeline                          β
β  ββββββββββββββββ  ββββββββββββββββ  ββββββββββββββββ   β
β  β   Semantic   β  β   Lexical    β  β   Hybrid     β   β
β  β   Search     β  β   Search     β  β   (DBSF/RRF) β   β
β  β  (Jina v2)   β  β   (BM25)     β  β  Reranking   β   β
β  ββββββββββββββββ  ββββββββββββββββ  ββββββββββββββββ   β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                 Vector Database                         β
β                    (Qdrant)                             β
β  β’ Dense vectors: Jina Embeddings v2 (768 dims)         β
β  β’ Sparse vectors: BM25 tokenization                    β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β               LLM Generation Layer                      β
β                  (NVIDIA NIM)                           β
β  β’ openai/gpt-oss-20b                                   β
β  β’ moonshotai/kimi-k2-instruct                          β
β  β’ meta/llama-4-scout-17b-16e-instruct                  β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
src/
βββ RAG/                          # Retrieval-Augmented Generation
β   βββ main.py                   # Core RAG orchestration with tracing
β   βββ prompts.py                # System and user prompts
βββ LLM/                          # Language Model interface
β   βββ main.py                   # NVIDIA NIM client wrapper with token tracking
βββ eval/                         # Evaluation
β   βββ retrival/                 # Retrieval evaluation
β   β   βββ evaluate.py           # Hit Rate and MRR metrics
β   β   βββ metrics.py            # Metric calculation utilities
β   βββ llm/                      # Generation evaluation
β   β   βββ generate_answers.py  # Answer generation for evaluation
β   β   βββ cosine_similarity/   # Semantic similarity evaluation
β   β   β   βββ evaluate.py
β   β   βββ llm_judge/           # LLM-as-judge evaluation
β   β       βββ evaluate.py
β   βββ generate_ground_truth.py # Ground truth dataset creation
β   βββ prompts.py               # Evaluation prompts
βββ monitoring/                   # Observability & Tracing
β   βββ tracing.py               # OpenTelemetry + Phoenix setup
βββ server/                       # Web application
β   βββ app.py                    # FastAPI server with traced endpoints
β   βββ templates/
β       βββ index.html            # Frontend interface
βββ config/                       # Configuration management
β   βββ db.py                     # Qdrant database setup
β   βββ envs.py                   # Environment variables
β   βββ paths.py                  # Path constants
β   βββ utils.py                  # Utility functions
βββ ingest.py                     # Data ingestion pipeline
βββ main.py                       # Entry point
- Python 3.12+
 - uv package manager
 - Docker
 - NVIDIA NIM API key
 
curl -LsSf https://astral.sh/uv/install.sh | shVerify installation:
uv --versiongit clone https://github.com/marcelonieva7/AA_Bot.git
cd AA_BOTuv venvuv syncThis installs all dependencies from pyproject.toml using locked versions from uv.lock.
Create a .env file in the project root by copying the provided template:
cp .env.template .envThen edit .env with your configuration:
# Qdrant Vector Database
QDRANT_URL=http://localhost:6333
# NVIDIA NIM API Configuration
NVIDIA_API_KEY=
NVIDIA_URL=Configuration Details:
| Variable | Description | Default/Example | Required | 
|---|---|---|---|
QDRANT_URL | 
Qdrant vector database endpoint | http://localhost:6333 | 
β Yes | 
NVIDIA_API_KEY | 
Your NVIDIA NIM API key from build.nvidia.com | nvapi-xxx... | 
β Yes | 
NVIDIA_URL | 
NVIDIA NIM API base URL | https://integrate.api.nvidia.com/v1 | 
β Yes | 
Getting Your NVIDIA API Key:
- Visit https://build.nvidia.com/
 - Sign in or create a free account
 - Navigate to your API keys section
 - Generate a new API key
 - Copy the key to your 
.envfile 
Note for Docker Users:
If deploying with Docker Compose, you can also configure .env.docker with the same variables. The Docker setup uses:
QDRANT_URL=http://qdrant:6333  # Note: uses service name instead of localhostStep 1: Start Qdrant Database
First, launch the Qdrant vector database using Docker:
docker run --rm -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/docker_volumes/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrantThis command:
- Exposes port 
6333for the REST API - Exposes port 
6334for the gRPC API - Persists data in 
./docker_volumes/qdrant_storage/ - Runs in the foreground (use 
Ctrl+Cto stop) 
Verify Qdrant is running:
curl http://localhost:6333/You should see a JSON response with version information.
Step 2: Ingest and Index Documents
Once Qdrant is running, populate the vector database:
uv run -m src.ingestThis ingestion pipeline will:
- π Load documents from 
data/final/FAQS.jsonand related files - π§ Generate dense embeddings using Jina Embeddings v2 (768 dimensions)
 - π€ Create sparse BM25 vectors for lexical search
 - πΎ Index all vectors in Qdrant with hybrid search capabilities
 - β Create collection with optimized indexing parameters
 
uv run uvicorn src.server.app:app --host 0.0.0.0 --port 8000 --reloadOpen your browser at http://localhost:8000 to access the chatbot interface.
Use .env.docker for Docker-specific settings:
QDRANT_URL=http://qdrant:6333
NVIDIA_API_KEY=
NVIDIA_URL=https://integrate.api.nvidia.com/v1docker compose up --buildThis starts:
- FastAPI server on port 8000
 - Qdrant vector database on port 6333
 - Phoenix Observability and monitoring on port 6006
 
The chatbot implements a sophisticated hybrid retrieval system combining semantic and lexical search:
search_types = [
    'semantic',              # Dense vector similarity (Jina v2)
    'lexical',               # Sparse BM25 keyword matching
    ['hybrid', 'DBSF'],      # Hybrid with Distribution-Based Score Fusion
    ['hybrid', 'RRF']        # Hybrid with Reciprocal Rank Fusion
]| Model Type | Model Name | Purpose | Dimensions | 
|---|---|---|---|
| Dense | jinaai/jina-embeddings-v2-base-es | 
Semantic similarity in Spanish and English | 768 | 
| Sparse | Qdrant/bm25 | 
Lexical keyword matching | Variable | 
- Semantic search captures conceptual similarity and handles paraphrasing
 - Lexical search ensures exact term matches (important for AA-specific terminology)
 - Reranking algorithms (DBSF/RRF) combine both strengths for optimal retrieval
 
The project compares three state-of-the-art language models via NVIDIA NIM endpoints:
| Model | Identifier | Strengths | Use Case | 
|---|---|---|---|
| GPT-20B | openai/gpt-oss-20b | 
Balanced performance, good Spanish support | General purpose answering | 
| Kimi K2 | moonshotai/kimi-k2-instruct | 
Long context, instruction following | Complex multi-step reasoning | 
| Llama 4 Scout | meta/llama-4-scout-17b-16e-instruct | 
Efficient, fast inference | Quick responses | 
All models are accessed through the NVIDIA NIM API, enabling consistent evaluation and deployment.
Located in src/eval/retrival/, this module measures retrieval quality using two key metrics:
Hit Rate (Recall@k)
- Measures the proportion of queries where at least one relevant document appears in the top-k results
 - Formula: 
(Number of queries with relevant docs in top-k) / (Total queries) - Range: 0.0 to 1.0 (higher is better)
 
Mean Reciprocal Rank (MRR)
- Evaluates how high the first relevant document ranks in the results
 - Formula: 
Average(1 / rank of first relevant document) - Range: 0.0 to 1.0 (higher is better)
 - Emphasizes ranking qualityβfinding relevant docs early is rewarded
 
uv run -m src.eval.retrival.evaluateThis script:
- Loads ground truth questions with known relevant document IDs
 - Executes each search type (semantic, lexical, hybrid)
 - Compares retrieved document IDs against ground truth
 - Calculates Hit Rate and MRR for each configuration
 - Saves results to 
src/eval/retrival/Eval_results.csv 
Based on evaluation across 580 test queries from the ground truth dataset:
| Search Type | Hit Rate | MRR | Interpretation | 
|---|---|---|---|
| Semantic | 0.9276 | 0.6741 | Finds relevant docs 92.8% of the time, typically ranked 3rd-4th | 
| Lexical | 0.7845 | 0.5505 | Baseline BM25 performanceβfinds docs 78.5% of time | 
| Hybrid (DBSF) | 0.9224 | 0.7090 | Best ranking qualityβrelevant docs appear higher (avg rank ~2) | 
| Hybrid (RRF) | 0.9379 | 0.6985 | Best hit rateβfinds relevant docs 93.8% of time | 
β Winner: Hybrid RRF for production deployment
- Highest hit rate (93.79%) ensures users almost always get relevant information
 - Strong MRR (0.6985) means relevant docs typically appear in top 2-3 positions
 - Combines strengths of both semantic understanding and exact term matching
 
π Best Ranking: Hybrid DBSF
- Highest MRR (0.7090) means relevant documents rank slightly higher on average
 - Nearly as good hit rate (92.24%)
 - Excellent choice if ranking position is critical
 
π― Semantic-Only Performance
- Strong hit rate (92.76%) shows Jina embeddings work well for Spanish AA content
 - Lower MRR suggests relevant docs sometimes appear lower in results
 - Good fallback if computational resources are limited
 
- Lowest performance (78.45% hit rate) confirms pure keyword matching isn't sufficient
 - Struggles with paraphrasing and conceptual questions
 - Important as a component but insufficient alone
 
Hit Rate Comparison (Higher is Better)
βββββββββββββββββββββββββββββββββββββββββ
Hybrid (RRF)    ββββββββββββββββββββ 93.79%
Semantic        ββββββββββββββββββββ 92.76%
Hybrid (DBSF)   ββββββββββββββββββββ 92.24%
Lexical         βββββββββββββββ      78.45%
MRR Comparison (Higher is Better)
βββββββββββββββββββββββββββββββββββββββββ
Hybrid (DBSF)   ββββββββββββββββββββ 0.7090
Hybrid (RRF)    βββββββββββββββββββ  0.6985
Semantic        ββββββββββββββββββ   0.6741
Lexical         βββββββββββββββ      0.5505
Located in src/eval/llm/, this module includes:
Measures semantic overlap between generated answers and ground truth using vector embeddings. This quantitative metric evaluates how closely each model's responses align with reference answers.
Running the Evaluation:
uv run -m src.eval.llm.cosine_similarity.evaluateMethodology:
- Embeds both generated answers and ground truth using the same embedding model
 - Computes cosine similarity scores (range: -1 to 1, where 1 = identical meaning)
 - Analyzes distribution across 580 test question-answer pairs
 
Results Summary:
| Model | Mean Similarity | Median | Std Dev | Min | Max | 25th %ile | 75th %ile | 
|---|---|---|---|---|---|---|---|
| Llama 4 Scout | 0.7756 | 0.7786 | 0.1370 | -0.0223 | 1.0000 | 0.7024 | 0.8626 | 
| Kimi K2 | 0.7425 | 0.7617 | 0.1509 | -0.0435 | 1.0000 | 0.6675 | 0.8411 | 
| GPT-20B | 0.6801 | 0.7162 | 0.2075 | -0.1596 | 1.0000 | 0.6182 | 0.8002 | 
Detailed Statistics:
GPT-20B (openai/gpt-oss-20b)
count  580.000000
mean     0.680088
std      0.207454
min     -0.159631
25%      0.618194
50%      0.716159
75%      0.800209
max      1.000000
Analysis:
- Lowest mean similarity (0.68) suggests more creative/varied responses
 - Highest standard deviation (0.21) indicates inconsistent alignment with ground truth
 - Some negative scores show occasional semantic drift
 - 75% of responses still achieve >0.62 similarity
 
Llama 4 Scout (meta/llama-4-scout-17b-16e-instruct)
count  580.000000
mean     0.775588
std      0.136974
min     -0.022276
25%      0.702364
50%      0.778617
75%      0.862613
max      1.000000
Analysis:
- Highest mean similarity (0.78) β best alignment with reference answers
 - Lowest standard deviation (0.14) shows most consistent performance
 - Minimum score near 0 (vs. negative for others) indicates fewer outliers
 - 75% of responses achieve >0.86 similarity (excellent)
 - Winner for semantic accuracy
 
Kimi K2 (moonshotai/kimi-k2-instruct)
count  580.000000
mean     0.742493
std      0.150890
min     -0.043473
25%      0.667533
50%      0.761724
75%      0.841071
max      1.000000
Analysis:
- Strong performance (0.74 mean) β second place overall
 - Moderate standard deviation (0.15) shows good consistency
 - Balanced distribution with median close to mean
 - 75% of responses achieve >0.84 similarity
 - Good middle ground between accuracy and creativity
 
Key Insights:
π₯ Llama 4 Scout emerges as the winner for semantic accuracy:
- Highest average similarity (77.6%) to ground truth
 - Most consistent performance across all queries
 - Fewest outliers and semantic drift cases
 
π₯ Kimi K2 provides strong balanced performance:
- Second-best similarity (74.2%)
 - Good for complex queries requiring nuanced understanding
 
- More creative/diverse responses (can be positive or negative depending on use case)
 - Less predictable alignment with reference answers
 - May provide alternative valid perspectives not captured in ground truth
 
Distribution Visualization:
Similarity Score Distribution
βββββββββββββββββββββββββββββββββββββββ
Llama 4 Scout:  [ββββββββββββββββββββββββ] 77.6% avg
                     β Most consistent
Kimi K2:        [ββββββββββββββββββββββ  ] 74.2% avg
                     β Balanced performance
GPT-20B:        [βββββββββββββββββββ     ] 68.0% avg
                     β More variance
Recommendation: For production deployment, Llama 4 Scout is recommended when semantic accuracy to established AA content is prioritized. However, all three models perform above the 0.68 threshold, indicating they all generate semantically relevant responses.
Uses an advanced language model to qualitatively assess answer quality through structured evaluation prompts. This approach provides nuanced assessment beyond pure numerical metrics.
Running the Evaluation:
uv run -m src.eval.llm.llm_judge.evaluateMethodology:
Two complementary evaluation perspectives are used to assess each model's responses across 580 test cases:
Prompt 1: Answer-to-Answer Comparison
- Compares generated answer against a reference ground truth answer
 - Evaluates semantic preservation and information completeness
 - Stricter evaluationβchecks if the model maintains factual accuracy
 
"Compare the generated answer to the original reference answer 
and classify relevance as: NON_RELEVANT | PARTLY_RELEVANT | RELEVANT"Prompt 2: Question-to-Answer Alignment
- Evaluates how well the generated answer addresses the original question
 - Focuses on user satisfaction and practical utility
 - More lenientβallows for valid alternative phrasings and approaches
 
"Evaluate how well the generated answer responds to the question
and classify relevance as: NON_RELEVANT | PARTLY_RELEVANT | RELEVANT"Results Summary:
| Model | Prompt Type | Relevant | Partly Relevant | Non-Relevant | Success Rate* | 
|---|---|---|---|---|---|
| Llama 4 Scout | Answer Comparison (P1) | 274 (47.2%) | 305 (52.6%) | 1 (0.2%) | 99.8% | 
| Llama 4 Scout | Question Alignment (P2) | 501 (86.4%) | 77 (13.3%) | 2 (0.3%) | 99.7% | 
| Kimi K2 | Answer Comparison (P1) | 259 (44.7%) | 313 (54.0%) | 8 (1.4%) | 98.6% | 
| Kimi K2 | Question Alignment (P2) | 488 (84.1%) | 86 (14.8%) | 6 (1.0%) | 99.0% | 
| GPT-20B | Answer Comparison (P1) | 242 (41.7%) | 308 (53.1%) | 30 (5.2%) | 94.8% | 
| GPT-20B | Question Alignment (P2) | 488 (84.1%) | 63 (10.9%) | 29 (5.0%) | 95.0% | 
*Success Rate = (Relevant + Partly Relevant) / Total
Detailed Results:
GPT-20B (openai/gpt-oss-20b)
Prompt 1 - Answer Comparison:
PARTLY_RELEVANT    308 (53.1%)
RELEVANT           242 (41.7%)
NON_RELEVANT        30 (5.2%)
Prompt 2 - Question Alignment:
RELEVANT           488 (84.1%)
PARTLY_RELEVANT     63 (10.9%)
NON_RELEVANT        29 (5.0%)
Analysis:
- Shows largest gap between strict (P1) and lenient (P2) evaluation
 - 5% non-relevant rate highest among all modelsβindicates more factual drift
 - Strong question-answering capability (84% fully relevant)
 - When aligned, provides comprehensive answers
 β οΈ Higher risk of deviating from reference material
Llama 4 Scout (meta/llama-4-scout-17b-16e-instruct)
Prompt 1 - Answer Comparison:
PARTLY_RELEVANT    305 (52.6%)
RELEVANT           274 (47.2%)
NON_RELEVANT         1 (0.2%)  β Best
Prompt 2 - Question Alignment:
RELEVANT           501 (86.4%)  β Best
PARTLY_RELEVANT     77 (13.3%)
NON_RELEVANT         2 (0.3%)
Analysis:
- π Winner: Best overall performance
 - Virtually no non-relevant answers (0.2-0.3%)
 - Highest "Relevant" score on question alignment (86.4%)
 - Excellent balance between accuracy and utility
 - Most reliable for production deployment
 
Kimi K2 (moonshotai/kimi-k2-instruct)
Prompt 1 - Answer Comparison:
PARTLY_RELEVANT    313 (54.0%)
RELEVANT           259 (44.7%)
NON_RELEVANT         8 (1.4%)
Prompt 2 - Question Alignment:
RELEVANT           488 (84.1%)
PARTLY_RELEVANT     86 (14.8%)
NON_RELEVANT         6 (1.0%)
Analysis:
- Strong performance, second place overall
 - Low non-relevant rate (1.0-1.4%)
 - Tied with GPT-20B on question alignment (84.1%)
 - Slightly more conservative than Llama (more "partly relevant" classifications)
 - Good choice for complex, nuanced queries
 
Key Insights:
π Evaluation Method Comparison:
The two prompts reveal different aspects of model performance:
| Metric | Prompt 1 (Strict) | Prompt 2 (Lenient) | Insight | 
|---|---|---|---|
| Average "Relevant" | 45.5% | 85.5% | Models better at answering questions than matching reference style | 
| Average "Non-Relevant" | 2.3% | 2.1% | Consistent failure rate across evaluation methods | 
π― Performance Ranking:
By Accuracy (Prompt 1 - Answer Fidelity):
- π₯ Llama 4 Scout: 99.8% success rate, 0.2% failures
 - π₯ Kimi K2: 98.6% success rate, 1.4% failures
 - π₯ GPT-20B: 94.8% success rate, 5.2% failures
 
By Utility (Prompt 2 - Question Answering):
- π₯ Llama 4 Scout: 86.4% fully relevant, 0.3% failures
 - π₯ Kimi K2 / GPT-20B: 84.1% fully relevant (tied)
 
π‘ Practical Implications:
For Production Use:
- Llama 4 Scout recommended: highest reliability + best question-answering
 - Maintains factual accuracy while providing helpful responses
 - Lowest risk of hallucination or irrelevant content
 
For Specialized Cases:
- Kimi K2: Excellent for long-context queries requiring deep understanding
 - GPT-20B: Consider when creative rephrasing is valued over strict accuracy
 
Visual Comparison:
Success Rate (Relevant + Partly Relevant)
βββββββββββββββββββββββββββββββββββββββ
Answer Comparison (Strict):
Llama 4 Scout  [ββββββββββββββββββββββββ] 99.8%
Kimi K2        [ββββββββββββββββββββββββ] 98.6%
GPT-20B        [βββββββββββββββββββββββ ] 94.8%
Question Alignment (Lenient):
Llama 4 Scout  [ββββββββββββββββββββββββ] 99.7%
Kimi K2        [ββββββββββββββββββββββββ] 99.0%
GPT-20B        [βββββββββββββββββββββββ ] 95.0%
Recommendation:
Based on combined evaluation (cosine similarity + LLM-as-a-judge), Llama 4 Scout emerges as the optimal model for the AA Assistant chatbot:
β
 Highest semantic similarity (77.6%)
β
 Best LLM-judge scores (86.4% fully relevant)
β
 Lowest failure rate (0.2-0.3% non-relevant)
β
 Consistent performance across evaluation methods
β
 Balanced accuracy and user satisfaction
This model is currently deployed in production.
The AA Assistant implements comprehensive observability using OpenTelemetry and Arize Phoenix to track performance, debug issues, and ensure system reliability in production.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                   User Request                          β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β              FastAPI Endpoint (/chat)                   β
β           [Traced: api.chat span]                       β
β  β’ Captures: client IP, model, query length             β
β  β’ Returns: response + trace_id                         β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                RAG Pipeline                             β
β           [Traced: rag.pipeline span]                   β
β  β’ Query preprocessing and validation                   β
β  β’ Coordinates retrieval + generation                   β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
                     β
                     ββββββββββββββββββββββββββββββββ
                     βΌ                              βΌ
         βββββββββββββββββββββββββ    βββββββββββββββββββββββββ
         β   Database Search     β    β   LLM Generation      β
         β  [rag_search span]    β    β   [llm.chat span]     β
         β  β’ Search type        β    β   β’ Model name        β
         β  β’ Fusion algorithm   β    β   β’ Token usage       β
         β  β’ Documents count    β    β   β’ Response preview  β
         β  β’ Results preview    β    β   β’ Finish reason     β
         βββββββββββββββββββββββββ    βββββββββββββββββββββββββ
                     β
                     βΌ
         βββββββββββββββββββββββββ
         β  Feedback Collection  β
         β [api.feedback span]   β
         β  β’ User satisfaction  β
         β  β’ Links to trace_id  β
         βββββββββββββββββββββββββ
The system captures detailed telemetry across four key areas:
Tracks all incoming requests to the chatbot:
| Attribute | Description | Example | 
|---|---|---|
request.client_ip | 
User's IP address | 192.168.1.100 | 
chat.model | 
Selected LLM model | meta/llama-4-scout-17b-16e-instruct | 
chat.query_preview | 
First 200 chars of query | "ΒΏQuΓ© son los 12 pasos de AA?" | 
chat.query_length | 
Total query length | 45 | 
chat.response_preview | 
First 200 chars of response | "Los 12 pasos son..." | 
trace_id | 
Unique identifier for request | 7f3b2a1c... | 
Oversees the entire RAG process:
| Attribute | Description | Example | 
|---|---|---|
rag.query_length | 
Query character count | 150 | 
rag.query_preview | 
First 500 chars of query | Full query text | 
span.status | 
Operation success/failure | OK / ERROR | 
Captures vector database search operations:
| Attribute | Description | Example | 
|---|---|---|
search_limit | 
Max documents to retrieve | 10 | 
search_type | 
Search algorithm used | hybrid | 
search_fusion_alg | 
Reranking method | RRF | 
retrieved_documents_count | 
Actual docs found | 8 | 
documents | 
Detailed results preview | JSON array with doc metadata | 
Document Preview Structure:
[
  {
    "ranking": 1,
    "answer": "Los 12 pasos son principios guΓa...",
    "question": "ΒΏQuΓ© son los 12 pasos?",
    "id": "faq_123",
    "topic": "Programa de AA",
    "source": "aa.org/es/12-pasos"
  }
]Tracks language model invocations:
| Attribute | Description | Example | 
|---|---|---|
llm.model | 
Model identifier | meta/llama-4-scout-17b-16e-instruct | 
input.user_preview | 
First 200 chars of prompt | User query | 
llm.prompt_tokens | 
Tokens in input | 450 | 
llm.completion_tokens | 
Tokens in output | 280 | 
llm.total_tokens | 
Total token usage | 730 | 
llm.finish_reason | 
Completion status | stop / length | 
llm.output_preview | 
First 200 chars of response | Generated answer | 
Collects user satisfaction data:
| Attribute | Description | Example | 
|---|---|---|
feedback.value | 
User rating | positive / negative | 
feedback.question_preview | 
Original query | First 200 chars | 
feedback.answer_preview | 
Bot response | First 200 chars | 
related_trace_id | 
Links to original chat trace | 7f3b2a1c... | 
Phoenix runs as a Docker container alongside the application. It's already configured in docker-compose.yml:
Option A: Start Everything with Docker Compose (Recommended)
docker compose up -dThis starts all services:
- FastAPI server on port 8000
 - Qdrant vector database on port 6333
 - Phoenix monitoring on port 6006
 
Option B: Run Phoenix Separately (Development)
If you're running the FastAPI app locally but want Phoenix in Docker:
# Start only Phoenix
docker compose up phoenix -d
# Then run your app locally
uv run uvicorn src.server.app:app --host 0.0.0.0 --port 8000 --reloadVerify Phoenix is Running:
# Check container status
docker compose ps phoenix
# Or visit the dashboard
http://localhost:6006Phoenix dashboard will be available at http://localhost:6006
Stopping Phoenix:
# Stop all services
docker compose down
# Stop only Phoenix
docker compose stop phoenixThe Arize Phoenix dashboard provides:
- Real-time trace visualization
 - Waterfall charts showing span hierarchies
 - Performance bottleneck identification
 - Error tracking and debugging
 
Click any trace to see:
- Timeline: Visual span waterfall
 - Attributes: All captured metadata
 - Events: Exceptions and log messages
 - Relationships: Parent-child span connections
 
The monitoring system respects user privacy:
- No full message storage: Only previews (first 200-500 chars) are logged
 - No PII collection: No names, emails, or personal identifiers
 - Feedback linkage: Trace IDs connect feedback without storing personal data
 
Traced files:
src/
βββ server/app.py           # API endpoints with tracing
βββ RAG/main.py            # RAG pipeline with nested spans
βββ LLM/main.py            # LLM generation with token tracking
Based on LLM Zoomcamp evaluation criteria:
| Objective | Implementation | Location | 
|---|---|---|
| Trace API requests | β
 OpenTelemetry spans on /chat and /feedback | 
src/server/app.py | 
| Monitor RAG pipeline | β Nested spans for retrieval + generation | src/RAG/main.py | 
| Track LLM usage | β Token counts, model selection, latency | src/LLM/main.py | 
| Capture user feedback | β Feedback endpoint linked to trace_id | src/server/app.py | 
| Visualize traces | β Arize Phoenix dashboard | http://localhost:6006 | 
| Debug issues | β Searchable traces with full context | Phoenix UI | 
| Performance optimization | β Latency tracking per component | All traced spans | 
The chatbot provides a clean, accessible interface for users seeking AA information:

Example interaction showing empathetic and informative responses:

https://www.loom.com/share/6d4af45916234437940b104dc607d170
Evaluated across 58 ground truth queries using Hit Rate and Mean Reciprocal Rank (MRR):
| Search Type | Hit Rate | MRR | Performance Summary | 
|---|---|---|---|
| Semantic | 0.9276 | 0.6741 | Strong recall, moderate ranking | 
| Lexical | 0.7845 | 0.5505 | Baseline BM25 performance | 
| Hybrid (DBSF) | 0.9224 | 0.7090 | Best ranking quality | 
| Hybrid (RRF) | 0.9379 | 0.6985 | Best overall - highest hit rate | 
Winner: Hybrid search with Reciprocal Rank Fusion (RRF) achieves the best retrieval performance with 93.79% hit rate, ensuring users almost always receive relevant AA information.
Comprehensive evaluation across 580 test cases using multiple metrics:
| Model | Mean Similarity | Median | Consistency (Std Dev) | Ranking | 
|---|---|---|---|---|
| Llama 4 Scout | 0.7756 | 0.7786 | 0.1370 (Best) | π₯ | 
| Kimi K2 | 0.7425 | 0.7617 | 0.1509 | π₯ | 
| GPT-20B | 0.6801 | 0.7162 | 0.2075 | π₯ | 
Answer Fidelity (vs. Ground Truth):
| Model | Relevant | Partly Relevant | Non-Relevant | Success Rate | 
|---|---|---|---|---|
| Llama 4 Scout | 47.2% | 52.6% | 0.2% β¨ | 99.8% | 
| Kimi K2 | 44.7% | 54.0% | 1.4% | 98.6% | 
| GPT-20B | 41.7% | 53.1% | 5.2% | 94.8% | 
Question-Answering Quality:
| Model | Relevant | Partly Relevant | Non-Relevant | User Satisfaction | 
|---|---|---|---|---|
| Llama 4 Scout | 86.4% β¨ | 13.3% | 0.3% | 99.7% | 
| Kimi K2 | 84.1% | 14.8% | 1.0% | 99.0% | 
| GPT-20B | 84.1% | 10.9% | 5.0% | 95.0% | 
meta/llama-4-scout-17b-16e-instruct is deployed in production based on:
β
 Highest semantic accuracy (77.6% cosine similarity)
β
 Best consistency (lowest standard deviation)
β
 Exceptional reliability (99.8% success rate)
β
 Top question-answering (86.4% fully relevant responses)
β
 Lowest failure rate (only 0.2-0.3% non-relevant answers)
Model Characteristics:
| Model | Strengths | Best Use Case | Production Status | 
|---|---|---|---|
| Llama 4 Scout | Accuracy, consistency, reliability | General AA information queries | β Deployed | 
| Kimi K2 | Long context, nuanced understanding | Complex multi-part questions | Available | 
| GPT-20B | Creative rephrasing, diversity | Alternative perspectives | Available | 
Configuration:
The production deployment uses:
- Retrieval: Hybrid RRF (93.79% hit rate)
 - Generation: Llama 4 Scout (86.4% relevance)
 - Combined: Provides accurate, empathetic, and trustworthy AA information
 
- Conversation Memory: Implement persistent chat history
 - Regional Meeting Finder: Integrate location-based AA meeting search
 - Human Feedback Loop: Collect user ratings to improve responses
 - Fine-tuned Model: Train a specialized model on AA literature
 
While this is a personal project for the LLM Zoomcamp, feedback and suggestions are welcome! Please open an issue or reach out directly.
Marcelo Nieva
Final Project for DataTalksClub LLM Zoomcamp
- π§ Email: marcelonieva7@gmail.com
 - πΌ LinkedIn: linkedin.com/in/marcelo-nieva
 
This project is released under the MIT License
This chatbot is not affiliated with or endorsed by Alcoholics Anonymous.
This is an educational project built to demonstrate RAG systems and improve access to publicly available AA information. It should never replace professional medical advice, therapy, or in-person AA meetings.
If you or someone you know is struggling with alcohol use:
- π Visit the official AA website: aa.org
 - π Contact a local AA chapter
 - π₯ Seek professional medical help
 
In case of emergency, call your local emergency services immediately.
- DataTalksClub for the excellent LLM Zoomcamp course
 - Alcoholics Anonymous for their invaluable resources and decades of helping people
 - NVIDIA for providing accessible LLM inference via NIM
 - Qdrant for their powerful vector search engine
 - The open-source community for tools like FastAPI, FastEmbed, and uv
 
Built with β€οΈ to help people find information about recovery
