# Tutorial: Simplifying Agent Development with MLflow & CrewAI

**Goal:** This tutorial demonstrates how Managed MLflow transforms the development of multi-agent systems (using CrewAI as an example) from a potentially opaque and difficult-to-debug process into a structured, observable, and iterative engineering workflow.

**Scenario:** We will use the "AI Research" agentic systen. This system researches new AI/ML tools. While useful, developing such agents presents common challenges:

*   **Black Box Execution:** What *exactly* did the agent do step-by-step? Why did it choose *that* tool?
*   **Comparing Changes:** How do we reliably compare results if we tweak prompts or agent configurations?
*   **Debugging Failures:** Why did the agent fail? What was it trying to do right before the error?
*   **Performance:** Is the agent getting slower? Where are the bottlenecks?
*   **Quality Evaluation:** Is changing the LLM or prompts actually improving the *quality* of the output?
*   **Cost/Speed Optimization:** How can we track and reduce token usage or latency?

We will tackle some problems by incrementally integrating MLflow tracking features into our CrewAI development process.

## Getting Started

1. Launch your instance of the Managed Service for MLflow with [the MLflow quickstart](https://docs.nebius.com/mlflow/quickstart).
2. Set up your API key to connect to Nebius AI Studio with [the AI Studio quickstart](https://docs.nebius.com/studio/inference/quickstart).

> **Note:** Launching an MLflow cluster for the first time may take 15-30 minutes to be fully provisioned and ready to use.

In [1]:
!pip install mlflow==2.21.2 python-dotenv openai crewai==0.114.0 crewai-tools duckduckgo-search



### Secrets and Environment Variables

Set the following environment variables where this notebook is running, so that the code in the following cells can connect to both Nebius Managed Service for MLflow and Nebius AI Studio. 

MLflow:<br>
`MLFLOW_TRACKING_SERVER_CERT_PATH`<br>
`MLFLOW_TRACKING_URI`<br>
`MLFLOW_TRACKING_USERNAME`<br>
`MLFLOW_TRACKING_PASSWORD`<br>

AI Studio:<br>
`NEBIUS_API_KEY`

### Environment Setup

To set the environment variables, run the following cell. You may choose to set them interactively or by loading from a `.env` file.

In [2]:
%reload_ext autoreload
%autoreload 2

import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add the parent directory to Python path
sys.path.append(str(Path.cwd().parent))

from env_setup import setup_env_from_file, setup_env_interactive, verify_env_setup

# Option 1: Interactive setup
# setup_env_interactive()

# Option 2: Load from .env file
setup_env_from_file('../.env')

# Verify the setup
verify_env_setup()

Load ENV variables from file: ../.env
All required environment variables are set.


True

### Check connection to MLflow


In [3]:
import mlflow 

# List experiments in MLflow
mlflow.search_experiments()

[<Experiment: artifact_location='mlflow-artifacts:/9', creation_time=1744833056702, experiment_id='9', last_update_time=1744833056702, lifecycle_stage='active', name='Step 4 - Evaluate Agent', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/8', creation_time=1744828328349, experiment_id='8', last_update_time=1744828328349, lifecycle_stage='active', name='Step 4 - Update prompts', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/7', creation_time=1744747169730, experiment_id='7', last_update_time=1744747169730, lifecycle_stage='active', name='Step 3 - Metrics', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/6', creation_time=1744745999110, experiment_id='6', last_update_time=1744745999110, lifecycle_stage='active', name='Step 2 - Autolog', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1744712739337, experiment_id='1', last_update_time=1744712739337, lifecycle_stage='active', name='CrewAI Agent Debugging Demo', tags=

# Step 1: Quick implementation of an AI Agent with CrewAI

In [5]:
import os
import io
import time
import json
import traceback
import contextlib
from textwrap import dedent
from crewai import Agent, Crew, Task, Process
from crewai import __version__ as crewai_version
from crewai_tools import WebsiteSearchTool
from langchain_openai import ChatOpenAI
import mlflow

print(f"MLflow Version: {mlflow.__version__}")
print(f"CrewAI Version: {crewai_version}")

/Users/mikhailrozhkov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pydantic/_internal/_config.py:291: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.8/migration/
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
  warn(
/Users/mikhailrozhkov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/crewai_tools/tools/scrapegraph_scrape_tool/scrapegraph_scrape_tool.py:34: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.8/migration/
  @validator("website_url")
/Users/mikhailrozhkov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/crewai_tool

MLflow Version: 2.21.2
CrewAI Version: 0.114.0


### Define Tools

In [6]:
# To search across any discovered websites

search_tool = WebsiteSearchTool()

### Define an Agent

In [11]:
class ResearchAgents:
    def researcher_agent(self):
        """Agent responsible for discovering relevant AI tools for specific tasks."""
        return Agent(
            role="Senior Data Researcher",
            goal="Discover and evaluate relevant AI tools (libraries, frameworks) for specific tasks",
            backstory="An expert in AI development and MLOps",
            tools=[search_tool],
            verbose=True,
            max_iter=3
        )

    def analyst_agent(self):
        """Agent responsible for analyzing tools and creating detailed reports."""
        return Agent(
            role="Reporting Analyst",
            goal="Create detailed reports based on data analysis and research findings",
            backstory="""You are a technical expert specializing in AI technologies evaluation. 
    You have a deep understanding of AI libraries and frameworks, and can 
    quickly assess their technical merits""",
            tools=[search_tool],
            verbose=True,
            max_iter=3 # Maximum iterations before the agent must provide its best answer. Default is 20.
        )

### Define Tasks

In [12]:
class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=dedent(f"""
                You are a research agent tasked with finding AI tools for: {{task}}.
                Consider compatibility with: {{ai_stack}}.
                
                For each tool, identify:
                - Name and URL
                - Primary use case
                - Brief description (2-3 sentences)
                
                Format your response as a JSON list of objects.
            """),
            agent=agent,
            expected_output="A JSON list containing details of 3-5 relevant AI tools with their names, URLs, use cases, and descriptions.",
            output_file="output/tool_candidates.json"
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=dedent(f"""
                Read the tool list from the previous task's result and perform a detailed analysis of each tool.
                
                For each tool:
                1. Research its capabilities, limitations, community adoption, documentation quality
                2. Evaluate how well it addresses the specified task: '{task}'
                3. Consider its compatibility with '{ai_stack}'
                4. Identify pros and cons
                
                Your output should be a JSON list of these detailed analysis objects.
            """),
            agent=agent,
            expected_output="A JSON list containing detailed analysis of each tool with comprehensive information about features, pros, cons, and recommendation scores.",
            output_file="output/tool_analysis.json",
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=dedent(f"""
                Read the analysis from the previous task's result and create a comprehensive Markdown report.
                
                The report should include:
                
                1. An introduction explaining the task ('{task}') and existing stack ('{ai_stack}')
                2. For each tool, create a section with:
                   - Tool name and URL as a heading
                   - Description
                   - Features (as bullet points)
                   - Pros (as bullet points)
                   - Cons (as bullet points)
                   - Integration complexity
                   - Recommendation score with justification
                3. A summary/conclusion comparing the tools and providing final recommendations
                
                Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
                Sort tools by recommendation score (descending).
                
                Your output should be a complete, well-formatted Markdown document.
            """),
            agent=agent,
            expected_output="A comprehensive Markdown report analyzing each tool with recommendations, properly formatted with headings, bullet points, and clear sections.",
            output_file="output/tool_recommendation_report.md",
        )

### Design a Crew 

A crew in crewAI is a collaborative group of agents working together to complete tasks. Crews define:

- Task execution strategy
- Agent collaboration methods
- Overall workflow coordination
- Communication patterns between agents
- Task delegation and sequencing

In [21]:
class AIOpsResearchCrew:
    def __init__(self, task, ai_stack):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
        """
        self.task = task
        self.ai_stack = ai_stack
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)

    def run(self):
        """Execute the research, analysis, and reporting process."""
        # Initialize agents
        agents = AIOpsResearchAgents()
        researcher = agents.researcher_agent()
        analyst = agents.analyst_agent()

        # Initialize tasks
        tasks = AIOpsResearchTasks()
        search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
        analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
        report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
        
        # Create the crew
        crew = Crew(
            agents=[researcher, analyst],
            tasks=[search_task, analyze_task, report_task],
            verbose=True,
            process=Process.sequential,
            memory=True
        )
        
        result = crew.kickoff()
        return result


### Test Run (Optional)

In [22]:
# Let's run it once without MLflow to see the typical verbose console output.

task_description = "Develop a conversational RAG system that can answer questions based on a large PDF document collection"
existing_stack = "LangChain, PogreSQL, FastAPI"

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Task:[00m [92m
You are a research agent tasked with finding AI tools for: {task}.
Consider compatibility with: {ai_stack}.

For each tool, identify:
- Name and URL
- Primary use case
- Brief description (2-3 sentences)

Format your response as a JSON list of objects.
[00m




[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mThought: I need to find relevant AI tools for developing a conversational RAG system that is compatible with LangChain, PostgreSQL, and FastAPI. I will search for AI tools that fit these criteria and provide detailed information including their names, URLs, use cases, and descriptions.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"AI tools for developing conversational Retrieval-Augmented Generation system compatible with LangChain, PostgreSQL, and FastAPI\", \"website\": \"https://www.tensorflow.org/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
Visionary in Gartner Magic Quadrant: Data Science and Machine Learning 2024 Listed in Gartner Magic Quadrant (Cloud AI Developer Services) in 2020, 2021, 2022, 2023, and 2024 Listed in Gartner Magic Quadrant (Data Science and Machine Learning) in 2018, 2019, 2020, 2021



[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Final Answer:[00m [92m
[
    {
        "name": "H2O.ai",
        "url": "https://www.h2o.ai/",
        "primary_use_case": "Automated Machine Learning and AI Model Deployment",
        "description": "H2O.ai converges Predictive AI with Generative AI, providing tools for various use cases such as fraud detection and recommendation engines. It simplifies the building and deployment of scalable, efficient AI applications with a strong focus on natural language use cases, making it ideal for conversational systems."
    },
    {
        "name": "LangChain",
        "url": "https://langchain.com/",
        "primary_use_case": "Framework for LLM applications",
        "description": "LangChain is a composable framework designed to build applications powered by language models (LLMs). It facilitates the creation of context-aware applications and integrates seamlessly with various data sources and APIs, making it a perfec

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the tool list from the previous task's result and perform a detailed analysis of each tool.

For each tool:
1. Research its capabilities, limitations, community adoption, documentation quality
2. Evaluate how well it addresses the specified task: 'Develop a conversational RAG system that can answer questions based on a large PDF document collection'
3. Consider its compatibility with 'LangChain, PogreSQL, FastAPI'
4. Identify pros and cons

Your output should be a JSON list of these detailed analysis objects.
[00m




[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mThought: I need to gather detailed information on each AI tool mentioned to analyze their suitability for developing a conversational RAG system.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"capabilities limitations community adoption documentation quality\", \"website\": \"https://www.h2o.ai/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
images. Intelligent document comparison with ProcurementGPT ProcurementGPT streamlines document analysis by automatically parsing documents into sections or clauses, then identifying the most similar counterparts across documents—working symmetrically in both directions. It further breaks down clauses into key requirements, determining where each one is fulfilled and highlighting met and unmet requirements for easy review. The system leverages the top 10 most similar requirements 



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Final Answer:[00m [92m
```json
[
    {
        "name": "H2O.ai",
        "capabilities": "Automated Machine Learning and AI Model Deployment, designed for efficiency with a focus on rapid model deployment across various environments.",
        "limitations": "Less control over model internals and limited customization options.",
        "community_adoption": "Gaining traction in business implementations, especially for quick deployment scenarios.",
        "documentation_quality": "Well-documented, providing clear guidelines for users but limited in depth for advanced customization.",
        "evaluation_for_task": "H2O.ai can assist in the deployment of conversational RAG systems; however, its flexibility may be less than desirable for complex query handling involved in language models.",
        "compatibility": {
            "LangChain": "Moderate: Possible but requires additional integration efforts.",
            "

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task ('Develop a conversational RAG system that can answer questions based on a large PDF document collection') and existing stack ('LangChain, PogreSQL, FastAPI')
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
[00m


Inserting batches in chromadb: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.93it/s]




[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"Develop a conversational RAG system that can answer questions based on a large PDF document collection LangChain, PogreSQL, FastAPI\", \"website\": \"https://langchain.com/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
LangChain Sign up to be the first to access recordings from Interrupt, The AI Agent Conference!Learn MoreFrom startups to global enterprises, ambitious builders choose LangChain products.BuildLangChain is a composable framework to build with LLMs. LangGraph is the orchestration framework for controllable agentic workflows.RunDeploy your LLM applications at scale with LangGraph Platform, our infrastructure purpose-built for agents.ManageLangSmith is a unified agent observability and evals platform to optimize the performance of your AI agents - whether they're built with a LangChain frame



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Final Answer:[00m [92m
# Comprehensive Markdown Report: Analysis of Tools for Developing a Conversational RAG System

## Introduction
The aim of this task is to **develop a conversational Retrieval-Augmented Generation (RAG) system** that can efficiently answer questions based on a large collection of PDF documents. The existing technology stack includes **LangChain**, **PostgreSQL**, and **FastAPI**. This report evaluates various AI libraries and frameworks, focusing on their suitability for this specific use case.

## Tool Analysis

### 1. [LangChain](https://langchain.com/)
**Description:** LangChain is a composable framework designed to build applications powered by language models (LLMs). It facilitates integration with various data sources and APIs, making it a perfect fit for building conversational RAG systems.

- **Features:**
  - Modular architecture
  - Strong focus on LLM integration
  - Extensive API suppor

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


In [23]:
print("\n=== Final Result (Base Run) ===")
print(result)


=== Final Result (Base Run) ===
# Comprehensive Markdown Report: Analysis of Tools for Developing a Conversational RAG System

## Introduction
The aim of this task is to **develop a conversational Retrieval-Augmented Generation (RAG) system** that can efficiently answer questions based on a large collection of PDF documents. The existing technology stack includes **LangChain**, **PostgreSQL**, and **FastAPI**. This report evaluates various AI libraries and frameworks, focusing on their suitability for this specific use case.

## Tool Analysis

### 1. [LangChain](https://langchain.com/)
**Description:** LangChain is a composable framework designed to build applications powered by language models (LLMs). It facilitates integration with various data sources and APIs, making it a perfect fit for building conversational RAG systems.

- **Features:**
  - Modular architecture
  - Strong focus on LLM integration
  - Extensive API support

- **Pros:**
  - Highly modular and composable
  - Stro

### Observe 

- Logs are great! But it's a lot of them! 
- Difficult to understand the process: How was the report generated? What internal steps were taken?
- No cost tracking: How many LLM tokens were used? What was the cost?
- Debugging Difficulty: Errors are hard to trace without visibility.
- Subjective Quality: Is the output report actually good? Accurate? Relevant? No way to know objectively.

# Step 2: Aautolog Agent execution and LLM calls with MLflow

**Problem:** 
- The console output scrolls by. How can we reliably capture *exactly* what the agent did, including its reasoning steps and tool usage, for later inspection or debugging?

**MLflow Solution:** 
- Turn on auto tracing by calling mlflow.crewai.autolog()


In [24]:
# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 2 - Autolog")

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Task:[00m [92m
You are a research agent tasked with finding AI tools for: {task}.
Consider compatibility with: {ai_stack}.

For each tool, identify:
- Name and URL
- Primary use case
- Brief description (2-3 sentences)

Format your response as a JSON list of objects.
[00m




[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mI need to search for relevant AI tools that can be utilized for specific tasks compatible with the given AI stack. This requires identifying suitable tools, their primary use cases, and providing brief descriptions.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"AI tools for developing conversational RAG systems\", \"website\": \"https://www.h2o.ai/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
H2O.ai | Convergence of The World's Best Predictive & Generative AI The World's Best Deep Research Chat Now Chat on Mobile Learn More Companies powered by H2O.ai H2O’s flagship SLMs H2OVL Mississippi SVLM Series Our newest economical multimodal OCR model developed for Document AI H2OVL Mississippi-2B, based on H2O Danube2, is trained on 17.3M conversation pairs for high-res image handling. The .8B model, built on Danube3,



[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Final Answer:[00m [92m
```json
[
    {
        "name": "H2O.ai",
        "url": "https://www.h2o.ai/",
        "primary_use_case": "Automated Machine Learning and Generative AI",
        "description": "H2O.ai is a platform that combines predictive and generative AI capabilities. It provides tools for various use cases including fraud detection, recommendation engines, and document AI capabilities, making it beneficial for developing conversational systems."
    },
    {
        "name": "LangChain",
        "url": "https://langchain.com/",
        "primary_use_case": "Framework for LLM applications",
        "description": "LangChain is a modular framework designed for building applications powered by language models (LLMs). It enables seamless integration with various data sources and APIs, optimizing the creation of context-aware conversational systems."
    },
    {
        "name": "TensorFlow",
        "url": "

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the tool list from the previous task's result and perform a detailed analysis of each tool.

For each tool:
1. Research its capabilities, limitations, community adoption, documentation quality
2. Evaluate how well it addresses the specified task: 'Develop a conversational RAG system that can answer questions based on a large PDF document collection'
3. Consider its compatibility with 'LangChain, PogreSQL, FastAPI'
4. Identify pros and cons

Your output should be a JSON list of these detailed analysis objects.
[00m




[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mThought: I need to research the capabilities, limitations, community adoption, and documentation quality of each tool indicated in the context provided to analyze their suitability for building a conversational RAG system.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"H2O.ai capabilities, limitations, community adoption, documentation quality\", \"website\": \"https://www.h2o.ai/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
H2O.ai | Convergence of The World's Best Predictive & Generative AI The World's Best Deep Research Chat Now Chat on Mobile Learn More Companies powered by H2O.ai H2O’s flagship SLMs H2OVL Mississippi SVLM Series Our newest economical multimodal OCR model developed for Document AI H2OVL Mississippi-2B, based on H2O Danube2, is trained on 17.3M conversation pairs for high-res image handling. The .



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mThought: I need to continue analyzing the remaining tools: LangChain, TensorFlow, PyTorch, and Keras for their capabilities, limitations, community adoption, and documentation quality.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"LangChain capabilities, limitations, community adoption, documentation quality\", \"website\": \"https://langchain.com/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
development and shipping experience. We couldn’t have achieved the product experience delivered to our customers without LangChain, and we couldn’t have done it at the same pace without LangSmith.”James SpiteriDirector of Security Products“As soon as we heard about LangSmith, we moved our entire development stack onto it. We could have built evaluation, testing and monitoring tools in house, but with LangSmith it took us 10x l



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mThought: I need to gather similar information for TensorFlow, PyTorch, and Keras to complete the detailed analysis required for each tool.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"TensorFlow capabilities, limitations, community adoption, documentation quality\", \"website\": \"https://www.tensorflow.org/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
announcements from the TensorFlow team and community. Explore the ecosystem Discover production-tested tools to accelerate modeling, deployment, and other workflows. Library TensorFlow.js Train and run models directly in the browser using JavaScript or Node.js. Library LiteRT Deploy ML on mobile and edge devices such as Android, iOS, Raspberry Pi, and Edge TPU. API tf.data Preprocess data and create input pipelines for ML models. Library TFX Create production ML pip



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Final Answer:[00m [92m
[
    {
        "name": "H2O.ai",
        "capabilities": "Combines predictive and generative AI, strong in document AI and automation for machine learning tasks.",
        "limitations": "Limited control and flexibility; higher costs associated with extensive use in complex deployments.",
        "community_adoption": "Growing user base, recognized for robust applications in enterprise AI.",
        "documentation_quality": "Comprehensive and user-friendly documentation available.",
        "suitability_for_RAG": "Moderate; can assist with document querying but struggles with complex conversational contexts.",
        "compatibility": {
            "LangChain": "Moderate compatibility.",
            "PostgreSQL": "Low compatibility.",
            "FastAPI": "Low compatibility."
        },
        "pros": [
            "User-friendly interface.",
            "Fast model deployment capabilities.",


  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task ('Develop a conversational RAG system that can answer questions based on a large PDF document collection') and existing stack ('LangChain, PogreSQL, FastAPI')
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
[00m




[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"Develop a conversational RAG system that can answer questions based on a large PDF document collection\", \"website\": \"https://langchain.com/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
H2O.ai | Convergence of The World's Best Predictive & Generative AI The World's Best Deep Research Chat Now Chat on Mobile Learn More Companies powered by H2O.ai H2O’s flagship SLMs H2OVL Mississippi SVLM Series Our newest economical multimodal OCR model developed for Document AI H2OVL Mississippi-2B, based on H2O Danube2, is trained on 17.3M conversation pairs for high-res image handling. The .8B model, built on Danube3, leads OCR benchmarks with 19M pairs, outperforming all SLMs in text recognition. For Developers  1 Text Recognition H2O Danube SLM Series Our most economical small model for fast, lightweight tasks



[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Final Answer:[00m [92m
# Conversational RAG System Analysis Report

## Introduction
The task focuses on developing a **Conversational Retrieval-Augmented Generation (RAG)** system capable of answering questions based on a large collection of PDF documents. The existing technology stack includes **LangChain**, **PostgreSQL**, and **FastAPI**. This report evaluates each of the tools in the context of this application.

## Tools Analysis

### 1. LangChain
**[LangChain](https://langchain.com/)**
- **Description**: A modular framework designed for building applications powered by language models (LLMs). LangChain enables seamless integration with various data sources, optimizing the creation of context-aware conversational systems.
- **Features**:
  - Modular architecture promoting flexibility.
  - Easy integration with different APIs and data sources.
  - Rich set of tools for LLMs.
- **Pros**:
  - Highly modular and flexib

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


### Observe 

- Enable auto-tracing for CrewAI in 1 line of code
- Optionally create an MLflow experiment to write traces to

# Step 3: Tracking metrics and artifacts

### Modify Crew Class for MLflow Run & Log Capture

In [33]:
import os
import json
import time
import logging
import io
import sys
import contextlib


class AIOpsResearchCrew:
    def __init__(self, task, ai_stack):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.crew = None
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)

    def run(self):
        """Execute the research, analysis, and reporting process with MLflow tracking."""
        
        # --- UPDATE: Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id
            

            logger.info("Initializing agents...")
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True
            )
            
            # Start the crew
            result = self.crew.kickoff()

            
            # --- UPDATE: Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            
            # --- UPDATE: Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- UPDATE: Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists("output/tool_candidates.json"):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            return result
       

### Run and Observe in MLflow UI

In [34]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 3 - Metrics")

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Task:[00m [92m
You are a research agent tasked with finding AI tools for: {task}.
Consider compatibility with: {ai_stack}.

For each tool, identify:
- Name and URL
- Primary use case
- Brief description (2-3 sentences)

Format your response as a JSON list of objects.
[00m


Inserting batches in chromadb: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.83it/s]




[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mI need to gather information about relevant AI tools for developing a conversational RAG (Retrieval-Augmented Generation) system while considering specific criteria. I will search for tools on the internet that match the use case and provide concise descriptions.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"tools for conversational RAG system\", \"website\": \"https://www.h2o.ai\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
LangChain Sign up to be the first to access recordings from Interrupt, The AI Agent Conference!Learn MoreFrom startups to global enterprises, ambitious builders choose LangChain products.BuildLangChain is a composable framework to build with LLMs. LangGraph is the orchestration framework for controllable agentic workflows.RunDeploy your LLM applications at scale with LangGraph Platform, our

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the tool list from the previous task's result and perform a detailed analysis of each tool.

For each tool:
1. Research its capabilities, limitations, community adoption, documentation quality
2. Evaluate how well it addresses the specified task: 'Develop a conversational RAG system that can answer questions based on a large PDF document collection'
3. Consider its compatibility with 'LangChain, PogreSQL, FastAPI'
4. Identify pros and cons

Your output should be a JSON list of these detailed analysis objects.
[00m


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"H2O.ai capabilities limitations community adoption documentation quality\", \"website\": \"https://www.h2o.ai/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
H2O.ai | Convergence of The World's Best Predictiv

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task ('Develop a conversational RAG system that can answer questions based on a large PDF document collection') and existing stack ('LangChain, PogreSQL, FastAPI')
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
[00m


[1m[95m# Agent:[00m [1m[92mReporting

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:
/Users/mikhailrozhkov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pydantic/main.py:1111: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.8/migration/


🏃 View run Tool_Research_1744870392 at: https://public-tracking.mlflow-e00rfd6s4w3cdtk9fx.backbone-e00rt16p71ew33dg9e.msp.eu-north1.nebius.cloud/#/experiments/7/runs/6221c1fa187e4f3a96325428f9aaf91a
🧪 View experiment at: https://public-tracking.mlflow-e00rfd6s4w3cdtk9fx.backbone-e00rt16p71ew33dg9e.msp.eu-north1.nebius.cloud/#/experiments/7


### Crew Usage Metrics

In [35]:
# Show the crew usage metrics 

ai_dev_crew.crew.usage_metrics

UsageMetrics(total_tokens=46689, prompt_tokens=42767, cached_prompt_tokens=23424, completion_tokens=3922, successful_requests=9)

**Observation:** 

Now, every run leaves a persistent, detailed trace artifact in MLflow. No more "What just happened?" guesswork.

# Step 4: Evaluating Agent Output Quality

**Problem:** How to automatically evaluate the quality of the final result without manually comparing LLMs or prompts.

**MLflow Solution:**
- Log agent configurations and outputs with MLflow parameters
- Define specific quality criteria for assessment reports
- Use LLM-as-Judge evaluation by having a powerful model score outputs against criteria
- Log evaluation metrics (`accuracy_score`, `completeness_score`, etc.) to MLflow
- Compare runs in MLflow UI to identify highest quality configurations

### Set up Nebius AI Studio client to "evaluator" LLM 

In [41]:
import openai

API_KEY = os.environ.get("NEBIUS_API_KEY")

# Instantiate the client instance
nebius_client = openai.OpenAI(
    api_key=API_KEY,
    base_url="https://api.studio.nebius.ai/v1/",
)


### Simplify tasks 

In [64]:
from crewai_tools import FileReadTool

class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=f"""
                Find some AI tools for {task}.
                Format however you think is best.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Info about AI tools",
            output_file="output/tool_candidates.json",
            cache=False
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=f"""
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with {ai_stack}?
                
                Make it JSON I guess.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Analysis of tools",
            output_file="output/tool_analysis.json",
            cache=False,
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=f"""
                Make a report about the tools.
                Use markdown formatting.
            """,
            agent=agent,
            tools=[FileReadTool(file_path='output/tool_analysis.json')],
            expected_output="A report",
            output_file="output/tool_recommendation_report.md",
            cache=False,
        )

### Define Evaluation function

In [85]:
import json
import logging
import os
from textwrap import dedent

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def evaluate_report(
    report_path: str, 
    nebius_client,
    model: str = "meta-llama/Llama-3.3-70B-Instruct"
) -> dict:
    """
    Evaluate a technical report using Nebius AI.
    
    Args:
        report_path: Path to the markdown report file
        nebius_client: Initialized Nebius API client
        model: Model name to use for evaluation
        
    Returns:
        Dictionary containing evaluation scores and reasoning
    """
    # Read the report content
    try:
        with open(report_path, 'r') as f:
            report_content = f.read()
    except Exception as e:
        logger.error(f"Error reading report file: {str(e)}")
        return {
            "completeness_score": 0,
            "relevance_score": 0,
            "overall_quality_score": 0,
            "reasoning": f"Error reading report: {str(e)}"
        }
    
    # Create evaluation prompt
    evaluation_prompt = dedent(f"""
        Evaluate the quality of the following technical report.
        
        Your job is to score the report on the following criteria (scale 1-10):
        
        - COMPLETENESS: Does the report include all required sections for each tool?
          1-3: Missing multiple required sections
          4-6: Has basic information but lacks detail
          7-8: Contains most required sections with good detail
          9-10: Complete with introduction, tool sections (name, URL, description, features, pros, cons, integration complexity, recommendation score), and conclusion
        
        - RELEVANCE: How directly applicable are the recommended tools to the specific task and tech stack?
          1-3: General tools with no specific RAG functionality (like basic logging libraries)
          4-6: Tools with potential use cases (like numpy for data processing)
          7-8: Related tools with partial functionality (like fastai - useful but no specific RAG features)
          9-10: Directly applicable tools (like LangGraph or CrewAI - frameworks specifically for RAG systems)
        
        - OVERALL QUALITY: The overall quality of the report considering all factors.
        
        Here is the report to evaluate:
        
        ---BEGIN REPORT---
        {report_content}
        ---END REPORT---
        
        Provide your evaluation as a JSON object with the following format:
            "completeness_score": X,
            "relevance_score": X,
            "overall_quality_score": X,
            "reasoning": "Detailed explanation of your evaluation and scores..."
        }}
        
        Return ONLY the JSON object with no additional text.
    """)
    
    # Call Nebius AI Studio for evaluation
    try:
        response = nebius_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an expert evaluator of technical reports."},
                {"role": "user", "content": evaluation_prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        # Parse response
        result = json.loads(response.choices[0].message.content)
        
        # Ensure all scores are present
        required_scores = ["completeness_score", "relevance_score", "overall_quality_score"]
        for score in required_scores:
            if score not in result:
                result[score] = 0
        
        # Ensure reasoning is present
        if "reasoning" not in result:
            result["reasoning"] = "No reasoning provided."
            
        return result
        
    except Exception as e:
        logger.error(f"Error during evaluation: {str(e)}")
        return {
            "completeness_score": 0,
            "relevance_score": 0,
            "overall_quality_score": 0,
            "reasoning": f"Evaluation error: {str(e)}"
        }

In [86]:
evaluation_results = evaluate_report(
                    report_path='output/tool_recommendation_report.md',
                    nebius_client=nebius_client,
                    model="meta-llama/Llama-3.3-70B-Instruct"
                )

evaluation_results

{'completeness_score': 10,
 'relevance_score': 9,
 'overall_quality_score': 9,
 'reasoning': 'The report is complete with an introduction, tool sections including name, URL, description, features, pros, cons, integration complexity, and recommendation score, and a conclusion. The tools recommended are directly applicable to the specific task of developing a conversational RAG system, with LangChain and H2O.ai being particularly well-suited. The report provides a detailed analysis of each tool, including their strengths and weaknesses, making it highly relevant. The overall quality of the report is excellent, with clear and concise writing, and a well-structured format. The only area for improvement could be in providing more specific examples or case studies to illustrate the effectiveness of each tool, but overall, the report is comprehensive and well-written.'}

### Modify AIOpsResearchCrew

In [87]:
class AIOpsResearchCrewEvaluated:
    def __init__(self, task, ai_stack, nebius_client=None, evaluation_model="meta-llama/Llama-3.3-70B-Instruct"):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
            nebius_client: Initialized Nebius API client for evaluation
            evaluation_model (str): Model to use for evaluation
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.execution_log = io.StringIO()
        self.crew = None
        self.nebius_client = nebius_client
        self.evaluation_model = evaluation_model
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)
        
        # Initialize Nebius client if not provided
        if self.nebius_client is None and os.environ.get("NEBIUS_API_KEY"):
            self.nebius_client = openai.OpenAI(
                api_key=os.environ.get("NEBIUS_API_KEY"),
                base_url="https://api.studio.nebius.ai/v1/"
            )

    def _clean_output_files(self):
        """Remove all previously generated output files if they exist."""
        files_to_remove = [
            "output/tool_candidates.json",
            "output/tool_analysis.json",
            "output/tool_recommendation_report.md",
            "output/report_evaluation.json"
        ]
        
        for file_path in files_to_remove:
            if os.path.exists(file_path):
                try:
                    os.remove(file_path)
                    logger.info(f"Removed existing file: {file_path}")
                except Exception as e:
                    logger.warning(f"Failed to remove file {file_path}: {str(e)}")

    def run(self, evaluate=True):
        """
        Execute the research, analysis, and reporting process with MLflow tracking.
        
        Args:
            evaluate (bool): Whether to evaluate the report after generation
        
        Returns:
            The result from the crew execution
        """
        # Clean any existing output files
        self._clean_output_files()
        
        # --- Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id

            # Initialize agents
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True,
                cache=False
            )
            
            # Start the crew
            logger.info("Starting crew execution...")
            result = self.crew.kickoff()
                        
            # --- Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            mlflow.log_param("evaluation_model", self.evaluation_model)

            # --- Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists(file_path):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            # --- Set success tag ---
            report_path = "output/tool_recommendation_report.md"
            if os.path.exists(report_path):
                mlflow.set_tag("status", "SUCCESS")
                
            # --- Evaluate the report if requested and client available ---
            if evaluate and self.nebius_client:
                evaluation_results = evaluate_report(
                    report_path=report_path,
                    nebius_client=self.nebius_client,
                    model=self.evaluation_model
                )
                
                # Log evaluation results to MLflow
                with open("output/report_evaluation.json", "w") as f:
                    json.dump(evaluation_results, f, indent=2)
                mlflow.log_artifact("output/report_evaluation.json")
                
                # Extract and log only numeric scores
                numeric_scores = {}
                for key, value in evaluation_results.items():
                    if key.endswith('_score') and isinstance(value, (int, float)):
                        numeric_scores[key] = value
                
                if numeric_scores:
                    mlflow.log_metrics(numeric_scores)
                
                return result


### Run with Agent evaluation

In [88]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 4 - Evaluate Agent")

ai_dev_crew = AIOpsResearchCrewEvaluated(
    task_description, 
    existing_stack,  
    nebius_client=nebius_client, 
    evaluation_model="meta-llama/Llama-3.3-70B-Instruct")

result = ai_dev_crew.run()

[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Task:[00m [92m
                Find some AI tools for Develop a conversational RAG system that can answer questions based on a large PDF document collection.
                Format however you think is best.
            [00m
[91m 

I encountered an error while trying to use the tool. This was the error: 403 Client Error: Forbidden for url: https://towardsdatascience.com/.
 Tool Search in a specific website accepts these inputs: Tool Name: Search in a specific website
Tool Arguments: {'search_query': {'description': 'Mandatory search query you want to use to search a specific website', 'type': 'str'}, 'website': {'description': 'Mandatory valid website URL you want to search on', 'type': 'str'}}
Tool Description: A tool that can be used to semantic search a query from a specific URL content.
[00m


[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mI need to find AI tools that ca

Inserting batches in chromadb: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.72it/s]




[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mThought: I need to find relevant AI tools for developing a conversational RAG system based on a large PDF document collection. I will search for tools focusing on this specific application.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"AI tools for conversational RAG systems\", \"website\": \"https://www.tensorflow.org\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
A RAG pipeline typically achieves this following these steps: Receive an input query. Use the retrieval system to search for relevant information based on the query. Incorporate the retrieved information into the prompt sent to the LLM. Generate a response that leverages the retrieved context. As an example, here's a simple RAG workflow that passes information from a retriever to a chat model : from langchain_openai import ChatOpenAI from langchain_co

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with LangChain, PogreSQL, FastAPI?
                
                Make it JSON I guess.
            [00m


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mAction: Search in a specific website[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"LangChain compatibility with PostgreSQL and FastAPI\", \"website\": \"https://langchain.com/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
As of the 0.3 release, LangChain uses Pydantic 2 internally. Users should install Pydantic 2 and are advised to avoid using the pydantic.v1 namespace of Pydantic 2 with
LangChain APIs. If you're working with prior versions of LangChain, please see the fol

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task (Develop a conversational RAG system that can answer questions based on a large PDF document collection) and existing stack (LangChain, PogreSQL, FastAPI)
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
[00m


[1m[95m# Agent:[00m [1m[92mReporting Ana

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:
/Users/mikhailrozhkov/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pydantic/main.py:1111: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.8/migration/


🏃 View run Tool_Research_1744876116 at: https://public-tracking.mlflow-e00rfd6s4w3cdtk9fx.backbone-e00rt16p71ew33dg9e.msp.eu-north1.nebius.cloud/#/experiments/9/runs/df4cb9457ff24268bddea2df4924e056
🧪 View experiment at: https://public-tracking.mlflow-e00rfd6s4w3cdtk9fx.backbone-e00rt16p71ew33dg9e.msp.eu-north1.nebius.cloud/#/experiments/9


# Step 5: Comparing Prompts with MLflow Prompt Registry

**Problem:** 
- How do you systematically track which prompt was used for which run and easily compare the resulting outputs?

**MLflow Solution: MLflow Prompt Registry**
- Version Control: Track prompt evolution with commit-based versioning and diff highlighting
- Aliasing: Create aliases (e.g., "production," "experimental") to isolate prompt versions
- Reusability: Store prompts centrally for use across multiple agents and applications
- Lineage: Connect prompts to specific model runs for comprehensive traceability
- Collaboration: Share prompts across your team with a centralized registry

### Register a propmt

In [89]:
old_create_report_task_prompt = """
Make a report about the tools.
Use markdown formatting.
"""

updated_prompt = mlflow.register_prompt(
    name="create-report-prompt",
    template=old_create_report_task_prompt,
    commit_message="Initial prompt",
    version_metadata={"author": "ai_platform@nebius.demo"}
)

2025/04/17 09:51:04 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: create-report-prompt, version 4


### Update the prompt

In [90]:
#  Improved instructions to get better structured output

improved_prompt = """
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task ({{task}}) and existing stack ({{ai_stack}})
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
"""

updated_prompt = mlflow.register_prompt(
    name="create-report-prompt",
    template=improved_prompt,
    commit_message="Improved formatting instructions and evaluation criteria",
    version_metadata={"author": "ai_platform@nebius.demo"}
)

2025/04/17 09:51:05 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: create-report-prompt, version 5


### Load prompt (last version)

In [91]:
# Example of loading and using the prompt
prompt = mlflow.load_prompt("create-report-prompt")

new_create_report_prompt = prompt.format(task=task_description, ai_stack=existing_stack)
print(new_create_report_prompt)


Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task (Develop a conversational RAG system that can answer questions based on a large PDF document collection) and existing stack (LangChain, PogreSQL, FastAPI)
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.



In [92]:
from crewai_tools import FileReadTool

class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=f"""
                Find some AI tools for {task}.
                Format however you think is best.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Info about AI tools",
            output_file="output/tool_candidates.json",
            cache=False
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=f"""
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with {ai_stack}?
                
                Make it JSON I guess.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Analysis of tools",
            output_file="output/tool_analysis.json",
            cache=False,
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=new_create_report_prompt,
            agent=agent,
            tools=[FileReadTool(file_path='output/tool_analysis.json')],
            expected_output="A report",
            output_file="output/tool_recommendation_report.md",
            cache=False,
        )

In [93]:
class AIOpsResearchCrewUpdatedPrompts:
    def __init__(self, task, ai_stack, nebius_client=None, evaluation_model="meta-llama/Llama-3.3-70B-Instruct"):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
            nebius_client: Initialized Nebius API client for evaluation
            evaluation_model (str): Model to use for evaluation
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.execution_log = io.StringIO()
        self.crew = None
        self.nebius_client = nebius_client
        self.evaluation_model = evaluation_model
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)
        
        # Initialize Nebius client if not provided
        if self.nebius_client is None and os.environ.get("NEBIUS_API_KEY"):
            self.nebius_client = openai.OpenAI(
                api_key=os.environ.get("NEBIUS_API_KEY"),
                base_url="https://api.studio.nebius.ai/v1/"
            )

    def _clean_output_files(self):
        """Remove all previously generated output files if they exist."""
        files_to_remove = [
            "output/tool_candidates.json",
            "output/tool_analysis.json",
            "output/tool_recommendation_report.md",
            "output/report_evaluation.json"
        ]
        
        for file_path in files_to_remove:
            if os.path.exists(file_path):
                try:
                    os.remove(file_path)
                    logger.info(f"Removed existing file: {file_path}")
                except Exception as e:
                    logger.warning(f"Failed to remove file {file_path}: {str(e)}")

    def run(self, evaluate=True):
        """
        Execute the research, analysis, and reporting process with MLflow tracking.
        
        Args:
            evaluate (bool): Whether to evaluate the report after generation
        
        Returns:
            The result from the crew execution
        """
        # Clean any existing output files
        self._clean_output_files()
        
        # --- Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id

            # Initialize agents
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True,
                cache=False
            )
            
            # Start the crew
            logger.info("Starting crew execution...")
            result = self.crew.kickoff()
                        
            # --- Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            mlflow.log_param("evaluation_model", self.evaluation_model)

            # --- Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists(file_path):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            # --- Set success tag ---
            report_path = "output/tool_recommendation_report.md"
            if os.path.exists(report_path):
                mlflow.set_tag("status", "SUCCESS")
                
            # --- Evaluate the report if requested and client available ---
            if evaluate and self.nebius_client:
                evaluation_results = evaluate_report(
                    report_path=report_path,
                    nebius_client=self.nebius_client,
                    model=self.evaluation_model
                )
                
                # Log evaluation results to MLflow
                with open("output/report_evaluation.json", "w") as f:
                    json.dump(evaluation_results, f, indent=2)
                mlflow.log_artifact("output/report_evaluation.json")
                
                # Extract and log only numeric scores
                numeric_scores = {}
                for key, value in evaluation_results.items():
                    if key.endswith('_score') and isinstance(value, (int, float)):
                        numeric_scores[key] = value
                
                if numeric_scores:
                    mlflow.log_metrics(numeric_scores)
                
                return result


### Run with Different Prompts and Compare in MLflow UI


In [None]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 5 - Comparing prompts")

ai_dev_crew = AIOpsResearchCrewUpdatedPrompts(
    task_description, 
    existing_stack,
    nebius_client=nebius_client, 
    evaluation_model="meta-llama/Llama-3.3-70B-Instruct")

result = ai_dev_crew.run()

[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Task:[00m [92m
                Find some AI tools for Develop a conversational RAG system that can answer questions based on a large PDF document collection.
                Format however you think is best.
            [00m


[1m[95m# Agent:[00m [1m[92mSenior Data Researcher[00m
[95m## Thought:[00m [92mI need to search for AI tools suitable for developing a Conversational RAG system that can effectively utilize large PDF document collections.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"AI tools for developing Conversational Retrieval-Augmented Generation systems for large PDF collections\", \"website\": \"https://www.langchain.com/\"}"[00m
[95m## Tool Output:[00m [92m
Relevant Content:
On this page One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications 

  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with LangChain, PogreSQL, FastAPI?
                
                Make it JSON I guess.
            [00m


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Thought:[00m [92mI need to conduct a detailed analysis of the mentioned tools to evaluate their suitability for integration with LangChain, PostgreSQL, and FastAPI. Since I lack direct access to information concerning certain tools, such as their compatibility with PostgreSQL and FastAPI, I will first search for relevant information to enhance my analysis.[00m
[95m## Using tool:[00m [92mSearch in a specific website[00m
[95m## Tool Input:[00m [92m
"{\"search_query\": \"LangChain compatibility with PostgreSQL and FastAPI\", \"website\": \"https://langchain.com/\"}"[00m


  if mode == Mode.FUNCTIONS:
  if mode == Mode.FUNCTIONS:


[1m[95m# Agent:[00m [1m[92mReporting Analyst[00m
[95m## Task:[00m [92m
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task (Develop a conversational RAG system that can answer questions based on a large PDF document collection) and existing stack (LangChain, PogreSQL, FastAPI)
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
[00m


[1m[95m# Agent:[00m [1m[92mReporting Ana

**Observation:** MLflow makes A/B testing prompts or configurations straightforward. You have a clear record of what changed and can easily compare the outcomes.

# 8. Conclusion

Developing robust and reliable LLM agents requires moving beyond simple script execution. By integrating **Managed MLflow**, we gain crucial capabilities:

*   **Traceability:** Understand exactly what happened during an agent run via logged artifacts (`execution_log.txt`, `error.log`).
*   **Reproducibility:** Capture configurations as parameters, ensuring you know precisely what setup produced a given result.
*   **Comparison:** Systematically evaluate the impact of changes (prompts, LLMs, agent logic) using the MLflow UI's compare feature for parameters, metrics, and artifacts.
*   **Debugging:** Quickly identify and analyze failed runs using status tags and dedicated error logs.
*   **Performance Monitoring:** Track execution time and other metrics to identify bottlenecks and regressions.
*   **Quality & Cost Evaluation:** Provide a central repository to store and analyze quality scores (human or LLM-judged) and resource usage (tokens, cost) for holistic optimization.

MLflow transforms agent development into a more transparent, measurable, and efficient engineering process, enabling faster iteration and more reliable outcomes.