# Tutorial: Simplifying Agent Development with MLflow & CrewAI

**Goal:** This tutorial demonstrates how Managed MLflow transforms the development of multi-agent systems (using CrewAI as an example) from a potentially opaque and difficult-to-debug process into a structured, observable, and iterative engineering workflow.

**Scenario:** We will use the "AI Research" agentic systen. This system researches new AI/ML tools. While useful, developing such agents presents common challenges:

*   **Black Box Execution:** What *exactly* did the agent do step-by-step? Why did it choose *that* tool?
*   **Comparing Changes:** How do we reliably compare results if we tweak prompts or agent configurations?
*   **Debugging Failures:** Why did the agent fail? What was it trying to do right before the error?
*   **Performance:** Is the agent getting slower? Where are the bottlenecks?
*   **Quality Evaluation:** Is changing the LLM or prompts actually improving the *quality* of the output?
*   **Cost/Speed Optimization:** How can we track and reduce token usage or latency?

We will tackle some problems by incrementally integrating MLflow tracking features into our CrewAI development process.

## Getting Started

1. Launch your instance of the Managed Service for MLflow with [the MLflow quickstart](https://docs.nebius.com/mlflow/quickstart).
2. Set up your API key to connect to Nebius AI Studio with [the AI Studio quickstart](https://docs.nebius.com/studio/inference/quickstart).

> **Note:** Launching an MLflow cluster for the first time may take 15-30 minutes to be fully provisioned and ready to use.

In [None]:
!pip install mlflow==2.21.2 python-dotenv openai crewai==0.114.0 crewai-tools duckduckgo-search

### Secrets and Environment Variables

Set the following environment variables where this notebook is running, so that the code in the following cells can connect to both Nebius Managed Service for MLflow and Nebius AI Studio. 

LLM:<br>
`OPENAI_API_KEY` - for Agent LLM calls

MLflow:<br> - for tracking agent execution and LLM calls
`MLFLOW_TRACKING_SERVER_CERT_PATH`<br>
`MLFLOW_TRACKING_URI`<br>
`MLFLOW_TRACKING_USERNAME`<br>
`MLFLOW_TRACKING_PASSWORD`<br>

AI Studio:<br>
`NEBIUS_API_KEY` - for LLM-as-Judge evaluation

To set the environment variables, run the following cell. You may choose to set them interactively or by loading from a `.env` file.

In [None]:
%reload_ext autoreload
%autoreload 2

import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add the parent directory to Python path
sys.path.append(str(Path.cwd().parent))

from env_setup import setup_env_from_file, setup_env_interactive, verify_env_setup

# Option 1: Interactive setup
# setup_env_interactive()

# Option 2: Load from .env file
setup_env_from_file('../.env')

# Verify the setup
verify_env_setup()

### Check connection to MLflow


In [None]:
import mlflow 

# List experiments in MLflow
mlflow.search_experiments()

# Step 1: Quick implementation of an AI Agent with CrewAI

Problem: How do we track what happens during agent execution when logs scroll by too quickly?

Solution:

- CrewAI provides basic logging but lacks structured traceability
- Console output is ephemeral and hard to analyze
- No built-in token usage or cost tracking
- Difficult to debug errors without visibility

In [None]:
import os
import io
import time
import json
import traceback
import contextlib
from textwrap import dedent
from crewai import Agent, Crew, Task, Process
from crewai import __version__ as crewai_version
from crewai_tools import WebsiteSearchTool
from langchain_openai import ChatOpenAI
import mlflow

print(f"MLflow Version: {mlflow.__version__}")
print(f"CrewAI Version: {crewai_version}")

### Define Tools

In [None]:
# To search across any discovered websites

search_tool = WebsiteSearchTool()

### Define an Agent

In [None]:
class ResearchAgents:
    def researcher_agent(self):
        """Agent responsible for discovering relevant AI tools for specific tasks."""
        return Agent(
            role="Senior Data Researcher",
            goal="Discover and evaluate relevant AI tools (libraries, frameworks) for specific tasks",
            backstory="An expert in AI development and MLOps",
            tools=[search_tool],
            verbose=True,
            max_iter=3
        )

    def analyst_agent(self):
        """Agent responsible for analyzing tools and creating detailed reports."""
        return Agent(
            role="Reporting Analyst",
            goal="Create detailed reports based on data analysis and research findings",
            backstory="""You are a technical expert specializing in AI technologies evaluation. 
    You have a deep understanding of AI libraries and frameworks, and can 
    quickly assess their technical merits""",
            tools=[search_tool],
            verbose=True,
            max_iter=3 # Maximum iterations before the agent must provide its best answer. Default is 20.
        )

### Define Tasks

In [None]:
class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=dedent(f"""
                You are a research agent tasked with finding AI tools for: {{task}}.
                Consider compatibility with: {{ai_stack}}.
                
                For each tool, identify:
                - Name and URL
                - Primary use case
                - Brief description (2-3 sentences)
                
                Format your response as a JSON list of objects.
            """),
            agent=agent,
            expected_output="A JSON list containing details of 3-5 relevant AI tools with their names, URLs, use cases, and descriptions.",
            output_file="output/tool_candidates.json"
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=dedent(f"""
                Read the tool list from the previous task's result and perform a detailed analysis of each tool.
                
                For each tool:
                1. Research its capabilities, limitations, community adoption, documentation quality
                2. Evaluate how well it addresses the specified task: '{task}'
                3. Consider its compatibility with '{ai_stack}'
                4. Identify pros and cons
                
                Your output should be a JSON list of these detailed analysis objects.
            """),
            agent=agent,
            expected_output="A JSON list containing detailed analysis of each tool with comprehensive information about features, pros, cons, and recommendation scores.",
            output_file="output/tool_analysis.json",
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=dedent(f"""
                Read the analysis from the previous task's result and create a comprehensive Markdown report.
                
                The report should include:
                
                1. An introduction explaining the task ('{task}') and existing stack ('{ai_stack}')
                2. For each tool, create a section with:
                   - Tool name and URL as a heading
                   - Description
                   - Features (as bullet points)
                   - Pros (as bullet points)
                   - Cons (as bullet points)
                   - Integration complexity
                   - Recommendation score with justification
                3. A summary/conclusion comparing the tools and providing final recommendations
                
                Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
                Sort tools by recommendation score (descending).
                
                Your output should be a complete, well-formatted Markdown document.
            """),
            agent=agent,
            expected_output="A comprehensive Markdown report analyzing each tool with recommendations, properly formatted with headings, bullet points, and clear sections.",
            output_file="output/tool_recommendation_report.md",
        )

### Design a Crew 

A crew in crewAI is a collaborative group of agents working together to complete tasks. Crews define:

- Task execution strategy
- Agent collaboration methods
- Overall workflow coordination
- Communication patterns between agents
- Task delegation and sequencing

In [None]:
class AIOpsResearchCrew:
    def __init__(self, task, ai_stack):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
        """
        self.task = task
        self.ai_stack = ai_stack
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)

    def run(self):
        """Execute the research, analysis, and reporting process."""
        # Initialize agents
        agents = AIOpsResearchAgents()
        researcher = agents.researcher_agent()
        analyst = agents.analyst_agent()

        # Initialize tasks
        tasks = AIOpsResearchTasks()
        search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
        analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
        report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
        
        # Create the crew
        crew = Crew(
            agents=[researcher, analyst],
            tasks=[search_task, analyze_task, report_task],
            verbose=True,
            process=Process.sequential,
            memory=True
        )
        
        result = crew.kickoff()
        return result


### Test Run (Optional)

In [None]:
# Let's run it once without MLflow to see the typical verbose console output.

task_description = "Develop a conversational RAG system that can answer questions based on a large PDF document collection"
existing_stack = "LangChain, PogreSQL, FastAPI"

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

In [None]:
print("\n=== Final Result (Base Run) ===")
print(result)

# Step 2: Aautolog Agent execution and LLM calls with MLflow

Problem: How can we reliably capture exactly what the agent did, including reasoning steps and tool usage, for later inspection?


Solution:

- Enable structured logging with `mlflow....autolog()`
- Record agent steps, decisions, and tools used


In [None]:
# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 2 - Autolog")

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

# Step 3: Tracking metrics and artifacts

Problem: How do we measure and compare agent performance across multiple runs?

Solution:

- Track execution time, token counts, and costs with `mlflow.log_metric()`
- Store output artifacts and conversation transcripts with `mlflow.log_artifact()`
- Add searchable metadata with `mlflow.set_tag()`

### Modify Crew Class for MLflow Run & Log Capture

In [None]:
import os
import json
import time
import logging
import io
import sys
import contextlib


class AIOpsResearchCrew:
    def __init__(self, task, ai_stack):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.crew = None
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)

    def run(self):
        """Execute the research, analysis, and reporting process with MLflow tracking."""
        
        # --- UPDATE: Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id
            

            logger.info("Initializing agents...")
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True
            )
            
            # Start the crew
            result = self.crew.kickoff()

            
            # --- UPDATE: Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            
            # --- UPDATE: Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- UPDATE: Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists("output/tool_candidates.json"):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            return result
       

### Run and Observe in MLflow UI

In [None]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 3 - Metrics")

ai_dev_crew = AIOpsResearchCrew(task_description, existing_stack)
result = ai_dev_crew.run()

### Crew Usage Metrics

In [None]:
# Show the crew usage metrics 

ai_dev_crew.crew.usage_metrics

# Step 4: Evaluating Agent Output Quality

**Problem:** How do we automatically evaluate output quality without manual comparison?

**MLflow Solution:**
- Define specific quality criteria for assessment
- Use LLM-as-Judge evaluation for scoring
- Log evaluation metrics to MLflow
- Compare configurations in MLflow UI

### Set up Nebius AI Studio client to "evaluator" LLM 

In [None]:
import openai

API_KEY = os.environ.get("NEBIUS_API_KEY")

# Instantiate the client instance
nebius_client = openai.OpenAI(
    api_key=API_KEY,
    base_url="https://api.studio.nebius.ai/v1/",
)


### Simplify tasks 

In [None]:
from crewai_tools import FileReadTool

class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=f"""
                Find some AI tools for {task}.
                Format however you think is best.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Info about AI tools",
            output_file="output/tool_candidates.json",
            cache=False
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=f"""
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with {ai_stack}?
                
                Make it JSON I guess.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Analysis of tools",
            output_file="output/tool_analysis.json",
            cache=False,
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=f"""
                Make a report about the tools.
                Use markdown formatting.
            """,
            agent=agent,
            tools=[FileReadTool(file_path='output/tool_analysis.json')],
            expected_output="A report",
            output_file="output/tool_recommendation_report.md",
            cache=False,
        )

### Define Evaluation function

In [None]:
import json
import logging
import os
from textwrap import dedent

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def evaluate_report(
    report_path: str, 
    nebius_client,
    model: str = "meta-llama/Llama-3.3-70B-Instruct"
) -> dict:
    """
    Evaluate a technical report using Nebius AI.
    
    Args:
        report_path: Path to the markdown report file
        nebius_client: Initialized Nebius API client
        model: Model name to use for evaluation
        
    Returns:
        Dictionary containing evaluation scores and reasoning
    """
    # Read the report content
    try:
        with open(report_path, 'r') as f:
            report_content = f.read()
    except Exception as e:
        logger.error(f"Error reading report file: {str(e)}")
        return {
            "completeness_score": 0,
            "relevance_score": 0,
            "overall_quality_score": 0,
            "reasoning": f"Error reading report: {str(e)}"
        }
    
    # Create evaluation prompt
    evaluation_prompt = dedent(f"""
        Evaluate the quality of the following technical report.
        
        Your job is to score the report on the following criteria (scale 1-10):
        
        - COMPLETENESS: Does the report include all required sections for each tool?
          1-3: Missing multiple required sections
          4-6: Has basic information but lacks detail
          7-8: Contains most required sections with good detail
          9-10: Complete with introduction, tool sections (name, URL, description, features, pros, cons, integration complexity, recommendation score), and conclusion
        
        - RELEVANCE: How directly applicable are the recommended tools to the specific task and tech stack?
          1-3: General tools with no specific RAG functionality (like basic logging libraries)
          4-6: Tools with potential use cases (like numpy for data processing)
          7-8: Related tools with partial functionality (like fastai - useful but no specific RAG features)
          9-10: Directly applicable tools (like LangGraph or CrewAI - frameworks specifically for RAG systems)
        
        - OVERALL QUALITY: The overall quality of the report considering all factors.
        
        Here is the report to evaluate:
        
        ---BEGIN REPORT---
        {report_content}
        ---END REPORT---
        
        Provide your evaluation as a JSON object with the following format:
            "completeness_score": X,
            "relevance_score": X,
            "overall_quality_score": X,
            "reasoning": "Detailed explanation of your evaluation and scores..."
        }}
        
        Return ONLY the JSON object with no additional text.
    """)
    
    # Call Nebius AI Studio for evaluation
    try:
        response = nebius_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an expert evaluator of technical reports."},
                {"role": "user", "content": evaluation_prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        
        # Parse response
        result = json.loads(response.choices[0].message.content)
        
        # Ensure all scores are present
        required_scores = ["completeness_score", "relevance_score", "overall_quality_score"]
        for score in required_scores:
            if score not in result:
                result[score] = 0
        
        # Ensure reasoning is present
        if "reasoning" not in result:
            result["reasoning"] = "No reasoning provided."
            
        return result
        
    except Exception as e:
        logger.error(f"Error during evaluation: {str(e)}")
        return {
            "completeness_score": 0,
            "relevance_score": 0,
            "overall_quality_score": 0,
            "reasoning": f"Evaluation error: {str(e)}"
        }

In [None]:
evaluation_results = evaluate_report(
                    report_path='output/tool_recommendation_report.md',
                    nebius_client=nebius_client,
                    model="meta-llama/Llama-3.3-70B-Instruct"
                )

evaluation_results

### Modify AIOpsResearchCrew

In [None]:
class AIOpsResearchCrewEvaluated:
    def __init__(self, task, ai_stack, nebius_client=None, evaluation_model="meta-llama/Llama-3.3-70B-Instruct"):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
            nebius_client: Initialized Nebius API client for evaluation
            evaluation_model (str): Model to use for evaluation
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.execution_log = io.StringIO()
        self.crew = None
        self.nebius_client = nebius_client
        self.evaluation_model = evaluation_model
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)
        
        # Initialize Nebius client if not provided
        if self.nebius_client is None and os.environ.get("NEBIUS_API_KEY"):
            self.nebius_client = openai.OpenAI(
                api_key=os.environ.get("NEBIUS_API_KEY"),
                base_url="https://api.studio.nebius.ai/v1/"
            )

    def _clean_output_files(self):
        """Remove all previously generated output files if they exist."""
        files_to_remove = [
            "output/tool_candidates.json",
            "output/tool_analysis.json",
            "output/tool_recommendation_report.md",
            "output/report_evaluation.json"
        ]
        
        for file_path in files_to_remove:
            if os.path.exists(file_path):
                try:
                    os.remove(file_path)
                    logger.info(f"Removed existing file: {file_path}")
                except Exception as e:
                    logger.warning(f"Failed to remove file {file_path}: {str(e)}")

    def run(self, evaluate=True):
        """
        Execute the research, analysis, and reporting process with MLflow tracking.
        
        Args:
            evaluate (bool): Whether to evaluate the report after generation
        
        Returns:
            The result from the crew execution
        """
        # Clean any existing output files
        self._clean_output_files()
        
        # --- Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id

            # Initialize agents
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True,
                cache=False
            )
            
            # Start the crew
            logger.info("Starting crew execution...")
            result = self.crew.kickoff()
                        
            # --- Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            mlflow.log_param("evaluation_model", self.evaluation_model)

            # --- Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists(file_path):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            # --- Set success tag ---
            report_path = "output/tool_recommendation_report.md"
            if os.path.exists(report_path):
                mlflow.set_tag("status", "SUCCESS")
                
            # --- Evaluate the report if requested and client available ---
            if evaluate and self.nebius_client:
                evaluation_results = evaluate_report(
                    report_path=report_path,
                    nebius_client=self.nebius_client,
                    model=self.evaluation_model
                )
                
                # Log evaluation results to MLflow
                with open("output/report_evaluation.json", "w") as f:
                    json.dump(evaluation_results, f, indent=2)
                mlflow.log_artifact("output/report_evaluation.json")
                
                # Extract and log only numeric scores
                numeric_scores = {}
                for key, value in evaluation_results.items():
                    if key.endswith('_score') and isinstance(value, (int, float)):
                        numeric_scores[key] = value
                
                if numeric_scores:
                    mlflow.log_metrics(numeric_scores)
                
                return result


### Run with Agent evaluation

In [None]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 4 - Evaluate Agent")

ai_dev_crew = AIOpsResearchCrewEvaluated(
    task_description, 
    existing_stack,  
    nebius_client=nebius_client, 
    evaluation_model="meta-llama/Llama-3.3-70B-Instruct")

result = ai_dev_crew.run()

# Step 5: Comparing Prompts with MLflow Prompt Registry

**Problem:** How do you track which prompt was used for which run and compare outputs?

**MLflow Solution: MLflow Prompt Registry**
- Version Control: Track prompt evolution with commit-based versioning and diff highlighting
- Aliasing: Create aliases (e.g., "production," "experimental") to isolate prompt versions
- Reusability: Store prompts centrally for use across multiple agents and applications
- Lineage: Connect prompts to specific model runs for comprehensive traceability
- Collaboration: Share prompts across your team with a centralized registry

### Register a propmt

In [None]:
old_create_report_task_prompt = """
Make a report about the tools.
Use markdown formatting.
"""

updated_prompt = mlflow.register_prompt(
    name="create-report-prompt",
    template=old_create_report_task_prompt,
    commit_message="Initial prompt",
    version_metadata={"author": "ai_platform@nebius.demo"}
)

### Update the prompt

In [None]:
#  Improved instructions to get better structured output

improved_prompt = """
Read the analysis from the previous task's result and create a comprehensive Markdown report.

The report should include:

1. An introduction explaining the task ({{task}}) and existing stack ({{ai_stack}})
2. For each tool, create a section with:
   - Tool name and URL as a heading
   - Description
   - Features (as bullet points)
   - Pros (as bullet points)
   - Cons (as bullet points)
   - Integration complexity
   - Recommendation score with justification
3. A summary/conclusion comparing the tools and providing final recommendations

Use proper Markdown formatting with headings, bullet points, and emphasis where appropriate.
Sort tools by recommendation score (descending).

Your output should be a complete, well-formatted Markdown document.
"""

updated_prompt = mlflow.register_prompt(
    name="create-report-prompt",
    template=improved_prompt,
    commit_message="Improved formatting instructions and evaluation criteria",
    version_metadata={"author": "ai_platform@nebius.demo"}
)

### Load prompt (last version)

In [None]:
# Example of loading and using the prompt
prompt = mlflow.load_prompt("create-report-prompt")

new_create_report_prompt = prompt.format(task=task_description, ai_stack=existing_stack)
print(new_create_report_prompt)

In [None]:
from crewai_tools import FileReadTool

class AIOpsResearchTasks:
    def search_tools_task(self, agent, task, ai_stack):
        """Task to discover relevant tools for a specific task considering the existing AI stack."""
        return Task(
            description=f"""
                Find some AI tools for {task}.
                Format however you think is best.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Info about AI tools",
            output_file="output/tool_candidates.json",
            cache=False
        )

    def analyze_tools_task(self, agent, task, ai_stack):
        """Task to perform in-depth analysis of discovered tools."""
        return Task(
            description=f"""
                Check out the tools from before.
                
                Tell me what you think about each one.
                Is it good? Does it work with {ai_stack}?
                
                Make it JSON I guess.
            """,
            agent=agent,
            tools=[search_tool],
            expected_output="Analysis of tools",
            output_file="output/tool_analysis.json",
            cache=False,
        )

    def create_report_task(self, agent, task, ai_stack):
        """Task to create a comprehensive report with recommendations."""
        return Task(
            description=new_create_report_prompt,
            agent=agent,
            tools=[FileReadTool(file_path='output/tool_analysis.json')],
            expected_output="A report",
            output_file="output/tool_recommendation_report.md",
            cache=False,
        )

In [None]:
class AIOpsResearchCrewUpdatedPrompts:
    def __init__(self, task, ai_stack, nebius_client=None, evaluation_model="meta-llama/Llama-3.3-70B-Instruct"):
        """
        Initialize the crew with the task description and existing AI stack.
        
        Args:
            task (str): Description of the task requiring AI tools
            ai_stack (str): Comma-separated list of existing tools/frameworks used
            nebius_client: Initialized Nebius API client for evaluation
            evaluation_model (str): Model to use for evaluation
        """
        self.task = task
        self.ai_stack = ai_stack
        self.run_id = None
        self.execution_log = io.StringIO()
        self.crew = None
        self.nebius_client = nebius_client
        self.evaluation_model = evaluation_model
        
        # Ensure output directory exists
        os.makedirs("output", exist_ok=True)
        
        # Initialize Nebius client if not provided
        if self.nebius_client is None and os.environ.get("NEBIUS_API_KEY"):
            self.nebius_client = openai.OpenAI(
                api_key=os.environ.get("NEBIUS_API_KEY"),
                base_url="https://api.studio.nebius.ai/v1/"
            )

    def _clean_output_files(self):
        """Remove all previously generated output files if they exist."""
        files_to_remove = [
            "output/tool_candidates.json",
            "output/tool_analysis.json",
            "output/tool_recommendation_report.md",
            "output/report_evaluation.json"
        ]
        
        for file_path in files_to_remove:
            if os.path.exists(file_path):
                try:
                    os.remove(file_path)
                    logger.info(f"Removed existing file: {file_path}")
                except Exception as e:
                    logger.warning(f"Failed to remove file {file_path}: {str(e)}")

    def run(self, evaluate=True):
        """
        Execute the research, analysis, and reporting process with MLflow tracking.
        
        Args:
            evaluate (bool): Whether to evaluate the report after generation
        
        Returns:
            The result from the crew execution
        """
        # Clean any existing output files
        self._clean_output_files()
        
        # --- Start MLflow run ---
        with mlflow.start_run(run_name=f"Tool_Research_{int(time.time())}") as mlflow_run:
            self.run_id = mlflow_run.info.run_id

            # Initialize agents
            agents = AIOpsResearchAgents()
            researcher = agents.researcher_agent()
            analyst = agents.analyst_agent()
            
            # Initialize tasks
            tasks = AIOpsResearchTasks()
            search_task = tasks.search_tools_task(researcher, self.task, self.ai_stack)
            analyze_task = tasks.analyze_tools_task(analyst, self.task, self.ai_stack)
            report_task = tasks.create_report_task(analyst, self.task, self.ai_stack)
            
            # Create the crew
            self.crew = Crew(
                agents=[researcher, analyst],
                tasks=[search_task, analyze_task, report_task],
                verbose=False,
                process=Process.sequential,
                memory=True,
                cache=False
            )
            
            # Start the crew
            logger.info("Starting crew execution...")
            result = self.crew.kickoff()
                        
            # --- Log parameters ---
            mlflow.log_param("task", self.task)
            mlflow.log_param("ai_stack", self.ai_stack)
            mlflow.log_param("evaluation_model", self.evaluation_model)

            # --- Log metrics ---
            mlflow.log_metrics(json.loads(self.crew.usage_metrics.json()))

            # --- Log artifacts ---
            artifact_files = [
                "output/tool_candidates.json",
                "output/tool_analysis.json",
                "output/tool_recommendation_report.md",
            ]
            for file_path in artifact_files:
                if os.path.exists(file_path):
                    mlflow.log_artifact(file_path)

            # --- UPDATE: Set success tag ---
            if os.path.exists("output/tool_recommendation_report.md"):
                mlflow.set_tag("status", "SUCCESS")
            else:
                mlflow.set_tag("status", "FAILED")
            
            # --- Set success tag ---
            report_path = "output/tool_recommendation_report.md"
            if os.path.exists(report_path):
                mlflow.set_tag("status", "SUCCESS")
                
            # --- Evaluate the report if requested and client available ---
            if evaluate and self.nebius_client:
                evaluation_results = evaluate_report(
                    report_path=report_path,
                    nebius_client=self.nebius_client,
                    model=self.evaluation_model
                )
                
                # Log evaluation results to MLflow
                with open("output/report_evaluation.json", "w") as f:
                    json.dump(evaluation_results, f, indent=2)
                mlflow.log_artifact("output/report_evaluation.json")
                
                # Extract and log only numeric scores
                numeric_scores = {}
                for key, value in evaluation_results.items():
                    if key.endswith('_score') and isinstance(value, (int, float)):
                        numeric_scores[key] = value
                
                if numeric_scores:
                    mlflow.log_metrics(numeric_scores)
                
                return result


### Run with Different Prompts and Compare in MLflow UI


In [None]:
# Let's run it once without MLflow to see the typical verbose console output.

# Turn on auto tracing by calling mlflow.crewai.autolog()
mlflow.crewai.autolog()
mlflow.set_experiment("Step 5 - Comparing prompts")

ai_dev_crew = AIOpsResearchCrewUpdatedPrompts(
    task_description, 
    existing_stack,
    nebius_client=nebius_client, 
    evaluation_model="meta-llama/Llama-3.3-70B-Instruct")

result = ai_dev_crew.run()

# 8. Conclusion

Developing robust and reliable LLM agents requires moving beyond simple script execution. By integrating **Managed MLflow**, we gain crucial capabilities:

*   **Traceability:** Understand exactly what happened during an agent run via logged artifacts (`execution_log.txt`, `error.log`).
*   **Reproducibility:** Capture configurations as parameters, ensuring you know precisely what setup produced a given result.
*   **Comparison:** Systematically evaluate the impact of changes (prompts, LLMs, agent logic) using the MLflow UI's compare feature for parameters, metrics, and artifacts.
*   **Debugging:** Quickly identify and analyze failed runs using status tags and dedicated error logs.
*   **Performance Monitoring:** Track execution time and other metrics to identify bottlenecks and regressions.
*   **Quality & Cost Evaluation:** Provide a central repository to store and analyze quality scores (human or LLM-judged) and resource usage (tokens, cost) for holistic optimization.

MLflow transforms agent development into a more transparent, measurable, and efficient engineering process, enabling faster iteration and more reliable outcomes.