
# MLflow プロンプト最適化

https://docs.databricks.com/aws/ja/mlflow3/genai/prompt-version-mgmt/prompt-registry/automatically-optimize-prompts

> # プロンプト最適化（実験的機能）
> MLflowでは、```mlflow.genai.optimize_prompt()```APIを使用したMLflowの統合インターフェースを通じて、プロンプトを高度なプロンプト最適化手法に組み込むことができます。この機能は、評価指標とラベル付きデータを活用して、プロンプトを自動的に改善するのに役立ちます。現在、このAPIはDSPyのMIPROv2アルゴリズムをサポートしています。
> 
> ## 主なメリット
> - **統合インターフェース:** 中立的なインターフェースを介して最先端のプロンプト最適化アルゴリズムにアクセスできます。
> - **プロンプト管理:** MLflow プロンプト レジストリと統合して、再利用性、バージョン管理、系統を実現します。
> - **評価:** MLflow の評価機能を使用してプロンプトのパフォーマンスを総合的に評価します。
> 

## 簡易チュートリアル

In [0]:
%pip install -U "mlflow[databricks]>=3.1.0" databricks-langchain langgraph dspy databricks-agents

%restart_python

Collecting databricks-langchain
  Downloading databricks_langchain-0.5.1-py3-none-any.whl.metadata (2.7 kB)
Collecting databricks-ai-bridge>=0.4.2 (from databricks-langchain)
  Downloading databricks_ai_bridge-0.5.1-py3-none-any.whl.metadata (6.2 kB)
Collecting databricks-connect>=16.1.1 (from databricks-langchain)
  Downloading databricks_connect-16.1.6-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting databricks-vectorsearch>=0.50 (from databricks-langchain)
  Downloading databricks_vectorsearch-0.56-py3-none-any.whl.metadata (2.8 kB)
Collecting unitycatalog-langchain>=0.2.0 (from unitycatalog-langchain[databricks]>=0.2.0->databricks-langchain)
  Downloading unitycatalog_langchain-0.2.0-py3-none-any.whl.metadata (6.5 kB)
Collecting tabulate>=0.9.0 (from databricks-ai-bridge>=0.4.2->databricks-langchain)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting deprecation>=2 (from databricks-vectorsearch>=0.50->databricks-langchain)
  Downloading deprecation-2.1.0-py2

In [0]:
import mlflow
from mlflow.entities import Prompt
mlflow.set_registry_uri("databricks-uc")

CATALOG = "workspace"
SCHEMA = "default"

In [0]:
# First prompt for summarization.
qa_prompt = mlflow.genai.register_prompt(
    name=f"{CATALOG}.{SCHEMA}.qa_prompt",
    template="次の質問に対して日本語で回答してください:{{question}}",
)

qa_prompt

PromptVersion(name=workspace.default.qa_prompt, version=3, template="\u6b21\u306e\u8cea\u554f\u306...)

In [0]:
import pandas as pd

# 質問と回答のペアをリストとして定義
train_data = [
    {
        "inputs": {"question": "Databricksとは何ですか？"},
        "expectations": {
            "answer": "Databricksは、データエンジニアリング、データサイエンス、機械学習のための統合データ分析プラットフォームです。"
        },
    },
    {
        "inputs": {"question": "Databricksの主な機能は何ですか？"},
        "expectations": {
            "answer": "Databricksの主な機能には、データの統合、分析、機械学習モデルのトレーニングとデプロイがあります。"
        },
    },
    {
        "inputs": {"question": "Databricksで使用できるプログラミング言語は何ですか？"},
        "expectations": {
            "answer": "Databricksでは、Python、SQL、R、Scalaなどのプログラミング言語を使用できます。"
        },
    },
    {
        "inputs": {"question": "Databricksのノートブックとは何ですか？"},
        "expectations": {
            "answer": "Databricksのノートブックは、データ分析や機械学習のコードを記述、実行、共有するためのインタラクティブな環境です。"
        },
    },
    {
        "inputs": {"question": "Databricksのクラスターとは何ですか？"},
        "expectations": {
            "answer": "Databricksのクラスターは、データ処理や分析のために使用されるコンピューティングリソースの集合です。"
        },
    },
]
eval_data = [
    {
        "inputs": {"question": "DatabricksのDelta Lakeとは何ですか？"},
        "expectations": {
            "answer": "Delta Lakeは、Databricks上で提供される信頼性の高いデータレイクソリューションで、ACIDトランザクションやスキーマエンフォースメントをサポートします。"
        },
    },
    {
        "inputs": {"question": "DatabricksのMLflowとは何ですか？"},
        "expectations": {"answer": "MLflowは、機械学習モデルのライフサイクル管理を支援するオープンソースプラットフォームです。"},
    },
    {
        "inputs": {"question": "Databricksのジョブとは何ですか？"},
        "expectations": {
            "answer": "Databricksのジョブは、スケジュールされたデータ処理タスクやワークフローを自動化するための機能です。"
        },
    },
    {
        "inputs": {"question": "Databricksのワークスペースとは何ですか？"},
        "expectations": {
            "answer": "Databricksのワークスペースは、データ分析や機械学習プロジェクトを管理するためのコラボレーション環境です。"
        },
    },
    {
        "inputs": {"question": "DatabricksのUnity Catalogとは何ですか？"},
        "expectations": {
            "answer": "Unity Catalogは、Databricks上でデータガバナンスとセキュリティを提供するための統合データカタログです。"
        },
    },
]

# pandasデータフレームに変換
pdf = pd.DataFrame(train_data)

# データフレームを表示
display(pdf)

inputs,expectations
List(Databricksとは何ですか？),List(Databricksは、データエンジニアリング、データサイエンス、機械学習のための統合データ分析プラットフォームです。)
List(Databricksの主な機能は何ですか？),List(Databricksの主な機能には、データの統合、分析、機械学習モデルのトレーニングとデプロイがあります。)
List(Databricksで使用できるプログラミング言語は何ですか？),List(Databricksでは、Python、SQL、R、Scalaなどのプログラミング言語を使用できます。)
List(Databricksのノートブックとは何ですか？),List(Databricksのノートブックは、データ分析や機械学習のコードを記述、実行、共有するためのインタラクティブな環境です。)
List(Databricksのクラスターとは何ですか？),List(Databricksのクラスターは、データ処理や分析のために使用されるコンピューティングリソースの集合です。)


In [0]:
from typing import Any
from mlflow.genai.scorers import Correctness
from mlflow.genai.optimize import OptimizerConfig, LLMParams
from mlflow.genai.scorers import scorer
import os

# OpenAI Clientが利用できるように、現在のCredentialをOPENAI_API_KEYに登録
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
os.environ["OPENAI_API_KEY"] = mlflow_creds.token

# Correctnessスコアを計算するbuilt-inオブジェクトを作成
_correctness = Correctness()

# プロンプト最適化のための評価関数（確からしさのテスト)
@scorer
def correctness(inputs, outputs, expectations):
    expectations = {"expected_response": expectations.get("answer")}
    return (
        _correctness(inputs=inputs, outputs=outputs, expectations=expectations).value
        == "yes"
    )

# 最適化対象のプロンプト
prompt = mlflow.genai.load_prompt(f"prompts:/{CATALOG}.{SCHEMA}.qa_prompt/1")

# プロンプトを最適化
result = mlflow.genai.optimize_prompt(
    target_llm_params=LLMParams(
        model_name="openai/databricks-llama-4-maverick",
        base_uri=f"{mlflow_creds.host}/serving-endpoints",
    ),
    prompt=prompt,
    train_data=train_data,
    eval_data=eval_data,
    scorers=[correctness],
    optimizer_config=OptimizerConfig(
        num_instruction_candidates=8,
        max_few_show_examples=2,
        # verbose=True,
        autolog=True,
    ),
)

# 最適化結果のプロンプトレジストリのURLを表示
print(result.prompt.uri)


2025/06/22 09:16:23 INFO mlflow.genai.optimize.base: Run `9c9d4d9743df4673a55f3bad9e06a308` is created for autologging prompt optimization. Watch the run to track the optimization progress.
2025/06/22 09:16:24 INFO mlflow.genai.optimize.optimizers.dspy_mipro_optimizer: Started optimizing prompt prompts:/workspace.default.qa_prompt/1. Please wait as this process typically takes several minutes, but can take longer with large datasets...
2025/06/22 09:17:36 INFO mlflow.genai.optimize.optimizers.dspy_mipro_optimizer: Prompt optimization completed. Evaluation score did not change. Score 80.0


prompts:/workspace.default.qa_prompt/15


[Trace(trace_id=tr-724e4c59534f92e0a1cc2b81348647f1), Trace(trace_id=tr-a93d85e284f9d130ca11b258cf8f3e0a)]

In [0]:
from pprint import pprint
prompt = mlflow.genai.load_prompt(result.prompt.uri)

print(prompt.template)

"<system>\nYour input fields are:\n1. `question` (str):\nYour output fields are:\n1. `answer` (str):\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  \"answer\": \"{answer}\"\n}\nIn adhering to this structure, your objective is: \n        {{question}}\n</system>\n\n<user>\n[[ ## question ## ]]\n{{question}}\n\nRespond with a JSON object in the following order of fields: `answer`.\n</user>"


In [0]:
import mlflow
from openai import OpenAI
import codecs

# MLflowの自動ロギングを有効にして、アプリケーションにトレースを追加
mlflow.openai.autolog()

# 実行ノートブックと同じ資格情報を使用してOpenAIクライアント経由でDatabricks LLMに接続
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token, base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# レジストリからプロンプトをロード
first_prompt = mlflow.genai.load_prompt(f"prompts:/{CATALOG}.{SCHEMA}.qa_prompt/1")
optimized_prompt = mlflow.genai.load_prompt(result.prompt.uri)
endpoint = "databricks-llama-4-maverick"

def predict(prompt):
    formatted_prompt = prompt.format(question="Databricksとは何ですか?")
    
    # LLMを呼び出す
    response = client.chat.completions.create(
        model="databricks-llama-4-maverick",
        messages=[
            {
                "role": "user",
                "content": formatted_prompt,
            },
        ],
    )
    return response.choices[0].message.content


for p in [first_prompt, optimized_prompt]:
    print(predict(p))

Databricksは、Apache Sparkを利用したビッグデータ処理と分析のためのプラットフォームおよびサービスを提供する企業、およびそのプラットフォームの名称です。Databricksは、Sparkの開発に携わった人々によって設立されました。Sparkは、大規模なデータセットを高速に処理できるオープンソースの分散処理フレームワークです。

Databricksが提供するプラットフォームは、データエンジニアリング、データサイエンス、およびビジネスユーザーを対象に設計されており、Apache Sparkを用いたデータの取り込み、変換、分析を容易に行えるようにします。主な特徴としては、以下のようなものがあります：

1. **インタラクティブなワークスペース**: データの探索、変換、分析をインタラクティブに行うための環境を提供します。ノートブック形式での作業が可能で、Python、R、Scala、SQLなどの言語に対応しています。

2. **データの取り込みと統合**: 多様なデータソースからのデータの取り込みをサポートし、データレイクやデータウェアハウスへのデータ統合を容易にします。

3. **高度な分析**: Apache Sparkの力を利用して、大規模なデータセットに対する高度な分析や機械学習を高速に実行できます。

4. **コラボレーション機能**: データサイエンティスト、エンジニア、ビジネスユーザー間のコラボレーションを促進するための機能が備わっています。

5. **セキュリティとガバナンス**: 企業がデータを安全に扱えるように、アクセス制御や監査ログなどのセキュリティ機能を提供しています。

Databricksは、クラウド（AWS、Azure、GCP）上でサービスとして提供されるほか、オンプレミス環境やプライベートクラウドへのデプロイメントもサポートしています。これにより、企業は自社のニーズや既存のITインフラに合わせた形で、ビッグデータと分析技術を活用することができます。
{
  "answer": "Databricksは、Apache Sparkの創始者たちが開発した、ビッグデータ処理および分析プラットフォームです。データエンジニアリング、データサイエンス、ビジネスアナリティクスのための統合環境を提供し、データの処理、分析、可視

[Trace(trace_id=tr-db0ff0b3fcc5cf7dae695eaba7000084), Trace(trace_id=tr-ab8011b8960e57eba06d88a5539b8eed)]

## 公式Notebook

https://docs.databricks.com/aws/ja/notebooks/source/mlflow/prompt-optimization.html

In [0]:
%pip install -U "mlflow[databricks]>=3.1.0" langchain-community langchain-openai beautifulsoup4 langgraph dspy databricks-agents

%restart_python

Collecting langchain-community
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.24-py3-none-any.whl.metadata (2.3 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting langgraph
  Downloading langgraph-0.4.8-py3-none-any.whl.metadata (6.8 kB)
Collecting dspy
  Downloading dspy-2.6.27-py3-none-any.whl.metadata (7.0 kB)
Collecting langchain-core<1.0.0,>=0.3.66 (from langchain-community)
  Downloading langchain_core-0.3.66-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain<1.0.0,>=0.3.26 (from langchain-community)
  Downloading langchain-0.3.26-py3-none-any.whl.metadata (7.8 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain-community)
  Downloading aiohttp-3.12.13-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.6 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_se

In [0]:
# TODO: If necessary, change the catalog and schema name here
CATALOG = "workspace"
SCHEMA = "default"

In [0]:
import mlflow
from mlflow.entities import Prompt
mlflow.set_registry_uri("databricks-uc")

In [0]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

split_docs = text_splitter.split_documents(docs)
print(f"Generated {len(split_docs)} documents.")

Created a chunk of size 1003, which is longer than the specified 1000


Generated 13 documents.


In [0]:
from langchain.chat_models import init_chat_model

mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
llm = init_chat_model(
    # "databricks-llama-4-maverick",
    "databricks-meta-llama-3-1-405b-instruct",
    model_provider="openai",
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints",
)


In [0]:
# First prompt for summarization.
summary_prompt = mlflow.genai.register_prompt(
    name=f"{CATALOG}.{SCHEMA}.summary_prompt",
    template="Write a concise summary of the following:{{content}}",
)

In [0]:
summary_prompt

PromptVersion(name=workspace.default.summary_prompt, version=1, template="Write a concise summary of th...)

In [0]:
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.messages import HumanMessage, SystemMessage

summary_chain = llm | StrOutputParser()

@mlflow.trace()
def call_summary_chain(content):
  return summary_chain.invoke([HumanMessage(summary_prompt.format(content=content))])

In [0]:
# Second prompt for topic extraction.
topic_prompt = mlflow.genai.register_prompt(name=f"{CATALOG}.{SCHEMA}.topic_prompt",
                       template="""
The following is the summary:
{{summary}}
Extract the main topic in a few words.
Return the response in JSON format: {"topic": "..."}
""")

topic_chain = llm | JsonOutputParser()

@mlflow.trace()
def call_topic_chain(summary):
  return topic_chain.invoke([HumanMessage(topic_prompt.format(summary=summary))])

In [0]:
from langchain_core.messages import HumanMessage, SystemMessage

@mlflow.trace
def agent(content):
  summary = call_summary_chain(content=content)
  return call_topic_chain(summary=summary)["topic"]

In [0]:
# Enable Autologging
mlflow.langchain.autolog()

In [0]:
# Run the agent
for doc in split_docs:
  try:
    print(agent(doc.page_content))
  except Exception as e:
    print(e)
    pass

LLM-powered Autonomous Agents
AI Planning and Self-Reflection
missing < at position 482 (line 1, column 483)
bad escape \l at position 4430 (line 28, column 180)
'topic'
'topic'
'topic'
'topic'
Super Mario Game
'topic'
'topic'
'topic'
LLM-powered Autonomous Agents


[Trace(trace_id=tr-0eb1475a3bdf2dd5a95794f3219ec1eb), Trace(trace_id=tr-5f1c57bea6d5197e1c15ae406495e767), Trace(trace_id=tr-462dd004173bd3b9f17434c5f0ca1ad7), Trace(trace_id=tr-16e855eabcef7a8279abbbe3f9863302), Trace(trace_id=tr-cc07480d4b172f4c4db60b15756e0999), Trace(trace_id=tr-b4ebe25baa4ca90ccea3486c38fa8bdd), Trace(trace_id=tr-d0d2a03072127ab9b1530fb4defee860), Trace(trace_id=tr-09399bf2adb47ec7ca7450763dc554b6), Trace(trace_id=tr-10b2e71d21977e093ffec7373d2e1685), Trace(trace_id=tr-71a3842b32fbcc5399619dc9123caf6d)]

## Dataset Creation

In [0]:
import mlflow

# Extract the inputs and outputs of the second LLM call
traces = mlflow.search_traces(extract_fields=[
  "call_topic_chain.inputs",
  "call_topic_chain.outputs",
])

In [0]:
traces.head(10)

Unnamed: 0,trace_id,trace,client_request_id,state,request_time,execution_duration,request,response,trace_metadata,tags,spans,assessments,call_topic_chain.inputs,call_topic_chain.outputs
0,tr-4047c21100faf7a16a91eb95c690e727,Trace(trace_id=tr-4047c21100faf7a16a91eb95c690...,tr-4047c21100faf7a16a91eb95c690e727,TraceState.OK,1750501933630,6076,"{'content': 'Or @article{weng2023agent,  titl...",LLM-powered Autonomous Agents,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'QEfCEQD696FqkeuVxpDnJw==', 'spa...",[],{'summary': 'The article discusses the concept...,{'topic': 'LLM-powered Autonomous Agents'}
1,tr-d96123695e54adc72b80036c20a8d10b,Trace(trace_id=tr-d96123695e54adc72b80036c20a8...,tr-d96123695e54adc72b80036c20a8d10b,TraceState.ERROR,1750501929342,3990,{'content': 'Finite context length: The restri...,,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': '2WEjaV5UrccrgANsIKjRCw==', 'spa...",[],{'summary': 'Here is a concise summary: LLM-p...,{}
2,tr-124d7f6c158ef804b22206dfa219374c,Trace(trace_id=tr-124d7f6c158ef804b22206dfa219...,tr-124d7f6c158ef804b22206dfa219374c,TraceState.ERROR,1750501922995,6021,"{'content': 'Conversatin samples: [  {  ""r...",,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'Ek1/bBWO+ASyIgbfohk3TA==', 'spa...",[],{'summary': 'Here is a concise summary of the ...,{}
3,tr-71a3842b32fbcc5399619dc9123caf6d,Trace(trace_id=tr-71a3842b32fbcc5399619dc9123c...,tr-71a3842b32fbcc5399619dc9123caf6d,TraceState.ERROR,1750501915653,7070,{'content': 'You will get instructions for cod...,,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'caOEKzL7zFOZYZ3JEjyvbQ==', 'spa...",[],{'summary': 'Here is a concise summary: **Tas...,{}
4,tr-10b2e71d21977e093ffec7373d2e1685,Trace(trace_id=tr-10b2e71d21977e093ffec7373d2e...,tr-10b2e71d21977e093ffec7373d2e1685,TraceState.OK,1750501908439,6940,{'content': 'You should only respond in JSON f...,Super Mario Game,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'ELLnHSGXfgk//sc3PS4WhQ==', 'spa...",[],{'summary': 'Here is a summary of the conversa...,{'topic': 'Super Mario Game'}
5,tr-09399bf2adb47ec7ca7450763dc554b6,Trace(trace_id=tr-09399bf2adb47ec7ca7450763dc5...,tr-09399bf2adb47ec7ca7450763dc554b6,TraceState.ERROR,1750501902566,5614,"{'content': 'Commands: 1. Google Search: ""goog...",,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'CTmb8q20fsfKdFB2PcVUtg==', 'spa...",[],{'summary': 'The text outlines a set of 20 com...,{}
6,tr-d0d2a03072127ab9b1530fb4defee860,Trace(trace_id=tr-d0d2a03072127ab9b1530fb4defe...,tr-d0d2a03072127ab9b1530fb4defee860,TraceState.ERROR,1750501895202,7057,{'content': 'inquired about current trends in ...,,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': '0NKgMHISermxUw+03v7oYA==', 'spa...",[],{'summary': 'Here is a concise summary of the ...,{}
7,tr-b4ebe25baa4ca90ccea3486c38fa8bdd,Trace(trace_id=tr-b4ebe25baa4ca90ccea3486c38fa...,tr-b4ebe25baa4ca90ccea3486c38fa8bdd,TraceState.ERROR,1750501886627,8253,{'content': '(3) Task execution: Expert models...,,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'tOviW6pMqQzOo0hsOPqL3Q==', 'spa...",[],{'summary': 'Here is a concise summary of the ...,{}
8,tr-cc07480d4b172f4c4db60b15756e0999,Trace(trace_id=tr-cc07480d4b172f4c4db60b15756e...,tr-cc07480d4b172f4c4db60b15756e0999,TraceState.ERROR,1750501879556,6831,"{'content': 'Comparison of MIPS algorithms, me...",,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'zAdIDUsXL0xNtgsVdW4JmQ==', 'spa...",[],{'summary': 'Here is a concise summary of the ...,{}
9,tr-16e855eabcef7a8279abbbe3f9863302,Trace(trace_id=tr-16e855eabcef7a8279abbbe3f986...,tr-16e855eabcef7a8279abbbe3f9863302,TraceState.ERROR,1750501879276,5,{'content': 'Sensory Memory: This is the earli...,,{'mlflow.databricks.workspaceID': '17655129088...,{'mlflow.artifactLocation': 'dbfs:/databricks/...,"[{'trace_id': 'FuhV6rzveoJ5q7vj+YYzAg==', 'spa...",[],,


In [0]:
from mlflow.genai import datasets

EVAL_DATASET_NAME=f"{CATALOG}.{SCHEMA}.data"
dataset = datasets.create_dataset(EVAL_DATASET_NAME)

In [0]:
dataset

<mlflow.genai.datasets.evaluation_dataset.EvaluationDataset at 0xffff2c06ba10>

In [0]:
# Create a dataset by treating the agent outputs as the default expectations.
traces = traces.rename(
    columns={
      "call_topic_chain.inputs": "inputs",
      "call_topic_chain.outputs": "expectations",
    }
)[["inputs", "expectations"]]
traces = traces.dropna()
dataset.merge_records(traces)

<mlflow.genai.datasets.evaluation_dataset.EvaluationDataset at 0xffff2b57b410>


## Labeling

In [0]:
dataset = datasets.get_dataset(EVAL_DATASET_NAME)
dataset.merge_records([])

<mlflow.genai.datasets.evaluation_dataset.EvaluationDataset at 0xffff2ae59f90>

In [0]:
dataset = dataset.to_df()
dataset.head()

Unnamed: 0,dataset_record_id,inputs,expectations,source,tags,create_time,last_update_time,created_by,last_updated_by
0,0db4e3cc-5799-4290-b93f-e688c3003e8f,{'summary': 'Here is a concise summary of the ...,{'topic': 'Autonomous Agents'},,,2025-06-21T10:34:46.638Z,2025-06-21T10:34:46.638Z,isanakamishiro@gmail.com,isanakamishiro@gmail.com
1,19b313e3-c2b7-44f6-96fc-ad402f5f2f97,{'summary': 'The article discusses the concept...,{'topic': 'LLM-powered Autonomous Agents'},,,2025-06-21T10:34:46.638Z,2025-06-21T10:34:46.638Z,isanakamishiro@gmail.com,isanakamishiro@gmail.com
2,29980eb3-82d4-4c98-a403-a5eb5b74140d,{'summary': 'Here is a concise summary of the ...,{'topic': 'AI models and applications'},,,2025-06-21T10:34:46.638Z,2025-06-21T10:34:46.638Z,isanakamishiro@gmail.com,isanakamishiro@gmail.com
3,2c2cf9ed-b87c-458b-a6df-89dacbe2fa69,{'summary': 'Here is a concise summary of the ...,{},,,2025-06-21T10:34:46.638Z,2025-06-21T10:34:46.638Z,isanakamishiro@gmail.com,isanakamishiro@gmail.com
4,2f51fcb7-2cb2-4d31-a2d2-0a3e4af780c3,{'summary': 'Here's a concise summary: The te...,{'topic': 'LLMs Tool Integration'},,,2025-06-21T10:34:46.638Z,2025-06-21T10:34:46.638Z,isanakamishiro@gmail.com,isanakamishiro@gmail.com


## Optimize

In [0]:
import os
import mlflow
from typing import Any
from mlflow.genai.scorers import scorer
from mlflow.genai.optimize import OptimizerConfig, LLMParams

mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
os.environ["OPENAI_API_KEY"] = mlflow_creds.token


@scorer
def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool:
    return expectations == outputs

prompt = mlflow.genai.register_prompt(
    name=f"{CATALOG}.{SCHEMA}.qa",
    template="Answer the following question: {{question}}",
)

result = mlflow.genai.optimize_prompt(
    target_llm_params=LLMParams(
        model_name="openai/databricks-meta-llama-3-1-405b-instruct",
        base_uri=f"{mlflow_creds.host}/serving-endpoints",
    ),
    train_data=[
        {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}}
        for i in range(100)
    ],
    scorers=[exact_match],
    prompt=prompt.uri,
    optimizer_config=OptimizerConfig(num_instruction_candidates=5),
)

print(result.prompt.template)

  prompt: PromptVersion = load_prompt(prompt)
2025/06/21 10:41:26 INFO mlflow.genai.optimize.optimizers.dspy_mipro_optimizer: Started optimizing prompt prompts:/workspace.default.qa/2. Please wait as this process typically takes several minutes, but can take longer with large datasets...
2025/06/21 10:45:47 INFO mlflow.genai.optimize.optimizers.dspy_mipro_optimizer: Prompt optimization completed. Evaluation score did not change. Score 0.0
  optimized_prompt = register_prompt(


"<system>\nYour input fields are:\n1. `question` (str):\nYour output fields are:\n1. `answer` (str):\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  \"answer\": \"{answer}\"\n}\nIn adhering to this structure, your objective is: \n        Answer the question that will be provided.\n</system>\n\n<user>\n[[ ## question ## ]]\n{{question}}\n\nRespond with a JSON object in the following order of fields: `answer`.\n</user>"


In [0]:
result.prompt

PromptVersion(name=workspace.default.qa, version=3, template="<system>\nYour input fields a...)

In [0]:
import os
from typing import Any
import mlflow
from mlflow.genai.scorers import Correctness
from mlflow.genai.optimize import OptimizerConfig, LLMParams
from mlflow.genai.scorers import scorer

_correctness = Correctness()


@scorer
def correctness(inputs, outputs, expectations):
    expectations = {"expected_response": expectations.get("topic")}
    return (
        _correctness(inputs=inputs, outputs=outputs, expectations=expectations).value
        == "yes"
    )


# Optimize the prompt
result = mlflow.genai.optimize_prompt(
    target_llm_params=LLMParams(
        model_name="openai/databricks-meta-llama-3-3-70b-instruct",
        base_uri=f"{mlflow_creds.host}/serving-endpoints",
    ),
    prompt=topic_prompt,
    train_data=dataset,
    scorers=[correctness],
    optimizer_config=OptimizerConfig(
        num_instruction_candidates=8,
        max_few_show_examples=2,
        verbose=True,
    ),
)

# The optimized prompt is automatically registered as a new version
# Open the prompt registry web site to check the new prompt
print(f"The new prompt URI: {result.prompt.uri}")

2025/06/21 10:50:14 INFO mlflow.genai.optimize.optimizers.dspy_mipro_optimizer: Started optimizing prompt prompts:/workspace.default.topic_prompt/2. Please wait as this process typically takes several minutes, but can take longer with large datasets...
2025/06/21 10:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/06/21 10:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/06/21 10:50:14 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=8 sets of demonstrations...


Bootstrapping set 1/8
Bootstrapping set 2/8
Bootstrapping set 3/8


 40%|████      | 2/5 [00:05<00:08,  2.81s/it] 40%|████      | 2/5 [00:05<00:08,  2.78s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/8


 20%|██        | 1/5 [00:01<00:07,  2.00s/it] 20%|██        | 1/5 [00:01<00:07,  2.00s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/8


 20%|██        | 1/5 [00:02<00:08,  2.11s/it] 20%|██        | 1/5 [00:02<00:08,  2.12s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/8


2025/06/21 10:50:24 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'summary': "Here is a concise summary of the provided text:\n\n**Task Execution and Response Generation**\n\n* Expert models execute tasks and log results, which are then used to generate a response for the user.\n* The response includes a straightforward answer to the user's request, followed by a description of the task process, analysis, and model inference results.\n\n**Challenges and Benchmarks**\n\n* To apply HuggingGPT in real-world scenarios, challenges such as efficiency improvement, stability, and long context windows need to be addressed.\n* API-Bank is a benchmark that evaluates the performance of tool-augmented LLMs using 53 APIs and 264 annotated dialogues.\n\n**API-Bank Workflow**\n\n* LLMs make decisions on whether to call an API, identify the right API, and respond based on API results.\n* The benchmark evaluates tool use capabilities at three levels: calling an API, retr

Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 7/8


 20%|██        | 1/5 [00:02<00:09,  2.34s/it] 20%|██        | 1/5 [00:02<00:09,  2.34s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/8


 40%|████      | 2/5 [00:04<00:07,  2.37s/it] 40%|████      | 2/5 [00:04<00:07,  2.41s/it]
2025/06/21 10:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/06/21 10:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/06/21 10:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=8 instructions...



[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
File [0;32m<command-8398490772614388>, line 21[0m
[1;32m     14[0m     [38;5;28;01mreturn[39;00m (
[1;32m     15[0m         _correctness(inputs[38;5;241m=[39minputs, outputs[38;5;241m=[39moutputs, expectations[38;5;241m=[39mexpectations)[38;5;241m.[39mvalue
[1;32m     16[0m         [38;5;241m==[39m [38;5;124m"[39m[38;5;124myes[39m[38;5;124m"[39m
[1;32m     17[0m     )
[1;32m     20[0m [38;5;66;03m# Optimize the prompt[39;00m
[0;32m---> 21[0m result [38;5;241m=[39m mlflow[38;5;241m.[39mgenai[38;5;241m.[39moptimize_prompt(
[1;32m     22[0m     target_llm_params[38;5;241m=[39mLLMParams(
[1;32m     23[0m         model_name[38;5;241m=[39m[38;5;124m"[39m[38;5;124mopenai/databricks-meta-llama-3-3-70b-instruct[39m[38;5;124m"[39m,
[1;32m     24[0m         base_uri[38;