In [1]:
%reload_ext autoreload
%autoreload 2

import getpass
import os
import sys
from pathlib import Path

import mlflow
import openai
import pandas as pd

In [2]:
os.environ["MLFLOW_TRACKING_URI"] = "http://0.0.0.0:5001"
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPEN API key:")

Enter OPEN API key: ········


In [3]:
mlflow.set_experiment("movie-recommendation-experiment")

2025/06/19 13:53:11 INFO mlflow.tracking.fluent: Experiment with name 'movie-recommendation-experiment' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/973507716245173164', creation_time=1750315991921, experiment_id='973507716245173164', last_update_time=1750315991921, lifecycle_stage='active', name='movie-recommendation-experiment', tags={}>

In [4]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "I want a movie like The Matrix, but with a deeper philosophical ending.",
            "Something emotional, like The Pursuit of Happyness, but without too much sadness.",
            "A crime movie like Breaking Bad, but focused on the psychological transformation.",
            "I want a romantic movie that’s not cheesy and has a realistic ending.",
            "A slow-paced sci-fi movie like Arrival, with a strong emotional core.",
        ],
        "ground_truth": [
            "A science fiction movie set in a dystopian future where reality is an elaborate simulation controlled by intelligent machines. The protagonist, a disillusioned but curious individual, uncovers the truth and joins a rebellion to free humanity. The narrative explores themes of free will, perception vs. reality, and existential purpose. The film culminates in a thought-provoking ending that questions the very nature of consciousness and what it means to be human.",
            "An inspiring drama following a determined protagonist who faces numerous life challenges but remains hopeful and resilient. The story focuses on themes of perseverance, parenthood, and personal growth. Despite setbacks, the film maintains a positive and uplifting tone, with a strong emotional core and a satisfying, heartwarming conclusion that celebrates human spirit and triumph over adversity.",
            "A gritty crime drama centered around a morally ambiguous protagonist who slowly descends into a criminal lifestyle. The narrative deeply examines the psychological evolution of the main character, portraying how desperation, power, and ego can alter one’s identity. The film is dark, tense, and introspective, with strong character development and a focus on inner conflict rather than action-driven plot.",
            "A grounded romantic drama portraying the evolving relationship between two complex individuals. The story avoids clichés and idealized portrayals, instead focusing on authentic emotional connection, communication struggles, and the challenges of building intimacy. The ending is nuanced and emotionally resonant, reflecting real-life complexities—love that feels earned, even if not perfect or everlasting.",
            "A contemplative science fiction film where the central conflict revolves around a non-violent first contact with an alien species. The pacing is deliberate, allowing time for introspection, linguistic puzzles, and emotional storytelling. Themes include memory, loss, communication, and the nonlinear nature of time. The protagonist’s emotional journey is central, with a subtle but powerful payoff that lingers in the viewer’s mind.",
        ],
    }
)

In [21]:
SYSTEM_PROMPT="""
You are helping a user reformulate their vague movie request into a detailed, clear expression of the kind of movie they want to watch.

Your task is to rewrite the user’s request as a single, rich paragraph that describes the desired movie in detail. Do not mention any specific movies, characters, actors, or directors. Do not invent fake titles or fictional plots.

Describe the kind of movie the user is looking for by covering:

- Genre and tone
- Narrative focus or themes
- Emotional experience they want
- Pacing and atmosphere
- Type of ending they expect

Write from the user's perspective using first-person language (e.g., “I’m looking for…”). Respond only with the paragraph, no extra text, titles, or headings.
"""

In [22]:
with mlflow.start_run() as run:
    system_prompt = SYSTEM_PROMPT
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        name="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{description}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 295.91it/s]
2025/06/19 14:19:04 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-0f015ef569384a8d94d3947551dea793
2025/06/19 14:19:04 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:19:04 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-0f015ef569384a8d94d3947551dea793
2025/06/19 14:19:04 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:19:04 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/06/19 14:19:09 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


🏃 View run worried-smelt-482 at: http://0.0.0.0:5001/#/experiments/973507716245173164/runs/d1b2cadf35a14faab8aafc63bfa924e1
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/973507716245173164


{'exact_match/v1': 0.0}

In [23]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 196.00it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count
0,"I want a movie like The Matrix, but with a dee...",A science fiction movie set in a dystopian fut...,I’m looking for a thought-provoking science fi...,172
1,"Something emotional, like The Pursuit of Happy...",An inspiring drama following a determined prot...,I’m looking for an emotional drama that captur...,125
2,"A crime movie like Breaking Bad, but focused o...",A gritty crime drama centered around a morally...,I’m looking for a gripping crime drama that di...,194
3,I want a romantic movie that’s not cheesy and ...,A grounded romantic drama portraying the evolv...,I’m looking for a romantic movie that blends h...,155
4,"A slow-paced sci-fi movie like Arrival, with a...",A contemplative science fiction film where the...,I’m looking for a slow-paced sci-fi movie that...,146


In [24]:
from mlflow.metrics.genai import EvaluationExample, answer_similarity

example = EvaluationExample(
    input="Something emotional, like The Pursuit of Happyness, but without too much sadness.",
    output=(
        "I’m looking for an emotional drama that captures the resilience of the human spirit, "
        "focusing on themes of hope and perseverance in the face of adversity. I want it to tell "
        "a heartfelt story about an individual's journey, showcasing their struggles but emphasizing "
        "uplifting moments and personal triumphs that bring warmth and inspiration. The pacing should "
        "be steady, allowing me to connect deeply with the characters and their experiences, while "
        "maintaining an atmosphere that feels both relatable and encouraging. I expect an ending that "
        "leaves me with a sense of fulfillment and optimism, where the protagonist not only overcomes "
        "their challenges but also experiences growth and newfound joy in their life."
    ),
    score=5,
    justification="The reformulated version captures the user's emotional intent, desired themes, pacing, and tone in a rich and specific way. It avoids clichés and does not reference real movies, making it ideal for embedding.",
    grading_context={
        "targets": (
            "I'm looking for an emotionally uplifting movie that follows a character overcoming challenges in life, "
            "but without being overwhelmingly sad. It should focus on themes like perseverance, hope, and personal growth. "
            "I want the story to be inspiring and heartwarming, with meaningful relationships and a positive tone. "
            "The ending should leave me feeling hopeful and motivated, not depressed or drained."
        )
    },
)

# Создаём метрику оценки похожести (answer similarity)
answer_similarity_metric = answer_similarity(
    model="openai:/gpt-4",
    examples=[example]
)

print(answer_similarity_metric)

EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_similarity based on the rubric
justification: Your reasoning about the model's answer_similarity score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_similarity based on the input and output.
A definition of answer_similarity and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them be

In [25]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above
    )
results.metrics

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 121.18it/s]
2025/06/19 14:29:16 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-0f015ef569384a8d94d3947551dea793
2025/06/19 14:29:16 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:29:17 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/06/19 14:29:22 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.69s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.31s/it]

🏃 View run sneaky-steed-684 at: http://0.0.0.0:5001/#/experiments/973507716245173164/runs/5dd06ce0920849e1abd8083848beaa47
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/973507716245173164





{'exact_match/v1': 0.0,
 'answer_similarity/v1/mean': np.float64(5.0),
 'answer_similarity/v1/variance': np.float64(0.0),
 'answer_similarity/v1/p90': np.float64(5.0)}

In [26]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 171.08it/s]
Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 212.41it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count,answer_similarity/v1/score,answer_similarity/v1/justification
0,"I want a movie like The Matrix, but with a dee...",A science fiction movie set in a dystopian fut...,I'm looking for a thought-provoking science fi...,150,5,The output closely aligns with the provided ta...
1,"Something emotional, like The Pursuit of Happy...",An inspiring drama following a determined prot...,I'm looking for an emotionally uplifting drama...,153,5,The output closely aligns with the provided ta...
2,"A crime movie like Breaking Bad, but focused o...",A gritty crime drama centered around a morally...,I'm looking for a gritty crime drama that delv...,183,5,The model's output aligns closely with the pro...
3,I want a romantic movie that’s not cheesy and ...,A grounded romantic drama portraying the evolv...,I’m looking for a romantic drama that strikes ...,148,5,The output closely aligns with the provided ta...
4,"A slow-paced sci-fi movie like Arrival, with a...",A contemplative science fiction film where the...,I’m looking for a slow-paced science fiction m...,145,5,The output closely aligns with the provided ta...


In [27]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

embedding_friendly_metric = make_genai_metric(
    name="embedding_friendly_rewriting",
    definition=(
        "This metric evaluates how well a user’s vague or informal movie request is reformulated into a clear, detailed, and embedding-friendly description. "
        "An ideal reformulation avoids specific movie titles, character names, or fictional details, and instead emphasizes genre, tone, themes, pacing, emotional goals, and the type of ending the user is looking for."
    ),
    grading_prompt=(
        "Embedding-friendly rewriting: The goal is to transform a vague user request into a richly detailed, general description of the desired movie, "
        "without referencing specific titles or inventing fake storylines. Use the following scale: "
        "- Score 1: Response is vague, off-topic, or mostly repeats the original input without useful elaboration. "
        "- Score 2: Response adds some detail but includes specific movies, names, or plot points. "
        "- Score 3: Response is mostly general and meaningful but lacks depth or has slight specific references. "
        "- Score 4: Response is clear, general, and detailed, capturing the user’s intent well without referencing specific titles. "
        "- Score 5: Response is richly detailed, entirely general, and fully expresses the user's intent with precision. Ideal for semantic search embeddings."
    ),
    examples=[
        EvaluationExample(
            input="Something emotional, like The Pursuit of Happyness, but without too much sadness.",
            output=(
                "I’m looking for an emotional drama that captures the resilience of the human spirit, focusing on themes of hope and perseverance in the face of adversity. "
                "I want it to tell a heartfelt story about an individual's journey, showcasing their struggles but emphasizing uplifting moments and personal triumphs that bring warmth and inspiration. "
                "The pacing should be steady, allowing me to connect deeply with the characters and their experiences, while maintaining an atmosphere that feels both relatable and encouraging. "
                "I expect an ending that leaves me with a sense of fulfillment and optimism, where the protagonist not only overcomes their challenges but also experiences growth and newfound joy in their life."
            ),
            score=5,
            justification=(
                "The response is highly detailed, avoids specific movie names or characters, and effectively captures the user's emotional and thematic intent. "
                "It reads like a structured expression of what the user wants and is ideal for embedding-based matching."
            ),
        )
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(embedding_friendly_metric)


EvaluationMetric(name=embedding_friendly_rewriting, greater_is_better=True, long_name=embedding_friendly_rewriting, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's embedding_friendly_rewriting based on the rubric
justification: Your reasoning about the model's embedding_friendly_rewriting score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called embedding_friendly_rewriting based on the input and output.
A definition of embedding_friendly_rewriting and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for refere

In [29]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[embedding_friendly_metric],  # use the professionalism metric we created above
    )
print(results.metrics)

Downloading artifacts: 100%|████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 98.47it/s]
2025/06/19 14:31:43 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-0f015ef569384a8d94d3947551dea793
2025/06/19 14:31:43 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:31:43 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/06/19 14:31:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.77s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.11it/s]

🏃 View run gentle-squid-204 at: http://0.0.0.0:5001/#/experiments/973507716245173164/runs/d93fdbec448848afb54bb7967b6ee208
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/973507716245173164
{'embedding_friendly_rewriting/v1/mean': np.float64(4.8), 'embedding_friendly_rewriting/v1/variance': np.float64(0.15999999999999998), 'embedding_friendly_rewriting/v1/p90': np.float64(5.0)}





In [30]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 169.34it/s]
Downloading artifacts: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.96it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count,embedding_friendly_rewriting/v1/score,embedding_friendly_rewriting/v1/justification
0,"I want a movie like The Matrix, but with a dee...",A science fiction movie set in a dystopian fut...,I’m looking for a science fiction movie that d...,161,5,"The model's response is highly detailed, avoid..."
1,"Something emotional, like The Pursuit of Happy...",An inspiring drama following a determined prot...,I'm looking for an emotional drama that emphas...,161,4,"The response is clear, general, and detailed, ..."
2,"A crime movie like Breaking Bad, but focused o...",A gritty crime drama centered around a morally...,I'm looking for a gripping crime drama that de...,171,5,"The model's response is highly detailed, avoid..."
3,I want a romantic movie that’s not cheesy and ...,A grounded romantic drama portraying the evolv...,I’m looking for a romantic film that strikes a...,159,5,"The model's response is highly detailed, avoid..."
4,"A slow-paced sci-fi movie like Arrival, with a...",A contemplative science fiction film where the...,I’m looking for a slow-paced sci-fi movie that...,154,5,"The model's response is highly detailed, avoid..."


In [32]:
with mlflow.start_run() as run:
    system_prompt = SYSTEM_PROMPT
    professional_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        name="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        professional_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[embedding_friendly_metric],
    )
print(results.metrics)

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 261.40it/s]
2025/06/19 14:35:01 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-a5640aaae9c64d4189342aab64aedde1
2025/06/19 14:35:01 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:35:02 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-a5640aaae9c64d4189342aab64aedde1
2025/06/19 14:35:02 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/19 14:35:02 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/06/19 14:35:07 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.46s/it

🏃 View run languid-panda-912 at: http://0.0.0.0:5001/#/experiments/973507716245173164/runs/e837d81a55234cea9a67a61c7d11d3bd
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/973507716245173164
{'embedding_friendly_rewriting/v1/mean': np.float64(5.0), 'embedding_friendly_rewriting/v1/variance': np.float64(0.0), 'embedding_friendly_rewriting/v1/p90': np.float64(5.0)}


In [33]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 92.24it/s]
Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 209.78it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count,embedding_friendly_rewriting/v1/score,embedding_friendly_rewriting/v1/justification
0,"I want a movie like The Matrix, but with a dee...",A science fiction movie set in a dystopian fut...,I’m looking for a thought-provoking science fi...,184,5,"The model's response is highly detailed, avoid..."
1,"Something emotional, like The Pursuit of Happy...",An inspiring drama following a determined prot...,I’m looking for a heartfelt drama that strikes...,177,5,"The model's response is highly detailed, avoid..."
2,"A crime movie like Breaking Bad, but focused o...",A gritty crime drama centered around a morally...,I’m looking for a gripping crime drama that de...,180,5,"The model's response is highly detailed, avoid..."
3,I want a romantic movie that’s not cheesy and ...,A grounded romantic drama portraying the evolv...,I’m looking for a romantic movie that strikes ...,147,5,"The model's response is highly detailed, avoid..."
4,"A slow-paced sci-fi movie like Arrival, with a...",A contemplative science fiction film where the...,I’m looking for a slow-paced sci-fi film that ...,138,5,"The model's response is highly detailed, avoid..."
