## MLFLOW WITH GEN AI

## Importing the important libraries

1. the dagshub offers mlflow
2. Use dotenv to load env variables
3. Import mlflow

In [21]:
from dotenv import load_dotenv
import os
import openai
import pandas as pd
import mlflow
import dagshub
from mlflow.metrics.genai import EvaluationExample, answer_similarity,make_genai_metric


In [41]:
load_dotenv()
os.environ["OPENAI_API_KEY"]= os.getenv("API_KEY_NAME")

## Dagshub
1. need to put the token file of dagshub in the user directory
2. set the mlflow tracking_uri based on the repoName

In [9]:
dagshub.init(repo_owner='mohannadrateb',repo_name="MLflow", mlflow=True)
mlflow.set_tracking_uri("https://dagshub.com/mohannadrateb/MLflow.mlflow")

## Creating a dataframe for evaluation
1. There are two columns in it. The inputs which will represent the input given to the model and the ground truth is the answers to theses
inputs which we will compare with. 

In [42]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "How does useEffect() work?",
            "What does the static keyword in a function mean?",
            "What does the 'finally' block in Python do?",
            "What is the difference between multiprocessing and multithreading?",
        ],
        "ground_truth": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
    }
)

## Mlflow guidelines
1. Using the mlflow.set_experiment(< experiment_name >) A new experiment is registered in mlflow which we can later access in MLFLOW UI
2. mlflow.start_run(), to start running the experiment
3. Define the model as well as the their inputs
4. mlflow.openai.log_model --> to  wrap the model in MLFLOW AND  can compare it later,use an openai model and to keep track of it -->.log_model()
5. use mlflow.evaluate(), to get results of different metrices


In [14]:
mlflow.set_experiment("LLM Evaluation")
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics

  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 14.30it/s]
2024/07/12 15:16:57 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/07/12 15:17:04 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


{'exact_match/v1': 0.0}

## The results from model.eval()
1. has an attrbuite of tables with the name ["eval_results_table"], which contain the results of the model as well as the column of ground truth to compare with.

In [15]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  2.60it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count
0,How does useEffect() work?,The useEffect() hook tells React that your com...,useEffect() is a hook provided by React that a...,42
1,What does the static keyword in a function mean?,"Static members belongs to the class, rather th...",The static keyword in a function means that th...,48
2,What does the 'finally' block in Python do?,'Finally' defines a block of code to run when ...,The 'finally' block in Python is used to defin...,47
3,What is the difference between multiprocessing...,Multithreading refers to the ability of a proc...,Multiprocessing involves utilizing multiple pr...,36


## Creation of custom evaluation example.
1. using mlflow.metrics.genai give input, output, score and justification for that score

In [17]:
# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "targets": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)

### use this example in the answer_similarity_metric

In [18]:
# Construct the metric using OpenAI GPT-4 as the judge
answer_similarity_metric = answer_similarity(model="openai:/gpt-4", examples=[example])
print(answer_similarity_metric)

EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_similarity based on the rubric
justification: Your reasoning about the model's answer_similarity score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_similarity based on the input and output.
A definition of answer_similarity and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them be

we can define extra metrcies in the mlflow.evalutate to be added to our results

In [28]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[answer_similarity_metric,mlflow.metrics.toxicity(), mlflow.metrics.latency()],  # use the answer similarity metric created above
    )
results.metrics

Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 15.15it/s]
2024/07/12 15:43:49 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/07/12 15:43:56 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|██████████| 1/1 [00:04<00:00,  4.06s/it]
100%|██████████| 4/4 [00:04<00:00,  1.08s/it]


{'latency/mean': 1.663176715373993,
 'latency/variance': 0.01992190785308523,
 'latency/p90': 1.811255145072937,
 'exact_match/v1': 0.0,
 'answer_similarity/v1/mean': 3.75,
 'answer_similarity/v1/variance': 1.1875,
 'answer_similarity/v1/p90': 4.7}

In [29]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  3.38it/s]


Unnamed: 0,inputs,ground_truth,outputs,latency,token_count,answer_similarity/v1/score,answer_similarity/v1/justification
0,How does useEffect() work?,The useEffect() hook tells React that your com...,useEffect() is a hook in React that allows you...,1.46817,45,4,The output effectively explains what useEffect...
1,What does the static keyword in a function mean?,"Static members belongs to the class, rather th...",The static keyword in a function means that th...,1.601177,47,2,The output provides some information about the...
2,What does the 'finally' block in Python do?,'Finally' defines a block of code to run when ...,The 'finally' block in Python is used to execu...,1.742742,49,5,The model's output closely aligns with the pro...
3,What is the difference between multiprocessing...,Multithreading refers to the ability of a proc...,Multiprocessing involves executing multiple pr...,1.840618,52,4,The output accurately describes the concepts o...


## Creating a custom genai metric
1. we will need to give the grading guidelines  in the "grading_prompt" 

In [30]:

professionalism_metric = make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below "
        "are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
        "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output=(
                "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
            ),
            score=2,
            justification=(
                "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
            ),
        )
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(professionalism_metric)

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's professionalism based on the rubric
justification: Your reasoning about the model's professionalism score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called professionalism based on the input and output.
A definition of professionalism and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before complet

In [36]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[professionalism_metric,mlflow.metrics.toxicity(), mlflow.metrics.latency()],  # use the professionalism metric we created above
    )
print(results.metrics)

Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 14.71it/s]
2024/07/12 15:50:32 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/07/12 15:50:38 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|██████████| 1/1 [00:01<00:00,  1.95s/it]
100%|██████████| 4/4 [00:39<00:00,  9.88s/it]


{'latency/mean': 1.5089607238769531, 'latency/variance': 0.07772480408246452, 'latency/p90': 1.827316975593567, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}


In [37]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  2.40it/s]


Unnamed: 0,inputs,ground_truth,outputs,latency,token_count,professionalism/v1/score,professionalism/v1/justification
0,How does useEffect() work?,The useEffect() hook tells React that your com...,useEffect() is a hook in React that allows you...,1.227306,46,4,The language used in the response is formal an...
1,What does the static keyword in a function mean?,"Static members belongs to the class, rather th...",The static keyword in a function means that th...,1.313091,46,4,The language used in the response is formal an...
2,What does the 'finally' block in Python do?,'Finally' defines a block of code to run when ...,The 'finally' block in Python is used to defin...,1.548738,47,4,The language used in the response is formal an...
3,What is the difference between multiprocessing...,Multithreading refers to the ability of a proc...,Multiprocessing involves running multiple proc...,1.946708,68,4,The language used in the response is formal an...


In [39]:
with mlflow.start_run() as run:
    system_prompt = "Answer the following question using extreme formality."
    professional_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        professional_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[professionalism_metric,mlflow.metrics.toxicity(), mlflow.metrics.latency()],
    )
print(results.metrics)

Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 13.86it/s]
2024/07/12 15:53:52 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/07/12 15:54:01 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|██████████| 1/1 [00:02<00:00,  2.61s/it]
100%|██████████| 4/4 [00:03<00:00,  1.20it/s]


{'latency/mean': 2.177800953388214, 'latency/variance': 0.3162884970014481, 'latency/p90': 2.7973836898803714, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}


In [40]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  2.61it/s]


Unnamed: 0,inputs,ground_truth,outputs,latency,token_count,professionalism/v1/score,professionalism/v1/justification
0,How does useEffect() work?,The useEffect() hook tells React that your com...,The useEffect() hook is a function provided by...,3.069404,159,4,The language used in the response is formal an...
1,What does the static keyword in a function mean?,"Static members belongs to the class, rather th...",The static keyword within a function declarati...,1.94677,101,4,The language used in the response is formal an...
2,What does the 'finally' block in Python do?,'Finally' defines a block of code to run when ...,The 'finally' block in Python is a crucial con...,1.53236,66,4,The language used in the response is formal an...
3,What is the difference between multiprocessing...,Multithreading refers to the ability of a proc...,The distinction between multiprocessing and mu...,2.16267,102,4,The language used in the response is formal an...
