# Development: ML Flow

File for developing the mlflow code for the project.

In [1]:
# IMPORTS --------------------------------------------------------------------------------------------------------------

# Use the below lines if any dependencies are missing.
# ! python -m pip install uv
# ! python -m uv pip install langchain_openai mlflow load_dotenv langchain pandas langchain_community

import os
import sys

sys.path.append(os.path.abspath('\\'.join(os.getcwd().split('\\')[:-1])))

import pandas as pd
from dotenv import load_dotenv
from ml_flow import (mlflow_server, create_example_llm, evaluate_llm, create_agent, evaluate_agent, get_info_on_runs,
                     delete_all_runs)

_ = load_dotenv()

import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

To begin with, we run the ML-Flow server:

In [2]:
server_process = mlflow_server()

## LLM and Standard ML FLow

Let's generate an example LLM:

In [3]:
example_model = create_example_llm()

We'll then read in an evaluation set:

In [4]:
data_folder_path = '\\'.join(os.getcwd().split('\\')[:-1]) +'/data/'
file_name = 'Evaluation Dataset - LLM.csv'

file_path = data_folder_path + file_name
eval_set = pd.read_csv(file_path)

eval_set = eval_set.rename(columns={'question': 'inputs', 'target': 'targets'})

display(eval_set)

Unnamed: 0,inputs,context,targets
0,How much money does client 1 have in shares?,Client 1 has 20 shares. 70% of their shares ar...,"Client 1 has £14,000 worth of NVDA shares (70%..."
1,How much money does client 2 have in shares?,Client 2 has 10 shares. 30% of their shares ar...,"Client 2 has £3,000 worth of NVDA shares (30% ..."


Let's demonstrate that the model works:

In [5]:
question = eval_set['inputs'][0]
context = eval_set['context'][0]

print(f"Question: {question}")
print('')
print('Answer: ' + example_model.invoke({'inputs': question, 'context': context}))

Question: How much money does client 1 have in shares?

Answer: Client 1 has £14,000 worth of NVDA shares (70% of 20 shares x £1000 per share) and £5,700 worth of AAPL shares (30% of 20 shares x £190 per share). Therefore, in total, Client 1 has £19,700 in shares.


We then connect to ML-Flow:

In [6]:
results = evaluate_llm(example_model, eval_set, "openai:/gpt-3.5-turbo", "mlflow_llm_development")

2024/05/30 15:56:45 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/05/30 15:56:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:01<00:00,  1.47s/it]
100%|██████████| 1/1 [00:01<00:00,  1.32s/it]
100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
100%|██████████| 1/1 [00:01<00:00,  1.43s/it]
100%|██████████| 2/2 [00:01<00:00,  1.01it/s]
100%|██████████| 2/2 [00:01<00:00,  1.22it/s]
100%|██████████| 2/2 [00:01<00:00,  1.17it/s]
100%|██████████| 2/2 [00:04<00:00,  2.03s/it]


And we can then take a look at the results:

In [7]:
output_df = pd.DataFrame(results.tables['eval_results_table'])
display(output_df)

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 390.75it/s]


Unnamed: 0,inputs,context,targets,outputs,token_count,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,faithfulness/v1/score,faithfulness/v1/justification,answer_similarity/v1/score,answer_similarity/v1/justification,answer_correctness/v1/score,answer_correctness/v1/justification,answer_relevance/v1/score,answer_relevance/v1/justification
0,How much money does client 1 have in shares?,Client 1 has 20 shares. 70% of their shares ar...,"Client 1 has £14,000 worth of NVDA shares (70%...","Client 1 has £14,000 worth of NVDA shares (70%...",62,4.2,7.0,5,The output correctly calculates the total valu...,3,The output has moderate semantic similarity to...,5,The output is correct and demonstrates a high ...,5,"The output directly mirrors the input, providi..."
1,How much money does client 2 have in shares?,Client 2 has 10 shares. 30% of their shares ar...,"Client 2 has £3,000 worth of NVDA shares (30% ...","Client 2 has £7,300 in shares. This is calcula...",76,7.0,10.2,5,The output correctly calculates the amount of ...,4,The output aligns with the provided targets in...,5,The output provided by the model is correct an...,5,The output directly addresses all aspects of t...


In [8]:
output_df['outputs'][1]

'Client 2 has £7,300 in shares. This is calculated by taking 30% of their shares in NVDA (3 shares x £1000 = £3000) and 70% of their shares in AAPL (7 shares x £190 = £1330), then adding these two amounts together (£3000 + £1330 = £4330).'

In [9]:
print(get_info_on_runs('mlflow_llm_development'))

------------------------------------------------------------------------------------------------------------------------
Run ID: 6e92e3ddea28432ba4ff42c729ae4aff
Parameters: {'model': 'first=PromptTemplate(input_variables=[\'context\', \'inputs\'], template="You\'re a investment manager. Using the context provided, reply to the question below to the best of your ability:\\nQuestion:\\n{inputs}\\nContext:\\n{context}") middle=[ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001DFB005B3D0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001DFB00CB950>, model_name=\'gpt-3.5-turbo-0125\', temperature=0.0, openai_api_key=SecretStr(\'**********\'), openai_proxy=\'\')] last=RunnableLambda(_get_content)'}
Metrics: {'answer_correctness/v1/mean': 5.0, 'answer_correctness/v1/p90': 5.0, 'answer_correctness/v1/variance': 0.0, 'answer_relevance/v1/mean': 5.0, 'answer_relevance/v1/p90': 5.0, 'answer_relevance/v1/variance': 0.0, 'answer_

In [11]:
delete_all_runs('mlflow_llm_development')

## Agent Model Evaluation

We can instantiate a simple agent to answer our queries:

In [12]:
example_agent = create_agent()

Let's then evaluate that agent against a set of evaluation questions:

In [13]:
eval_set = pd.read_csv('\\'.join(os.getcwd().split('\\')[:-1]) +'/data/' + 'Evaluation Dataset - Agent.csv')
display(eval_set)

evaluate_agent(example_agent, eval_set['question'], 'mlflow_agent_development')

Unnamed: 0,question
0,Return the values for client_id 1 in the sql d...
1,Tell me the latest AAPL stock price.
2,Return me the stock allocation for client 5.
3,Return me the stock allocation for client 8.
4,Return me the stock allocation for every client.
5,Give me all the stock allocations from all cli...
6,Give me a sentence from the apple 10-k report.
7,What is the net sales of iPhones in 2021.
8,Add a new client to the database with random s...
9,Give me all the stock allocations from all cli...


Evaluating agent on questions...: 100%|██████████| 12/12 [00:58<00:00,  4.86s/it]
2024/05/30 15:58:54 INFO mlflow.tracking.fluent: Experiment with name 'mlflow_agent_development' does not exist. Creating a new experiment.


We can then investigate the performance of the model on ML Flow:

In [14]:
print(get_info_on_runs('mlflow_agent_development'))

------------------------------------------------------------------------------------------------------------------------
Run ID: f1a81fa31b9b4315a82d9032a756129b
Metrics: {'ari_score_mean': 6.26, 'ari_score_variance': 4.77, 'response_time_mean': 4.41, 'response_time_variance': 0.3, 'success_rate_mean': 0.0, 'success_rate_variance': 0.0}
Tags: {'mlflow.runName': 'respected-yak-807', 'mlflow.source.name': 'c:\\Code\\GenAIGroupProject\\.venv\\Lib\\site-packages\\ipykernel_launcher.py', 'mlflow.source.type': 'LOCAL', 'mlflow.user': 'MichaelBerney'}


In [15]:
delete_all_runs('mlflow_agent_development')

## Using Real Agent

Here, our real agent class is used and tracked using MLFlow