# Use FLAML to Tune Large Language Models

`flaml.autogen` offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. The research study finds that tuning hyperparameters can significantly improve the utility of LLMs.

In this notebook, we tune OpenAI ChatGPT models for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning.


## Requirements

FLAML requires `Python>=3.8`. To run this notebook example, please install the following packages:

In [None]:
%pip install "openai==0.28.1" "datasets==2.14.6" "diskcache"

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 58, Finished, Available, Finished)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



FLAML has provided an API for hyperparameter optimization of OpenAI ChatGPT models: `autogen.ChatCompletion.tune` and to make a request with the tuned config: `autogen.ChatCompletion.create`.
First, we import autogen from flaml:

In [None]:
from flaml import autogen

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 60, Finished, Available, Finished)

2024-09-03 04:36:25.665596: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-03 04:36:25.697355: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Set your API Endpoint

It's important to note that the built-in [Azure Open AI service](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview) is not supported on trial SKUs and only paid SKUs (F64 or higher, or P1 or higher) are supported. Please update below config_list with your own LLMs settings.


In [None]:
config_list = [
    {
        "model": "gpt-35-turbo-0125",
    },
]

assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 61, Finished, Available, Finished)

models to use:  ['gpt-35-turbo-0125']


Check if the configs are valid:

In [None]:
for config in config_list:
    print(f"Current model: {config['model']}")
    agent = autogen.agentchat.ConversableAgent(name=config["model"], llm_config={"config_list": [config]}, max_consecutive_auto_reply=1, human_input_mode="NEVER")
    userproxy = autogen.agentchat.ConversableAgent(name="user", max_consecutive_auto_reply=0, llm_config=False, default_auto_reply="TERMINATE", human_input_mode="NEVER")
    userproxy.initiate_chat(recipient=agent, message="Tell me a quick joke.")

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 62, Finished, Available, Finished)

Current model: gpt-35-turbo-0125
[33muser[0m (to gpt-35-turbo-0125):

Tell me a quick joke.

--------------------------------------------------------------------------------
[33mgpt-35-turbo-0125[0m (to user):

Why couldn't the bicycle stand up by itself?

Because it was two-tired!

--------------------------------------------------------------------------------


## Load dataset

We load the competition_math dataset. The dataset contains 201 "Level 2" Algebra examples. We use a random sample of 20 examples for tuning the generation hyperparameters and the remaining for evaluation.

In [None]:
import datasets

seed = 41
data = datasets.load_dataset("competition_math")
train_data = data["train"].shuffle(seed=seed)
test_data = data["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
    {
        "problem": train_data[x]["problem"],
        "solution": train_data[x]["solution"],
    }
    for x in range(len(train_data))
    if train_data[x]["level"] == "Level 2" and train_data[x]["type"] == "Algebra"
][:n_tune_data]
test_data = [
    {
        "problem": test_data[x]["problem"],
        "solution": test_data[x]["solution"],
    }
    for x in range(len(test_data))
    if test_data[x]["level"] == "Level 2" and test_data[x]["type"] == "Algebra"
]
print(len(tune_data), len(test_data))

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 63, Finished, Available, Finished)

20 201


Check a tuning example:

In [None]:
print(tune_data[1]["problem"])

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 64, Finished, Available, Finished)

If $3+a=4-b$ and $4+b=7+a$, what is $3-a$?


Here is one example of the canonical solution:

In [None]:
print(tune_data[1]["solution"])

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 65, Finished, Available, Finished)

First we begin by solving the system of equations \begin{align*}
3+a&=4-b, \\
4+b&=7+a.
\end{align*}Adding the two equations, we get $3+a+4+b=4-b+7+a$, which simplifies to $7+a+b=11+a-b$. Cancelling $a$ from both sides, we get $7+b=11-b$. Solving for $b$, we find that $b=2$. Plugging this into the first equation above, we obtain $3+a=4-2$. Hence $a=-1$ and $3-a=\boxed{4}$.


## Define Success Metric

Before we start tuning, we must define the success metric we want to optimize. For each math task, we use voting to select a response with the most common answers out of all the generated responses. We consider the task successfully solved if it has an equivalent answer to the canonical solution. Then we can optimize the mean success rate of a collection of tasks.

In [None]:
from flaml.autogen.math_utils import eval_math_responses

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 66, Finished, Available, Finished)

## Use the tuning data to find a good configuration


For (local) reproducibility and cost efficiency, we cache responses from OpenAI with a controllable seed.

In [None]:
autogen.ChatCompletion.set_cache(seed)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 67, Finished, Available, Finished)

This will create a disk cache in ".cache/{seed}". You can change `cache_path_root` from ".cache" to a different path in `set_cache()`. The cache for different seeds are stored separately.

### Perform tuning

The tuning will take a while to finish, depending on the optimization budget. The tuning will be performed under the specified optimization budgets.

* `inference_budget` is the benchmark's target average inference budget per instance. For example, 0.004 means the target inference budget is 0.004 dollars, which translates to 2000 tokens (input + output combined) if the gpt-3.5-turbo model is used.
* `optimization_budget` is the total budget allowed for tuning. For example, 1 means 1 dollar is allowed in total, which translates to 500K tokens for the gpt-3.5-turbo model.
* `num_sumples` is the number of different hyperparameter configurations allowed to be tried. The tuning will stop after either num_samples trials are completed or optimization_budget dollars are spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided by `optimization_budget`.

Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.. The default search space is:

```python
default_search_space = {
    "model": tune.choice([
        "gpt-3.5-turbo",
        "gpt-4",
    ]),
    "temperature_or_top_p": tune.choice(
        [
            {"temperature": tune.uniform(0, 2)},
            {"top_p": tune.uniform(0, 1)},
        ]
    ),
    "max_tokens": tune.lograndint(50, 1000),
    "n": tune.randint(1, 100),
    "prompt": "{prompt}",
}
```

The default search space can be overridden by users' input.
For example, the following code specifies a fixed prompt template. The default search space will be used for hyperparameters that don't appear in users' input.

In [None]:
from flaml import tune

models = tune.choice(["gpt-35-turbo-0125"])  # to update models to search
prompts = [
    "{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."
]

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 68, Finished, Available, Finished)

Since we'll use models which are not pre-defined in FLAML, we can update the models' price info manually. Check [here](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) for more price details.

In [None]:
autogen.ChatCompletion.chat_models.update({"gpt-35-turbo-0125", "gpt-4o", "gpt-4-32k"})
autogen.ChatCompletion.price1K.update({"gpt-35-turbo-0125": (0.0005, 0.0015), "gpt-4o": (0.005, 0.015), "gpt-4-32k": (0.06, 0.12)})

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 69, Finished, Available, Finished)

In [None]:
import logging

config, analysis = autogen.ChatCompletion.tune(
    data=tune_data,  # the data for tuning
    metric="success_vote",  # the metric to optimize
    mode="max",  # the optimization mode
    eval_func=eval_math_responses,  # the evaluation function to return the success metrics
    # log_file_name="logs/math.log",  # the log file name
    inference_budget=0.02,  # the inference budget (dollar per instance)
    optimization_budget=1,  # the optimization budget (dollar in total)
    # num_samples can further limit the number of trials for different hyperparameter configurations;
    # -1 means decided by the optimization budget only
    num_samples=20,  # number of configurations to try
    prompt=prompts,  # the prompt templates to choose from
    # stop="###",  # the stop sequence
    config_list=config_list,  # the endpoint list
    allow_format_str_template=True,  # whether to allow format string template
    logging_level=logging.INFO,  # the logging level
    model=models,  # models to choose from
)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 70, Finished, Available, Finished)

[32m[I 2024-09-03 04:36:38,702][0m A new study created in memory with name: optuna[0m


No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune
[flaml.tune.tune: 09-03 04:36:39] {905} INFO - trial 1 config: {'model': 'gpt-35-turbo-0125', 'temperature_or_top_p': {'top_p': 0.36280922847807595}, 'max_tokens': 347, 'n': 10, 'prompt': 0, 'allow_format_str_template': True}
[flaml.tune.tune: 09-03 04:37:38] {909} DEBUG - result in tune: <flaml.tune.trial_runner.SimpleTrial object at 0x7305ab2ab290>, {'expected_success': 0.9150833227400001, 'success': 0.95, 'success_vote': 0.85, 'voted_answer': 'We can see that each term in the sequence is obtained by dividing the previous term by 3. \n\nSo, the sequence can be written as $6075, \\frac{6075}{3}, \\frac{6075}{3^2}, \\frac{6075}{3^3}, \\ldots$\n\nThe $n$th term in the sequence is given by $a_n = \\frac{6075}{3^{n-1}

### Output tuning results

After the tuning, we can print out the config and the result found by FLAML:

In [None]:
print("optimized config", config)
print("best result on tuning data", analysis.best_result)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 71, Finished, Available, Finished)

optimized config {'model': 'gpt-35-turbo-0125', 'max_tokens': 470, 'n': 50, 'prompt': '{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}.', 'allow_format_str_template': True, 'temperature': 1.2672964698525508}
best result on tuning data {'expected_success': 0.9817914610756876, 'success': 1.0, 'success_vote': 0.9, 'voted_answer': "First, let's see if we can find a pattern in the exponents of 3 as we go from one number to the next.\n\nStarting with 6075, which is $3^5 \\cdot 5$, we divide by 3 to get 2025, which is $3^4 \\cdot 5$. From 2025 to 675, we are dividing by 3 again, and we get $3^3 \\cdot 5$. \n\nSo, we can see that each time we are dividing by 3, the exponent of 3 is decreasing by 1.\n\nContinuing this pattern, the next number in the sequence would be $3^2 \\cdot 5$, then $3^1 \\cdot 5$, and finally $3^0 \\cdot 5$. \n\nSo, there are a total of 6 integers in this sequence - 6075, 2025, 675, 225, 75, and 25. \n\n

### Make a request with the tuned config

We can apply the tuned config on the request for an example task:

In [None]:
response = autogen.ChatCompletion.create(context=tune_data[1], config_list=config_list, **config)
metric_results = eval_math_responses(autogen.ChatCompletion.extract_text(response), **tune_data[1])
print("response on an example data instance:", response)
print("metric_results on the example data instance:", metric_results)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 72, Finished, Available, Finished)

response on an example data instance: {
  "id": "chatcmpl-A3FaYDpLh5cHf8tL5FvDGm2Kp2NKb",
  "choices": [
    {
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      },
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Given:\n$3+a=4-b$ ...(1)\n$4+b=7+a$ ...(2)\n\nCombining equations (1) and (2) to isolate $a$ and $b$:\n\n$3+a = 4-b \\implies a+b = 1$ ...(3)\n$4+b = 7+a \\implies b-a = 3$ ...(4)\n\nAdding equations (3) and (4) gives:\n$(a+b) + (b-a) = 1+3$\n$2b = 4$\n$b = 2$\n\nSubstitute $b=2$ into equation (3) gives:\n$a+2 = 1 \\implies a = -1$\n\nNow that we have found the values of $a$ and $b$, we can find $

### Evaluate the success rate on the test data

You can use `autogen.ChatCompletion.test` to evaluate the performance of an entire dataset with the tuned config. The following code will take a while (~15 mins) to evaluate all the test data instances if uncommented and run. It will cost roughly $3. 

In [None]:
result = autogen.ChatCompletion.test(test_data, logging_level=logging.INFO, config_list=config_list, **config)
print("performance on test data with the tuned config:", result)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 73, Finished, Available, Finished)

[flaml.autogen.oai.completion: 09-03 04:42:11] {930} INFO - evaluating data instance 0
evaluating data instance 0
[flaml.autogen.oai.completion: 09-03 04:42:18] {930} INFO - evaluating data instance 1
evaluating data instance 1
[flaml.autogen.oai.completion: 09-03 04:42:23] {930} INFO - evaluating data instance 2
evaluating data instance 2
[flaml.autogen.oai.completion: 09-03 04:42:26] {930} INFO - evaluating data instance 3
evaluating data instance 3
[flaml.autogen.oai.completion: 09-03 04:42:30] {930} INFO - evaluating data instance 4
evaluating data instance 4
[flaml.autogen.oai.completion: 09-03 04:42:34] {930} INFO - evaluating data instance 5
evaluating data instance 5
[flaml.autogen.oai.completion: 09-03 04:42:40] {930} INFO - evaluating data instance 6
evaluating data instance 6
[flaml.autogen.oai.completion: 09-03 04:42:44] {930} INFO - evaluating data instance 7
evaluating data instance 7
[flaml.autogen.oai.completion: 09-03 04:42:48] {930} INFO - evaluating data instance 8
e

What about the default, untuned gpt-35-turbo-0125 config (with the same prompt as the tuned config)? We can evaluate it and compare:

In [None]:
# the following code will cost roughly $0.1 if run.

default_config = {"model": 'gpt-35-turbo-0125', "prompt": prompts[0], "allow_format_str_template": True}
default_result = autogen.ChatCompletion.test(test_data, config_list=config_list, **default_config)
print("performance on test data from gpt-35-turbo-0125 with a default config:", default_result)

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 82, Finished, Available, Finished)

performance on test data from gpt-35-turbo-0125 with a default config: {'expected_success': 0.7761194029850746, 'success': 0.7761194029850746, 'success_vote': 0.7761194029850746, 'votes': 1.0, 'cost': 0.09040799999999997, 'inference_cost': 0.0004497910447761193}


In [None]:
print("tuned config succeeds in {:.1f}% test cases".format(result["success_vote"] * 100))
print("untuned config succeeds in {:.1f}% test cases".format(default_result["success_vote"] * 100))

StatementMeta(, 280727be-3509-4d73-817b-4c3fe9a4d244, 75, Finished, Available, Finished)

tuned config succeeds in 93.0% test cases
untuned config succeeds in 77.6% test cases
