Copyright (c) Microsoft Corporation. All rights reserved. 

Licensed under the MIT License.

# Use FLAML to Tune ChatGPT

FLAML offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. Our study finds that tuning hyperparameters can significantly improve the utility of LLMs.

In this notebook, we tune OpenAI ChatGPT (both GPT-3.5 and GPT-4) models for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. 

## Requirements

FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:
```bash
pip install flaml[openai]>=1.2.0
```

In [1]:
%pip install flaml[synapse]==1.2.1 xgboost==1.6.1 pandas==1.5.1 numpy==1.23.4 datasets --force-reinstall

StatementMeta(automl, 22, 7, Finished, Available)

Collecting flaml[openai]@ git+https://github.com/microsoft/FLAML.git
  Cloning https://github.com/microsoft/FLAML.git to /tmp/pip-install-xu3eqqf_/flaml_e1a5e755467242909f1c7ba6b02a4117
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/FLAML.git /tmp/pip-install-xu3eqqf_/flaml_e1a5e755467242909f1c7ba6b02a4117
  Resolved https://github.com/microsoft/FLAML.git to commit c780d79004cc8fc3790c6ddd71962856e6f1a557
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting xgboost==1.6.1
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.9/192.9 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas==1.5.1
  Downloading pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m99.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0




Set your OpenAI key:

In [2]:
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

StatementMeta(automl, 22, 9, Finished, Available)

Uncomment the following to use Azure OpenAI:

In [3]:
# import openai
# openai.api_type = "azure"
# openai.api_base = "YOUR_API_ENDPOINT"
# openai.api_version = "2022-12-01"
# openai.api_key = os.getenv("OPENAI_API_KEY")

StatementMeta(automl, 22, 10, Finished, Available)

## Load dataset

First, we load the competition_math dataset. The dataset contains 201 "Level 2" Algebra examples. We use a random sample of 20 examples for tuning the generation hyperparameters and the remaining for evaluation. We use one demonstration example in the prompt.

In [4]:
import datasets

seed = 41
data = datasets.load_dataset("competition_math")
train_data = data["train"].shuffle(seed=seed)
test_data = data["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
    {
        "problem": train_data[x]["problem"],
        "solution": train_data[x]["solution"],
    }
    for x in range(len(train_data)) if train_data[x]["level"] == "Level 2" and train_data[x]["type"] == "Algebra"
][:n_tune_data]
test_data = [
    {
        "problem": test_data[x]["problem"],
        "solution": test_data[x]["solution"],
    }
    for x in range(len(test_data)) if test_data[x]["level"] == "Level 2" and test_data[x]["type"] == "Algebra"
]
print(len(tune_data), len(test_data))


StatementMeta(automl, 22, 11, Finished, Available)

Downloading builder script:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.32k [00:00<?, ?B/s]

Downloading and preparing dataset competition_math/default to /home/trusted-service-user/.cache/huggingface/datasets/competition_math/default/1.0.0/2a2a2995c2847186883ecd64f69be7d602b8a6f6b51950624d4dc2263f93333b...


Downloading data:   0%|          | 0.00/20.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset competition_math downloaded and prepared to /home/trusted-service-user/.cache/huggingface/datasets/competition_math/default/1.0.0/2a2a2995c2847186883ecd64f69be7d602b8a6f6b51950624d4dc2263f93333b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

20 201


StatementMeta(automl, 22, 18, Finished, Available)

Check a tuning example:

In [5]:
print(tune_data[1]["problem"])

StatementMeta(automl, 22, 12, Finished, Available)

The sum of Jim's weight and Bob's weight is 180 pounds. If you subtract Jim's weight from Bob's weight, you get half of Bob's weight. How many pounds does Bob weigh?


Here is one example of the canonical solution:

In [6]:
print(tune_data[1]["solution"])

StatementMeta(automl, 22, 13, Finished, Available)

Call Jim's weight $j$ and Bob's weight $b$. We can use the following system of equations to represent the information given: \begin{align*}
j + b &= 180 \\
b - j &= \frac{b}{2} \\
\end{align*} Adding the two equations together gives $2b = 180 + \frac{b}{2}$. Solving for $b$ gives $3b = 360$, or $b = 120$. Thus, Bob weighs $\boxed{120}$ pounds.


## Define Success Metric

Before we start tuning, we need to define the success metric we want to opotimize. For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks.

In [7]:
from typing import Optional

def remove_boxed(string: str) -> Optional[str]:
    """Source: https://github.com/hendrycks/math
    Extract the text within a \\boxed{...} environment.
    Example:
    >>> remove_boxed(\\boxed{\\frac{2}{3}})
    \\frac{2}{3}
    """
    left = "\\boxed{"
    try:
        assert string[: len(left)] == left
        assert string[-1] == "}"
        return string[len(left) : -1]
    except Exception:
        return None


def last_boxed_only_string(string: str) -> Optional[str]:
    """Source: https://github.com/hendrycks/math
    Extract the last \\boxed{...} or \\fbox{...} element from a string.
    """
    idx = string.rfind("\\boxed")
    if idx < 0:
        idx = string.rfind("\\fbox")
        if idx < 0:
            return None

    i = idx
    right_brace_idx = None
    num_left_braces_open = 0
    while i < len(string):
        if string[i] == "{":
            num_left_braces_open += 1
        if string[i] == "}":
            num_left_braces_open -= 1
            if num_left_braces_open == 0:
                right_brace_idx = i
                break
        i += 1

    if right_brace_idx is None:
        retval = None
    else:
        retval = string[idx : right_brace_idx + 1]

    return retval


def _fix_fracs(string: str) -> str:
    """Source: https://github.com/hendrycks/math
    Reformat fractions.
    Examples:
    >>> _fix_fracs("\\frac1b")
    \frac{1}{b}
    >>> _fix_fracs("\\frac12")
    \frac{1}{2}
    >>> _fix_fracs("\\frac1{72}")
    \frac{1}{72}
    """
    substrs = string.split("\\frac")
    new_str = substrs[0]
    if len(substrs) > 1:
        substrs = substrs[1:]
        for substr in substrs:
            new_str += "\\frac"
            if substr[0] == "{":
                new_str += substr
            else:
                try:
                    assert len(substr) >= 2
                except Exception:
                    return string
                a = substr[0]
                b = substr[1]
                if b != "{":
                    if len(substr) > 2:
                        post_substr = substr[2:]
                        new_str += "{" + a + "}{" + b + "}" + post_substr
                    else:
                        new_str += "{" + a + "}{" + b + "}"
                else:
                    if len(substr) > 2:
                        post_substr = substr[2:]
                        new_str += "{" + a + "}" + b + post_substr
                    else:
                        new_str += "{" + a + "}" + b
    string = new_str
    return string


def _fix_a_slash_b(string: str) -> str:
    """Source: https://github.com/hendrycks/math
    Reformat fractions formatted as a/b to \\frac{a}{b}.
    Example:
    >>> _fix_a_slash_b("2/3")
    \frac{2}{3}
    """
    if len(string.split("/")) != 2:
        return string
    a_str = string.split("/")[0]
    b_str = string.split("/")[1]
    try:
        a = int(a_str)
        b = int(b_str)
        assert string == "{}/{}".format(a, b)
        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
        return new_string
    except Exception:
        return string


def _remove_right_units(string: str) -> str:
    """Source: https://github.com/hendrycks/math
    Remove units (on the right).
    "\\text{ " only ever occurs (at least in the val set) when describing units.
    """
    if "\\text{ " in string:
        splits = string.split("\\text{ ")
        assert len(splits) == 2
        return splits[0]
    else:
        return string


def _fix_sqrt(string: str) -> str:
    """Source: https://github.com/hendrycks/math
    Reformat square roots.
    Example:
    >>> _fix_sqrt("\\sqrt3")
    \sqrt{3}
    """
    if "\\sqrt" not in string:
        return string
    splits = string.split("\\sqrt")
    new_string = splits[0]
    for split in splits[1:]:
        if split[0] != "{":
            a = split[0]
            new_substr = "\\sqrt{" + a + "}" + split[1:]
        else:
            new_substr = "\\sqrt" + split
        new_string += new_substr
    return new_string


def _strip_string(string: str) -> str:
    """Source: https://github.com/hendrycks/math
    Apply the reformatting helper functions above.
    """
    # linebreaks
    string = string.replace("\n", "")
    # print(string)

    # remove inverse spaces
    string = string.replace("\\!", "")
    # print(string)

    # replace \\ with \
    string = string.replace("\\\\", "\\")
    # print(string)

    # replace tfrac and dfrac with frac
    string = string.replace("tfrac", "frac")
    string = string.replace("dfrac", "frac")
    # print(string)

    # remove \left and \right
    string = string.replace("\\left", "")
    string = string.replace("\\right", "")
    # print(string)

    # Remove circ (degrees)
    string = string.replace("^{\\circ}", "")
    string = string.replace("^\\circ", "")

    # remove dollar signs
    string = string.replace("\\$", "")

    # remove units (on the right)
    string = _remove_right_units(string)

    # remove percentage
    string = string.replace("\\%", "")
    string = string.replace("\%", "")

    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
    string = string.replace(" .", " 0.")
    string = string.replace("{.", "{0.")
    # if empty, return empty string
    if len(string) == 0:
        return string
    if string[0] == ".":
        string = "0" + string

    # to consider: get rid of e.g. "k = " or "q = " at beginning
    if len(string.split("=")) == 2:
        if len(string.split("=")[0]) <= 2:
            string = string.split("=")[1]

    # fix sqrt3 --> sqrt{3}
    string = _fix_sqrt(string)

    # remove spaces
    string = string.replace(" ", "")

    # \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc.
    # Even works with \frac1{72} (but not \frac{72}1).
    # Also does a/b --> \\frac{a}{b}
    string = _fix_fracs(string)

    # manually change 0.5 --> \frac{1}{2}
    if string == "0.5":
        string = "\\frac{1}{2}"

    # NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
    string = _fix_a_slash_b(string)

    return string


def get_answer(solution: Optional[str]) -> Optional[str]:
    if solution is None:
        return None
    last_boxed = last_boxed_only_string(solution)
    if last_boxed is None:
        return None
    answer = remove_boxed(last_boxed)
    if answer is None:
        return None
    return answer


def is_equiv(str1: Optional[str], str2: Optional[str]) -> float:
    """Returns (as a float) whether two strings containing math are equivalent up to differences of formatting in
    - units
    - fractions
    - square roots
    - superfluous LaTeX.
    Source: https://github.com/hendrycks/math
    """
    if str1 is None and str2 is None:
        print("WARNING: Both None")
        return 1.0
    if str1 is None or str2 is None:
        return 0.0

    try:
        ss1 = _strip_string(str1)
        ss2 = _strip_string(str2)
        return float(ss1 == ss2)
    except Exception:
        return float(str1 == str2)


def is_equiv_chain_of_thought(str1: str, str2: str) -> float:
    """Strips the solution first before calling `is_equiv`."""
    ans1 = get_answer(str1)
    ans2 = get_answer(str2)

    return is_equiv(ans1, ans2)


def success_metrics(responses, solution, **args):
    """Check if each response is correct.
    
    Args:
        responses (list): The list of responses.
        solution (str): The canonical solution.
    
    Returns:
        dict: The success metrics.
    """
    success_list = []
    n = len(responses)
    for i in range(n):
        response = responses[i]
        succeed = is_equiv_chain_of_thought(response, solution)
        success_list.append(succeed)
    # voting
    answers = {}
    for i in range(n):
        equiv = i
        if get_answer(responses[i]) is None:
            # ignore None answers
            continue
        for j in answers:
            if is_equiv_chain_of_thought(responses[i], responses[j]):
                equiv = j
                break
        if equiv in answers:
            answers[equiv] += 1
        else:
            answers[equiv] = 1
    # find the answer with highest votes in answers
    answer = max(answers.items(), key=lambda x: x[1], default=(0, 0))[0]
    # check if the answer is correct
    success_vote = is_equiv_chain_of_thought(responses[answer], solution)
    return {
        "expected_success": 1 - pow(1 - sum(success_list) / n, n),
        "success": any(s for s in success_list),
        "success_vote": success_vote,
        "voted_answer": responses[answer],
    }


StatementMeta(automl, 22, 14, Finished, Available)

## Use the tuning data to find a good configuration

### Import the oai and tune subpackages from flaml.

FLAML has provided an API for hyperparameter optimization of OpenAI ChatGPT models: `oai.ChatCompletion.tune` and to make a request with the tuned config: `oai.ChatCompletion.create`. First, we import oai from flaml:

In [11]:
from flaml import oai

StatementMeta(automl, 22, 19, Finished, Available)

For (local) reproducibility and cost efficiency, we cache responses from OpenAI.

In [12]:
oai.ChatCompletion.set_cache(seed)

StatementMeta(automl, 22, 20, Finished, Available)

This will create a disk cache in ".cache/{seed}". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately.

### Perform tuning

The tuning will take a while to finish, depending on the optimization budget. The tuning will be performed under the specified optimization budgets.

* `inference_budget` is the target average inference budget per instance in the benchmark. For example, 0.004 means the target inference budget is 0.004 dollars, which translates to 2000 tokens (input + output combined) if the gpt-3.5-turbo model is used.
* `optimization_budget` is the total budget allowed to perform the tuning. For example, 1 means 1 dollars are allowed in total, which translates to 500K tokens for the gpt-3.5-turbo model.
* `num_sumples` is the number of different hyperparameter configurations which is allowed to try. The tuning will stop after either num_samples trials or after optimization_budget dollars spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided by `optimization_budget`.

Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.. The default search space is:

```python
default_search_space = {
    "model": tune.choice([
        "gpt-3.5-turbo",
        "gpt-4",
    ]),
    "temperature_or_top_p": tune.choice(
        [
            {"temperature": tune.uniform(0, 1)},
            {"top_p": tune.uniform(0, 1)},
        ]
    ),
    "max_tokens": tune.lograndint(50, 1000),
    "n": tune.randint(1, 100),
    "prompt": "{prompt}",
}
```

The default search space can be overridden by users' input.
For example, the following code specifies a fixed prompt template. For hyperparameters which don't appear in users' input, the default search space will be used.

In [29]:
import logging

prompts = ["{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."]
config, analysis = oai.ChatCompletion.tune(
    data=tune_data,  # the data for tuning
    metric="success_vote",  # the metric to optimize
    mode="max",  # the optimization mode
    eval_func=success_metrics,  # the evaluation function to return the success metrics
    # log_file_name="logs/math.log",  # the log file name
    inference_budget=0.03,  # the inference budget (dollar)
    optimization_budget=1,  # the optimization budget (dollar)
    # num_samples can further limit the number of trials for different hyperparameter configurations;
    # -1 means decided by the optimization budget only
    num_samples=-1,
    model="gpt-35-turbo",
    # model="chatgpt-35-turbo-0301",  # uncomment if using Azure OpenAI
    # model="gpt-3.5-turbo",  # uncomment if you don't have access to gpt-4
    prompt=prompts,  # the prompt templates to choose from
    # stop="###",  # the stop sequence
    logging_level=logging.INFO,  # the logging level
)


StatementMeta(automl, 22, 37, Finished, Available)

[32m[I 2023-04-11 02:07:27,277][0m A new study created in memory with name: optuna[0m


No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune
[flaml.tune.tune: 04-11 02:07:27] {777} INFO - trial 1 config: {'model': 'gpt-35-turbo', 'temperature_or_top_p': {'top_p': 0.36280922847807595}, 'max_tokens': 347, 'n': 10, 'prompt': 0}
[flaml.tune.tune: 04-11 02:08:44] {197} INFO - result: {'expected_success': 0.799999999995, 'success': 0.8, 'success_vote': 0.8, 'voted_answer': '\n\nStarting with $4+2.3y = 1.7y - 20$, we can simplify by subtracting $1.7y$ from both sides: $$4+0.6y = -20.$$ Next, we can subtract 4 from both sides: $$0.6y = -24.$$ Finally, we can divide both sides by 0.6 to get $$y = \\boxed{-40}.$$', 'total_cost': 0.060924, 'cost': 0.060924, 'inference_cost': 0.0026643, 'training_iteration': 0, 'config': {'model': 'gpt-35-turbo', 'temperature_or_to

### Output tuning results

After the tuning, we can print out the config and the result found by FLAML:

In [30]:
print("optimized config", config)
print("best result on tuning data", analysis.best_result)

StatementMeta(automl, 22, 38, Finished, Available)

optimized config {'max_tokens': 470, 'n': 50, 'prompt': '{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}.', 'model': 'gpt-35-turbo', 'stop': None, 'temperature': 0.6336482349262754}
best result on tuning data {'expected_success': 0.9935049966526093, 'success': 1.0, 'success_vote': 0.95, 'voted_answer': '\n\nCombining like terms, we have $4 + 2.3y = 1.7y - 20$. Adding $20$ to both sides gives $24+2.3y = 1.7y$. Subtracting $1.7y$ from both sides gives $0.6y = -24$. Dividing both sides by $0.6$ gives $y = \\boxed{-40}$.', 'total_cost': 0.37151399999999996, 'cost': 0.31059, 'inference_cost': 0.015078500000000002, 'training_iteration': 0, 'config': {'max_tokens': 470, 'n': 50, 'prompt': '{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}.', 'model': 'gpt-35-turbo', 'stop': None, 'temperature': 0.6336482349262754}, 'config/temperature_or_top_p': {'temperature

### Make a request with the tuned config

We can apply the tuned config on the request for an example task:

In [31]:
responses = oai.ChatCompletion.create(context=tune_data[1], **config)
metric_results = success_metrics([response["message"]["content"].rstrip() for response in responses["choices"]], **tune_data[1])
print("response on an example data instance:", responses)
print("metric_results on the example data instance:", metric_results)


StatementMeta(automl, 22, 39, Finished, Available)

response on an example data instance: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\nLet's call Jim's weight \"J\" and Bob's weight \"B\". \n\nFrom the first sentence, we know that: \n\nJ + B = 180 \n\nFrom the second sentence, we know that: \n\nB - J = 0.5B \n\nWe can simplify the second equation by adding J to both sides: \n\nB = J + 0.5B \n\nNow we can substitute this expression for B into the first equation: \n\nJ + (J + 0.5B) = 180 \n\nSimplifying: \n\n2J + 0.5B = 180 \n\nSubtracting 0.5B from both sides: \n\n2J = 180 - 0.5B \n\nDividing by 2: \n\nJ = 90 - 0.25B \n\nNow we can substitute this expression for J into the equation B - J = 0.5B: \n\nB - (90 - 0.25B) = 0.5B \n\nSimplifying: \n\n1.25B - 90 = 0.5B \n\nSubtracting 0.5B from both sides: \n\n0.75B - 90 = 0 \n\nAdding 90 to both sides: \n\n0.75B = 90 \n\nDividing by 0.75: \n\nB = 120 \n\nTherefore, Bob weighs \\boxed{120} pounds.",
        "role": "assistant"

### Evaluate the success rate on the test data

You can use flaml's `oai.ChatCompletion.test` to evaluate the performance of an entire dataset with the tuned config. The following code will take a while (30 mins to 1 hour) to evaluate all the test data instances if uncommented and run. It will cost roughly $3. 

In [32]:
# result = oai.Completion.test(test_data, config, success_metrics)
# print("performance on test data with the tuned config:", result)

StatementMeta(automl, 22, 40, Finished, Available)

What about the default, untuned gpt-4 config (with the same prompt as the tuned config)? We can evaluate it and compare:

In [33]:
# assuming you have access to gpt-4; otherwise use gpt-3.5-turbo
# the following code will cost roughly $2 if uncommented and run.

# default_config = {"model": 'gpt-4', "prompt": prompts[0]}
# default_result = oai.Completion.test(test_data, default_config, success_metrics)
# print("performance on test data from gpt-4 with a default config:", default_result)

StatementMeta(automl, 22, 41, Finished, Available)

In [34]:
# print("tuned config succeeds in {:.1f}% test cases".format(result["success_vote"] * 100))
# print("untuned config succeeds in {:.1f}% test cases".format(default_result["success_vote"] * 100))

StatementMeta(automl, 22, 42, Finished, Available)

Note that the untuned config has a lower inference cost. What if we heuristically increase the number of responses n to 5?

In [35]:
# config_larger = {"model": 'gpt-4', "prompt": prompts[0], "n": 5}
# default_result = oai.ChatCompletion.test(test_data, config_larger, success_metrics)
# print("performance on test data from gpt-4 with a default config and n=5:", default_result)

StatementMeta(automl, 22, 43, Finished, Available)

We find that the 'success_vote' metric is increased at the cost of exceeding the inference budget. But the tuned configuration has both higher 'success_vote' (92% vs. 87%) and lower average inference cost ($0.016 vs. $0.04 per instance).

A developer could use flaml to tune the configuration to satisfy the target inference budget while maximizing the value out of it.