Copyright (c) Microsoft Corporation. All rights reserved. 

Licensed under the MIT License.

# Use FLAML to Optimize Code Generation Performance

In this notebook, we optimize OpenAI models for code generation. We use [the HumanEval benchmark](https://huggingface.co/datasets/openai_humaneval) released by OpenAI for synthesizing programs from docstrings.

Related link: [Blogpost](https://microsoft.github.io/FLAML/blog/2023/05/18/GPT-adaptive-humaneval) based on this experiment.

## Requirements

FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [autogen] option:
```bash
pip install flaml[autogen]==1.2.2
```

In [1]:
# %pip install flaml[autogen]==1.2.2 datasets

Set your OpenAI key:

In [2]:
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "<your OpenAI API key here>"

If you use Azure OpenAI, uncomment the following:

In [3]:
# import openai
# openai.api_type = "azure"
# openai.api_base = "https://<your_endpoint>.openai.azure.com/"
# openai.api_version = "2023-03-15-preview"  # change if necessary

## Load dataset

First, we load the humaneval dataset. The dataset contains 164 examples. In each example, the "prompt" is the prompt string for eliciting the code generation (renamed into "definition"), "test" is the Python code for unit test for the example, and "entry_point" is the function name to be tested.

In [4]:
import datasets

seed = 41
data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
data = data.select(range(len(data))).rename_column("prompt", "definition").remove_columns(["task_id", "canonical_solution"])

Found cached dataset openai_humaneval (/home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75/cache-1e8448101c1b32e8.arrow


In [5]:
from flaml.autogen.code_utils import eval_function_completions, implement
from flaml import oai

The `implement` function will first generate assertion statements for a problem. Then, it uses the assertions to select the generated responses.

In [6]:
prompt = "# Python 3{definition}"
stops = [["\nclass", "\ndef", "\nif", "\nprint"], None]
configs = [{"model": 'gpt-3.5-turbo', "prompt": prompt, "stop": stops[1], "temperature": 0, "seed": 0}, {"model": 'gpt-3.5-turbo', "prompt": prompt, "stop": stops[0], "n": 7, "seed": 0}, {"model": 'gpt-4', "prompt": prompt, "stop": stops[1], "temperature": 0, "seed": 1}, {"model": 'gpt-4', "prompt": prompt, "stop": stops[0], "n": 2, "seed": 2}, {"model": 'gpt-4', "prompt": prompt, "stop": stops[0], "n": 1, "seed": 2}]
# baseline_gpt4_configs = [{"model": 'gpt-4', "prompt": prompt, "seed": 1}]
oai.Completion.set_cache(0)
oai.Completion.retry_timeout = 600
cost = 0
success = 0
for i, d in enumerate(data):
    response, cost_i, j = implement(d["definition"], configs)
    metrics = eval_function_completions(responses=[response], use_docker=False, **d)
    success += metrics["success"]
    cost += cost_i
    print(f"Example {i}, config {j}, success {success}")
print(f"Success rate: {success / len(data):.3f}")
print(f"Average cost: {cost / len(data):.5f}")

Example 0, config 1, success 1
Example 1, config 0, success 2
Example 2, config 0, success 3
Example 3, config 2, success 4
Example 4, config 2, success 5
Example 5, config 4, success 6
Example 6, config 4, success 6
Example 7, config 2, success 7
Example 8, config 2, success 8
Example 9, config 0, success 9
Example 10, config 1, success 10
Example 11, config 0, success 10
Example 12, config 2, success 11
Example 13, config 2, success 12
Example 14, config 0, success 13
Example 15, config 2, success 14
Example 16, config 0, success 15
Example 17, config 1, success 15
Example 18, config 1, success 16
Example 19, config 3, success 17
Example 20, config 2, success 18
Example 21, config 2, success 19
Example 22, config 2, success 19
Example 23, config 2, success 20
Example 24, config 0, success 21
Example 25, config 0, success 22
Example 26, config 4, success 23
Example 27, config 2, success 24
Example 28, config 4, success 24
Example 29, config 2, success 25
Example 30, config 2, success 