# Model Distillation for Function Calling

This notebook is part of a series demonstrating advanced model distillation techniques for creating specialized, function-calling-aware models. The goal is to distill the knowledge from a large language model (Amazon Nova Premier) into a smaller, more efficient model while maintaining high-quality function calling capabilities.

## Learning Objectives
- Prepare training data for function calling model distillation
- Design structured output formats for consistent function parameter generation
- Implement function selection and parameter extraction
- Create evaluation datasets for measuring function calling accuracy

## Dataset: Berkeley Function Calling Leaderboard (BFCL) V2 Live
We use the Berkeley Function Calling Leaderboard (BFCL) V2 Live dataset as our base dataset. This dataset is particularly suitable for function-calling model training because:

1. Contains 2,251 question-function-answer pairs total
2. Provides diverse function calling scenarios:
   - 258 simple calls
   - 1,053 multiple parameter calls
   - 16 parallel function calls
   - 24 parallel multiple parameter calls
   - 882 irrelevance detection cases
   - 18 relevance detection cases
3. Offers complexity with an average of 3 function choices per entry (maximum 37)
4. Includes parameter diversity with an average of 4 parameters per function (maximum 28)

The dataset is processed and stored in optimized formats for efficient model training and evaluation.

In [None]:
# upgrade boto3 
%pip install --upgrade pip --quiet
%pip install boto3 --upgrade --quiet
%pip install bcfl-eval --quiet

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

We need to set the project root so results are put in the correct location. 

In [None]:
import os
os.environ['BFCL_PROJECT_ROOT'] = os.getcwd()

If you're running this on your own machine, enter your AWS access keys in an .env file.
Uncomment the below cell to copy down an example .env.

In [None]:
# # set up environment file
# %cp $(python -c "import bfcl_eval; print(bfcl_eval.__path__[0])")/.env.example $BFCL_PROJECT_ROOT/.env
# # Fill in necessary values in `.env`

In [None]:
# download sample data
# %cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'test_case_ids_to_generate.json.example')") $BFCL_PROJECT_ROOT/test_case_ids_to_generate.json

For this example we're going to use a combination of v3_simple, v3_multiple, v3_live_relevance, and v3_irrelevance to train and evaluate with. For more information on these categories and their intents, please visit the [official BFCL documentation](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard)

Let's move these to our local directory so we can begin preparing the data for distillation.

In [None]:
%mkdir questions
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_simple.json')") ./questions/BFCL_v3_simple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_multiple.json')") ./questions/BFCL_v3_multiple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_irrelevance.json')") ./questions/BFCL_v3_irrelevance.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'BFCL_v3_live_relevance.json')") ./questions/BFCL_v3_live_relevance.json

Now will grab the corresponding answers. For the simple and multiple datasets, we are provided possible answers and we'll use these for our mix-in labels.

Per the BFCL documentation, the correct answer for any question in the `BFCL_v3_irrelevance` datset is an empty list of functions, as these are design specifically to test the model's ability to correctly identify zero possible functions that are relevant.

The correct answer for any question in the `BFCL_v3_live_relevance` dataset is "at least one" function call returned.

Here's an excerpt from [their documentation](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard):

> **Irrelevance Detection (875):** The scenario where none of the function choices provided are relevant to the user query and none should be invoked. We expect the model to not output a function call; the model can either output a message explaining why the function provided are not relevant or simply output a non-function call response (e.g., an empty list).

> **Relevance Detection (41):** The opposite of irrelevance detection. The scenario where at least one of the function choices provided are relevant to the user query and should be invoked, but the way the user prompt or the function doc is stated means that there could be infinitely many correct function calls and impossible to use a pre-defined possible answer set to evaluate. We expect the model to output some function call (one or multiple) that is relevant to the user query; we don't check for the correctness of the function call in this category (eg, correct parameter value).

In [None]:
%mkdir answers
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'possible_answer' / 'BFCL_v3_simple.json')") ./answers/BFCL_v3_simple.json
%cp $(python -c "import bfcl_eval, pathlib; print(pathlib.Path(bfcl_eval.__path__[0]) / 'data' / 'possible_answer' / 'BFCL_v3_multiple.json')") ./answers/BFCL_v3_multiple.json


To make things a bit cleaner, we'll manually create answer files for the relevance and irrelevance datasets as well. This will make it easier to combine for our training dataset as our mix-in will require a few ground truth examples.
We'll emulate the ground truth response structure the BFCL team is using, and return this for irrelevance answers:
```json
{"id": "irrelevance_13", "ground_truth": []}
```

and this for relevance answers, by picking a random function name from the list. Note that they "don't check for the correctness of the function call in this category (eg, correct parameter value)." so we'll use placeholders for the actual ground truth answers. Remember we're just doing this for labeled mix-in training data to hint our teacher model during distillation:
```json
{"id": "live_relevance_5-5-0", "ground_truth": [{"get_copyright_info": {"copyright_content": ["The specific content that is claimed to be copyrighted."], "copyright_holder": ["The name of the individual or organization that holds the copyright."], "confidence_score": [0.8]}}]}
```

Let's process the irrelevance answers

In [None]:
import json
irrelevance_answers = []
with open("questions/BFCL_v3_irrelevance.json", "r") as samples:
    for sample in samples.readlines():
        id = json.loads(sample)['id']
        answer = {
            "id": id,
            "ground_truth": []
        }
        irrelevance_answers.append(answer)

In [None]:
with open("answers/BFCL_v3_irrelevance.json", "w") as output_file:
    for answer in irrelevance_answers:
        json.dump(answer, output_file)
        output_file.write('\n')

Now, lets process the relevance answers. We don't have an answer file for this already, but remember, BFCL counts any non-zero answer as correct. We'll use a random function from the list of functions provided in the example.

To create answers with the right data types, we'll build a help function to account for all of the different answer scenarios in the relevance dataset.

In [None]:
import random

def generate_answer(function: dict) -> dict:
    function_params = function['parameters']['properties']
    required_params = function['parameters']['required']
    param_values = {}


    for p in function_params.keys():
        data_type = function_params[p]['type']
        # check if default availablt and use that
        if 'default' in function_params[p].keys():
            param_values[p] = [function_params[p]['default']]
            # print(param_values)

        if p in required_params:
            if data_type == 'string':
                if 'enum' in function_params[p].keys():
                    enums = function_params[p]['enum']
                    # print("found enums", function_params[p]['enum'])
                    param_values[p] = [enums[random.randint(0,len(enums)-1)]]
                else:
                    param_values[p] = ['test parameter value']
                # param_values[p] = ['test string']
            
        # else, create a value based on the data type

    answer = {
        function['name']: param_values
    }
    return answer
    # print("answer", answer)

In [None]:
import json
import random
relevance_answers = []
with open("questions/BFCL_v3_live_relevance.json", "r") as samples:
    for sample in samples.readlines():
        s = json.loads(sample)
        id = s['id']

        chosen_function = s['function'][random.randint(0,len(s['function'])-1)]
        relevance_answers.append({
            "id": id,
            "ground_truth": [generate_answer(chosen_function)]
            })

In [None]:
with open("answers/BFCL_v3_live_relevance.json", "w") as output_file:
    for answer in relevance_answers:
        json.dump(answer, output_file)
        output_file.write('\n')

## Data Preparation Steps

1. **Data Splitting**
   - Split the BFCL question datasets randomly:
     - 50% into `training` directory
     - 50% into `eval` directory
   
2. **Mix-in Data Creation**
   - From the training data:
     - Create `mix_in` subdirectory
     - Move 10% of records into mix_in
     - Keep 90% in training
   
3. **Ground Truth Integration**
   - For mix-in data:
     - Look up corresponding answers in answer dataset
     - Add as ground truth assistant responses
   
4. **Prompt Engineering**
   - Build Bedrock invoke API prompts with tool calling functionality
   
5. **Data Consolidation**
   - Combine all training data (including mix-in)
   - Format as JSONL for Bedrock distillation service
   - Save in training directory

In [None]:
import json
import os
import pandas as pd
import numpy as np

# Create directories if they don't exist
os.makedirs('training', exist_ok=True)
os.makedirs('eval', exist_ok=True)

# Load and combine all question datasets
question_files = [
    'questions/BFCL_v3_simple.json',
    'questions/BFCL_v3_multiple.json',
    'questions/BFCL_v3_irrelevance.json',
    'questions/BFCL_v3_live_relevance.json'  # Note the double dot in filename
]

all_questions = []
for file in question_files:
    print("reading... ", file)
    with open(file, 'r') as f:
        for answer in f.readlines():
            # print(question)
            all_questions.append(json.loads(answer))
        # questions = json.load(contents)
        # all_questions.extend(questions)

# Convert to DataFrame for easier manipulation
df_questions = pd.DataFrame(all_questions)

# Randomly split into training (50%) and eval (50%)
df_train = df_questions.sample(frac=0.5, random_state=42)
df_eval = df_questions.drop(df_train.index)

# Save splits to respective directories
df_train.to_json('training/questions.json', orient='records', indent=2)
df_eval.to_json('eval/questions.json', orient='records', indent=2)

print(f"Training set size: {len(df_train)}")
print(f"Evaluation set size: {len(df_eval)}")

In [None]:
# Create mix_in directory
os.makedirs('training/mix_in', exist_ok=True)

# Select 10% of training data for mix-in
df_mix_in = df_train.sample(frac=0.1, random_state=42)
df_train_remaining = df_train.drop(df_mix_in.index)

# Save mix-in and remaining training data
df_mix_in.to_json('training/mix_in/questions.json', orient='records', indent=2)
df_train_remaining.to_json('training/questions.json', orient='records', indent=2)

print(f"Mix-in set size: {len(df_mix_in)}")
print(f"Remaining training set size: {len(df_train_remaining)}")

In [None]:
# Load answer datasets
answer_files = {
    'simple': 'answers/BFCL_v3_simple.json',
    'multiple': 'answers/BFCL_v3_multiple.json',
    'relevance': 'answers/BFCL_v3_live_relevance.json',
    'irrelevance': 'answers/BFCL_v3_irrelevance.json'
}

all_answers = []
for dataset_type, file in answer_files.items():
    with open(file, 'r') as f:
        for answer in f.readlines():
            all_answers.append(json.loads(answer))
        # all_answers.update({a['id']: a['ground_truth'] for a in answers})

# Convert to DataFrame for easier manipulation
df_all_answers = pd.DataFrame(all_answers)

In [None]:
# Add ground truth answers to mix-in data
# Merge the dataframes on the 'id' column
df_mix_in = df_mix_in.merge(
    df_all_answers[['id', 'ground_truth']], 
    on='id', 
    how='left'
).rename(columns={'answer': 'ground_truth'})

df_mix_in.to_json('training/mix_in/questions_with_answers.json', orient='records', indent=2)

print(f"Added ground truth answers to {len(df_mix_in)} mix-in records")

By now we should have our mix-in record with answers. We'll be combining these with the delta for training records without answers to form our final distillation training data set. However, we still have to form the dataset to work with bedrock along with our Nova prompt.

We'll begin the final formatting now.

First we'll start with our system prompt, as this will contain the tools available for the agent to call. We're following the best practices laid out here for agent calling: https://docs.aws.amazon.com/bedrock/latest/userguide/distillation-prepare-datasets.html


## Prepare distillation training data with Prompt-Only Function Calling
This is a prompting-only approach to tool calling that relies entirely on the system prompt to provide the available tools to the model to pick from. 

In [None]:
# System prompt for function calling
def create_sys_prompt(tools) -> str:

    SYSTEM_PROMPT = f"""You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function calls in your response.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

At each turn, you should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user's request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task.
"""

    default_system_prompt = (
        SYSTEM_PROMPT
        + """
    Here is a list of functions in JSON format that you can invoke.\n{tools}\n
    """
    )
    return default_system_prompt

In [None]:
# helper method for transforming list of functions from BFCL data set to bedrock tool spec
from typing import List

def transform_to_toolspec(input_data: List):
    """
    Transform function calling format to toolSpec format.
    
    Args:
        input_data (list): List of function definitions in the input format
        
    Returns:
        list: List of function definitions in the toolSpec format
    """
    result = []
    
    for func in input_data:
        # Extract the parameters object
        parameters = func.get("parameters", {})
        
        # Create the toolSpec structure
        toolspec_item = {
            "toolSpec": {
                "name": func["name"],
                "description": func["description"],
                "inputSchema": {
                    "json": {
                        "type": "object",  # Convert "dict" to "object"
                        "properties": parameters.get("properties", {}),
                        "required": parameters.get("required", [])
                    }
                }
            }
        }
        
        result.append(toolspec_item)
    
    return result

Here we'll be sure to fine-tune with the system prompt used for BFCL. If your evaluation framework is using a specific system prompt that represents your business, you would want to include that prompt in your fine-tuning so the model tuned to your business. 

In [None]:
def create_jsonl_record(row, use_tool_config=False,batch_inf_format=False):
    """
    creates a jsonl record for bedrock distillation or batch inference formats
    """
    
    conversation = {}
    
    if use_tool_config:
        conversation = {
            "schemaVersion": "bedrock-conversation-2024",
            "system": [
                {
                    "text": """You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function calls in your response.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

At each turn, you should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user's request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task."""
                }
            ],
            "messages": [{
                            "role": "user",
                            "content": [
                            {
                                "text": row['question'][0][0]['content']
                            }
                            ]
                        }],
            "toolConfig": {"tools": transform_to_toolspec(row['function'])}
        }
    else:
        conversation = {
            "schemaVersion": "bedrock-conversation-2024",
            "system": [
                {
                    "text": create_sys_prompt(tools=row['function'])
                }
            ],
            "messages": [{
                            "role": "user",
                            "content": [
                            {
                                "text": row['question'][0][0]['content']
                            }
                            ]
                        }],
        }

    if 'ground_truth' in row.keys():
        conversation['messages'].append({
                            "role": "assistant",
                            "content": [
                            {
                                "text": f"{json.dumps(row['ground_truth'])}"
                            }
                            ]
                        })
    
    if batch_inf_format:
        return {
            "recordId": row['id'],
            "modelInput": conversation
        }
    else:
        return conversation



In [None]:
# Process regular training data without labels
records = []
with open('training/questions.json', 'r') as f:
    train_data = json.load(f)
    for item in train_data:
        # print(item)
        record = create_jsonl_record(
            row=item
        )
        records.append(record)

In [None]:
# Process mix-in data with ground truth answers
with open('training/mix_in/questions_with_answers.json', 'r') as f:
    train_data = json.load(f)
    for item in train_data:
        # print(item)
        record = create_jsonl_record(
            row=item
        )
        records.append(record)

In [None]:
# Save combined training data as JSONL
with open('training/bedrock_training_data_prompt_only.jsonl', 'w') as f:
    for record in records:
        f.write(json.dumps(record) + '\n')

## Prepare distillation training data with Tool Config
This method is alternative to the prompt-only approach, where we include a seperate inference parameter called tool_config, and put our list of available tools in this configuration instead of the system prompt.

In [None]:
# Process regular training data without labels
records_tools_use = []
with open('training/questions.json', 'r') as f:
    train_data = json.load(f)
    for item in train_data:
        # print(item)
        record = create_jsonl_record(
            row=item,
            use_tool_config=True
        )
        records_tools_use.append(record)

# Process mix-in data with ground truth answers
with open('training/mix_in/questions_with_answers.json', 'r') as f:
    train_data = json.load(f)
    for item in train_data:
        # print(item)
        record = create_jsonl_record(
            row=item,
            use_tool_config=True
        )
        records_tools_use.append(record)

# Save combined training data as JSONL
with open('training/bedrock_training_data_tool_config.jsonl', 'w') as f:
    for record in records_tools_use:
        f.write(json.dumps(record) + '\n')

## Prepare Evaluation Data
Now we'll prepare our evaluation data set that we set aside at the beginning. we'll prepare  the 

In [None]:
# # Prepare eval data for prompt only formatting
# records_prompt_only = []
# with open('eval/questions.json', 'r') as f:
#     eval_data = json.load(f)
#     for item in eval_data:
#         # print(item)
#         record = create_jsonl_record(
#             row=item,
#             batch_inf_format=True
#         )
#         records_prompt_only.append(record)

# # Save combined training data as JSONL
# with open('eval/bedrock_eval_prompt_only.jsonl', 'w') as f:
#     for record in records_prompt_only:
#         f.write(json.dumps(record) + '\n')

In [None]:
# # Prepare eval data for tool config formatting
# records_tool_config = []
# with open('eval/questions.json', 'r') as f:
#     eval_data = json.load(f)
#     for item in eval_data:
#         # print(item)
#         record = create_jsonl_record(
#             row=item,
#             use_tool_config=True,
#             batch_inf_format=True
#         )
#         records_tool_config.append(record)

# # Save combined training data as JSONL
# with open('eval/bedrock_eval_tool_config.jsonl', 'w') as f:
#     for record in records_tool_config:
#         f.write(json.dumps(record) + '\n')

Let's also generate the answers for these eval questions so we'll be ready to evaluate the model response compared to the ground truth answer from BFCL

In [None]:
# iterate through questions in eval/questions.jsonl and find the corresponding answer and put into a jsonl file
answer_files = [
    'answers/BFCL_v3_simple.json',
    'answers/BFCL_v3_multiple.json',
    'answers/BFCL_v3_irrelevance.json',
    'answers/BFCL_v3_live_relevance.json'  # Note the double dot in filename
]

# all_answers = []
# for file in answer_files:
#     print("reading... ", file)
#     with open(file, 'r') as f:
#         for answer in f.readlines():
#             # print(question)
#             all_answers.append(json.loads(answer))
#         # questions = json.load(contents)
#         # all_questions.extend(questions)

# # Convert to DataFrame for easier manipulation
# df_answers = pd.DataFrame(all_answers)
# df_answers.to_json('eval/answers.json', orient='records', indent=2)

Here's we'll format our list of eval questions in the format specified for BFCL when running specific test cases: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/README.md#selecting-specific-test-cases-with---run-ids

We'll use this final in our final evaluation for our customized model.

In [None]:
from pathlib import Path

def in_training_data_set(id, training_dataset_filepath) -> bool:
    """Checks if id is in training data so as to exclude from evaluation"""
    return id in Path(training_dataset_filepath).read_text()

In [None]:
# Dictionary to store categorized IDs
from collections import defaultdict
categorized_ids = defaultdict(list)

# Process each answer file
for file in answer_files:
    print(f"Reading... {file}")
    if os.path.exists(file):
        with open(file, 'r') as f:
            for line in f:
                line = line.strip()
                if line:
                    try:
                        answer_data = json.loads(line)
                        record_id = answer_data.get('id')
                        if in_training_data_set(record_id, 'training/bedrock_training_data_tool_config.jsonl'):
                            print(f"Excluding record {record_id} from evaluation.")
                        else:
                            if record_id:
                                # Determine category based on ID prefix
                                if record_id.startswith('simple_'):
                                    categorized_ids['simple'].append(record_id)
                                elif record_id.startswith('multiple_'):
                                    categorized_ids['multiple'].append(record_id)
                                elif record_id.startswith('live_relevance'):
                                    categorized_ids['live_relevance'].append(record_id)
                                elif record_id.startswith('irrelevance'): 
                                    categorized_ids['irrelevance'].append(record_id)
                                    pass
                                else:
                                    # Handle any other categories
                                    prefix = record_id.split('_')[0]
                                    categorized_ids[prefix].append(record_id)
                                
                    except json.JSONDecodeError as e:
                        print(f"Error parsing JSON in {file}: {e}")
    else:
        print(f"Warning: File {file} not found")

# Convert defaultdict to regular dict and sort IDs within each category
result = {}
for category, ids in categorized_ids.items():
    result[category] = sorted(ids)

# Write to JSON file
output_file = 'test_case_ids_to_generate.json' # This will be used with BFCL

with open(output_file, 'w') as f:
    json.dump(result, f, indent=2, sort_keys=True)

print(f"\nCategorized IDs written to: {output_file}")
print(f"Categories found: {list(result.keys())}")

for category, ids in result.items():
    print(f"  {category}: {len(ids)} IDs")

In [None]:
# answers = []
# with open('eval/questions.json', 'r') as f:
#     eval_data = json.load(f)
#     for row in eval_data:
#         question_id = row['id']
#         gt_answer = df_answers[df_answers['id'] == question_id]['ground_truth'].values
#         answer_record = {'id': question_id, 'answer': gt_answer[0]}
#         answers.append(answer_record)

In [None]:
# with open("eval/answers.json", "w") as output_file:
#     for answer in answers:
#         json.dump(answer, output_file)
#         output_file.write('\n')

## Wrap-up
We've now prepared two datasets for use in Bedrock Distillation, one using the prompt-only approach to tool calling, the other using the toolConfig parameter.
You should also now have evaluation datasets with our hold out data we can use for making inferences to bedrock for evaluation

## Next Steps

Now that we have prepared our training data for the Bedrock distillation service, you can proceed to:

1. **Model Training**: Use the generated `bedrock_training_data.jsonl` file to train your distilled model using the [Bedrock Model Distillation service](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html)

2. **Evaluation**: Use the data in the `eval` directory to assess your model's performance on:
   - Function selection accuracy
   - Parameter extraction quality
   - Handling of irrelevant queries
   - Response format consistency

3. **Fine-tuning**: Based on evaluation results, you may want to:
   - Adjust the mix-in percentage (currently 10%)
   - Modify the system prompt
   - Enhance the training data with additional examples

For more information on model distillation best practices, refer to the [Amazon Bedrock documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models-distillation.html).