# Generating training data for fine tuning using o3-mini

In this notebook i show my suggestion for making a fine-tuning dataset for 4o-mini, using the power of o3-mini.

The first step is getting the data.

We take the clustering made by Manoj with o3-mini, and then ask o3-mini with reasoning set to high to go through its own results for validation, make corrections if needed, and then explain why this is wrong as a user message.

This message will be used to simulate a conversation with 4o-mini to teach it clustering patterns that we want to highlight. We are essentially trying to gaslight GPT into making better clusters. Making it doubt itself.  

With fine tuing we therefore aim to make the predictions of 4o-mini be more similar to o3-mini, but at a lower cost. 


I think the bonus with this approach, is that we don't need annotate ourselves, just pick the ones we agree with the validation, which is much faster.


## Generate dataset

In [105]:
#This is a script that loads a pandas dataframe from csv, converts the string formatted as a list into a list of lists. The list of lists is then flattened into a single list.

import pandas as pd
import random
import json
df = pd.read_csv('o3mini_maxclust_80_subset_random_1000_nocutoff.csv', index_col=None)

#rename the column "Entity Name" to "Entity" and "Group Items" to "output"
df = df.rename(columns={"Group Items": "Output"})

# Convert the string formatted as a list into a list of lists
df['Output'] = df['Output'].apply(lambda x: eval(x))

# Flatten the list of lists into a single list
df['Input'] = df['Output'].apply(lambda x: [item.strip("*") for sublist in x for item in sublist])

#randomize the order of the single list
df['Input'] = df['Input'].apply(lambda x: random.sample(x, len(x)))
df = df[df['Input'].apply(len) < 30]
# make a new column, where internal lists of length 1 are removed 
df['Output>1'] = df['Output'].apply(lambda x: [item for item in x if len(item) > 1])


#zip the Group_Items_no_singles with the Entity Name 
df["Output"] = df.apply(lambda row: {row["Entity Name"]: row['Output']}, axis=1)
df["Input"] = df.apply(lambda row: {row["Entity Name"]: row['Input']}, axis=1)
df['Output>1'] = df.apply(lambda row: {row["Entity Name"]: row['Output>1']}, axis=1)



#filter for single_list length < 30



#convert the columns to json strings
df["Output"] = df["Output"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
df["Input"] = df["Input"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
df["Output>1"] = df["Output>1"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)

#Change order of columns to have input, then output, then output>1, then entity name
df = df[['Entity Name', 'Input', 'Output', 'Output>1']]

#filter by Output>1 length of the list > 1
#use column Output, but use only the ones where it clusteres multiple items together

In [106]:
df
#note i don't really use Outputs>1 in this workflow.

Unnamed: 0,Entity Name,Input,Output,Output>1
0,58963,"{""58963"": [""Gardea-Torresdey et al. [researche...","{""58963"": [[""Gardea-Torresdey et al. [research...","{""58963"": []}"
2,22996,"{""22996"": [""number of open stomata and amount ...","{""22996"": [[""**transpiration and stomatal cond...","{""22996"": [[""**transpiration and stomatal cond..."
3,100116,"{""100116"": [""modern clade [gene]"", ""Clade SL [...","{""100116"": [[""Clade M [clade]""], [""Clade SL [c...","{""100116"": []}"
4,133255,"{""133255"": [""LPP\u10b2 [gene]"", ""SUPPRESSOR OF...","{""133255"": [[""**PLSP2A [gene]**"", ""PLSP2A gene...","{""133255"": [[""**PLSP2A [gene]**"", ""PLSP2A gene..."
5,2276,"{""2276"": [""fc2 phenotypes [phenotype]"", ""100% ...","{""2276"": [[""F1 trichome phenotype [phenotype]""...","{""2276"": []}"
...,...,...,...,...
995,31615,"{""31615"": [""auxin-regulated endocytosis [proce...","{""31615"": [[""auxin-mediated modulation of the ...","{""31615"": [[""auxin-dependent inhibition of end..."
996,157227,"{""157227"": [""ZmSAUR1 [gene]"", ""ZmSAUR1 (Zea ma...","{""157227"": [[""GmSAUR [gene]""], [""StSAUR [gene]...","{""157227"": [[""ZmSAUR1 (Zea mays SAURs) [gene]""..."
997,131559,"{""131559"": [""cytochrome b(5) [gene]"", ""BtuN ge...","{""131559"": [[""BtuB gene(s) [gene]""], [""BtuC ge...","{""131559"": []}"
998,99091,"{""99091"": [""D. oleifera genome [organism]"", ""O...","{""99091"": [[""Genome sequencing of O. fragrans ...","{""99091"": [[""**O. fragrans 'Liuyejingui' (OFL)..."


## Prompts for making fine-tuning dataset

- constant_system_message is unchanged
- constant_user_message is only slightly modified from Manoj's promts, in the beginning and the end.
- variable_user_message is "Output" for each entry in the dataframe



In [63]:
constant_system_message = """
You are senior scientist in plant biology. You NEED to meticulously VALIDATE given clustering output, and provide a corrected clustering if needed. Explain which groupings are wrong and provide final output. You MUST write response like a senior human plant scientist.

The following are the entity clustering guidelines:
	1.	Exact Phrase Matching Matters: Always consider the full phrase, including key biological terms, bracketed text (ignoring minor differences such as spacing, punctuation, correct abbreviations, plurality).
	2.	Strict (100%) Key Term Separation: Entities with different biological terms MUST be placed in separate clusters.
3. Sub-identifier separation: Separate Entities with numeric differences, sub-identifiers, or qualifiers into different groups.
	4.	Avoid False Similarity: Do NOT cluster two items together in same group just because they share a common word or term.
	5.	Strict Synonym/Near-Synonym Grouping: Only group entities that refer to the same biological structure, process, meaning or concept.
	6.	Maintain 100% Precision: When in even small doubt, MUST place entities in separate clusters.
	7.	Preserve Original Data: No new items should be introduced, no duplicates should be introduced, and no entities should be omitted.
8. YOU MUST pickup most appropriate and easy-to-understand cluster representative and enclose it with '**', if there is more than one entity in that particular cluster. For example, pick the full term instead of an abbreviation.
	9.	Output Format: Always return results in valid JSON format. MUST USE GIVEN KEY.

Your main task is to read the clustered entities and output the validation in the below mentioned format.

The goal:
These results came from a smaller LLM, and i want you to take a look at it.
We  your expertise in plant biology, to generate prompts for fine tuning the smaller LLM and mimick your patterns for this specific task.
I want you to validate the clustering of this smaller model, and provide a corrected clustering if needed.
Then you should imagine that you are a user that want to explain to the smaller LLM what it is doing wrong, to give it the final output. You should never mention the smaller LLM, but instead just pretend you are a user. 
An example of this would be:
'I think you are wrong, cluster X is actually different from cluster Y because of reasons. Cluster X and Y should be clustered seperately'.
If the smaller model is correct however, you should write something like:
'Great job, that looks correct. Now output the correct format, but only output clusters with more than one member.'
The expected input format:

{
  "submission_id_1": {
    "clusters": [
      [
        "**cluster_item1**",
        "cluster_item2"
      ],
      [
        "cluster_item3"
      ]
    ]
  }
}

Return format:

If the clustering is correct:
{
  "submission_id_1": {
    "is_correct": "True",
    "user_prompt": "This looks correct to me, output the clusters that have more than 1 member in the correct json format.",
    "clusters": [
      [
        "cluster_item1",
        "cluster_item2"
      ]
    ]
  }
}

If the clustering is incorrect:

{
  "submission_id_1": {
    "is_correct": "False",
    "user_prompt": "This looks incorrect to me, <insert reason here>. Output the clusters that have more than 1 member in the correct json format.",
    "clusters": [
      [
        "cluster_item2",
        "cluster_item3"
      ]
    ]
  }
}

Warnings:

Be completely sure to not forget any nodes from the input list.
Remember not to base corrections on higher domain level knowledge, that the smaller model might not have, but rather the semantic meaning of the terms.
is_correct should be returned as a string "True" or "False", and only one of these two.
If there is any doubt if two groups should cluster together, cluster them separately.
Remember to select group representative by enclosing entity in "**"

"""

In [64]:
constant_user_message = """  
Here are 2 examples of validation behavior:

Example 1, correct behavior:

input:
{120406: ['difficulty in detecting mutations [phenotype]', 'hereditary breast cancers [phenotype]', 'mutational status of parental tumor cells [phenotype]', 'certain mutation types [phenotype]', 'Well-characterized mutations [phenotype]', 'alternative classifications [phenotype]', 'unique molecular signature [phenotype]', 'oncological diseases [phenotype]', 'independent mutation data [phenotype]', 'inherited disease mutation databases [database]', 'mutational signatures [mutation]', 'mutation types [phenotype]', 'More than half of human cancers [phenotype]', 'somatic and germline mutations [phenotype]', 'Identified mutations [phenotype]', 'lung cancer patients [phenotype]', 'germline transmissible [phenotype]', 'mutations in the germline [phenotype]', 'sporadic breast cancers [phenotype]', 'segregating mutations [phenotype]', 'breast/ovarian cancers [phenotype]', 'non-small-cell lung carcinoma [phenotype]', 'inherited breast cancer cases [phenotype]', 'tumour-specific aberrations [phenotype]', 'list of likely candidate mutations [phenotype]', 'germline-transmitted mutations [phenotype]', 'systematic asymmetries in heritable mutations [genetic feature]', 'targeted heritable mutations [phenotype]', 'fixed mutations in ancestral line [phenotype]', 'various types of mutation [phenotype]', 'several types of mutation [phenotype]', 'screening of these exons [phenotype]', 'many human cancers [phenotype]', 'difficulties to differentiate mutations [phenotype]', 'base substitutions, deletions, insertions, and translocations [phenotype]', 'specific mutational classes [phenotype]', 'germline mutations [phenotype]', 'other types of mutation [phenotype]', 'unmasking of deleterious mutations [phenotype]']}

Explanation of correct clustering:

["**germline mutations [phenotype]**", "germline-transmitted mutations [phenotype]", "mutations in the germline [phenotype]", "germline transmissible [phenotype]"] all describe inheritable genetic alterations passed from one generation to the next.
["More than half of human cancers [phenotype]", "**oncological diseases [phenotype]**", "many human cancers [phenotype]"] reference broad groups of malignant diseases occurring in humans.
["**mutation types [phenotype]**", "base substitutions, deletions, insertions, and translocations [phenotype]", "various types of mutation [phenotype]", "several types of mutation [phenotype]", "other types of mutation [phenotype]", "certain mutation types [phenotype]", "specific mutational classes [phenotype]"] collectively describe or list categories of genetic changes.
["difficulties to differentiate mutations [phenotype]", "**difficulty in detecting mutations [phenotype]**"] denote challenges in identifying or distinguishing specific genetic variants.
["inherited breast cancer cases [phenotype]", "**hereditary breast cancers [phenotype]**"] explicitly point to breast cancer instances where genetic risk is inherited.
Notes on Separation Certain single-element entries might share partial similarity, such as "somatic and germline mutations [phenotype]" or "targeted heritable mutations [phenotype]," but remain separate because of additional qualifiers or exact term differences

Example 2, incorrect clustering:
input:
{"165162": [["**internalized cargo [metabolite]**", "endocytosed material [metabolite]", "Internalized cargoes [metabolite]"], ["**Internalization of endocytic tracer FM4-64 [metabolite]**", "internalisation of an endocytic tracer dye [metabolite]"], ["endocytosis of transferrin [metabolite]"], ["cellular uptake of 14 chemicals [metabolite]"]]}

Explanation of incorrect behavior:

["**internalized cargo [metabolite]**", "endocytosed material [metabolite]", "Internalized cargoes [metabolite]"] all describe general endocytic cargo without specifying particular molecules or numeric differences.
["**Internalization of endocytic tracer FM4-64 [metabolite]**", "internalisation of an endocytic tracer dye [metabolite]"] are clustered wrong, one specifically mentions FM4-64, while the other refers generally to any tracer dye, and they should therefore be separated following strict separation rules.

Now read the cluster, validate and output the desired format:
"""

In [65]:
#make user_message_2 which should be Input 2 from the df
variable_user_message = df.iloc[1]['Output']
print(variable_user_message)

{"22996": ["Conductance measurements [phenotype]", "stomatal conductance, transpiration rate, hydraulic conductivity and photosynthetic efficiency [phenotype]", "E and g s [phenotype]", "regulation of stomatal conductance through FER [phenotype]", "engineering stomatal conductance [phenotype]", "stomatal conductance, transpiration rate, relative-water content of plant(s) leaf [phenotype]", "stomatal conductance measurements [phenotype]", "measuring stomatal conductance [phenotype]", "K(leaf) values [phenotype]", "direct measurements of g s [phenotype]", "AtPIP2;5 regulating gm [phenotype]", "conductance levels [phenotype]", "number of open stomata and amount of water loss through transpiration [phenotype]", "methods to quickly assess the stomatal conductance (g\u209b) [phenotype]", "time-resolved measurements of stomatal conductance [phenotype]", "time-resolved stomatal conductance analyses [phenotype]", "stomatal conductance analyses [phenotype]", "transpiration and stomatal conductan

In [1]:
from openai import OpenAI


run = False

if run:
    client = OpenAI()

    response = client.chat.completions.create(
      model="o3-mini",
      messages=[
        {
          "role": "developer",
          "content": [
            {
              "type": "text",
              "text": constant_system_message
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": constant_user_message
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": variable_user_message
            }
          ]
        }
      ],
      response_format={
        "type": "json_object"
      },
      reasoning_effort="high"
    )

# Redo: Cost estimate for o3-mini

In [104]:
import tiktoken

# with open("data/api_key.txt", "r") as f:
#     api_key = f.read()

def prepare_embedding_data(df, system_message, user_message, assistant_response, model_name, input_cost_per_token, output_cost_per_token):
    """Estimate the cost of embedding the data."""
    print(" ")
    print(model_name)
    encoding = tiktoken.get_encoding("o200k_base")
    embedding_data = []
    total_input_tokens = 0
    total_output_tokens = 0
    print("individual requests: ", len(df))
    for sub_list in zip(df["Input"], df["Output"]):
        
        for item in sub_list:
            n_tokens = len(encoding.encode(item))
            total_input_tokens += n_tokens
        total_input_tokens += len(encoding.encode(system_message))
        total_input_tokens += len(encoding.encode(user_message))
        total_input_tokens += len(encoding.encode(assistant_response))
        total_output_tokens += len(encoding.encode(sub_list[1])) # the output format of my prompt is pretty short, so i multiply by 3 to overestimate
        
    print(f"Total input tokens: {total_input_tokens}")
    print(f"Total output tokens: {total_output_tokens}")

    # Calculate costs
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_token  
    output_cost = ((total_output_tokens*3) / 1_000_000) * output_cost_per_token 
    total_cost = input_cost + output_cost

    print(f"Estimated input cost for {model_name}: ${input_cost:.2f}")
    print(f"Estimated output cost for {model_name}: ${output_cost:.2f}")
    print(f"Total estimated cost for {model_name}: ${total_cost:.2f}")
    print(" ")
    print(encoding.encode(constant_system_message))
    return None
# Example usage

constant_fine_tuning_format_promt = ""
o3_mini_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o3-mini", 1.10, 4.40)
o3_mini_batch_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o3-mini-batch", 0.55, 2.20)
o1_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o1", 15.10, 60)
o1_batch_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o1-batch", 7.55, 30)

 
o3-mini
individual requests:  802
Total input tokens: 3642512
Total output tokens: 116118
Estimated input cost for o3-mini: $4.01
Estimated output cost for o3-mini: $1.53
Total estimated cost for o3-mini: $5.54
 
[3575, 553, 261, 1238, 57204, 70116, 306, 97243, 6804, 36598, 26291, 13, 4886, 5296, 382, 316, 19723, 6771, 26291, 2049, 38971, 137858, 316, 290, 3992, 20959, 734, 197, 16, 13, 197, 50196, 133290, 114005, 93032, 25, 30141, 3331, 290, 3149, 27179, 11, 3463, 2140, 36598, 5941, 11, 59422, 295, 2201, 350, 733, 7443, 15054, 19504, 2238, 472, 47236, 11, 107300, 11, 6145, 111046, 929, 11, 103951, 6294, 197, 17, 13, 197, 64810, 350, 1353, 18929, 7926, 12167, 172201, 25, 115198, 483, 2647, 36598, 5941, 52178, 413, 12989, 306, 13574, 51310, 558, 18, 13, 5934, 12, 32535, 39612, 25, 121693, 115198, 483, 52077, 19504, 11, 1543, 165192, 19873, 11, 503, 192156, 1511, 2647, 8896, 558, 197, 19, 13, 49448, 1876, 7983, 50901, 536, 25, 3756, 7116, 19723, 1920, 4732, 4717, 306, 2684, 3566, 1327,