# Generating training data for fine tuning using o3-mini

In this notebook i show my suggestion for making a fine-tuning dataset for 4o-mini, using the power of o3-mini.

The first step is getting the data.

We take the clustering made by Manoj with o3-mini, and then ask o3-mini with reasoning set to high to go through its own results for validation, make corrections if needed, and then explain why this is wrong as a user message.

This message will be used to simulate a conversation with 4o-mini to teach it clustering patterns that we want to highlight. We are essentially trying to gaslight GPT into making better clusters. Making it doubt itself.  

With fine tuing we therefore aim to make the predictions of 4o-mini be more similar to o3-mini, but at a lower cost. 


I think the bonus with this approach, is that we don't need annotate ourselves, just pick the ones we agree with the validation, which is much faster.


## Generate dataset

In [105]:
#This is a script that loads a pandas dataframe from csv, converts the string formatted as a list into a list of lists. The list of lists is then flattened into a single list.

import pandas as pd
import random
import json
df = pd.read_csv('o3mini_maxclust_80_subset_random_1000_nocutoff.csv', index_col=None)

#rename the column "Entity Name" to "Entity" and "Group Items" to "output"
df = df.rename(columns={"Group Items": "Output"})

# Convert the string formatted as a list into a list of lists
df['Output'] = df['Output'].apply(lambda x: eval(x))

# Flatten the list of lists into a single list
df['Input'] = df['Output'].apply(lambda x: [item.strip("*") for sublist in x for item in sublist])

#randomize the order of the single list
df['Input'] = df['Input'].apply(lambda x: random.sample(x, len(x)))
df = df[df['Input'].apply(len) < 30]
# make a new column, where internal lists of length 1 are removed 
df['Output>1'] = df['Output'].apply(lambda x: [item for item in x if len(item) > 1])


#zip the Group_Items_no_singles with the Entity Name 
df["Output"] = df.apply(lambda row: {row["Entity Name"]: row['Output']}, axis=1)
df["Input"] = df.apply(lambda row: {row["Entity Name"]: row['Input']}, axis=1)
df['Output>1'] = df.apply(lambda row: {row["Entity Name"]: row['Output>1']}, axis=1)



#filter for single_list length < 30



#convert the columns to json strings
df["Output"] = df["Output"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
df["Input"] = df["Input"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
df["Output>1"] = df["Output>1"].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)

#Change order of columns to have input, then output, then output>1, then entity name
df = df[['Entity Name', 'Input', 'Output', 'Output>1']]

#filter by Output>1 length of the list > 1
#use column Output, but use only the ones where it clusteres multiple items together

In [106]:
df
#note i don't really use Outputs>1 in this workflow.

Unnamed: 0,Entity Name,Input,Output,Output>1
0,58963,"{""58963"": [""Gardea-Torresdey et al. [researche...","{""58963"": [[""Gardea-Torresdey et al. [research...","{""58963"": []}"
2,22996,"{""22996"": [""number of open stomata and amount ...","{""22996"": [[""**transpiration and stomatal cond...","{""22996"": [[""**transpiration and stomatal cond..."
3,100116,"{""100116"": [""modern clade [gene]"", ""Clade SL [...","{""100116"": [[""Clade M [clade]""], [""Clade SL [c...","{""100116"": []}"
4,133255,"{""133255"": [""LPP\u10b2 [gene]"", ""SUPPRESSOR OF...","{""133255"": [[""**PLSP2A [gene]**"", ""PLSP2A gene...","{""133255"": [[""**PLSP2A [gene]**"", ""PLSP2A gene..."
5,2276,"{""2276"": [""fc2 phenotypes [phenotype]"", ""100% ...","{""2276"": [[""F1 trichome phenotype [phenotype]""...","{""2276"": []}"
...,...,...,...,...
995,31615,"{""31615"": [""auxin-regulated endocytosis [proce...","{""31615"": [[""auxin-mediated modulation of the ...","{""31615"": [[""auxin-dependent inhibition of end..."
996,157227,"{""157227"": [""ZmSAUR1 [gene]"", ""ZmSAUR1 (Zea ma...","{""157227"": [[""GmSAUR [gene]""], [""StSAUR [gene]...","{""157227"": [[""ZmSAUR1 (Zea mays SAURs) [gene]""..."
997,131559,"{""131559"": [""cytochrome b(5) [gene]"", ""BtuN ge...","{""131559"": [[""BtuB gene(s) [gene]""], [""BtuC ge...","{""131559"": []}"
998,99091,"{""99091"": [""D. oleifera genome [organism]"", ""O...","{""99091"": [[""Genome sequencing of O. fragrans ...","{""99091"": [[""**O. fragrans 'Liuyejingui' (OFL)..."


## Prompts for making fine-tuning dataset

- constant_system_message is unchanged
- constant_user_message is only slightly modified from Manoj's promts, in the beginning and the end.
- variable_user_message is "Output" for each entry in the dataframe



In [63]:
constant_system_message = """
You are senior scientist in plant biology. You NEED to meticulously VALIDATE given clustering output, and provide a corrected clustering if needed. Explain which groupings are wrong and provide final output. You MUST write response like a senior human plant scientist.

The following are the entity clustering guidelines:
	1.	Exact Phrase Matching Matters: Always consider the full phrase, including key biological terms, bracketed text (ignoring minor differences such as spacing, punctuation, correct abbreviations, plurality).
	2.	Strict (100%) Key Term Separation: Entities with different biological terms MUST be placed in separate clusters.
3. Sub-identifier separation: Separate Entities with numeric differences, sub-identifiers, or qualifiers into different groups.
	4.	Avoid False Similarity: Do NOT cluster two items together in same group just because they share a common word or term.
	5.	Strict Synonym/Near-Synonym Grouping: Only group entities that refer to the same biological structure, process, meaning or concept.
	6.	Maintain 100% Precision: When in even small doubt, MUST place entities in separate clusters.
	7.	Preserve Original Data: No new items should be introduced, no duplicates should be introduced, and no entities should be omitted.
8. YOU MUST pickup most appropriate and easy-to-understand cluster representative and enclose it with '**', if there is more than one entity in that particular cluster. For example, pick the full term instead of an abbreviation.
	9.	Output Format: Always return results in valid JSON format. MUST USE GIVEN KEY.

Your main task is to read the clustered entities and output the validation in the below mentioned format.

The goal:
These results came from a smaller LLM, and i want you to take a look at it.
We  your expertise in plant biology, to generate prompts for fine tuning the smaller LLM and mimick your patterns for this specific task.
I want you to validate the clustering of this smaller model, and provide a corrected clustering if needed.
Then you should imagine that you are a user that want to explain to the smaller LLM what it is doing wrong, to give it the final output. You should never mention the smaller LLM, but instead just pretend you are a user. 
An example of this would be:
'I think you are wrong, cluster X is actually different from cluster Y because of reasons. Cluster X and Y should be clustered seperately'.
If the smaller model is correct however, you should write something like:
'Great job, that looks correct. Now output the correct format, but only output clusters with more than one member.'
The expected input format:

{
  "submission_id_1": {
    "clusters": [
      [
        "**cluster_item1**",
        "cluster_item2"
      ],
      [
        "cluster_item3"
      ]
    ]
  }
}

Return format:

If the clustering is correct:
{
  "submission_id_1": {
    "is_correct": "True",
    "user_prompt": "This looks correct to me, output the clusters that have more than 1 member in the correct json format.",
    "clusters": [
      [
        "cluster_item1",
        "cluster_item2"
      ]
    ]
  }
}

If the clustering is incorrect:

{
  "submission_id_1": {
    "is_correct": "False",
    "user_prompt": "This looks incorrect to me, <insert reason here>. Output the clusters that have more than 1 member in the correct json format.",
    "clusters": [
      [
        "cluster_item2",
        "cluster_item3"
      ]
    ]
  }
}

Warnings:

Be completely sure to not forget any nodes from the input list.
Remember not to base corrections on higher domain level knowledge, that the smaller model might not have, but rather the semantic meaning of the terms.
is_correct should be returned as a string "True" or "False", and only one of these two.
If there is any doubt if two groups should cluster together, cluster them separately.
Remember to select group representative by enclosing entity in "**"

"""

In [64]:
constant_user_message = """  
Here are 2 examples of validation behavior:

Example 1, correct behavior:

input:
{120406: ['difficulty in detecting mutations [phenotype]', 'hereditary breast cancers [phenotype]', 'mutational status of parental tumor cells [phenotype]', 'certain mutation types [phenotype]', 'Well-characterized mutations [phenotype]', 'alternative classifications [phenotype]', 'unique molecular signature [phenotype]', 'oncological diseases [phenotype]', 'independent mutation data [phenotype]', 'inherited disease mutation databases [database]', 'mutational signatures [mutation]', 'mutation types [phenotype]', 'More than half of human cancers [phenotype]', 'somatic and germline mutations [phenotype]', 'Identified mutations [phenotype]', 'lung cancer patients [phenotype]', 'germline transmissible [phenotype]', 'mutations in the germline [phenotype]', 'sporadic breast cancers [phenotype]', 'segregating mutations [phenotype]', 'breast/ovarian cancers [phenotype]', 'non-small-cell lung carcinoma [phenotype]', 'inherited breast cancer cases [phenotype]', 'tumour-specific aberrations [phenotype]', 'list of likely candidate mutations [phenotype]', 'germline-transmitted mutations [phenotype]', 'systematic asymmetries in heritable mutations [genetic feature]', 'targeted heritable mutations [phenotype]', 'fixed mutations in ancestral line [phenotype]', 'various types of mutation [phenotype]', 'several types of mutation [phenotype]', 'screening of these exons [phenotype]', 'many human cancers [phenotype]', 'difficulties to differentiate mutations [phenotype]', 'base substitutions, deletions, insertions, and translocations [phenotype]', 'specific mutational classes [phenotype]', 'germline mutations [phenotype]', 'other types of mutation [phenotype]', 'unmasking of deleterious mutations [phenotype]']}

Explanation of correct clustering:

["**germline mutations [phenotype]**", "germline-transmitted mutations [phenotype]", "mutations in the germline [phenotype]", "germline transmissible [phenotype]"] all describe inheritable genetic alterations passed from one generation to the next.
["More than half of human cancers [phenotype]", "**oncological diseases [phenotype]**", "many human cancers [phenotype]"] reference broad groups of malignant diseases occurring in humans.
["**mutation types [phenotype]**", "base substitutions, deletions, insertions, and translocations [phenotype]", "various types of mutation [phenotype]", "several types of mutation [phenotype]", "other types of mutation [phenotype]", "certain mutation types [phenotype]", "specific mutational classes [phenotype]"] collectively describe or list categories of genetic changes.
["difficulties to differentiate mutations [phenotype]", "**difficulty in detecting mutations [phenotype]**"] denote challenges in identifying or distinguishing specific genetic variants.
["inherited breast cancer cases [phenotype]", "**hereditary breast cancers [phenotype]**"] explicitly point to breast cancer instances where genetic risk is inherited.
Notes on Separation Certain single-element entries might share partial similarity, such as "somatic and germline mutations [phenotype]" or "targeted heritable mutations [phenotype]," but remain separate because of additional qualifiers or exact term differences

Example 2, incorrect clustering:
input:
{"165162": [["**internalized cargo [metabolite]**", "endocytosed material [metabolite]", "Internalized cargoes [metabolite]"], ["**Internalization of endocytic tracer FM4-64 [metabolite]**", "internalisation of an endocytic tracer dye [metabolite]"], ["endocytosis of transferrin [metabolite]"], ["cellular uptake of 14 chemicals [metabolite]"]]}

Explanation of incorrect behavior:

["**internalized cargo [metabolite]**", "endocytosed material [metabolite]", "Internalized cargoes [metabolite]"] all describe general endocytic cargo without specifying particular molecules or numeric differences.
["**Internalization of endocytic tracer FM4-64 [metabolite]**", "internalisation of an endocytic tracer dye [metabolite]"] are clustered wrong, one specifically mentions FM4-64, while the other refers generally to any tracer dye, and they should therefore be separated following strict separation rules.

Now read the cluster, validate and output the desired format:
"""

In [65]:
#make user_message_2 which should be Input 2 from the df
variable_user_message = df.iloc[1]['Output']
print(variable_user_message)

{"22996": ["Conductance measurements [phenotype]", "stomatal conductance, transpiration rate, hydraulic conductivity and photosynthetic efficiency [phenotype]", "E and g s [phenotype]", "regulation of stomatal conductance through FER [phenotype]", "engineering stomatal conductance [phenotype]", "stomatal conductance, transpiration rate, relative-water content of plant(s) leaf [phenotype]", "stomatal conductance measurements [phenotype]", "measuring stomatal conductance [phenotype]", "K(leaf) values [phenotype]", "direct measurements of g s [phenotype]", "AtPIP2;5 regulating gm [phenotype]", "conductance levels [phenotype]", "number of open stomata and amount of water loss through transpiration [phenotype]", "methods to quickly assess the stomatal conductance (g\u209b) [phenotype]", "time-resolved measurements of stomatal conductance [phenotype]", "time-resolved stomatal conductance analyses [phenotype]", "stomatal conductance analyses [phenotype]", "transpiration and stomatal conductan

In [1]:
from openai import OpenAI


run = False

if run:
    client = OpenAI()

    response = client.chat.completions.create(
      model="o3-mini",
      messages=[
        {
          "role": "developer",
          "content": [
            {
              "type": "text",
              "text": constant_system_message
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": constant_user_message
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": variable_user_message
            }
          ]
        }
      ],
      response_format={
        "type": "json_object"
      },
      reasoning_effort="high"
    )

# Redo: Cost estimate for o3-mini

In [104]:
import tiktoken

# with open("data/api_key.txt", "r") as f:
#     api_key = f.read()

def prepare_embedding_data(df, system_message, user_message, assistant_response, model_name, input_cost_per_token, output_cost_per_token):
    """Estimate the cost of embedding the data."""
    print(" ")
    print(model_name)
    encoding = tiktoken.get_encoding("o200k_base")
    embedding_data = []
    total_input_tokens = 0
    total_output_tokens = 0
    print("individual requests: ", len(df))
    for sub_list in zip(df["Input"], df["Output"]):
        
        for item in sub_list:
            n_tokens = len(encoding.encode(item))
            total_input_tokens += n_tokens
        total_input_tokens += len(encoding.encode(system_message))
        total_input_tokens += len(encoding.encode(user_message))
        total_input_tokens += len(encoding.encode(assistant_response))
        total_output_tokens += len(encoding.encode(sub_list[1])) # the output format of my prompt is pretty short, so i multiply by 3 to overestimate
        
    print(f"Total input tokens: {total_input_tokens}")
    print(f"Total output tokens: {total_output_tokens}")

    # Calculate costs
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_token  
    output_cost = ((total_output_tokens*3) / 1_000_000) * output_cost_per_token 
    total_cost = input_cost + output_cost

    print(f"Estimated input cost for {model_name}: ${input_cost:.2f}")
    print(f"Estimated output cost for {model_name}: ${output_cost:.2f}")
    print(f"Total estimated cost for {model_name}: ${total_cost:.2f}")
    print(" ")
    print(encoding.encode(constant_system_message))
    return None
# Example usage

constant_fine_tuning_format_promt = ""
o3_mini_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o3-mini", 1.10, 4.40)
o3_mini_batch_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o3-mini-batch", 0.55, 2.20)
o1_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o1", 15.10, 60)
o1_batch_cost = prepare_embedding_data(df, constant_system_message, constant_user_message, constant_fine_tuning_format_promt, "o1-batch", 7.55, 30)

 
o3-mini
individual requests:  802
Total input tokens: 3642512
Total output tokens: 116118
Estimated input cost for o3-mini: $4.01
Estimated output cost for o3-mini: $1.53
Total estimated cost for o3-mini: $5.54
 
[3575, 553, 261, 1238, 57204, 70116, 306, 97243, 6804, 36598, 26291, 13, 4886, 5296, 382, 316, 19723, 6771, 26291, 2049, 38971, 137858, 316, 290, 3992, 20959, 734, 197, 16, 13, 197, 50196, 133290, 114005, 93032, 25, 30141, 3331, 290, 3149, 27179, 11, 3463, 2140, 36598, 5941, 11, 59422, 295, 2201, 350, 733, 7443, 15054, 19504, 2238, 472, 47236, 11, 107300, 11, 6145, 111046, 929, 11, 103951, 6294, 197, 17, 13, 197, 64810, 350, 1353, 18929, 7926, 12167, 172201, 25, 115198, 483, 2647, 36598, 5941, 52178, 413, 12989, 306, 13574, 51310, 558, 18, 13, 5934, 12, 32535, 39612, 25, 121693, 115198, 483, 52077, 19504, 11, 1543, 165192, 19873, 11, 503, 192156, 1511, 2647, 8896, 558, 197, 19, 13, 49448, 1876, 7983, 50901, 536, 25, 3756, 7116, 19723, 1920, 4732, 4717, 306, 2684, 3566, 1327,

# Distillation/Fine tuning

Now that we have generated the a dataset of 802 simulated conversations, we can set up the fine tuning.

This is the order of prompts:

- system_prompt: Same as original
- user_prompt_1: Same as original, but only include sometimes (maybe 25% of the time), to prevent overfitting to the examples and to make the fine-tuning data look more like the data that we will produce.
- user_prompt_2: Input from df
- assistant_output_1 : precomputed output from df
- **user_prompt_3**: the o3-mini generated user_prompt from validation explaining whether or not it was correct.
- **assistant_output_2**: the o3-mini generated corrections, excluding clusters with only 1 member.

When we then make predictions using the fine-tuned model, we slightly modify the system_prompt to say that it needs to exclude clusters with only 1 member, and bam wham bamalam, we hopefully have a model. 

In [None]:
detailed_description = """Input-1
{
  "0": '['Meiotic block [phenotype]', 'early arrest [phenotype]', 'arrest of meiotic progression in anaphase II [phenotype]', 'arrested zygotic divisions [phenotype]', 'meiotic arrest [phenotype]', 'delayed/arrested meiosis [phenotype]', 'pachytene arrest [phenotype]', 'block in meiosis prophase I [phenotype]', 'male meiosis until the end of the first division [phenotype]', 'meiotic prophase arrest [phenotype]', 'arrest in late prophase I [phenotype]', 'arrest at the end of meiosis I [phenotype]', 'leptotene arrest [phenotype]', 'meiosis I arrest [phenotype]', 'arresting the first mitosis during gametogenesis [phenotype]', 'absence of meiotic arrest [phenotype]', 'meiotic arrest phenotype [phenotype]', 'meiotic arrest at telophase I [phenotype]', 'termination of meiosis after anaphase I [phenotype]', 'premature termination of meiosis after anaphase I [phenotype]', 'mitotic arrest during female gametogenesis [phenotype]', 'arrest after meiosis I [phenotype]', 'meiotic arrest at pachytene [phenotype]', 'arrested endosperm nuclear divisions [phenotype]', 'meiotic arrest in anaphase II [phenotype]', 'meiotic division stop [phenotype]', 'arrest of the first mitotic division in gametogenesis [phenotype]', 'FNM half-stop [phenotype]', 'males arresting in the middle of prophase I [phenotype]', 'arresting prior to the first mitotic division [phenotype]']'
}
Input-1 → Output-1 [REASONING]

1) Generic Meiotic Arrest

Cluster:
[ "Meiotic block [phenotype]", "**meiotic arrest [phenotype]**", "meiotic arrest phenotype [phenotype]", "meiotic division stop [phenotype]" ]
• Incorrect: Grouping these separately as if they refer to different stages would obscure their identical meaning.
• Correct: Recognize they all broadly indicate a total halt in meiosis at an unspecified stage.
• Important: These terms are functionally synonymous, capturing any generic failure of meiotic progression.
• Cluster representative: I selected 'meiotic arrest [phenotype]' as the representative because it is the most concise and commonly used term to describe a complete halt in the meiotic process. Unlike the other phrases which include additional descriptors or less formal wording, 'meiotic arrest' clearly and unambiguously captures the essential biological event, making it the best exemplar for the cluster.

2) Delayed / Arrested Meiosis

Cluster:
[ "delayed/arrested meiosis [phenotype]" ]
• Incorrect: Merging it with generic meiotic arrest overlooks the “delayed” aspect, which implies partial progression.
• Correct: Keep it separate because it specifies a slow or partial block before a final stall.
• Important: “Delayed” suggests some chromosomes or cells proceed further than in an outright immediate block.

3) Absence of Meiotic Arrest

Cluster:
[ "absence of meiotic arrest [phenotype]" ]
• Incorrect: Combining with any arrest group would contradict its meaning.
• Correct: Maintain it as the negative counterpart, meaning no block occurs.
• Important: This phenotype is crucial for comparisons, showing normal meiotic completion instead of a stoppage.

4) Prophase I Arrest (Broad)

Cluster:
[ "block in meiosis prophase I [phenotype]", "meiotic prophase arrest [phenotype]" ]
• Incorrect: Splitting them into sub-sub-stages if the label doesn’t specify.
• Correct: They both emphasize an arrest somewhere in prophase I, without detailing the exact sub-stage (leptotene, pachytene, etc.).
• Important: This captures a prophase I blockade in general, distinct from specific sub-stage arrests.
• Cluster representative: I selected 'meiotic prophase arrest [phenotype]' as the representative because it succinctly captures the core biological event of an arrest occurring during prophase, without the additional wording found in the alternative. This clarity makes it the most direct and precise exemplar for the cluster.

5) Male Arrest in Mid–Prophase I

Cluster:
[ "males arresting in the middle of prophase I [phenotype]" ]
• Incorrect: Mixing with general prophase I arrest loses the male-specific nature and midpoint detail.
• Correct: Keep it unique because it adds a sex specification (male) and timing (mid-prophase I).
• Important: This addresses sex-specific contexts where the XY body or other male meiotic events fail around zygotene/pachytene.

6) Early Arrest (Undefined Sub-Stage)

Cluster:
[ "early arrest [phenotype]" ]
• Incorrect: Equating “early” to a named stage like leptotene or zygotene.
• Correct: It must stand alone since it lacks a formal sub-stage but implies an initial block.
• Important: The label indicates an arrest that happens before mid- or late-stage phenomena but is otherwise unspecified.

7) Leptotene Arrest

Cluster:
[ "leptotene arrest [phenotype]" ]
• Incorrect: Merging it with “early arrest” would lose the sub-stage clarity.
• Correct: “Leptotene” is a well-defined earliest sub-stage of prophase I, deserving its own node.
• Important: This precisely pinpoints where chromosomes start to condense yet fail to progress.

8) Pachytene Arrest (Synonyms)

Cluster:
[ "**pachytene arrest [phenotype]**", "meiotic arrest at pachytene [phenotype]" ]
• Incorrect: Splitting these two would ignore that they describe the exact same block.
• Correct: They both name the pachytene sub-stage, so they cluster together.
• Important: Pachytene is when homologs are fully synapsed, so arrest here is distinct from earlier or later prophase I phases.
• Cluster representative: I selected 'pachytene arrest [phenotype]' as the representative because it is more concise and directly highlights the specific stage of meiotic arrest without additional qualifiers. This succinct phrasing clearly captures the biological event at the pachytene stage, making it the best exemplar for the cluster.

9) Late Prophase I Arrest

Cluster:
[ "arrest in late prophase I [phenotype]" ]
• Incorrect: Grouping with general prophase I arrests would lose the “late” distinction.
• Correct: Keep it separate because it suggests diplotene or diakinesis sub-stages.
• Important: Identifies that the block occurs after chromosome synapsis (pachytene) but before metaphase I.

10) Meiosis I Arrest (Broad)

Cluster:
[ "meiosis I arrest [phenotype]" ]
• Incorrect: Conflating with prophase I or end-of-meiosis-I arrests.
• Correct: This label is intentionally broad for a block anywhere in the entire first meiotic division.
• Important: Distinct from narrower arrests at anaphase I or telophase I.

11) End-of-Meiosis-I Arrest

Cluster:
[ "arrest at the end of meiosis I [phenotype]", "arrest after meiosis I [phenotype]", "**meiotic arrest at telophase I [phenotype]**" ]
• Incorrect: Mixing with generic “meiosis I arrest” might obscure that these specifically reach telophase I or just beyond.
• Correct: All describe an arrest that specifically coincides with or follows telophase I.
• Important: They finish prophase–anaphase I but fail to transition into or complete meiosis II.
• Cluster representative: I selected 'meiotic arrest at telophase I [phenotype]' as the representative because it explicitly specifies the stage of arrest, leaving no ambiguity about the timing within meiosis. By clearly indicating telophase I, it provides a precise and biologically accurate descriptor compared to the more ambiguous alternatives present in the cluster.

12) Male Meiosis Until End of First Division

Cluster:
[ "male meiosis until the end of the first division [phenotype]" ]
• Incorrect: Merging with “arrest at the end of meiosis I” would ignore the explicit mention of male gametogenesis.
• Correct: It parallels an end-of-meiosis-I block but is sex-specific.
• Important: Reflects male-specific phenotypes where meiosis I completes in a partial sense but doesn’t proceed to meiosis II.

13) Post–Anaphase I Termination

Cluster:
[ "**termination of meiosis after anaphase I [phenotype]**", "premature termination of meiosis after anaphase I [phenotype]" ]
• Incorrect: Combining with end-of-meiosis-I arrests (telophase I) might overlook the specific time point (right after anaphase I).
• Correct: These highlight that meiosis halts immediately following homolog separation in anaphase I.
• Important: “Premature termination” still implies the same staging (post-anaphase I), so they cluster together.
• Cluster representative: I selected 'termination of meiosis after anaphase I [phenotype]' as the representative because it provides a clear and concise description of the cessation of meiosis immediately following anaphase I. The absence of the qualifier 'premature' avoids additional nuance regarding timing, making it a more universally applicable and straightforward term to represent the cluster.

14) Anaphase II Arrest (Synonyms)

Cluster:
[ "arrest of meiotic progression in anaphase II [phenotype]", "**meiotic arrest in anaphase II [phenotype]**" ]
• Incorrect: Grouping with anaphase I or telophase I arrests would misrepresent the division stage.
• Correct: Both pinpoint the second meiotic anaphase, so they are true synonyms.
• Important: This arrest means meiosis I completed successfully, but the cell fails during separation of sister chromatids.
• Cluster representative: I selected 'meiotic arrest in anaphase II [phenotype]' as the representative because it succinctly and directly identifies the specific phase (anaphase II) at which the arrest occurs. Its concise phrasing avoids unnecessary complexity, making it the clearest descriptor of the biological event in this cluster.

15) Arrested Zygotic Divisions

Cluster:
[ "arrested zygotic divisions [phenotype]" ]
• Incorrect: Folding into meiotic blocks misses that zygotic divisions are post-fertilization mitoses.
• Correct: Keep separate, as this block is in the earliest embryo after fertilization.
• Important: Distinguishing embryonic arrests from gametogenic or meiotic ones is crucial in developmental contexts.

16) Arrested Endosperm Nuclear Divisions

Cluster:
[ "arrested endosperm nuclear divisions [phenotype]" ]
• Incorrect: Combining with zygotic divisions lumps distinct post-fertilization tissues (embryo vs. endosperm).
• Correct: Endosperm is a separate tissue formed post-fertilization (often triploid), so it merits its own category.
• Important: In many plants, endosperm divides separately, so an arrest here is unique from zygotic embryonic arrest.

17) First Mitotic Division in Gametogenesis (Synonyms)

Cluster:
[ "arresting the first mitosis during gametogenesis [phenotype]", "**arrest of the first mitotic division in gametogenesis [phenotype]**", "FNM half-stop [phenotype]" ]
• Incorrect: Splitting these fails to see that all reference halting the very first post-meiotic mitosis.
• Correct: They describe the same stage (first mitosis in gametogenesis), so they are synonyms.
• Important: “FNM half-stop” is shorthand for the same phenomenon, not a different event.
• Cluster representative: I selected 'arrest of the first mitotic division in gametogenesis [phenotype]' as the representative because it is the most precise and descriptive term. It clearly specifies the process (mitotic division) and context (gametogenesis), avoiding the informal shorthand of 'FNM half-stop' and the less formal phrasing of 'arresting the first mitosis during gametogenesis.' This precision makes it the best exemplar for the cluster.

18) Arrest Prior to First Mitotic Division

Cluster:
[ "arresting prior to the first mitotic division [phenotype]" ]
• Incorrect: Assuming it is the same as “arresting the first mitosis” would confuse the actual onset of that mitosis.
• Correct: This indicates cells never even enter mitosis.
• Important: Distinguishing “before it starts” from “during the division” can be crucial for understanding gametogenesis defects.

19) Mitotic Arrest During Female Gametogenesis

Cluster:
[ "mitotic arrest during female gametogenesis [phenotype]" ]
• Incorrect: Merging with the “first mitosis” cluster might ignore that multiple mitotic divisions can occur in female lines.
• Correct: A female-specific block in some mitotic division (not necessarily the first).
• Important: Sex specificity and indefinite mitotic stage set it apart from a clearly labeled “first mitosis” arrest.
Output-1
{
"0": [
[
"Meiotic block [phenotype]",
"**meiotic arrest [phenotype]**",
"meiotic arrest phenotype [phenotype]",
"meiotic division stop [phenotype]"
],
[
"delayed/arrested meiosis [phenotype]"
],
[
"absence of meiotic arrest [phenotype]"
],
[
"block in meiosis prophase I [phenotype]",
"**meiotic prophase arrest [phenotype]**"
],
[
"males arresting in the middle of prophase I [phenotype]"
],
[
"early arrest [phenotype]"
],
[
"leptotene arrest [phenotype]"
],
[
"**pachytene arrest [phenotype]**",
"meiotic arrest at pachytene [phenotype]"
],
[
"arrest in late prophase I [phenotype]"
],
[
"meiosis I arrest [phenotype]"
],
[
"arrest at the end of meiosis I [phenotype]",
"arrest after meiosis I [phenotype]",
"**meiotic arrest at telophase I [phenotype]**"
],
[
"male meiosis until the end of the first division [phenotype]"
],
[
"**termination of meiosis after anaphase I [phenotype]**",
"premature termination of meiosis after anaphase I [phenotype]"
],
[
"arrest of meiotic progression in anaphase II [phenotype]",
"**meiotic arrest in anaphase II [phenotype]**"
],
[
"arrested zygotic divisions [phenotype]"
],
[
"arrested endosperm nuclear divisions [phenotype]"
],
[
"arresting the first mitosis during gametogenesis [phenotype]",
"**arrest of the first mitotic division in gametogenesis [phenotype]**",
"FNM half-stop [phenotype]"
],
[
"arresting prior to the first mitotic division [phenotype]"
],
[
"mitotic arrest during female gametogenesis [phenotype]"
]
]
}
Input-2
{
  "1": '[
    "Salt-stress severity [treatment]",
    "high NaCl stress [treatment]",
    "potassium deprivation stress [treatment]",
    "salt-stress response [treatment]",
    "salt stress tolerance [treatment]",
    "heat and salt stress conditions [treatment]",
    "prolonged levels of salt stress [treatment]",
    "recovery from salt stress [treatment]",
    "salt stress assay [treatment]",
    "salt stress signaling pathways [treatment]",
    "gradual salt stress treatments [treatment]",
    "salt and low temperature stresses [treatment]",
    "salt and silicon stresses [treatment]"
  ]'
}
Input-2 → Output-2 [REASONING]

1) Salt-Stress Core

Clustered Terms
• “Salt-stress severity [treatment]”
• “**high NaCl stress [treatment]**”
• “salt-stress response [treatment]”
• “salt stress tolerance [treatment]”
• “prolonged levels of salt stress [treatment]”
• “recovery from salt stress [treatment]”
• “salt stress assay [treatment]”
• “salt stress signaling pathways [treatment]”
• “gradual salt stress treatments [treatment]”

Incorrect: Splitting “NaCl” from “salt” would be biologically misleading since NaCl is the chemical basis of most salt stress.
Correct: Recognize all are purely salt-based conditions; “NaCl” is the explicit form, but it is still “salt.”
Important: These labels measure or manipulate salt-stress conditions alone (no other stress factor).
Cluster representative: I selected 'high NaCl stress [treatment]' as the representative because it directly encapsulates the core concept of salt stress by explicitly naming the chemical agent (NaCl) responsible for inducing the stress condition. This term is both succinct and unambiguous, avoiding additional qualifiers (like severity, response, or tolerance) that could shift the focus away from the primary salt stress condition.

2) Unique Stress: Potassium Deprivation

Clustered Term
• “potassium deprivation stress [treatment]”

Incorrect: Combining this with salt-based treatments implies overlapping ionic stress without specificity.
Correct: Keep it separate because it focuses on K+ deficiency rather than NaCl excess.
Important: Potassium starvation is a distinct abiotic stress requiring separate interpretation and management from salt stress.

3) Heat + Salt Stress

Clustered Term
• “heat and salt stress conditions [treatment]”

Incorrect: Merging with the salt core group would lose the additional heat component.
Correct: Keep it in its own cluster because it involves two distinct stressors (heat + salt).
Important: Many experiments examine combined stresses differently than single-stress treatments.

4) Salt + Low Temperature

Clustered Term
• “salt and low temperature stresses [treatment]”

Incorrect: Folding it into a single “salt” cluster disregards the cold factor.
Correct: Identify that it specifically tests tolerance or response to dual stress: salt and cold.
Important: Understanding multi-stress interactions is crucial for breeding or experimental design.

5) Salt + Silicon

Clustered Term
• “salt and silicon stresses [treatment]”

Incorrect: Grouping with plain salt stress lumps unique “silicon” involvement into generic salt.
Correct: Keep separate because it’s salt + another factor (silicon) that could mitigate or alter salt stress.
Important: Silicon is sometimes used to ameliorate salt stress, so it forms a distinct combined treatment.

output-2
{
  "1": [
    [
      "Salt-stress severity [treatment]",
      "**high NaCl stress [treatment]**",
      "salt-stress response [treatment]",
      "salt stress tolerance [treatment]",
      "prolonged levels of salt stress [treatment]",
      "recovery from salt stress [treatment]",
      "salt stress assay [treatment]",
      "salt stress signaling pathways [treatment]",
      "gradual salt stress treatments [treatment]"
    ],
    ["potassium deprivation stress [treatment]"],
    ["heat and salt stress conditions [treatment]"],
    ["salt and low temperature stresses [treatment]"],
    ["salt and silicon stresses [treatment]"]
  ]
}
"""

In [None]:
system_prompt = """You are a data scientist specializing in grouping plant biological entities. Your task is to cluster similar entities while strictly adhering to the following guidelines:

	1.	Exact Phrase Matching Matters: Always consider the full phrase, including key biological terms, bracketed text (ignoring minor differences such as spacing, punctuation, correct abbreviations, plurality).
	2.	Strict (100%) Key Term Separation: Entities with different biological terms MUST be placed in separate clusters.
    3. Sub-identifier separation: Separate Entities with numeric differences, sub-identifiers, or qualifiers into different groups.
	4.	Avoid False Similarity: Do NOT cluster two items together in same group just because they share a common word or term.
	5.	Strict Synonym/Near-Synonym Grouping: Only group entities that refer to the same biological structure, process, meaning or concept.
	6.	Maintain 100% Precision: When in even small doubt, MUST place entities in separate clusters.
	7.	Preserve Original Data: No new items should be introduced, no duplicates should be introduced, and no entities should be omitted.
    8. YOU MUST pickup most appropriate and easy-to-understand cluster representative and enclose it with '**', if there is more than one entity in that particular cluster. For example, pick the full term instead of an abbreviation.
	9.	Output Format: Always return results in valid JSON format. MUST USE GIVEN KEY.

Read the input list, and return a clustered output list.
"""




input_list = "" #the Input list, taken from original K-meansdataframe

initial_output = "" # the first round of o3-mini predictions from the dataframe

o3_user_prompt = "" # the userprompt that o3-mini came up with from the second round o3-mini output

corrected_output = "" # the corrected and shortened list.

#for previous example
fine_tune_dict = {
    "messages": [
        {"role": "system","content": system_prompt}, 
        {"role": "user", "content": input_list}, 
        {"role": "assistant", "content": initial_output},
        {"role": "user", "content": o3_user_prompt},
        {"role": "assistant", "content": corrected_output},
        ]}

# draw a float between 0 and 1
include_detailed_description = random.random() < 0.2
if include_detailed_description:
    # assign user_prompt_1 at index 1
    fine_tune_dict["messages"].insert(1, {"role": "user", "content": detailed_description})
    
fine_tune_dict


{'messages': [{'role': 'system',
   'content': "You are a data scientist specializing in grouping plant biological entities. Your task is to cluster similar entities while strictly adhering to the following guidelines:\n\t1.\tExact Phrase Matching Matters: Always consider the full phrase, including key biological terms, bracketed text (ignoring minor differences such as spacing, punctuation, correct abbreviations, plurality).\n\t2.\tStrict (100%) Key Term Separation: Entities with different biological terms MUST be placed in separate clusters.\n3. Sub-identifier separation: Separate Entities with numeric differences, sub-identifiers, or qualifiers into different groups.\n\t4.\tAvoid False Similarity: Do NOT cluster two items together in same group just because they share a common word or term.\n\t5.\tStrict Synonym/Near-Synonym Grouping: Only group entities that refer to the same biological structure, process, meaning or concept.\n\t6.\tMaintain 100% Precision: When in even small doubt

In [1]:
import pandas as pd
df = pd.read_csv ("o3mini_rand_50.csv",index_col=None)

In [10]:
df["user_len"] = df["user_prompt"].str.len()
df["user_len"].max()
df[df["user_len"] == df["user_len"].max()]["user_prompt"].values[0]

"This looks incorrect to me. The problem lies with the fourth cluster. It currently groups together four items: 'a biological control agent [treatment]', 'biological control agent (BCA) [treatment]', 'potent biological control agent [treatment]', and 'potential biological control agent [treatment]'. According to our strict grouping rules, items with qualifiers that change the meaning (i.e. 'potent' and 'potential') must be separated from the generic form. In this case, 'a biological control agent [treatment]' and 'biological control agent (BCA) [treatment]' clearly refer to the general concept and should be clustered together (with the latter as the representative), while the other two should be in separate clusters. Since we are only outputting clusters with more than one member, only the generic biological control agent group should be included. Also, the first cluster (biopesticides) is correctly grouped. Please output only the clusters with more than one entity in the correct JSON 

In [None]:
#