By using a complete loop, repeat the previous steps multiple times to achieve the final result.
Ensure that the recursion correctly passes each step's result into the recursive result.
The main idea of writing this loop is:
Assume that an initial API call has been made to preliminarily classify the results in the dataset, resulting in an initial classification.
Then, use the parsing function to store this initial classification result in a dictionary (storing the classification results of different states as key-value pairs, with keys as state types and values as classified species results).
Recursively process each key-value pair, storing each part's results in results, then use different keys as recursive inputs, and return the recursive results, storing them in result{}, and call them separately.

In [2]:
# First, import all the necessary packages
import json
from openai import OpenAI
import os
import re

Import the required file contents, including the morphological matrix data and the initial parsed data.
The initial parsed data refers to the species classification information obtained by the initial API call selecting the initial character and its states, stored as a dictionary after parsing.


In [7]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
with open("D:/桌面/TEST-KG/nexus fix/matrix_knowledge_graph_22.json", "r", encoding="utf-8") as file:
    matrix_data = json.load(file)

# Parsing the API response for the initial character using the parsing function.
initial_classification = {'1': ['Agriphila', 'Chilo', 'Euchromius', 'Haimbachia'], '2': ['Ancylolomia', 'Calamotropha', 'Catoptria', 'Chrysocrambus', 'Chrysoteuchia', 'Crambus', 'Donacaula', 'Pediasia', 'Platytes', 'Schoenobius', 'Thisanotia']}

# List to store all values.
groups = []

# This sets the foundation for subsequent distributed loops. Each key-value pair in the dictionary is stored in a list variable, enabling multiple iterations over these variables to eventually store the results.
for state, species_list in initial_classification.items():
    groups.append(species_list)

# Print the results stored in the list.
for i, group in enumerate(groups, 1):
    print(f"Group {i} species:", group)
print(groups)
group_1_species = groups[0]
group_2_species = groups[1]

print(f"State 1: {group_1_species}",f"State 2: {group_2_species}")

Group 1 species: ['Agriphila', 'Chilo', 'Euchromius', 'Haimbachia']
Group 2 species: ['Ancylolomia', 'Calamotropha', 'Catoptria', 'Chrysocrambus', 'Chrysoteuchia', 'Crambus', 'Donacaula', 'Pediasia', 'Platytes', 'Schoenobius', 'Thisanotia']
[['Agriphila', 'Chilo', 'Euchromius', 'Haimbachia'], ['Ancylolomia', 'Calamotropha', 'Catoptria', 'Chrysocrambus', 'Chrysoteuchia', 'Crambus', 'Donacaula', 'Pediasia', 'Platytes', 'Schoenobius', 'Thisanotia']]
State 1: ['Agriphila', 'Chilo', 'Euchromius', 'Haimbachia'] State 2: ['Ancylolomia', 'Calamotropha', 'Catoptria', 'Chrysocrambus', 'Chrysoteuchia', 'Crambus', 'Donacaula', 'Pediasia', 'Platytes', 'Schoenobius', 'Thisanotia']


The loop construction part needs two functions:
1. API call part: Create a generic API call applicable to all keys, maintaining consistency in the results for all keys in the groups.
2. Loop construction: In the loop traversal function, it's important to call the matrix information stored in each key's subgroup to reduce the API call burden. Repeated API calls for data classification are necessary, so correctly passing input is crucial.
Additionally, focus on how to save, store, and parse each API response result. It's best to simply output the final result in the loop.
Consider a potential issue: the number of taxa classification results achieved in one go. Currently, the test dataset includes 12 taxa. After initial classification, they are roughly divided into 2, 4, and 6. When there are 6 taxa, errors tend to occur. Therefore, set a conditional statement in the loop: if the number of species in the key exceeds a threshold, use another API call for initial classification, and then proceed with direct classification in the next step. This idea involves dynamic classification selection, where secondary results need to be parsed and passed to the API. Hence, an additional parsing function is required after this conditional call.
The current priority is to construct a loop.

In [5]:
# API call function for continued grouping for each subgroup
def classify_group(group_species):
    group_matrix = {species: matrix_data[species] for species in group_species}
    group_matrix_str = json.dumps(group_matrix, ensure_ascii=False)
    messages3 = [
            {"role": "system",
             "content":
                 """
                 You are a helpful taxonomist assistant.\n
                 You are skilled at calculating the correct information gain to choose the character that best divides species into even groups based on their states.\n
                 Based on the selected character, classify the species into different groups according to their states.\n
                 For each group with more than two species, continue selecting characters to further classify this group until each group only has one species.\n
                 After multiple classifications, determine the final classification levels and record each classifying character and its state.\n
                 Finally, generate a taxonomic key.
                 """},
            {"role": "system","content":
                """
                Generate the nested taxonomic key based on the provided morphological matrix. \n
                The process involves selecting a character to classify the species into groups. Repeat this classification within each subgroup until each group contains only one species.
                Information gain measures how much the uncertainty in the dataset is reduced after using a character for classification. It helps in selecting characters that minimize the entropy of the subset after classification, leading to better classification results.
                Please select the classification character for these group's species based on the morphological matrix and information gain methods.
                In the morphological matrix, 'Missing' and 'Not applicable' are invalid states. If a character has invalid states for the group being classified, it should be ignored.
                States are represented by numbers. For example, '1 and 2' means multiple states should be treated as a single state type and this multi-state characterization should not be confused with the single states within it (the state of '3' and '2 and 3' is different state, when you choose the character to based on the state to distinguish need to careful handle).The initial character should have no more than three state types.
                You need to calculate the information gain for each character and choose the highest information gain result. The higher the information gain result, the greater the contribution of the feature to the classification.
                After selecting the initial classification character and categorizing the species based on its state, repeat the process within each subgroup. For each subgroup, select the character with the highest information gain to further classify the species. Continue this process recursively until each group contains only one species.
                Now I will show you the morphological matrix. Please provide the classification character and the categorization of species based on its state. Then, continue to classify each subgroup recursively, showing the chosen character and categorization for each subgroup. Please present the result in a structured format, with each step clearly labeled.
                please don't show how you analysis and calculate, please show me the final result           
            """},
            {"role": "assistant",
             "content": """
                Understood. I will generate the nested taxonomic key based on the provided morphological matrix. Here is a summary of the steps I will follow:\n
                1. The matrix includes all species and their different states for each character.\n
                2. I will select a character to classify the species into groups and repeat this classification within each subgroup until each group contains only one species.\n
                3. I will use information gain to measure how much the uncertainty in the dataset is reduced after using a feature for classification. This helps in selecting features that minimize the entropy of the subset after classification, leading to better classification results.\n
                4. I will select the classification character for the group's species based on the morphological matrix and information gain methods.\n
                5. In the morphological matrix, 'Missing' and 'Not applicable' are considered invalid states. If a character has invalid states for the group being classified, it will be ignored.\n
                6. States are represented by numbers. For example, '2 and 3' means multiple states should be treated as a single state type, and This multi-state characterization should not be confused with the individual states(like '2', '3') within it (such as '3' and '2 and 3'  is the different state, these are two separate states, when i choose character to based on different state to distinguish the species). The classification character should have no more than three state types.\n
                7. I will use information gain to calculate all character and choose the highest information gain result, The higher the Information Gain result, the greater the contribution of the feature to the classification. \n
                8. The final result will provide only the initial classification character and the categorization of species based on its state. \n
                9. Don't need to show how the process about choose, only need to show the final result as nested structure, and i will store result in #character classify result# block
                Please provide the group morphological matrix data so that I can proceed with the classification.
             """},
            {"role": "user", "content": f"Here is the group information need to be classify and include the morphological matrix{group_matrix_str}"}
        ]
    response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages3,
            stop=None,
            temperature=0,
            max_tokens=1000,
            n=1)
    result = response.choices[0].message.content
    print(f"API response for group {group_species}: {result}")
    return result

After writing the main function for this call, all I needed to implement was to store the final result by using recursion

In [6]:
def recursive_classification(groups, final_classification, classification_results):
    while groups:  # Continue looping while the groups list is not empty
        try:
            state, current_group = groups.pop(0)  # Extract the first tuple from groups, containing state identifier and current group of species
            print(f"Processing group with state: {state}, species: {current_group}")  # Print the current group's state and species for debugging

            if len(current_group) == 1:  # If the current group has only one species
                final_classification[current_group[0]] = current_group  # Store this single-species group directly in final_classification dictionary
            else:
                classification_result = classify_group(current_group)  # Call the API function to classify the current group
                classification_results[state] = classification_result  # Store the API classification result in classification_results dictionary using state as the key

        except Exception as e:  # Catch and handle any exceptions that may occur
            print(f"Error processing group with state: {state}, species: {current_group}")  # Print the error message and current group's state and species for debugging
            print(f"Exception: {e}")  # Print the details of the exception
            raise e  # Re-raise the exception so that it can be handled by the caller

    return final_classification  # Return the final classification result


This is a new recursive approach that includes an additional conditional statement to help reduce errors when the API processes a large number of species, thereby improving the accuracy of the final result.

In [ ]:
# This approach aims to generate more accurate classification results by reducing the complexity of API calls to some extent, helping it to perform more accurately.
# Suppose this is another API call used when the number of species is too large. The purpose of this part of the API call is to first group a large number of species and then call the relevant functions for further classification (nested classification).
def classify_first_group(group_species):
    group_matrix = {species: matrix_data[species] for species in group_species}
    group_matrix_str = json.dumps(group_matrix, ensure_ascii=False)
    messages = "Here is the group information need to be classify"
    response = client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        stop=None,
        temperature=0,
        max_tokens=1000,
        n=1)
    result = response.choices[0].message.content
    print(f"API response for group {group_species}: {result}")
    return result

# This is a new recursive approach that includes an additional conditional statement to help reduce errors when the API processes a large number of species, thereby improving the accuracy of the final result.
def recursive_classification_2(groups, final_classification, classification_results):
    while groups:
        try:
            state, current_group = groups.pop(0)
            print(f"Processing group with state: {state}, species: {current_group}")  # Debug information

            if len(current_group) == 1:
                final_classification[current_group[0]] = current_group
            else:
                if len(current_group) > 6:  # The currently measured result seems to indicate that when the number is around 6, there are some issues. However, it's not conclusive because when I tested this data, there was an additional API call overhead. When studying the results of different groups, I always input the entire matrix, meaning the API had to load the current dataset and select the corresponding species' matrix information from groups[i]. This might consume a lot of additional overhead, causing memory occupation in the API response. Thus, errors occur when there are more than 6 species (the errors are not extremely direct or outrageous but involve misinterpreting '2' and '2 and 3' as a single multi-state feature). Therefore, the API might perform better on single analyses. I think its limit should be around 10 species.
                    # Use classify_first_group to subdivide datasets with more than 6 species
                    first_classification_result = classify_first_group(current_group)
                    classification_results[state] = first_classification_result  # Store initial classification result
                    
                    # Add the subdivided results to groups for further classification
                    for new_state, new_group in first_classification_result.items():
                        if len(new_group) > 6:
                            # If the new group still has more than 6 species, recursively call classify_first_group to continue subdividing
                            groups.append((new_state, new_group))
                        else:
                            # If the new group has 6 or fewer species, directly classify it
                            groups.append((new_state, new_group))
                else:
                    # Use classify_group to classify datasets with 6 or fewer species
                    classification_result = classify_group(current_group)
                    classification_results[state] = classification_result  # Store API call result
                    
                    # Add the new group results to groups for further classification
                    for new_state, new_group in classification_result.items():
                        groups.append((new_state, new_group))

        except Exception as e:
            print(f"Error processing group with state: {state}, species: {current_group}")  # Error debug information
            print(f"Exception: {e}")
            raise e
    return final_classification


need to set up an empty dataset, to store these results at the same time need to call a piece of it, which if you are considering the storage of the results of these two final storage lists: final_classification, classification_results; the first of these storage lists final is used for a case, that is, the initial character directly separate out a species, but this problem may also need to consider the place is needed again when the number of species is greater than 6 may also need to use this place

In [9]:
# Assume the variables have been initialized
# Dictionary to store the final classification where each species is classified individually
final_classification = {}

# Dictionary to store the API classification results for each state
classification_results = {}

# Print the initial state of groups and dictionaries for debugging purposes
print("Initial groups:", groups)
print("Initial final_classification:", final_classification)
print("Initial classification_results:", classification_results)

# Call the recursive_classification function to process the groups and store the results
final_classification = recursive_classification(groups, final_classification, classification_results)

# Print the final classification results
print("Final Classification:")
print(json.dumps(final_classification, indent=2, ensure_ascii=False))

# Print the classification results from the API calls
print("\nClassification Results:")
print(json.dumps(classification_results, indent=2, ensure_ascii=False))


Initial groups: [['Ancylolomia', 'Calamotropha', 'Catoptria', 'Chrysocrambus', 'Chrysoteuchia', 'Crambus', 'Donacaula', 'Pediasia', 'Platytes', 'Schoenobius', 'Thisanotia']]
Initial final_classification: {}
Initial classification_results: {}


UnboundLocalError: cannot access local variable 'state' where it is not associated with a value

When the final result is stored in the final_classification, we need to consider the use of this result needs to be parsed and integrated, stored in the final result of the dictionary form. Then we first need to consider the results of the different API calls to extract the key content, and then analyze them separately, and then and the results of the good conversion into a whole, and parsing

In [ ]:
# Through the results obtained from the previous distributed traversal loop, i.e., the classification results of all different feature states in the initial character, the classification results of the API calls again
# print(result1)
# print(result2)
# print(result3)

# We need to extract the most critical parts from these three sections, namely the classification paths of species in each subgroup.
# In each distributed loop result, we need to specify that the final results should store the classification results of species in each subgroup in the block (Final Taxonomic Key). 
# Then, by constructing an extraction function, we can extract the final classification information.

# Define a function to extract the required part from each result
def extract_final_taxonomic_key(result):
    match = re.search(r'Final Taxonomic Key(.*)', result, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        return None

# Extract the required part from each result
final_taxonomic_key1 = extract_final_taxonomic_key(final_classification[0])
final_taxonomic_key2 = extract_final_taxonomic_key(final_classification[1])
final_taxonomic_key3 = extract_final_taxonomic_key(final_classification[2])

# Print the extracted results
print("Final Taxonomic Key 1:\n", final_taxonomic_key1)
print("Final Taxonomic Key 2:\n", final_taxonomic_key2)
print("Final Taxonomic Key 3:\n", final_taxonomic_key3)

A simple example of how the results of parsing may end up being stored:

In [ ]:
result1_1 = """
1. Character18
   - 1 and 2: Equisetum_palustre
   - 1: Equisetum_litorale
"""
result2_1 = """
1. Character8
   - 1: 
     - Character7
       - 1: Equisetum_ramosissimum
       - 2 and 3: 
         - Character26
           - 1: Equisetum_variegatum
           - 2: Equisetum_trachyodon
       - 3: Equisetum_hyemale
   - 2: 
     - Character9
       - 1: Equisetum_moorei
       - 2: Equisetum_pratense
"""
result3_1 ="""
1. Character2
   - 1: 
     - Character10
       - 1: Equisetum_telmateia
       - 2: Equisetum_arvense
   - 2: 
     - Character20
       - 2 and 3: Equisetum_pratense
       - 3: Equisetum_sylvaticum
   - 3: Equisetum_fluviatile
"""

# Define a function to parse each result string
def parse_result(result):
    lines = result.strip().split('\n')
    result_dict = {}
    stack = [(0, result_dict)]  # Use tuples to record indentation level and current dictionary
    for line in lines:
        indent_level = len(line) - len(line.lstrip())
        current_dict = stack[-1][1]
        # Adjust stack to match current indentation level
        while stack and stack[-1][0] >= indent_level:
            stack.pop()
        
        if ':' in line:
            key, value = line.split(':', 1)
            key = key.strip()
            value = value.strip()
            
            if value:
                current_dict[key] = value
            else:
                current_dict[key] = {}
                stack.append((indent_level, current_dict[key]))
        else:
            key = line.strip()
            current_dict[key] = {}
            stack.append((indent_level, current_dict[key]))
    
    return result_dict

# Parse each result string
parsed_result1 = parse_result(result1_1)
parsed_result2 = parse_result(result2_1)
parsed_result3 = parse_result(result3_1)

# Check if the parsing results are correct
parsed_results = [parsed_result1, parsed_result2, parsed_result3]

# Check the number of keys in the initial classification results
initial_keys = list(initial_classification.keys())

# Check if the numbers match
if len(parsed_results) != len(initial_keys):
    print("Error: Parsed results and initial keys do not match in length.")
else:
    # Build the main dictionary
    nested_structure = {}
    for key, parsed_result in zip(initial_keys, parsed_results):
        nested_structure[key] = parsed_result
    
    # Print the nested structure
    import json
    print(json.dumps(nested_structure, indent=2))
