# BioLLM x Plants - Idea Mining, Divergent-Convergent Script

Rachel K. Luu, Ming Dao, Subra Suresh, Markus J. Buehler (2025) ENHANCING SCIENTIFIC INNOVATION IN LLMS: A FRAMEWORK APPLIED TO PLANT MECHANICS RESEARCH [full reference to be updated to be included here]

## Load Model Functions

For the divergent generation phase, BioinspiredLLM quantized to 8bit is used. For the convergent evaluation phase, Llama-3.1-8b-instruct quantized to 8bit is used. Depending on your system, you may need to load these models separately.

In [None]:
from llama_index.core import PromptTemplate
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from typing import List, Optional, Sequence
import pandas as pd


#### LLAMA 3.1 TEMPLATE #### 

# Transform a string into input using Llama-specific chat format
def completion_to_prompt(completion):
    return "<|start_header_id|>system<|end_header_id|>\n<eot_id>\n<|start_header_id|>user<|end_header_id|>\n" + \
           f"{completion}<eot_id>\n<|start_header_id|>assistant<|end_header_id|>\n"



# Transform a list of chat messages into Llama-specific input
def messages_to_prompt(messages):
    prompt = "<|start_header_id|>system<|end_header_id|>\n<eot_id>\n"  # Start with a system message placeholder
    for message in messages:
        if message.role == "system":
            prompt += f"system message<eot_id>\n"
        elif message.role == "user":
            prompt += f"<|start_header_id|>user<|end_header_id|>\n{message.content}<eot_id>\n"
        elif message.role == "assistant":
            prompt += f"<|start_header_id|>assistant<|end_header_id|>\n{message.content}<eot_id>\n"

    # Add a final assistant prompt for generation
    prompt += "<|start_header_id|>assistant<|end_header_id|>\n"
    
    return prompt


## Load BioLLM Q8

In [None]:
model_url = "https://huggingface.co/rachelkluu/Llama3.1-8b-Instruct-CPT-SFT-DPO-09022024-Q8_0-GGUF/resolve/main/llama3.1-8b-instruct-cpt-sft-dpo-09022024-q8_0.gguf"
bioinspiredllm_q8 = LlamaCPP(
    model_url=model_url,
    model_path=None,
    temperature=1,
    max_new_tokens=2048,
    context_window=16000,
    model_kwargs={"n_gpu_layers": -1},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=False,
     
)

## Load Llama3.1 Q8

In [None]:
model_url = "https://huggingface.co/rachelkluu/Meta-Llama-3.1-8B-Instruct-Q8_0-GGUF/resolve/main/meta-llama-3.1-8b-instruct-q8_0.gguf"
llama31_q8 = LlamaCPP(
    model_url=model_url,
    model_path=None,
    temperature=.1,
    max_new_tokens=5000,
    context_window=16000,
    model_kwargs={"n_gpu_layers": -1},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
     
)

## Divergent Generation
For the divergent generation phase, BioinspiredLLM quantized to 8bit is used.

In [2]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_response

Settings.llm = bioinspiredllm_q8
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

documents = SimpleDirectoryReader(
    "./PlantPapers/"   # FOLDER TO PAPERS OF INTEREST
).load_data()

Settings.chunk_size = 128
Settings.chunk_overlap = 50

vector_index = VectorStoreIndex.from_documents(documents)
query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=10) 

In [3]:
def extract_bullet_points(response_text):
    # Split the response into lines
    lines = response_text.split('\n')
    
    # Set to store unique bullet points or numbered points (without bullet or number)
    bullet_points = set()
    
    # Loop through lines to find bullet points or numbered points
    for line in lines:
        # Strip leading/trailing whitespace
        line = line.strip()
        
        # Check if the line starts with a bullet point symbol, numbered point, or bracketed number
        if line.startswith(('- ', '• ', '* ')):
            # Remove the bullet point symbol and add to the set
            bullet_points.add(line[2:].strip())
        elif line and line[0].isdigit() and line[1:3] == '. ':
            # Remove the number and period and add to the set
            bullet_points.add(line[3:].strip())
        elif line.startswith('[') and line[1].isdigit() and line[2] == ']':
            # Remove the bracketed number and add to the set
            bullet_points.add(line[3:].strip())
    
    # Convert set back to list for further processing
    return list(bullet_points)



def sample_bullets(num_generations, num_ideas, prompt):
    # Initialize an empty set to store all unique bullet points across multiple generations
    all_bullet_points = set()
    
    # Initialize a list to keep track of the number of bullet points generated for each generation
    bullet_points_count = []

    # List to store bullet points for each generation
    bullet_points_per_generation = []

    data_for_df =[]


    # Loop to run the generation multiple times
    for gen_num in range(num_generations):
        # Create a custom prompt for each generation
        #txt = f"{prompt} Be creative and very specific. Concisely brainstorm {num_ideas} different ideas into a bullet point list. No explanations."
        txt = f"{prompt} Be creative. Concisely brainstorm {num_ideas} different ideas into a bullet point list."

        # Run the query using the query engine
        response = query_engine.query(txt)

        # Extract bullet points from the response object
        bullet_points = extract_bullet_points(response.response)

        # Add the extracted bullet points to the set of all bullet points
        all_bullet_points.update(bullet_points)

        # Track the prompt and the corresponding bullet points for this generation
        for bullet_point in bullet_points:
            data_for_df.append({"Prompt": prompt, "Idea": bullet_point})
        
        bullcount = len(all_bullet_points)

    # Convert the list to a pandas DataFrame for column display
    df = pd.DataFrame(data_for_df, columns=["Prompt", "Idea"])

    return df, all_bullet_points, bullcount

from difflib import SequenceMatcher

# Function to check similarity between two strings
def are_similar(a, b, threshold=0.8):
    return SequenceMatcher(None, a, b).ratio() > threshold


def filter_ideas(new_ideas, unique_ideas, similarity_threshold=0.8):
    filtered_ideas = []
    
    # Compare each new idea with the unique ideas (from previous generations)
    for idea in new_ideas:
        is_unique = all(not are_similar(idea, existing_idea, similarity_threshold) for existing_idea in unique_ideas)
        
        # Check also within the new batch for duplicates
        if is_unique and all(not are_similar(idea, filtered_idea, similarity_threshold) for filtered_idea in filtered_ideas):
            filtered_ideas.append(idea)
    
    return filtered_ideas

# Concatenate the new DataFrame but filter the ideas
def filter_unique_ideas(df_existing, df_new, similarity_threshold=0.8):
    # Extract the existing ideas from the existing DataFrame
    unique_ideas = df_existing['Idea'].tolist()
    
    # Extract the new ideas from the new DataFrame
    new_ideas = df_new['Idea'].tolist()
    
    # Filter new ideas to only keep unique ones
    filtered_ideas = filter_ideas(new_ideas, unique_ideas, similarity_threshold)
    
    # Filter the new DataFrame to only keep rows with the filtered unique ideas
    df_filtered = df_new[df_new['Idea'].isin(filtered_ideas)]
    
    # Concatenate the existing and filtered new DataFrames
    return pd.concat([df_existing, df_filtered], ignore_index=True)



### Divergent Inference

In [5]:
prompt = "Beyond hemorrhage control, where else could the high absorption properties of pollen-based cryogels be effectively applied?"
num_per_gen = "" #can optionally specify how many ideas per generation
sim_thres = 0.7 #adjusts the similarity threshold, with values closer to 1 being greater similarity 
num_ideas = 100 #number of ideas desired in total

##############################

finaldf = pd.DataFrame({
    "Prompt":[],
    "Idea":[],
})

while len(finaldf) < num_ideas:
    gendf, bullets, bullcount = sample_bullets(1, num_per_gen, prompt) 
    print(bullets)
    print(f"{bullcount} Ideas were generated")
    
    finaldf = filter_unique_ideas(finaldf, gendf, similarity_threshold= sim_thres)
    print(finaldf)
    print(f"Unique rows were added. Current number of rows: {len(finaldf)}")

print("DataFrame is now ready")
finaldf.to_csv(f'{prompt}.csv', index=False)


{'Drug metabolism: Pollen cryogels could be used to create scaffolds for studying drug metabolism and transport due to their ability to mimic the extracellular matrix and support cell growth. This could help in the development of new drugs and therapies.', 'Wound healing: The high absorption properties of pollen cryogels could be used to aid in wound healing by promoting the growth of new tissue, reducing inflammation, and preventing infections. This could be achieved by incorporating bioactive compounds or drugs into the cryogels.', 'Medical imaging: Pollen cryogels could be used to create contrast agents for medical imaging due to their biocompatibility and unique microstructure. This could be achieved by incorporating contrast agents or other imaging agents into the cryogels.', 'Biocompatible coatings: Pollen cryogels could be used as a coating on various surfaces, such as implants, to promote biocompatibility and prevent infections. This could be achieved by incorporating antibacte

## Convergent Evaluation
For the convergent evaluation phase, Llama-3.1-8b-instruct quantized to 8bit is used.

In [2]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_response

Settings.llm = llama31_q8
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

documents = SimpleDirectoryReader(
    "./PlantPapers/"   # FOLDER TO PAPERS OF INTEREST
).load_data()

Settings.chunk_size = 128
Settings.chunk_overlap = 50

vector_index = VectorStoreIndex.from_documents(documents)
query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=10) 

In [5]:
import random
import json
import pandas as pd

# remove "\n" from generated outputs to be read into json 
def clean_json_string(json_string):
    # Remove all newline characters
    cleaned_string = json_string.replace("\n", " ")
    
    # Remove multiple spaces
    #cleaned_string = re.sub(r'\s+', ' ', cleaned_string)
    
   # # Remove trailing commas before closing curly braces
    #cleaned_string = re.sub(r',\s*}', '}', cleaned_string)
    
    return cleaned_string

def load_ideas_from_files(filenames):
    # Initialize an empty list to store all ideas
    all_ideas = []
    
    # Iterate through each file
    for filename in filenames:
        # Load the CSV file
        df = pd.read_csv(filename)
        
        # Extract the ideas (assuming ideas are in the second column)
        ideas = df.iloc[:, 1].tolist()
        
        # Append each idea to the list
        all_ideas.extend(ideas)
    
    return all_ideas


def pairwise_elim_and_rate(filenames, prompt, round_limit, outputfilename, ratings_outputfile):
    # Load ideas from multiple files with their source models
    all_ideas = load_ideas_from_files(filenames)
    
    # Randomize the list of ideas initially
    random.shuffle(all_ideas)

    round_number = 1
    all_rounds_data = []  # List to collect data for each round
    ratings_data = []  # List to store ratings data for each idea

    while len(all_ideas) > 1 and round_number <= round_limit:
        print(f"--- Round {round_number} ---")
        winners = []
        
        # Compare ideas in pairs
        for i in range(0, len(all_ideas), 2):
            if i + 1 < len(all_ideas):  # Make sure we have a pair
                idea_1 = all_ideas[i]
                idea_2 = all_ideas[i + 1]

                # Only rate the ideas in the first round
                if round_number == 1:
                    # Prepare the rating prompt for these two ideas
                    rating_prompt = f"""You must critically rate each idea out of 10 for categories: novelty and effectiveness.
                    The response must only contain the ratings in the following strict JSON format:
                    {{
                    "Idea 1": {{"novelty": X, "effectiveness": Y}},
                    "Idea 2": {{"novelty": X, "effectiveness": Y}}
                    }}"""
                    
                    ideas_str = f"Idea 1: {idea_1}\nIdea 2: {idea_2}"
                    rating_txt = f"{rating_prompt}\n\n{ideas_str}"

                    # Query the LLM to rate the two ideas (replace with your actual LLM query)
                    rating_response = query_engine.query(rating_txt)

                    # Parse the LLM's response to get the ratings (replace with your actual cleaning function)
                    cleaned_rating_response = clean_json_string(rating_response.response)
                    try:
                        ratings = json.loads(cleaned_rating_response)

                        # Append the ratings to the ratings_data list for storage
                        ratings_data.append({
                            'Idea': idea_1,
                            'Novelty': ratings.get('Idea 1', {}).get('novelty', None),
                            'Effectiveness': ratings.get('Idea 1', {}).get('effectiveness', None)
                        })
                        ratings_data.append({
                            'Idea': idea_2,
                            'Novelty': ratings.get('Idea 2', {}).get('novelty', None),
                            'Effectiveness': ratings.get('Idea 2', {}).get('effectiveness', None)
                        })
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON: {e}")
                        continue  # Skip this pair if there's an issue

                # Prepare the pairwise comparison prompt
                compare_txt = f"""To answer this question: {prompt} Which idea is better based on novelty and effectiveness?
                Idea 1: {idea_1} 
                Idea 2: {idea_2} 
                Respond with the winner as 'Idea 1' or 'Idea 2' in the following JSON format:
                {{ "winner": "Idea 1" or "Idea 2" }}
                """

                # Query the model (replace with your actual LLM query)
                comparison_response = query_engine.query(compare_txt)

                # Extract the winner from the LLM's response (replace with your actual cleaning function)
                cleaned_comparison_response = clean_json_string(comparison_response.response)
                try:
                    result = json.loads(cleaned_comparison_response)
                    winner = idea_1 if result["winner"] == "Idea 1" else idea_2
                    winners.append(winner)
                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON: {e}")
            
            else:
                # If there's an odd number of ideas, move the last one directly to the next round
                winners.append(all_ideas[i])

        # Store the remaining ideas after the round
        round_data = [{'Round': round_number, 'Idea': idea} for idea in winners]
        all_rounds_data.extend(round_data)

        # Move the winners to the next round
        all_ideas = winners
        random.shuffle(all_ideas)  # Randomize for the next round
        round_number += 1

    # Final list of winners
    print(f"Final round winners (Top {len(all_ideas)} ideas): {[idea for idea in all_ideas]}")
    
    # Convert the final list of winners into a DataFrame with the idea and the model it came from
    df_winners = pd.DataFrame(all_ideas)
    df_winners['Round'] = round_number - 1  # Mark the last round for the final winners

    # Save the final round and all rounds data to the same CSV file
    output_file = outputfilename
    
    # Convert the list of all rounds' data into a DataFrame
    df_all_rounds = pd.DataFrame(all_rounds_data)

    # Append the final round data to the DataFrame
    final_output_df = pd.concat([df_all_rounds, df_winners])

    # Save everything to a CSV
    final_output_df.to_csv(output_file, index=False)

    print(f"Results (including rounds) exported to {output_file}.")

    # Save the ratings from the first round into a separate CSV file
    df_ratings = pd.DataFrame(ratings_data)
    df_ratings.to_csv(ratings_outputfile, index=False)
    
    print(f"Ratings from the first round exported to {ratings_outputfile}.")


### Convergent Inference

In [6]:
prompt = "Beyond hemorrhage control, where else could the high absorption properties of pollen-based cryogels be effectively applied?"
filename=[f"{prompt}.csv",]
filetagline = "PollenAbsorption"



# Maximum number of elimination rounds to perform
round_limit = 6

# Filename to store the elimination results and final winners
outputfilename = f'{filetagline}_elim.csv'

# Filename to store the ratings collected in the first round
ratings_outputfile = f'{filetagline}_rate.csv'

# Run the pairwise elimination and rating function
pairwise_elim_and_rate(filename, prompt, round_limit, outputfilename, ratings_outputfile)


--- Round 1 ---


Llama.generate: 25 prefix-match hit, remaining 1452 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       6.29 ms /    47 runs   (    0.13 ms per token,  7470.99 tokens per second)
llama_print_timings: prompt eval time =    1658.29 ms /  1452 tokens (    1.14 ms per token,   875.60 tokens per second)
llama_print_timings:        eval time =    1609.85 ms /    46 runs   (   35.00 ms per token,    28.57 tokens per second)
llama_print_timings:       total time =    3341.14 ms /  1498 tokens
Llama.generate: 25 prefix-match hit, remaining 1431 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.27 ms /    12 runs   (    0.11 ms per token,  9456.26 tokens per second)
llama_print_timings: prompt eval time =     989.69 ms /  1431 tokens (    0.69 ms per token,  1445.91 tokens per second)
llama_print_timings:        eval time =     386.52 ms /    11 runs  

--- Round 2 ---



llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.41 ms /    12 runs   (    0.12 ms per token,  8510.64 tokens per second)
llama_print_timings: prompt eval time =    1024.84 ms /  1485 tokens (    0.69 ms per token,  1449.01 tokens per second)
llama_print_timings:        eval time =     391.37 ms /    11 runs   (   35.58 ms per token,    28.11 tokens per second)
llama_print_timings:       total time =    1433.70 ms /  1496 tokens
Llama.generate: 25 prefix-match hit, remaining 1436 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.08 ms /    12 runs   (    0.09 ms per token, 11111.11 tokens per second)
llama_print_timings: prompt eval time =     999.46 ms /  1436 tokens (    0.70 ms per token,  1436.78 tokens per second)
llama_print_timings:        eval time =     389.89 ms /    11 runs   (   35.44 ms per token,    28.21 tokens per second)
llama_print_timings: 

--- Round 3 ---



llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.30 ms /    12 runs   (    0.11 ms per token,  9195.40 tokens per second)
llama_print_timings: prompt eval time =     794.06 ms /  1134 tokens (    0.70 ms per token,  1428.11 tokens per second)
llama_print_timings:        eval time =     367.48 ms /    11 runs   (   33.41 ms per token,    29.93 tokens per second)
llama_print_timings:       total time =    1178.71 ms /  1145 tokens
Llama.generate: 25 prefix-match hit, remaining 1445 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.16 ms /    12 runs   (    0.10 ms per token, 10318.14 tokens per second)
llama_print_timings: prompt eval time =     994.36 ms /  1445 tokens (    0.69 ms per token,  1453.20 tokens per second)
llama_print_timings:        eval time =     386.29 ms /    11 runs   (   35.12 ms per token,    28.48 tokens per second)
llama_print_timings: 

--- Round 4 ---



llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       0.92 ms /    12 runs   (    0.08 ms per token, 12987.01 tokens per second)
llama_print_timings: prompt eval time =     997.58 ms /  1421 tokens (    0.70 ms per token,  1424.44 tokens per second)
llama_print_timings:        eval time =     383.57 ms /    11 runs   (   34.87 ms per token,    28.68 tokens per second)
llama_print_timings:       total time =    1395.70 ms /  1432 tokens
Llama.generate: 25 prefix-match hit, remaining 1433 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       0.96 ms /    12 runs   (    0.08 ms per token, 12565.45 tokens per second)
llama_print_timings: prompt eval time =     997.45 ms /  1433 tokens (    0.70 ms per token,  1436.66 tokens per second)
llama_print_timings:        eval time =     379.56 ms /    11 runs   (   34.51 ms per token,    28.98 tokens per second)
llama_print_timings: 

--- Round 5 ---



llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.40 ms /    12 runs   (    0.12 ms per token,  8583.69 tokens per second)
llama_print_timings: prompt eval time =     960.62 ms /  1402 tokens (    0.69 ms per token,  1459.47 tokens per second)
llama_print_timings:        eval time =     394.11 ms /    11 runs   (   35.83 ms per token,    27.91 tokens per second)
llama_print_timings:       total time =    1371.60 ms /  1413 tokens
Llama.generate: 25 prefix-match hit, remaining 1457 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.14 ms /    12 runs   (    0.10 ms per token, 10489.51 tokens per second)
llama_print_timings: prompt eval time =    1011.27 ms /  1457 tokens (    0.69 ms per token,  1440.77 tokens per second)
llama_print_timings:        eval time =     383.96 ms /    11 runs   (   34.91 ms per token,    28.65 tokens per second)
llama_print_timings: 

--- Round 6 ---



llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.39 ms /    12 runs   (    0.12 ms per token,  8614.50 tokens per second)
llama_print_timings: prompt eval time =     952.11 ms /  1380 tokens (    0.69 ms per token,  1449.41 tokens per second)
llama_print_timings:        eval time =     390.41 ms /    11 runs   (   35.49 ms per token,    28.18 tokens per second)
llama_print_timings:       total time =    1359.74 ms /  1391 tokens
Llama.generate: 59 prefix-match hit, remaining 1417 prompt tokens to eval

llama_print_timings:        load time =     738.06 ms
llama_print_timings:      sample time =       1.60 ms /    12 runs   (    0.13 ms per token,  7509.39 tokens per second)
llama_print_timings: prompt eval time =    1003.65 ms /  1417 tokens (    0.71 ms per token,  1411.84 tokens per second)
llama_print_timings:        eval time =     393.40 ms /    11 runs   (   35.76 ms per token,    27.96 tokens per second)
llama_print_timings: 

Final round winners (Top 2 ideas): ['Tissue engineering: The biocompatibility and high absorption properties of pollen-based cryogels can be used to develop novel scaffolds for tissue engineering applications. These cryogels can be used to support the growth and differentiation of various cell types, making them promising materials for regenerative medicine.', 'Wound dressing for burns, as the high absorption properties of pollen cryogels can effectively absorb and retain wound exudates, promoting a moist environment for faster healing.']
Results (including rounds) exported to PollenAbsorption_elim.csv.
Ratings from the first round exported to PollenAbsorption_rate.csv.
