This notebook generates a ground truth dataset of user questions for an **Indoor Plant Knowledge Assistant**.  
Using plant records as input, it creates five relevant, self-contained questions per plant (skipping missing data), then flattens results into a pandas DataFrame of `(id, question)` pairs for training or evaluation.

In [4]:
from dotenv import load_dotenv
import json
import openai
from openai import OpenAI
import os
import pandas as pd
from tqdm.auto import tqdm

## Load openaikey and dataset

In [5]:
# Load the environment variables from the .env file
load_dotenv()

# Retrieve the OpenAI API key
api_key = os.getenv('OPENAI_API_KEY')

# Use it (e.g., in an OpenAI API call)

openai.api_key = api_key
client = OpenAI()


In [6]:
df = pd.read_csv('../data/plants_data.csv')
documents = df.to_dict(orient='records')

In [15]:
documents[:5]

[{'id': 0,
  'name': 'Adelonema wallisii',
  'summary': 'Adelonema wallisii (synonym Homalomena wallisii) is a species of aroid plant (family Araceae) native to Venezuela, Colombia, and Panama.\n\n',
  'cultivation': 'No data available',
  'toxicity': 'No data available'},
 {'id': 1,
  'name': 'Adenium obesum',
  'summary': 'Adenium obesum, more commonly known as a desert rose, is a poisonous species of flowering plant belonging to the tribe Nerieae of the subfamily Apocynoideae of the dogbane family, Apocynaceae. It is native to the Sahel regions south of the Sahara (from Mauritania and Senegal to Sudan), tropical and subtropical eastern and southern Africa, as well as the Arabian Peninsula. Other names for the flower include Sabi star, kudu, mock azalea, and impala lily. Adenium obesum is a popular houseplant and bonsai in temperate regions.\n\n',
  'cultivation': "Adenium obesum is a popular houseplant and bonsai in temperate regions. It requires a sunny location and a minimum indoo

## Functions for generating ground truth dataset

In [8]:
prompt_template = """
You are simulating a user interacting with our Indoor Plant Knowledge Assistant.
Based on the given plant record, create 5 complete, specific questions the user might ask about the plant. Only create questions for categories that contain information different than "No data available".
The questions must:

* Be relevant to the details in the record.
* Be clear and self-contained (not too short).
* Use as few exact words from the record as possible while keeping the meaning.

Plant record format:

plant name: {name}
summary: {summary}
cultivation: {cultivation}
toxicity: {toxicity}

Provide the output in parsable JSON without using code blocks:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

In [9]:
prompt = prompt_template.format(**documents[7])

In [10]:
print(prompt)

You are simulating a user interacting with our Indoor Plant Knowledge Assistant.
Based on the given plant record, create 5 complete, specific questions the user might ask about the plant. Only create questions for categories that contain information different than "No data available".
The questions must:

* Be relevant to the details in the record.
* Be clear and self-contained (not too short).
* Use as few exact words from the record as possible while keeping the meaning.

Plant record format:

plant name: Aechmea fasciata
summary: Aechmea fasciata is a species of flowering plant in the Bromeliaceae family. It is commonly called the silver vase or urn plant and is native to Brazil. This plant is probably the best known species in this genus, and it is often grown as a houseplant in temperate areas.


cultivation: Aechmea fasciata requires partial shade to bright indirect light, and can handle brief periods of early morning sunlight, but should be shielded from the sun during the hotte

In [12]:
def generate_questions(doc):
    """
    Generate user-like questions for a given plant record.

    This function formats the provided plant data into a predefined prompt template,
    sends it to a language model, and returns the generated questions as a JSON string.

    Args:
        doc (dict): A dictionary containing plant details with the keys:
            - name (str): Plant's name.
            - summary (str): Overview or description of the plant.
            - cultivation (str): Cultivation and care instructions.
            - toxicity (str): Information about plant toxicity.

    Returns:
        str: A JSON-formatted string containing the generated questions.
    """
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [13]:
questions = generate_questions(documents[7])

In [14]:
questions

'{"questions": ["What kind of lighting conditions are ideal for growing Aechmea fasciata indoors?", "How can I ensure that my Aechmea fasciata maintains its vibrant coloration?", "What precautions should I take when handling the leaves of Aechmea fasciata to avoid skin irritation?", "What kind of environment is best for keeping Aechmea fasciata healthy outdoors?", "How does Aechmea fasciata manage to form a colony over time?"]}'

## Generating questions for whole dataset

In [12]:
results = {}

In [None]:
# Process each plant record
for doc in tqdm(documents):

    doc_id = doc['id']
    if doc_id in results:  # Skip if already processed
        continue

    # Generate and parse questions
    questions_raw = generate_questions(doc)
    questions = json.loads(questions_raw)

    # Store questions by plant ID
    results[doc_id] = questions['questions']

  0%|          | 0/197 [00:00<?, ?it/s]

In [None]:
# convert dict to DataFrame
final_results = []

for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

df_results = pd.DataFrame(final_results, columns=['id', 'question'])

## Save ground truth dataset to csv

In [None]:
#save to csv
df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)