## Initialize the Libaries and Set Up the OpenAI Environment

In [11]:
# !pip install pandas
# !pip install openai
# !pip install python-dotenv

In [12]:
import os
import re
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv; load_dotenv()

api_key = os.getenv("OPENAI_API_KEY") # Set up OpenAI API key in .env file in root
client = OpenAI(api_key=api_key)

import warnings
warnings.filterwarnings('ignore')

## Selecting 10 Charts With Caption and Trying out Different Prompts and Checking Response

In [13]:
imageids = [82, 184, 196, 236, 290, 324, 332, 380, 447, 547]

df = pd.read_csv('../data/200charts.csv')
df = df[df['imageid'].isin(imageids)].reset_index(drop=True)
df.head()

Unnamed: 0,imageid,full_caption,image_base64,domain,chart_type,views
0,82,private and publicsector investment in rd clas...,iVBORw0KGgoAAAANSUhEUgAAA5EAAAEbCAIAAADMBJd/AA...,Healthcare,Bar Graph,single view
1,184,Probability ratio (PR) of exceeding (heavy pr...,iVBORw0KGgoAAAANSUhEUgAAAqcAAAEDCAIAAACH4jo2AA...,Climate Science,Line Chart,composite views
2,196,Decomposition of the change in total annual c...,iVBORw0KGgoAAAANSUhEUgAABCsAAAGhCAIAAADHuqkfAA...,Climate Science,Bar Graph,single view
3,236,Projections and uncertainties for global mean ...,iVBORw0KGgoAAAANSUhEUgAABCMAAAG7CAIAAAB2IMgWAA...,Energy,Bar Graph,single view
4,290,The value of improved technology. \nNote: Mode...,iVBORw0KGgoAAAANSUhEUgAABEAAAAHrCAIAAABAfn+SAA...,Energy,Bar Graph,composite views


In [14]:
prompt1 = '''
    I have a chart **along with its caption**, and I need a list of inference-based questions generated from it. Your task is to create **questions that require interpretation, trend analysis, and reasoning**, rather than just retrieving values.

    ### **Guidelines for Question Generation:**  
    1. **Encourage trend identification and pattern recognition.**  
    - Instead of simply asking for numerical values, prompt reasoning about **how** or **why** trends occur.  
    - Example: ✅ *"How does the trend in renewable energy consumption compare to fossil fuel consumption over the past decade?"*  
    - ❌ Avoid: *"What was the renewable energy consumption in 2020?"*  

    2. **Emphasize comparisons, correlations, and cause-effect relationships (when suggested by the chart).**  
    - Look for **patterns between different variables** and form questions that explore their relationship.  
    - Example: ✅ *"Based on the trend shown, how might an increase in electric vehicle sales impact oil consumption?"*  

    3. **Encourage reasoning-based and predictive questions.**  
    - Questions should encourage logical inference rather than direct retrieval of numbers.  
    - Example: ✅ *"Given the declining trend in traditional media consumption, what can we infer about digital media’s dominance in the coming years?"*  

    4. **Ensure all questions are fully answerable using the given chart.**  
    - The chart should contain enough information to support a logical answer.  
    - Example: ✅ *"What does the chart suggest about the relationship between inflation and consumer spending?"*  
    - ❌ Avoid: *"What are the main reasons behind the rise in inflation?"* (Requires external knowledge)  

    ### **Format for Output:**  
    Generate a numbered list of refined inference-based questions:  

    1. (Generated question 1)  
    2. (Generated question 2)  
    3. (Generated question 3)  
    4. (Generated question 4)  

    Now, generate the list of inference-based questions based on the attached chart and caption.

'''

prompt2 = '''
    I have a chart **along with its caption**, and I need a list of **analytical and reasoning-based questions** generated from it. 
    Your task is to create **questions that require interpretation, pattern recognition, causal reasoning, 
    and forecasting—rather than simply retrieving values.**  

    ### **Guidelines for Question Generation:**  
    1. **Adapt to the specific chart**  
    - Select the most relevant question types **based on the data presented** rather than forcing a fixed distribution of question types.  

    2. **Encourage trend identification and pattern recognition**  
    - ✅ Example: *"How does the trend in renewable energy consumption compare to fossil fuel consumption over the past decade?"*  
    - ❌ Avoid: *"What was the renewable energy consumption in 2020?"* (Pure retrieval)  

    3. **Use inference-based reasoning to connect data points**  
    - ✅ Example: *"What does the trend in inflation suggest about changes in consumer spending patterns?"*  

    4. **Incorporate explanatory (cause-effect) questions when relationships are implied in the data**  
    - ✅ Example: *"What pattern in the chart might explain the sharp drop in production costs in 2018?"*  

    5. **Use counterfactual questions only when meaningful scenarios exist in the chart**  
    - ✅ Example: *"If tax rates had remained unchanged in 2019, how might economic growth have differed?"*  

    6. **Include predictive questions only if the trend is clear and projectable**  
    - ✅ Example: *"If the current trend continues, what would be the projected GDP in 2030?"*  

    7. **Prioritize evaluative, anomaly detection, mechanistic, analogical, and conceptual questions only if applicable to the data**  
    - ✅ Example: *"Which investment sector demonstrated the most stable returns over the last decade?"*  
    - ✅ Example: *"Which year deviates the most from the expected trend in GDP growth?"*  

    ### **Constraints:**  
    ✅ **Do not force one question per category—choose questions dynamically.**  
    ✅ **All questions must be fully answerable using only the given chart.**  
    ✅ **Avoid pure retrieval-based questions.**  

    ### **Format for Output:**  
    Generate a numbered list of refined questions:  

    1. (Generated question 1)  
    2. (Generated question 2)  
    3. (Generated question 3)  
    4. (Generated question 4)  

    Now, generate the list of refined questions based on the attached chart and caption.

'''

prompt3 = '''
    I have a chart **along with its caption**, and I need a list of questions generated from it. Your task is to create **open-ended and thought-provoking questions** that encourage deeper exploration, reasoning, and interpretation while ensuring they are fully answerable using only the given chart.  

    ### **Guidelines for Question Generation:**  

    1. **Encourage reflection on underlying themes and insights.**  
    - Instead of focusing on direct numerical retrieval, prompt discussion on **what the data implies** or **what patterns reveal**.  
    - Example: ✅ *"What underlying factors might explain the changes in the trend observed in the chart?"*  

    2. **Focus on implications and broader interpretations.**  
    - Questions should encourage reasoning about **how different aspects of the data relate to each other** rather than just reporting values.  
    - Example: ✅ *"What potential consequences might arise if the trend shown in the chart continues?"*  

    3. **Encourage critical thinking and alternative perspectives.**  
    - Ask about **possible explanations** for patterns or **alternative ways** the data could have been presented.  
    - Example: ✅ *"How would a different visualization (e.g., line chart vs. bar chart) change the way we interpret this data?"*  

    4. **Highlight areas of uncertainty or data limitations.**  
    - Prompt awareness of **what the chart does not show** and encourage reasoning within those constraints.  
    - Example: ✅ *"What are some key insights missing from this chart that would help provide a more complete picture?"*  

    5. **Explore hypothetical and counterfactual reasoning.**  
    - Create **"What if?"** scenarios that are grounded in the data but push the reader to think beyond its immediate representation.  
    - Example: ✅ *"If one category had been excluded from this chart, how would that impact our understanding of the trend?"*  

    ### **Format for Output:**  
    Generate a numbered list of refined open-ended questions:  

    1. (Generated question 1)  
    2. (Generated question 2)  
    3. (Generated question 3)  
    4. (Generated question 4)  

    Now, generate the list of exploratory and conceptual questions based on the attached chart and caption.

'''

prompts = [prompt1, prompt2, prompt3]

In [15]:
max_epochs = 10

for i, prompt in enumerate(prompts, start=1):  # Iterate over prompts with index starting from 1
    current_epoch = 0  
    results = []

    for idx, row in df.iterrows():
        if current_epoch >= max_epochs:  # Ensure we don't exceed max_epochs
            break  

        chart = row["image_base64"] 
        caption = row['full_caption']

        try:
            response = client.chat.completions.create(
                model="chatgpt-4o-latest",
                messages=[
                    {
                        "role": "system",
                        "content": "Give me a maximum of 10 questions."
                    },
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt,  
                            },
                            {
                                "type": "text",
                                "text": "Caption: " + caption,
                            },
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/png;base64,{chart}"
                                },
                            },
                        ],
                    }
                ],
            )

            # Extract questions from response
            questions = response.choices[0].message.content
            questions = re.findall(r"\d+\.\s(.+?)(?=\n|$)", questions)
            questions = [q.rstrip() for q in questions]

            # Store result with imageid and extracted questions
            result_entry = {'imageid': row['imageid']}
            for q_num in range(1, 11):
                result_entry[f'Q{q_num}'] = questions[q_num - 1]
            results.append(result_entry)

            print(f"Processed row {idx}")
            current_epoch += 1  

        except Exception as e:
            print(f"Error processing row {idx}: {e}")
            current_epoch += 1  

    # Convert results to DataFrame and save dynamically
    q_df = pd.DataFrame(results)
    output_path = f"../data/prompt_testing/q_prompt{i}.csv"
    q_df.to_csv(output_path, index=False)

    print(f"Saved {output_path}")

Processed row 0
Processed row 1
Processed row 2
Processed row 3
Processed row 4
Processed row 5
Processed row 6
Processed row 7
Processed row 8
Processed row 9
Saved ../data/prompt_testing/q_prompt1.csv
Processed row 0
Processed row 1
Processed row 2
Processed row 3
Processed row 4
Processed row 5
Processed row 6
Processed row 7
Processed row 8
Processed row 9
Saved ../data/prompt_testing/q_prompt2.csv
Processed row 0
Processed row 1
Processed row 2
Processed row 3
Processed row 4
Processed row 5
Processed row 6
Processed row 7
Processed row 8
Processed row 9
Saved ../data/prompt_testing/q_prompt3.csv
