# Make Sample Questions

This notebook starts with a reference to a directory full of documents and an abstract description of the content of those documents.  It then does the following:

1. It uses the abstract description of the documents to generate a bunch of questions by calling a question generator model, which is currently set to gpt-4o.  You want a very powerful and smart model for that purpose because generating a large volume of questions from an abstract description is a pretty challenging task.
2. It builds a vector database from the content of those documents using Docling to analyze them.
3. It uses RAG and a reference answer generator model (also gpt-4o currently) to generate reference answers.  You really need a very powerful model to be the reference answer generator because you're going to be treating these reference answers as ground truth for the smaller and presumably less powerful models that you were trying to actually evaluate in the next notebook.
4. It through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know.  This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.
5. It stores all of this information in a file for use in the next notebook, [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).

If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale so I think in practice often that's not going to happen.

## Import dependencies

In [37]:
import evaluation_utilities

import os
import re
import requests
import importlib

from pathlib import Path

from IPython.display import clear_output

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage

In [38]:
# Rerun this cell whenever you change evaluation_utilities
importlib.reload(evaluation_utilities)

<module 'evaluation_utilities' from '/Users/bmurdock/lls-comparisons/evaluation_utilities.py'>

## Configure and initialize models

The main configuration options for this notebook are in the following cell, so you may want to edit some values there before running.

In [9]:
QUESTION_GENERATOR_MODEL_INFO={"model": "gpt-4o", "timeout": 7200}
REFERENCE_ANSWER_GENERATOR_MODEL={"model": "gpt-4o"}
EMBED_MODEL_ID="ibm-granite/granite-embedding-125m-english"

NUM_TOPICS=50
NUM_ITERATIONS_PER_TOPIC=10

CONTENT_URLS=["https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173"]
CONTENT_LOCATION="./docs/"
CONTENT_DESCRIPTION="IBM 2024 Annual Report"

EXPERIMENT_SHORT_LABEL = f"ibm-{NUM_TOPICS}-{NUM_ITERATIONS_PER_TOPIC}"

In [3]:
EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)
question_generator_model = OpenAI(**QUESTION_GENERATOR_MODEL_INFO)
reference_answer_generator_model = OpenAI(**REFERENCE_ANSWER_GENERATOR_MODEL)

In [4]:
messages = [
    ChatMessage(role="user", content="Say hello to the world"),
]
question_generator_model.chat(messages)

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='Hello, world!')]), raw=ChatCompletion(id='chatcmpl-BgtZ0p3DOv3lXFKhaKNS5k9Nw1RZD', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello, world!', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1749563062, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_07871e2ad8', usage=CompletionUsage(completion_tokens=4, prompt_tokens=12, total_tokens=16, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), delta=None, logprobs=None, additional_kwargs={'prompt_tokens': 12, 'completion_tokens': 4, 'total_tokens': 16})

## Download content

This downloads the content at the URLs specified in CONTENT_URLS and stores them in CONTENT_LOCATION.

In [36]:

def download_file(url: str, output_dir: str|Path):
    response = requests.get(url)
    if "content-disposition" in response.headers:
        content_disposition = response.headers["content-disposition"]
        filename = content_disposition.split("filename=")[1]
        if filename.startswith('"') and filename.endswith('"'):
            filename = filename[1:-1]
    else:
        filename = url.split("/")[-1]
    target = Path(output_dir, filename)
    with open(target, mode="wb") as file:
        file.write(response.content)
    return target

os.makedirs(CONTENT_LOCATION, exist_ok=True)
for url in CONTENT_URLS:
    target = download_file(url=url, output_dir=CONTENT_LOCATION)
    print(target)

docs/ibm-annual-report-2024.pdf


## Question type prompts

These prompts are adapted from https://github.com/docling-project/docling-sdg/blob/main/docling_sdg/qa/prompts/generation_prompts.py but rewritten to use abstract descriptions of the content and not the content itself.

In [None]:
DEFAULT_FACT_SINGLE_QUESTION_PROMPT = (
    'A "single-fact" question is a question with the following properties:\n'
    "- It is a natural language question.\n"
    "- It is answered with a single piece of factual information.\n"
    "\n"
    "I will provide you with an abstract description of a document and a topic and a list of existing questions.\n"
    "Think of a single-fact question that could plausibly be answered using only information contained in the given "
    "context and that is distinct from the existing questions.\n"
    "\n"
    "## Abstract Description of Document\n\n{document_description_str}\n\n"
    "## Topic\n\n{topic_str}\n\n"
    "## Existing Questions\n\n{existing_questions_str}\n\n"
    "\n"
    "What question did you think about? Do not say anything other than the question."
)

DEFAULT_SUMMARY_QUESTION_PROMPT = (
    'A "summary" question is a question with the following properties:\n'
    "- It is a natural language question.\n"
    "- It is answered with a summary of multiple pieces of information.\n"
    "- It cannot be answered with a single piece of factual information.\n"
    "\n"
    "I will provide you with an abstract description of a document and a topic and a list of existing questions.\n"
    'Think of a "summary" question that must be answered using only information '
    "contained in the given context and that is distinct from the existing questions.\n"
    "\n"
    "## Abstract Description of Document\n\n{document_description_str}\n\n"
    "## Topic\n\n{topic_str}\n\n"
    "## Existing Questions\n\n{existing_questions_str}\n\n"
    "\n"
    "What question did you think about? Do not say anything other than the question."
)

DEFAULT_REASONING_QUESTION_PROMPT = (
    'A "reasoning" question is a question with the following properties:\n'
    "- It is a natural language question.\n"
    "- It requires the reader to think critically and make an inference or draw a "
    "conclusion based on the information provided.\n"
    "\n"
    "I will provide you with an abstract description of a document and a topic and a list of existing questions.\n"
    'Think of a "reasoning" question that must be answered using only information '
    "contained in the given context and that is distinct from the existing questions.\n"
    "\n"
    "## Abstract Description of Document\n\n{document_description_str}\n\n"
    "## Topic\n\n{topic_str}\n\n"
    "## Existing Questions\n\n{existing_questions_str}\n\n"
    "\n"
    "What question did you think about? Do not say anything other than the question."
)

DEFAULT_TOPIC_GENERATION_PROMPT = (
    "I will provide you with an abstract description of a document and ask you to generate a list of topics that might be in that document.\n\n"
    "## Abstract Description of Document\n\n{document_description_str}\n\n"
    "Please generate a list of {num_topics} topics that this document could plausibly address.  Generate one topic per line.  "
    'For each line, put a number and then a "." and then a short description of the topic.'
)

In [6]:
QUESTION_PROMPTS = [DEFAULT_FACT_SINGLE_QUESTION_PROMPT, DEFAULT_SUMMARY_QUESTION_PROMPT, DEFAULT_REASONING_QUESTION_PROMPT]

In [7]:
print(DEFAULT_SUMMARY_QUESTION_PROMPT.format(document_description_str="Stuff about document", topic_str="A topic", existing_questions_str="Who?\nWhen?\nWhere?"))

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with a an abstract description of a document and a topic and a list of existing questions.
Think of a "summary" question that must be answered using only information contained in the given context and that is distinct from the existing questions.

## Abstract Description of Document

Stuff about document

## Topic

A topic

## Existing Questions

Who?
When?
Where?


What question did you think about? Do not say anything other than the question.


## Generate topics

In [8]:
# Run with just 3 topics to show what the outputs can look like before continuing on with the full generation using the NUM_TOPICS constant.
message = DEFAULT_TOPIC_GENERATION_PROMPT.format(document_description_str=CONTENT_DESCRIPTION, num_topics=3)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

In [9]:
print(resp.message.blocks[0].text)

1. Financial Performance Overview: Analysis of IBM's financial results for the year 2024, including revenue, profit margins, and key financial metrics.

2. Strategic Initiatives and Innovations: Discussion of IBM's strategic priorities, technological advancements, and new product developments introduced in 2024.

3. Sustainability and Corporate Responsibility: Overview of IBM's efforts and achievements in sustainability, corporate social responsibility, and environmental impact reduction during the year.


In [10]:
# Assisted by Google Gemini 
def extract_list_items(text_block: str) -> list[str]:
    """
    Extracts items from a multi-line string.

    It handles:
    - Optional blank lines between items.
    - Optional numbering (e.g., "1.", "1 ", "2.") at the start of lines.
    It returns a list of strings, with numbers and blank lines removed,
    and each item stripped of leading/trailing whitespace.

    Args:
        text_block: The multi-line string to process.

    Returns:
        A list of extracted string values.
    """
    items = []
    # Regex to identify leading numbers followed by an optional period and optional whitespace.
    # Example: "1.", "1 ", "  1. ", "2 "
    # This pattern is applied to lines that have already had their outer whitespace stripped.
    # ^      Matches the beginning of the string (the stripped line).
    # \d+    Matches one or more digits (the number).
    # \.?    Matches an optional literal period.
    # \s* Matches zero or more whitespace characters following the number/period.
    number_prefix_pattern = re.compile(r"^\d+\.?\s*")

    for line in text_block.splitlines():
        # 1. Remove leading/trailing whitespace from the current line.
        stripped_line = line.strip()

        # 2. If the line is blank after stripping, skip it.
        if not stripped_line:
            continue

        # 3. Remove the number prefix, if present.
        #    The sub() method replaces the matched pattern with an empty string.
        item_text = number_prefix_pattern.sub("", stripped_line)

        # 4. Strip any leading/trailing whitespace that might remain on the item_text.
        #    This is important if the original item had spaces after the number,
        #    or if the item itself had leading/trailing spaces (which strip() in step 1
        #    would have handled if no number was present, but this ensures cleanliness
        #    after potential prefix removal).
        final_item_text = item_text.strip()

        # 5. Add the cleaned item to the list, only if it's not empty.
        #    (e.g., a line like "1." would become "" after processing).
        if final_item_text:
            items.append(final_item_text)

    return items

In [11]:
def generate_topics(prompt, content_description, question_generator_model, num_topics):
    message = prompt.format(document_description_str=content_description, num_topics=num_topics)
    messages = [ChatMessage(role="user", content=message)]
    resp = question_generator_model.chat(messages)
    response_text = resp.message.blocks[0].text
    topics = extract_list_items(response_text)
    return response_text, topics

In [12]:
response_text, topics = generate_topics(DEFAULT_TOPIC_GENERATION_PROMPT, CONTENT_DESCRIPTION, question_generator_model, NUM_TOPICS)
topics

["Overview of IBM's 2024 Financial Performance",
 'Revenue Breakdown by Business Segment',
 'Analysis of Global Market Trends',
 'Key Innovations and Technological Advancements',
 'Strategic Initiatives and Partnerships',
 'Sustainability and Environmental Impact Efforts',
 'Corporate Governance and Leadership Changes',
 'Risk Management Strategies',
 "IBM's Cloud Computing Growth",
 'Artificial Intelligence and Machine Learning Developments',
 'Quantum Computing Progress and Investments',
 'Software and Services Revenue Analysis',
 'Hardware and Infrastructure Business Performance',
 "IBM's Role in Digital Transformation",
 'Cybersecurity Measures and Enhancements',
 'Employee Engagement and Workforce Development',
 'Diversity, Equity, and Inclusion Initiatives',
 'Research and Development Investments',
 'Customer Success Stories and Case Studies',
 'Competitive Landscape and Market Position',
 'Financial Highlights and Key Metrics',
 'Shareholder Returns and Dividend Policy',
 'Capit

In [13]:
len(topics)

50

In [14]:
# If the list winds up being messed up for some reason, look at the response text to see what went wrong.
response_text

"1. Overview of IBM's 2024 Financial Performance\n2. Revenue Breakdown by Business Segment\n3. Analysis of Global Market Trends\n4. Key Innovations and Technological Advancements\n5. Strategic Initiatives and Partnerships\n6. Sustainability and Environmental Impact Efforts\n7. Corporate Governance and Leadership Changes\n8. Risk Management Strategies\n9. IBM's Cloud Computing Growth\n10. Artificial Intelligence and Machine Learning Developments\n11. Quantum Computing Progress and Investments\n12. Software and Services Revenue Analysis\n13. Hardware and Infrastructure Business Performance\n14. IBM's Role in Digital Transformation\n15. Cybersecurity Measures and Enhancements\n16. Employee Engagement and Workforce Development\n17. Diversity, Equity, and Inclusion Initiatives\n18. Research and Development Investments\n19. Customer Success Stories and Case Studies\n20. Competitive Landscape and Market Position\n21. Financial Highlights and Key Metrics\n22. Shareholder Returns and Dividend P

## Generate questions for each topic

In [15]:
existing_questions = []

question_prompt = DEFAULT_SUMMARY_QUESTION_PROMPT
topic = topics[0]
existing_questions_str="\n".join(existing_questions) if existing_questions else "NONE"
message = question_prompt.format(document_description_str=CONTENT_DESCRIPTION, topic_str=topic, existing_questions_str=existing_questions_str)
print(message)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with a an abstract description of a document and a topic and a list of existing questions.
Think of a "summary" question that must be answered using only information contained in the given context and that is distinct from the existing questions.

## Abstract Description of Document

IBM 2024 Annual Report

## Topic

Overview of IBM's 2024 Financial Performance

## Existing Questions

NONE


What question did you think about? Do not say anything other than the question.


In [16]:
existing_questions.append(resp.message.blocks[0].text)
print(resp.message.blocks[0].text)

What were the key factors that influenced IBM's financial performance in 2024, and how did they impact the company's overall results?


In [17]:
question_prompt = DEFAULT_SUMMARY_QUESTION_PROMPT
topic = topics[0]
existing_questions_str="\n".join(existing_questions) if existing_questions else "NONE"
                                 
message = question_prompt.format(document_description_str=CONTENT_DESCRIPTION, topic_str=topic, existing_questions_str="\n".join(existing_questions))
print(message)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with a an abstract description of a document and a topic and a list of existing questions.
Think of a "summary" question that must be answered using only information contained in the given context and that is distinct from the existing questions.

## Abstract Description of Document

IBM 2024 Annual Report

## Topic

Overview of IBM's 2024 Financial Performance

## Existing Questions

What were the key factors that influenced IBM's financial performance in 2024, and how did they impact the company's overall results?


What question did you think about? Do not say anything other than the question.


In [18]:
existing_questions.append(resp.message.blocks[0].text)
print(resp.message.blocks[0].text)

How did IBM's revenue and profit figures for 2024 compare to previous years, and what trends can be observed from this comparison?


In [19]:
# Now we put all the pieces above into a function that iterates through all the topics, question prompts, and repeats a given number of times.

def generate_questions(topics, content_description, question_prompts, num_iterations_per_topic, question_generator_model):
    all_questions_with_prompts_and_topics = []
    num_questions_expected = len(topics) * len(question_prompts) * num_iterations_per_topic
    i = 1
    for topic in topics:
        for question_prompt in question_prompts:
            existing_questions_for_topic_and_prompt = []
            for _ in range(num_iterations_per_topic):
                existing_questions_str="\n".join(existing_questions_for_topic_and_prompt) if existing_questions_for_topic_and_prompt else "NONE"
                message = question_prompt.format(document_description_str=content_description, topic_str=topic, existing_questions_str=existing_questions_str)
                messages = [ChatMessage(role="user", content=message)]
                resp = question_generator_model.chat(messages)
                question_text = resp.message.blocks[0].text
                existing_questions_for_topic_and_prompt.append(question_text)
                # Note that what we're storing here is the question/prompt/topic tuple.  What we really want as an output is just the question, but the prompt and topic might be useful for understanding where the question came from.
                all_questions_with_prompts_and_topics.append((question_text, question_prompt, topic))
                clear_output(wait=True)
                print(f"{i} / {num_questions_expected}")
                i += 1
    return all_questions_with_prompts_and_topics

In [20]:
all_questions_with_prompts_and_topics = generate_questions(topics, CONTENT_DESCRIPTION, QUESTION_PROMPTS, NUM_ITERATIONS_PER_TOPIC, question_generator_model)

1500 / 1500


In [21]:
all_questions_with_prompts_and_topics

[("What was IBM's total revenue for the year 2024?",
  'A "single-fact" question is a question with the following properties:\n- It is a natural language question.\n- It is answered with a single piece of factual information.\n\nI will provide you with a an abstract description of a document and a topic and a list of existing questions.\nThink of a single-fact question that could plausibly be answered using only information contained in the given context and that is distinct from the existing questions.\n\n## Abstract Description of Document\n\n{document_description_str}\n\n## Topic\n\n{topic_str}\n\n## Existing Questions\n\n{existing_questions_str}\n\n\nWhat question did you think about? Do not say anything other than the question.',
  "Overview of IBM's 2024 Financial Performance"),
 ("What was IBM's net income for the year 2024?",
  'A "single-fact" question is a question with the following properties:\n- It is a natural language question.\n- It is answered with a single piece of fa

In [22]:
questions = [t[0] for t in all_questions_with_prompts_and_topics]
print(len(questions))
questions[0:5]

1500


["What was IBM's total revenue for the year 2024?",
 "What was IBM's net income for the year 2024?",
 "What was IBM's operating income for the year 2024?",
 "What was IBM's earnings per share (EPS) for the year 2024?",
 "What was IBM's total assets value at the end of 2024?"]

In [23]:
# Note that the use of set here ensures that duplicate questions are removed.
sorted_unique_questions = list(set(questions)) 
sorted_unique_questions.sort()
print(len(sorted_unique_questions))

questions = sorted_unique_questions
questions[0:5]

1495


["Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
 'Based on the trends and projections outlined in the IBM 2024 Annual Report, what strategic initiatives is IBM likely to prioritize to enhance its competitive position in the technology sector over the next five years?',
 "How could IBM's approach to sustainability and environmental responsibility affect its market perception and customer base in different regions worldwide?",
 "How could IBM's investment in research and development influence its ability to anticipate and respond to global market trends, and what impact might this have on its innovation pipeline?",
 "How could IBM's understanding of regional regulatory changes and compliance requirements impact its strategic planning and ability to navigate global market trends effectively?"]

In [24]:
evaluation_utilities.write_json([{"user_input" : q} for q in questions], f"./questions-{EXPERIMENT_SHORT_LABEL}.json")

## Make the vector database

This next block lists all the files in the specified directory and then ingests them all into a vector database.

It is using the Llama Index DoclingReader, which is a simple and naive way to use Docling.  It converts everything to mark down and then use built-in primitives in Llama Index to do the chunking.

You can see a much more sophisticated use of Docling at IBM's [Granite_Multimodal_RAG.ipynb](https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Multimodal_RAG.ipynb)
where they use the Docling hierarchical chunker and then skip over chunks with tables in them, and then iterate through the tables separately and convert the tables to
markdown one at a time.  That example also uses a model to generate descriptions of pictures and then include those descriptions in the index too.  At some point, we'd like
to do some all of that here and get a better foundation for building the vector index that we would use for both generating the reference answers in this notebook and
for doing the actual rag evaluations in [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).

In [25]:
file_paths = evaluation_utilities.list_files(CONTENT_LOCATION)
file_paths

[PosixPath('docs/ibm-annual-report-2024.pdf')]

In [26]:
index = evaluation_utilities.make_simple_index(file_paths, EMBED_MODEL)
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x4c2be4510>

In [27]:
q = "Tell me about Z mainframe sales"
result = index.as_query_engine(llm=question_generator_model).query(q)
#print(f"Q: {q}\nA: {result.response.strip()}\n\nSources:")
#display([(n.text, n.metadata) for n in result.source_nodes])

[n.text for n in result.source_nodes]

["## 14 Management Discussion\n\nInternational Business Machines Corporation and Subsidiary Companies\n\nIBM Z: the premier transaction processing platform with leading security, resilience and scale, highly optimized for mission-critical, high-volume transaction workloads and enabled for enterprise AI and hybrid cloud. It includes IBM Z and LinuxONE, with a range of high-performance systems designed to address enterprise computing capacity, security and performance needs, z/OS, a securityrich, high-performance enterprise operating system, as well as Linux and other operating systems.\n\nDistributed  Infrastructure: includes  Power,  Storage  and  IBM  Cloud  Infrastructure-as-a-Service  (IaaS).  Power  consists  of  highperformance servers, designed and engineered for data intensive and AI-enabled workloads and optimized for hybrid cloud and Linux. The Storage portfolio consists of a broad range of storage hardware and software-defined offerings, including Z-attach and distributed fla

## Use RAG to generate reference answers


Here we use RAG on all of the questions and get answers from the RAG.  We will label these "reference answers" because they're being generated by the model that we have designated to be our reference answer generator, i.e., the model that we trust to be close enough to perfect that it can act as our "ground truth" for evaluating the other models that we intend to evaluate.  We just discard the retrieve context instead of creating them as reference contexts because there's no particular reason to believe that they're particularly good.  If we had an ultra-high power search capability (e.g., something that retrieved a long list of results and then had a powerful model rate each result), we would want to use it here to get reference contexts and then use those reference context to generate the answer instead of the actual retrieve contexts.

Note that we are calling "run_reference_rag", our reference answer generator RAG.  That RAG is very slow because it calls the LLM on each search result separately to assess the search result quality before then taking the search results rated highest by the LLM and using them to generate the answer (and recording them as the reference contexts).  It would be impractical to do this in a deployed application, but it can be useful for evaluation purposes because we expect it to generate something that is closer to a "ground truth" for evaluation then what we would get with a simpler/faster RAG.

In [28]:
question_data = []
for q in questions:
    question_data.append({"user_input": q})

data = evaluation_utilities.run_reference_rag(question_data, reference_answer_generator_model, index, number_of_search_results=5)

data[0:5]

[{'user_input': "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
  'reference': "Based on IBM's 2024 Annual Report, the company's financial performance and strategic initiatives appear to support its ability to sustain or potentially increase dividend payouts to shareholders. Here are some key points from the report that influence this assessment:\n\n1. **Revenue Growth and Cash Flow**: IBM reported $62.8 billion in revenue, with a 3% increase at constant currency, and generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. Strong revenue growth and cash flow generation provide a solid foundation for sustaining dividend payouts.\n\n2. **Investment in Growth Areas**: IBM has made significant investments in AI and hybrid cloud, which are expected to drive future growth. The company allocated over $7 billion to research 

## Store the generated data

In [29]:
evaluation_utilities.write_json(data, f"./questions_and_reference_answers-{EXPERIMENT_SHORT_LABEL}-{len(data)}.json")
data[0:5]

[{'user_input': "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
  'reference': "Based on IBM's 2024 Annual Report, the company's financial performance and strategic initiatives appear to support its ability to sustain or potentially increase dividend payouts to shareholders. Here are some key points from the report that influence this assessment:\n\n1. **Revenue Growth and Cash Flow**: IBM reported $62.8 billion in revenue, with a 3% increase at constant currency, and generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. Strong revenue growth and cash flow generation provide a solid foundation for sustaining dividend payouts.\n\n2. **Investment in Growth Areas**: IBM has made significant investments in AI and hybrid cloud, which are expected to drive future growth. The company allocated over $7 billion to research 