# Reasoning usecases : Model comparison o1 vs gpt-4o

In [57]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-40-globalStandard'
O1_MODEL = 'o1'

In [58]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

In [59]:
demo_prompt = """
<title>
Comprehensive Insurance Coverage Analysis for Sarah Barnes' Bakery/Café
</title>

<datasets>

### Dataset 1: Client Profile and Business Summary
*This table (visualized as a detailed infographic and pie chart of revenue distribution) contains the core business details.

| Field                     | Details                           |
|---------------------------|-----------------------------------|
| Client Name               | Sarah Barnes                      |
| Business Type             | Bakery/Café                       |
| Location                  | Tampa, Florida                    |
| Number of Employees       | 12                                |
| Annual Revenue            | $850,000                          |
| Years in Operation        | 8                                 |
| Key Concerns              | Hurricane damage, slip-and-fall liability, workers’ compensation, cyber threats from digital payment systems, cost management |
| Business Premises Details  | 2500 sq. ft. storefront with an adjacent kitchen area |

### Dataset 2: Risk Management Audit and Safety Reports
*This dataset is presented as a series of bar charts and tables showing risk scores and incident frequencies.

| Risk Category               | Risk Score (1-10) | Recent Incidents (Past 2 Years) |
|-----------------------------|-------------------|---------------------------------|
| Hurricane/Weather Damage    | 9                 | 2 (minor roof leaks, window damage) |
| Slip-and-Fall Liability     | 7                 | 4 (employee and customer incidents) |
| Workers’ Compensation       | 6                 | 3 (non-severe injuries reported) |
| Cybersecurity               | 5                 | 1 (attempted phishing attack)   |

### Dataset 3: Florida State Insurance Regulations
*This dataset (shown as a text extract with callouts and highlighted clauses) includes excerpts on mandatory coverages and exclusions for businesses in hurricane-prone areas.

**Key Excerpts:**
- Commercial Flood Insurance is not automatically included in standard policies in Florida.
- Policies must detail hurricane deductibles and may require additional endorsements for windstorm damage.
- Certain coverage limits for workers’ compensation are mandated by state law.

### Dataset 4: Premium Cost Benchmarks and Underwriting Quotes
*Presented as a table and line graph showing average premium ranges and policy limits for similar small businesses (core coverage quotes vs. add-ons).

| Coverage Type                 | Recommended Policy Limit       | Estimated Annual Premium Range |
|-------------------------------|--------------------------------|--------------------------------|
| Commercial Property           | $1.5 million - $2 million      | $4,000 - $7,000                |
| General Liability             | $1 million per occurrence      | $2,500 - $5,000                |
| Workers’ Compensation         | Statutory limits               | $3,500 - $6,000                |
| Cyber Liability               | $500,000                       | $1,000 - $2,500                |
| Commercial Flood (Add-on)     | $1 million                     | $1,500 - $3,000                |
| Business Interruption (Add-on)| 3-6 months of revenue coverage| $2,000 - $4,000                |

### Dataset 5: Inventory of Bakery Ingredients *DS*
*This dataset (displayed as a colorful bar chart) lists monthly ingredient usage which is not directly relevant to the insurance recommendation.

| Ingredient       | Monthly Usage (lbs) |
|------------------|---------------------|
| Flour            | 800                 |
| Sugar            | 300                 |
| Eggs             | 500                 |
| Milk             | 400                 |

### Dataset 6: Local Market Sales Figures *DS*
*This dataset (visualized as a line graph) shows competitor sales metrics in the local area.

| Month      | Competitor A Sales ($) | Competitor B Sales ($) |
|------------|------------------------|------------------------|
| Jan        | 750,000                | 800,000                |
| Feb        | 780,000                | 820,000                |
| Mar        | 770,000                | 810,000                |

</datasets>

<question>
You are an advanced insurance consultant. Your task is to generate a comprehensive and clause-specific insurance coverage recommendation for Sarah Barnes' bakery/café in Tampa, Florida.

The analysis must be highly interpretable, transparent in reasoning, and reference dataset elements directly. Specifically:

1. Identify necessary core coverages (e.g., Commercial Property, General Liability, Workers’ Compensation) and appropriate limits using dataset references.
2. Recommend optional add-on policies relevant to the client's hurricane-exposed environment and cyber risk profile, citing state regulations and risk scores.
3. Estimate realistic premium ranges using underwriting quotes, considering the specific business profile and past incidents.
4. Propose concrete cost-reduction strategies tied to incident history and Florida-specific compliance requirements.
5. Summarize final recommendations with explicit references to the most critical dataset insights.

</question>

<instruction>
- Start with an executive summary that outlines the logic of your approach and key recommendations.
- Trace your conclusions back to specific dataset entries, showing analytical reasoning.
- Do not generalize—emphasize precision and regulatory detail.
- Use tables and bullet points for clarity, but the focus should be on defensibility and fidelity to the data.
- Format the answer using markdown.
- Your answer should go beyond surface-level coverage listing and reflect real-world risk consulting rigor.

</instruction>


"""

In [60]:
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

In [61]:
print(gpt_output)

Below is the comprehensive analysis for Sarah Barnes’ Bakery/Café, drawing directly from the provided datasets and highlighting key regulatory requirements for Florida‐based businesses. The recommendations have been tailored for the bakery’s operational size and identified risks.

---

## Executive Summary

Our analysis integrates key business details (Dataset 1), risk audits (Dataset 2), and Florida insurance regulations (Dataset 3) with premium benchmarks (Dataset 4) to develop a targeted insurance coverage solution. We identified core coverages—Commercial Property, General Liability, and Workers’ Compensation—as essential, with limits based on both revenue specifics and industry norms. Given the high hurricane risk (risk score of 9) and exposure to cybersecurity threats (risk score of 5), optional endorsements (Commercial Flood and Cyber Liability) are strongly advised. We also propose specific cost-reduction strategies that leverage incident history and state mandates to optimize p

In [62]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

In [63]:
print(o1_output)

```markdown
# Executive Summary

This report provides a clause-specific insurance coverage recommendation for Sarah Barnes' Bakery/Café in Tampa, Florida. Our approach integrates core business data, quantified risk scores, regulatory requirements specific to Florida’s hurricane-prone environment, and premium benchmarks from underwriting quotes. By referencing explicit dataset details, we have tailored coverage limits, add-on endorsements, and cost management strategies to address both immediate risks and long-term compliance. Key recommendations encompass robust core coverages with adequate limits, targeted endorsements for hurricane weather, flood, and cyber threats, and precise premium estimations based on historical incident frequencies.

# 1. Core Coverage Recommendations

Based on Sarah Barnes' business profile (Dataset 1) and relevant risk factors (Dataset 2), the following core coverages with specific policy limits are advised:

- **Commercial Property Insurance**  
  - **Purpos

In [64]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is our detailed evaluation comparing Answer 1 (O1) and Answer 2 (GPT-4o) across the eight requested dimensions.

──────────────────────────────
1. Clarity

• Answer 1 is very well organized. It begins with an Executive Summary that outlines the approach, then breaks the content into clearly labeled sections (Core Coverage Recommendations, Optional Add-On Policies, Premium Range Estimation, Cost-Reduction Strategies, and a Final Recommendations Summary). The use of markdown headers, bullet points, and tables makes the answer easy to follow.

• Answer 2 also uses a clear markdown structure with headings and table presentations. However, its sections are slightly less segmented in terms of explicit subheadings (for example, its “Detailed Insurance Coverage Recommendation” section contains subsections, but they are less visually separated than in Answer 1).

Strengths:  
– O1’s highly segmented structure, clear numbering, and inclusion of a summary table help the reader quickly identify key insights.  
– GPT-4o is clear and concise, but the structure is a bit more compact overall.

──────────────────────────────
2. Accuracy & Correctness

• Both answers accurately reference dataset elements:
 – They correctly point to Sarah Barnes’ size, location, and risk scores from Dataset 1 & 2.
 – Recommended limits match the underwriting quotes (Dataset 4) and Florida regulations (Dataset 3).
 – Both answers note a hurricane risk score of 9, slip-and-fall incidents (score 7, 4 incidents), and cybersecurity risk along with the low number of cyber incidents.
  
• Answer 1 goes a step further by providing detailed justification behind each recommended limit and frequently cites the exact datasets (e.g., “Dataset 2” for specific incidents). This detailed cross-reference makes its reasoning especially robust.

Strengths:  
– Both answers are factually correct, but O1’s constant linking back to the datasets further validates its conclusions.

──────────────────────────────
3. Completeness

• Both responses cover all parts of the prompt:  
 1. They identify the core coverages (Commercial Property, General Liability, Workers’ Compensation) with reference to appropriate limits and historical data.
 2. They recommend optional add-ons specifically addressing hurricane exposure, flood risk, cyber threats, and even business interruption.
 3. They estimate premium ranges using the underwriting quotes.
 4. They provide cost-reduction strategies tied to past incidents and compliance considerations.
 5. Their final summaries clearly tie together insights from multiple datasets.

• Answer 1 includes a more detailed “Cost-Reduction Strategies” section with actionable bullet points and extra regulatory detail; Answer 2 is slightly more concise but still addresses all required topics.

Strengths:  
– Both are complete; O1’s additional emphasis on individual bullet points in cost reduction adds extra depth.

──────────────────────────────
4. Relevance & Adherence

• Each answer adheres well to the instructions:
 – They begin with an executive summary.
 – Both use tables and bullet points.
 – They directly reference specific dataset elements and quotations from state regulations.
 – They maintain a focus on real-world risk consulting rigor and regulatory detail.
  
• Answer 1 almost “overachieves” by providing a very detailed rationale with explicit dataset references and clear cost-management proposals, demonstrating strict adherence to the prompt.

Strengths:  
– Both are relevant, but O1 demonstrates slightly higher adherence in mapping each recommendation directly back to specific data points.

──────────────────────────────
5. Analytical Depth

• Answer 1 displays strong analytical depth. Its multi-step reasoning explains why each limit is recommended based on risk scores and past incidents. The answer includes a detailed premium table and an extensive set of cost-reduction strategies that justify the recommendations.
  
• Answer 2 also shows deep reasoning by integrating multiple risk factors, but it is a bit less verbose than Answer 1 regarding the underlying logic of each strategy.

Strengths:  
– O1’s explicit line‐by‐line justifications (e.g., referencing “minor roof leaks, window damage” and “non-severe injuries”) plus its detailed premium breakdown offer more analytical insight.

──────────────────────────────
6. Multi-Dataset Synthesis

• Both answers integrate insights from:
 – Client Details (Dataset 1),
 – Risk Scores and Incident Frequencies (Dataset 2),
 – Regulatory Requirements (Dataset 3), and
 – Underwriting Premium Data (Dataset 4).
  
• Answer 1 clearly ties everything together by showing how each dataset impacts a particular coverage or recommendation. The premium table and subsequent cost-reduction strategies reference the interplay between datasets directly.

Strengths:  
– Both answers are strong here, but O1’s explicit mentions like “Reference: Dataset 2” and multiple in-text clarifications help the reader see the synthesis more transparently.

──────────────────────────────
7. Robustness to Ambiguity

• Both answers manage the inherent ambiguity in legal/regulatory texts well by pinpointing specific Florida requirements (e.g., hurricane deductibles and non-automatic inclusion of flood insurance). 

• Answer 1 also adds depth by noting how the incident history (minor damage and employee injuries) necessitates specific upgrades and trainings as part of compliance, which shows it handled potential ambiguities in the client’s risk profile better.

Strengths:  
– O1’s approach to clarifying ambiguous details (e.g., explicitly tying cost-reduction strategies to the dataset’s identified risks) makes it slightly more robust.

──────────────────────────────
8. Format & Usability

• Both responses are formatted in markdown using headers, bullet lists, and tables; they are well-suited for a legal, compliance, or due diligence team.

• Answer 1’s layout—with its well-delineated sections (executive summary, detailed tables, step-by-step strategies) and explicit dataset citations—is slightly more practically useful in a compliance review context because every recommendation is clearly anchored to specific data points.

Strengths:  
– O1’s detailed breakdown and table summarization aid quick referencing by professionals who may need to cross-check regulatory details and underwriting quotes.

──────────────────────────────
Summary Statement

Between the two answers, Answer 1 is better. It stands out for its crystal-clear organization, comprehensive cross-referencing of datasets, and detailed justifications behind each recommendation. The extra granularity—especially in the cost-reduction strategies and extensive use of bullet points and tables—offers greater transparency and usability for legal, risk management, and compliance teams. Answer 2 is also strong and accurate, but Answer 1’s enhanced structure and depth of analysis make it the superior response.