# Reasoning usecases : Model comparison o1 vs gpt-4o

In [27]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-40-globalStandard'
O1_MODEL = 'o1'

In [28]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

In [42]:
demo_prompt = """case_study_task = "<title>  
Commercial Underwriting Analysis for TechEdge Manufacturing Co.: Balancing Risk and Compliance under State Regulations  
</title>  

<datasets>  
**Dataset 1: Business Profile**  
This dataset provides details about TechEdge Manufacturing Co., a small business seeking commercial insurance coverage. It includes:  
- Business Name: TechEdge Manufacturing Co.  
- Industry: Electronics Manufacturing  
- Years in Operation: 12  
- Annual Revenue ($): 8,500,000  
- Employee Count: 120  
- Coverage Requested: Commercial Property & Liability  
- Location: Texas  
- Risk Management Measures: Fire alarms, sprinkler system installed  

---  

**Dataset 2: Claims History**  
This table outlines historical insurance claims filed by TechEdge Manufacturing Co. over the past 5 years, with an accompanying line graph showing claim amounts over time.  

| Date       | Claim Type           | Claim Amount ($) | Cause of Loss           | Resolution Status |  
|------------|----------------------|------------------|-------------------------|-------------------|  
| 2018-05-20 | Fire Damage          | 125,000          | Electrical Fault        | Settled           |  
| 2019-08-15 | Equipment Breakdown  | 75,000           | Machinery Failure       | Settled           |  
| 2020-11-10 | Property Damage      | 200,000          | Structural Issues       | Open              |  
| 2021-03-05 | Theft                | 50,000           | Burglary                | Settled           |  
| 2022-07-22 | Fire Damage          | 150,000          | Overheating Equipment   | Open              |  

---  

**Dataset 3: Risk Assessment Survey**  
This dataset provides on-site risk factor scores (with a radar chart visualization):  

| Risk Factor                   | Score (out of 10) |  
|-------------------------------|-------------------|  
| Fire Risk                     | 8                 |  
| Flood Risk                    | 4                 |  
| Equipment Malfunction Risk    | 7                 |  
| Theft Risk                    | 6                 |  
| Employee Safety               | 5                 |  

---  

**Dataset 4: Commercial Property Inspection Report**  
An inspection report with key findings on property safety:  

| Inspection Area        | Finding                                  | Rating (1-10) |  
|------------------------|------------------------------------------|---------------|  
| Structural Integrity   | Minor cracks; overall acceptable         | 7             |  
| Fire Safety Systems    | Sprinkler system partially outdated      | 5             |  
| Electrical Systems     | Inconsistent maintenance practices       | 6             |  
| Building Maintenance   | Signs of wear; upgrades recommended      | 5             |  

---  

**Dataset 5: State Regulation Compliance Checklist**  
Evaluates the property’s compliance with state safety and insurance regulations:  

| Regulation Category    | Requirement                             | Compliance Status         |  
|------------------------|-----------------------------------------|---------------------------|  
| Fire Code              | Fully compliant with sprinkler requirements | Non-Compliant             |  
| Electrical Safety      | Updated wiring and periodic inspections | Compliant                 |  
| Structural Safety      | Seismic retrofitting as per code         | Conditionally Compliant   |  
| Health & Safety        | Regular safety drills mandated           | Non-Compliant             |  

---  

**Dataset 6: Underwriting Guidelines Document** *DS*  
A document outlining standard underwriting procedures (in bullet points):  
- Base premium is determined by annual revenue and inherent risk rating.  
- Adjustments are applied for historical claims frequency and severity.  
- Additional loadings are imposed for non-compliance with state regulations.  
- Discounts are applied for proactive risk management measures.  

---  

**Dataset 7: Financial Performance and Revenue Trends** *DS*  
Monthly revenue data over the past 12 months, depicted as a line graph.  

| Month | Revenue ($) |  
|-------|-------------|  
| Jan   | 700,000     |  
| Feb   | 680,000     |  
| Mar   | 720,000     |  
| Apr   | 710,000     |  
| May   | 690,000     |  
| Jun   | 705,000     |  
| Jul   | 715,000     |  
| Aug   | 700,000     |  
| Sep   | 690,000     |  
| Oct   | 710,000     |  
| Nov   | 720,000     |  
| Dec   | 730,000     |  

---  

**Dataset 8: Email Transcript** *DS*  
An email between the business owner and an insurance broker discussing premium quotes and potential risk improvements.  

'Subject: Insurance Premium Inquiry  
Hi,  
I am seeking a detailed quote for our property insurance. While we have installed fire alarms and a sprinkler system, we know some systems still need updating. Please advise on how these factors might affect our premium.  
Regards,  
Owner – TechEdge Manufacturing Co.'  

---  

**Dataset 9: Employee Satisfaction Survey Results** *DS*  
Employee satisfaction data, visualized as a bar chart.  

| Department      | Satisfaction Score (out of 10) |  
|-----------------|-------------------------------|  
| Operations      | 7                             |  
| Maintenance     | 6                             |  
| Administration  | 8                             |  

---  

**Dataset 10: Social Media Marketing Performance** *DS*  
Metrics on the company’s online engagement, depicted via pie charts and trend lines.  

| Metric                      | Value    |  
|-----------------------------|----------|  
| Monthly Engagement Rate (%) | 12       |  
| Social Media Reach          | 50,000   |  
| Ad Spend ($)                | 15,000   |  

</datasets>  

<question>  
As an insurance consultant, perform a comprehensive underwriting analysis for TechEdge Manufacturing Co., Specifically:  

1. **Precision Risk Computation** – Utilize Dataset 2 to calculate the risk adjustment factor, with explicit, step-by-step derivations of claim frequency, severity weighting, and open-claim risk projections.  
2. **Multi-Dataset Risk Score Synthesis** – Incorporate Dataset 3 (risk assessment) and Dataset 4 (inspection findings) into a structured risk weighting matrix, ensuring transparency in each factor’s influence.  
3. **State Compliance Penalties** – Translate Dataset 5 non-compliance factors into concrete underwriting adjustments, with justifications based on industry best practices.  
4. **Holistic Data Utilization** – Integrate insights from all datasets (Datasets 1, 6, 7, 8, 9, and 10) for a contextual underwriting perspective that accounts for financial health, business communications, and operational stability.  
5. **Premium Adjustment and Risk Mitigation** – Provide a final premium loading percentage based on a structured underwriting formula and suggest actionable mitigation strategies.  

Your response should include:  
- A **detailed executive summary** with quantitative insights  
- **Step-by-step numerical breakdowns** with markdown headers and structured tables  
- **Conceptual visualizations** (e.g., risk heat maps, premium adjustment trends)  
- A final underwriting decision that **rationalizes every component of the premium adjustment**  
</question>  


"""

In [43]:
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

In [44]:
print(gpt_output)

Below is a comprehensive underwriting analysis for TechEdge Manufacturing Co. that integrates all available data sources. The analysis is organized into five parts that address the specific questions, along with an executive summary, detailed numerical breakdowns in tables, conceptual visualizations, and a final premium adjustment decision with recommended risk‐mitigation strategies.

────────────────────────────
EXECUTIVE SUMMARY
────────────────────────────
TechEdge Manufacturing Co. operates in electronics manufacturing with steady revenue (~$8.5M) and 12 years of business history. However, its historical claims record (5 claims over 5 years, including 2 open claims) and several non‐compliance findings (fire code and health & safety) raise its overall risk profile. Our step‐by‐step computations suggest:  

• A CLAIM-BASED RISK ADJUSTMENT FACTOR of approximately 1.75  
• A STATE COMPLIANCE PENALTY that adds roughly a 25% load to the premium  
• An overall premium loading multiplier –

In [45]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

In [46]:
print(o1_output)

Below is a comprehensive step‐by‐step underwriting analysis for TechEdge Manufacturing Co. that integrates all available data. This analysis is organized into five sections addressing the specific areas of inquiry.

──────────────────────────────
1. Executive Summary
──────────────────────────────
TechEdge Manufacturing Co. is an established electronics manufacturer with a 12‐year track record, robust monthly revenue (~$700K on average) and a moderate risk profile. However, the historical claims profile (5 claims in 5 years, with two “open” claims totaling $350K in potential liabilities), suboptimal fire safety system ratings, and non‐compliance on key state regulatory items (fire code and health & safety) drive additional premium loadings. Our multi‐dataset synthesis yields a final premium loading of approximately +22% above the base premium. Actionable risk mitigation strategies include upgrading fire safety systems and addressing the outlined non-compliance issues to reduce loadings

In [48]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?

For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is our detailed evaluation of the two responses according to the five dimensions.

────────────────────────────
1. Clarity
────────────────────────────
• Answer 1 is extremely well structured. It clearly breaks the analysis into numbered sections (Executive Summary, Step‐by‐Step Numerical Breakdown, Integrated Risk Matrix, Compliance penalties, and Final Decision) and even includes visual “conceptual visualization” descriptions and tables. For example, it uses clearly labeled markdown headers and “Table 1. Risk Weighting Matrix” in its risk synthesis.  
• Answer 2 is also organized into sections with similar headings (Executive Summary, Precision Risk Computation, Risk Matrix, Compliance, etc.), and it includes tables and conceptual visualizations. However, its presentation sometimes feels a bit denser—for instance, its part labeled “Overall Claims Risk Adjustment Factor Calculation” is mathematically detailed but slightly less “digestible” due to a greater reliance on assumed benchmarks.  

Overall, both responses are clear, but Answer 1 edges ahead thanks to its user‐friendly stepwise format and descriptive visualization cues.

────────────────────────────
2. Accuracy & Correctness
────────────────────────────
• In Answer 1 the analyst computes claim frequency (1 claim/year), averages claim amounts separately for settled and open claims, applies an “open claim risk factor” of 1.5, and then sums percentages from different components before subtracting a mitigation discount. The arithmetic is sound in the context of its assumptions and all data components are addressed.  
• Answer 2 also derives claim frequency (1 claim/year) and computes a Frequency Factor (1.33) and Severity Factor (1.20) then multiplies by an open‐claim penalty (1.10) to get an overall risk factor of 1.75. For compliance, it assumes penalties of +10% per non-compliant area (fire & health) and a +5% for conditional compliance, multiplying these to get a 1.25 factor and finally “compounding” to 2.19. While the individual steps are clear, Answer 2’s use of external benchmark assumptions (such as an industry benchmark of 0.75 claim/year or an assumed “benchmark claim” of $100K) could be questioned if one expects the analysis to rely only on provided data.  
 
Thus, both answers work correctly within their frameworks. Answer 1’s additive approach (leading to a +22% loading) may be easier to follow than Answer 2’s multiplicative strategy (a premium multiplier of about 2.19), which is more aggressive and rests on added assumptions.

────────────────────────────
3. Completeness
────────────────────────────
• Answer 1 methodically addresses all five areas in the prompt—it computes a precise risk factor from Dataset 2, integrates risk scores from Datasets 3 and 4 in a clear table, translates state non-compliance from Dataset 5 into concrete loadings, and then pulls in insights from Datasets 1, 6, 7, 8, 9, and 10 for a truly holistic view. Its final section on mitigation strategies is well detailed.  
• Answer 2 likewise covers every required point and offers several tables and charts. It computes step‐by‐step risk factors, builds an integrated matrix, and proposes detailed risk‐mitigation actions. However, the final premium loading (approximately a 120% increase) seems to diverge from the overall integrated narrative and might be seen as overcompensating relative to the other datasets’ signals.  

Both answers are complete; however, Answer 1’s conclusion (with a final +22% adjustment) appears more in line with the full‐spectrum data analysis provided.

────────────────────────────
4. Relevance & Adherence
────────────────────────────
• Answer 1 adheres closely to the task instructions. It explicitly identifies each dataset, uses markdown headers for step‐by‐step derivations, and even “imagines” visualizations like heat maps and trend charts. Its discussion of each dataset (including business profile, claims history, inspection, state compliance, and even employee/marketing data) is well integrated into the final decision.  
• Answer 2 also stays on target and fully responds to the prompt. It explicitly references each dataset and uses tables and visual aids. Its approach is logically consistent, though its heavy reliance on some assumed benchmarks means that a reader must accept a few extra premises.  

Both responses are very relevant; Answer 1, however, makes fewer extraneous assumptions and more directly ties each part of the prompt to its final decision.

────────────────────────────
5. Analytical Depth
────────────────────────────
• Answer 1 demonstrates strong analytical depth. It not only computes numerical risk factors but also explains the rationale behind each step (e.g., why an “open claim risk factor” is applied, the reasoning behind compliance penalties, and how mitigation discounts work). The inclusion of a risk heat map and premium adjustment trend chart shows extra effort in providing actionable insights.  
• Answer 2 also presents a deep analysis with a detailed risk matrix, clear tables, and thorough breakdowns of the risk components. Its multi-factor product approach is rigorous. Nevertheless, its use of assumed industry benchmarks (for frequency and average claim) slightly undermines the transparency compared to Answer 1’s more directly derived numbers.

Both answers show meaningful insight, but Answer 1’s reasoning is more self-contained, making it slightly more accessible and easier to audit for someone cross-referencing the data.

────────────────────────────
Concise Summary
────────────────────────────
Both answers address every part of the prompt and present detailed, multi‐dataset analyses. However, Answer 1 stands out as the better response because it is more clearly organized, uses fewer external assumptions, and offers a transparent, step‐by‐step breakdown that neatly ties all datasets into a final premium adjustment of +22%. Its clear tables, conceptual visualizations, and detailed mitigation recommendations make it more actionable and easier to follow.

Overall, Answer 1 is preferred because its clarity, self-contained computations, and consistent integration of all data elements result in a more robust and transparent underwriting analysis.