# Reasoning usecases : Model comparison o1 vs gpt-4o

### Step 1
This cell imports necessary libraries, sets environment variables, and initializes the OpenAI client. Provide your GPT-4o and O1 deployment names to the variables GPT_MODEL and O1_MODEL. Makre sure to provide your AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in env file.

In [1]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-4o'
O1_MODEL = 'o1'
print(" GPT-4o Model Name for the first time...", GPT_MODEL)
print(" O1 Model NAme for the first time...", O1_MODEL)


 GPT-4o Model Name for the first time... gpt-4o
 O1 Model NAme for the first time... o1


### Step 2
This cell defines the `get_chat_completion` function used to request model responses.

In [2]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

### Step 3
This cell holds the multi-line `demo_prompt`. Provide your prompt here or use one of the prompts from the usecases in this repo.

In [20]:
demo_prompt = """


# Cross-Asset Volatility and Sentiment Forecasting: A Multi-Modal Analysis

## Datasets
Below are multiple micro datasets provided in markdown format. They include structured tables, time-series data, and unstructured text meant to help you assess market sentiment, volatility drivers, and macroeconomic conditions. Note that some datasets (marked with *DS*) are included as potential distractions.

**Dataset 1: Volatility Indices Data** (Visual: Line Graph)

| Date       | VIX (S&P 500) | MOVE (10Y Treasuries) | Implied Volatility - Brent (%) | Implied Volatility - Gold (%) |
|------------|---------------|-----------------------|--------------------------------|-------------------------------|
| 2023-09-01 | 18.2          | 9.5                   | 22.0                           | 12.5                          |
| 2023-09-08 | 19.0          | 10.0                  | 22.5                           | 12.8                          |
| 2023-09-15 | 20.1          | 10.2                  | 23.0                           | 13.0                          |
| 2023-09-22 | 21.0          | 10.5                  | 23.8                           | 13.5                          |
| 2023-09-29 | 20.5          | 10.3                  | 23.2                           | 13.2                          |
| 2023-10-06 | 22.0          | 10.7                  | 24.0                           | 13.8                          |
| 2023-10-13 | 22.5          | 11.0                  | 24.5                           | 14.0                          |

**Dataset 2: Market Sentiment Scores** (Visual: Bar Chart)

| Date       | Central Bank Sentiment (1-10) | Financial News Tone (1-10) | Reddit/Twitter Sentiment (1-10) |
|------------|-------------------------------|----------------------------|---------------------------------|
| 2023-09-01 | 7                             | 6                          | 5                               |
| 2023-09-08 | 6                             | 5                          | 4                               |
| 2023-09-15 | 5                             | 4                          | 3                               |
| 2023-09-22 | 4                             | 3                          | 2                               |
| 2023-09-29 | 5                             | 4                          | 3                               |
| 2023-10-06 | 6                             | 5                          | 4                               |
| 2023-10-13 | 7                             | 6                          | 5                               |

**Dataset 3: Macroeconomic Indicators** (Visual: Clustered Bar Chart)

| Date       | CPI YoY (%) | PMI | NFP (in thousands) | Unemployment Rate (%) |
|------------|-------------|-----|--------------------|-----------------------|
| 2023-09-01 | 2.1         | 55  | 150                | 4.0                   |
| 2023-09-15 | 2.2         | 54  | 145                | 4.1                   |
| 2023-09-29 | 2.3         | 53  | 140                | 4.2                   |
| 2023-10-13 | 2.4         | 52  | 135                | 4.3                   |

**Dataset 4: Option Term Structure Data** (Visual: Line Graph)

| Date       | S&P 500 30d IV (%) | 10Y T-Note 30d IV (%) | Brent 30d IV (%) | Gold 30d IV (%) |
|------------|--------------------|-----------------------|------------------|-----------------|
| 2023-09-01 | 18.0               | 10.0                  | 22.0             | 12.5            |
| 2023-10-01 | 19.5               | 10.5                  | 23.5             | 13.0            |

**Dataset 5: News Headlines and Central Bank Statements** (Unstructured Data)

- "Central Bank signals gradual tightening amid global uncertainty." (Date: 2023-10-05)
- "Financial news: Equity markets rally despite mixed earnings reports." (Date: 2023-10-06)
- "Breaking: New policy measures expected to cushion market headwinds." (Date: 2023-10-07)
- "Central Bank statement: 'We maintain cautious optimism despite rising inflation.'" (Date: 2023-10-08)

**Dataset 6: ETF Flows and Institutional Positioning *DS*** (Visual: Bar Chart)

| Date       | ETF Flows (USD Millions) | Institutional Net Long (%) |
|------------|--------------------------|----------------------------|
| 2023-09-01 | 500                      | 55                         |
| 2023-09-15 | 450                      | 53                         |
| 2023-09-29 | 520                      | 56                         |
| 2023-10-13 | 480                      | 54                         |

**Dataset 7: Social Media Sentiment Aggregates** (Visual: Scatter Plot)

| Date       | Twitter Sentiment Score | Reddit Sentiment Score |
|------------|-------------------------|------------------------|
| 2023-09-01 | 5.0                     | 4.8                    |
| 2023-09-15 | 4.2                     | 4.0                    |
| 2023-09-29 | 3.5                     | 3.4                    |
| 2023-10-13 | 4.0                     | 3.9                    |

**Dataset 8: Commodity Price Indexes** (Visual: Line Graph)

| Date       | Brent Crude Spot Price (USD/barrel) | Gold Spot Price (USD/oz) |
|------------|-------------------------------------|--------------------------|
| 2023-09-01 | 85                                  | 1800                     |
| 2023-09-15 | 88                                  | 1780                     |
| 2023-09-29 | 90                                  | 1750                     |
| 2023-10-13 | 92                                  | 1765                     |

**Dataset 9: Historical Asset Returns *DS*** (Visual: Line Charts)

| Date Range                 | Asset Class  | Average Daily Return (%) | Volatility (%) |
|----------------------------|--------------|--------------------------|----------------|
| 2023-07-01 to 2023-09-30   | Equities     | 0.05                     | 1.2            |
| 2023-07-01 to 2023-09-30   | Fixed Income | 0.02                     | 0.8            |

## Question
As a senior strategist at a global macro hedge fund, your objective is to deliver a high-conviction forecast for cross-asset volatility and market sentiment over the next 30 days. Based on the provided datasets, please:

1. Classify the current market sentiment regime (e.g., risk-on, risk-off, or transitional) by integrating the sentiment scores from central bank communications, financial news tone, and social media data.
2. Forecast the 30-day forward implied and realized volatility for the following asset classes:
   - Equities (S&P 500 via VIX dynamics)
   - Fixed Income (10Y Treasury via MOVE index)
   - Commodities (Brent Crude and Gold)
3. Identify and quantify the key drivers (e.g., macroeconomic indicators, option term structures, central bank statements) influencing the forecast and highlight any potential volatility clusters or contagion risks across asset classes.
4. Recommend tactical hedging strategies (e.g., VIX calls, long gamma positions, or correlation dispersion trades) based on your forecast.

## Instruction
Please include all intermediate calculations, working steps, and references to the appropriate datasets as evidence in your analysis. Begin your final answer with a brief executive summary that highlights the key findings and recommendations of your forecast. Be sure to illustrate your thought process and justify your conclusions using both quantitative data and qualitative insights drawn from the datasets.


"""

### Step 4
This cell uses the GPT model to generate a response based on the prompt.

In [1]:
# Read the content of the template file and assign it to demo_prompt
#template_path = './credit-risk-assessment-and-management/prompt.md'
#template_path = './customer-relationship-management/prompt.md'
#template_path = './data-analysis/prompt.md'
#template_path = './fraud-detection-and-prevention/prompt.md'
#template_path = './insurance-claims-processing/prompt.md'
#template_path = './insurance-plan/prompt.md'
#template_path = './loan-agreement/prompt.md'
#template_path = './market-sentiment-and-volatility-forecasting/prompt.md'
#template_path = './portfolio-optimization/prompt.md'
#template_path = './risk-assessment-for-underwriting-health-auto-life-property/prompt.md'
template_path = './underwriting-analysis/prompt.md'

try:
	with open(template_path, 'r') as file:
		demo_prompt = file.read()
except FileNotFoundError:
	print("Error: The file ", template_path, " was not found using default demo prompt.")
	demo_prompt = demo_prompt

# Generate GPT output using the prompt
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)


NameError: name 'get_chat_completion' is not defined

### Step 5
This cell prints the output from the GPT model.

In [26]:
print(gpt_output)

### Executive Summary

After evaluating John Miller's application for comprehensive auto insurance, it is concluded that he qualifies for standard premium rates with potential for usage-based monitoring. Key factors for this decision include a moderate driving record, manageable accident history, excellent credit, and regular vehicle maintenance. However, continued improvement in driving behavior, as evident from telematics data, could further mitigate risks.

### Key Findings and Metrics

1. **Driving Record History**
   - **Infractions and Points Accumulation**: 
     - 2017 (Speeding - Minor): 2 points
     - 2019 (Red Light Violation): 3 points
     - 2021 (Speeding - Minor): 2 points
     - Total: 7 points over a 6-year period
   
   Risk Implication: Despite prior infractions, the completion of a defensive driving course demonstrates a corrective action toward safer driving.

2. **Accident & Claim History**
   - **Accidents**: 
     - 2018 (Rear-end collision): $1,200 (Minor)
   

### Step 6
This cell uses the O1 model to generate a response using the same prompt.

In [27]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

### Step 7
This cell prints the output from the O1 model.

In [28]:
print(o1_output)

# Executive Summary

John Miller's auto insurance application has been evaluated using comprehensive data sets, assessing his driving history, accident claims, telematics insights, creditworthiness, and vehicle maintenance records. Upon reviewing these metrics, it is evident that while John maintains a decent credit score and satisfactory vehicle maintenance, his driving history and telematics data suggest a moderate risk profile.

## Key Findings

- **Driving History:** John has a history of infractions, but they are dispersed over time and appear to have decreased recently after completing a defensive driving course.
- **Accident and Claim Records:** He has had one moderate accident in 2020 and two minor collisions, which could indicate some risk but not excessively high.
- **Telematics Driving Behavior:** Monthly telematics data generally shows stability, with occasional spikes in harsh braking, raising a slight concern.
- **Credit History:** A strong credit score of 720 with an exc

### Step 8
This cell evaluates and compares both model outputs using a specialized prompt.

In [29]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


### Evaluation of Model Outputs

#### **Answer 1 (O1):**

**Clarity:**  
- **Strengths:** Well-structured and organized, clearly separating each section (Executive Summary, Key Findings, Detailed Analysis, Decision & Recommendations).
- **Weaknesses:** Some parts, such as the list of infractions and accidents, could benefit from further synthesis rather than just listing facts.

**Accuracy & Correctness:**  
- **Strengths:** Correctly interprets data points (e.g., credit score, driving infractions, accident frequency).
- **Weaknesses:** Describes John's telematics data as having "occasional spikes in harsh braking" without delving into specific consequences or comparing it to baseline figures.

**Completeness:**  
- **Strengths:** Addresses all parts of the prompt, including driving, credit, and vehicle maintenance, while making relevant recommendations.
- **Weaknesses:** Briefly touches on dataset interdependencies; could more thoroughly integrate dataset insights, especially the telematics data's potential bearing on future behavior.

**Relevance & Adherence:**  
- **Strengths:** Sticks to the task with a compliance focus, ensuring all required data points are examined.
- **Weaknesses:** Unclear references to certain datasets (like "monthly telematics data generally shows stability") may imply room for tighter cross-referencing.

**Analytical Depth:**  
- **Strengths:** Provides actionable recommendations in the form of UBI, highlighting strengths such as credit responsibility.
- **Weaknesses:** Could enhance depth by linking the defensive driving course completion more directly to potential for reduced risk.

**Multi-Dataset Synthesis:**  
- **Strengths:** Uses multiple datasets, particularly shining in parts like credit and vehicle maintenance.
- **Weaknesses:** Lacks in pulling insights from the telematics data, needed for risk mitigation assessments.

**Robustness to Ambiguity:**  
- **Strengths:** Interprets qualitative insights (e.g., call transcript discussion) adequately.
- **Weaknesses:** Slightly superficial treatment of behavioral change over time, missing deeper analysis of interpretive ambiguity.

**Format & Usability:**  
- **Strengths:** The decision and recommendations are clearly formatted and practical for underwriting teams.
- **Weaknesses:** Provides limited personalization or context details that might assist a compliance team in validating decisions.

#### **Answer 2 (GPT-4o):**

**Clarity:**  
- **Strengths:** Concise executive summary with clear justification for decisions. Logical flow with distinct sections.
- **Weaknesses:** Could benefit from clearer delineation of findings within sections to aid readability.

**Accuracy & Correctness:**  
- **Strengths:** Precisely interprets data, such as accident frequency and credit score metrics.
- **Weaknesses:** Overly broad in some areas (e.g., just mentions "improved behaviors"), slightly under-analyzed dataset interactions.

**Completeness:**  
- **Strengths:** Thoroughly addresses all required datasets, producing comprehensive insights.
- **Weaknesses:** Less direct inclusion of nuanced elements from datasets, like specific month-to-month telematics fluctuations.

**Relevance & Adherence:**  
- **Strengths:** Maintains strict adherence to prompt requirements, focusing on relevant compliance and underwriting needs.
- **Weaknesses:** Somewhat general recommendations may lack specificity for immediate action.

**Analytical Depth:**  
- **Strengths:** Insightful recommendations, recognizing financial reliability and proposing UBI as a proactive measure.
- **Weaknesses:** Further depth regarding relationships between datasets (how improved credit supports responsible driving) could solidify approach.

**Multi-Dataset Synthesis:**  
- **Strengths:** Integrates data insights (e.g., driving and credit) effectively.
- **Weaknesses:** Marginally misses integrating telematics deeply enough to inform ongoing risk analysis.

**Robustness to Ambiguity:**  
- **Strengths:** Adept at handling and integrating qualitative feedback (e.g., call transcript) into an actionable plan.
- **Weaknesses:** Leaves possible improvements in examining the detailed aspects of behavioral evidence within ambiguous telematics data.

**Format & Usability:**  
- **Strengths:** Provides a well-structured and practical output suited for immediate use in a compliance context.
- **Weaknesses:** Minimal issues, though some sections could become more outlined for faster operational uptake.

### **Concise Summary**

Between the two responses, **Answer 2 (GPT-4o)** stands out as the better response. It provides a more coherent analysis with structured insights and actionable recommendations suitable for an underwriting team. While both responses have strengths in clear explanations and using data to inform decisions, Answer 2 excels in its synthesis of datasets and initial clarity of presentation, better sidelining non-impactful data and focusing on relevant variables to drive compliance-focused underwriting outcomes.