# Reasoning usecases : Model comparison o1 vs gpt-4o

### Step 1
This cell imports necessary libraries, sets environment variables, and initializes the OpenAI client. Provide your GPT-4o and O1 deployment names to the variables GPT_MODEL and O1_MODEL. Makre sure to provide your AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in env file.

In [1]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-40-globalStandard'
O1_MODEL = 'o1'

### Step 2
This cell defines the `get_chat_completion` function used to request model responses.

In [2]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

### Step 3
This cell holds the multi-line `demo_prompt`. Provide your prompt here or use one of the prompts from the usecases in this repo.

In [3]:
demo_prompt = """


# Cross-Asset Volatility and Sentiment Forecasting: A Multi-Modal Analysis

## Datasets
Below are multiple micro datasets provided in markdown format. They include structured tables, time-series data, and unstructured text meant to help you assess market sentiment, volatility drivers, and macroeconomic conditions. Note that some datasets (marked with *DS*) are included as potential distractions.

**Dataset 1: Volatility Indices Data** (Visual: Line Graph)

| Date       | VIX (S&P 500) | MOVE (10Y Treasuries) | Implied Volatility - Brent (%) | Implied Volatility - Gold (%) |
|------------|---------------|-----------------------|--------------------------------|-------------------------------|
| 2023-09-01 | 18.2          | 9.5                   | 22.0                           | 12.5                          |
| 2023-09-08 | 19.0          | 10.0                  | 22.5                           | 12.8                          |
| 2023-09-15 | 20.1          | 10.2                  | 23.0                           | 13.0                          |
| 2023-09-22 | 21.0          | 10.5                  | 23.8                           | 13.5                          |
| 2023-09-29 | 20.5          | 10.3                  | 23.2                           | 13.2                          |
| 2023-10-06 | 22.0          | 10.7                  | 24.0                           | 13.8                          |
| 2023-10-13 | 22.5          | 11.0                  | 24.5                           | 14.0                          |

**Dataset 2: Market Sentiment Scores** (Visual: Bar Chart)

| Date       | Central Bank Sentiment (1-10) | Financial News Tone (1-10) | Reddit/Twitter Sentiment (1-10) |
|------------|-------------------------------|----------------------------|---------------------------------|
| 2023-09-01 | 7                             | 6                          | 5                               |
| 2023-09-08 | 6                             | 5                          | 4                               |
| 2023-09-15 | 5                             | 4                          | 3                               |
| 2023-09-22 | 4                             | 3                          | 2                               |
| 2023-09-29 | 5                             | 4                          | 3                               |
| 2023-10-06 | 6                             | 5                          | 4                               |
| 2023-10-13 | 7                             | 6                          | 5                               |

**Dataset 3: Macroeconomic Indicators** (Visual: Clustered Bar Chart)

| Date       | CPI YoY (%) | PMI | NFP (in thousands) | Unemployment Rate (%) |
|------------|-------------|-----|--------------------|-----------------------|
| 2023-09-01 | 2.1         | 55  | 150                | 4.0                   |
| 2023-09-15 | 2.2         | 54  | 145                | 4.1                   |
| 2023-09-29 | 2.3         | 53  | 140                | 4.2                   |
| 2023-10-13 | 2.4         | 52  | 135                | 4.3                   |

**Dataset 4: Option Term Structure Data** (Visual: Line Graph)

| Date       | S&P 500 30d IV (%) | 10Y T-Note 30d IV (%) | Brent 30d IV (%) | Gold 30d IV (%) |
|------------|--------------------|-----------------------|------------------|-----------------|
| 2023-09-01 | 18.0               | 10.0                  | 22.0             | 12.5            |
| 2023-10-01 | 19.5               | 10.5                  | 23.5             | 13.0            |

**Dataset 5: News Headlines and Central Bank Statements** (Unstructured Data)

- "Central Bank signals gradual tightening amid global uncertainty." (Date: 2023-10-05)
- "Financial news: Equity markets rally despite mixed earnings reports." (Date: 2023-10-06)
- "Breaking: New policy measures expected to cushion market headwinds." (Date: 2023-10-07)
- "Central Bank statement: 'We maintain cautious optimism despite rising inflation.'" (Date: 2023-10-08)

**Dataset 6: ETF Flows and Institutional Positioning *DS*** (Visual: Bar Chart)

| Date       | ETF Flows (USD Millions) | Institutional Net Long (%) |
|------------|--------------------------|----------------------------|
| 2023-09-01 | 500                      | 55                         |
| 2023-09-15 | 450                      | 53                         |
| 2023-09-29 | 520                      | 56                         |
| 2023-10-13 | 480                      | 54                         |

**Dataset 7: Social Media Sentiment Aggregates** (Visual: Scatter Plot)

| Date       | Twitter Sentiment Score | Reddit Sentiment Score |
|------------|-------------------------|------------------------|
| 2023-09-01 | 5.0                     | 4.8                    |
| 2023-09-15 | 4.2                     | 4.0                    |
| 2023-09-29 | 3.5                     | 3.4                    |
| 2023-10-13 | 4.0                     | 3.9                    |

**Dataset 8: Commodity Price Indexes** (Visual: Line Graph)

| Date       | Brent Crude Spot Price (USD/barrel) | Gold Spot Price (USD/oz) |
|------------|-------------------------------------|--------------------------|
| 2023-09-01 | 85                                  | 1800                     |
| 2023-09-15 | 88                                  | 1780                     |
| 2023-09-29 | 90                                  | 1750                     |
| 2023-10-13 | 92                                  | 1765                     |

**Dataset 9: Historical Asset Returns *DS*** (Visual: Line Charts)

| Date Range                 | Asset Class  | Average Daily Return (%) | Volatility (%) |
|----------------------------|--------------|--------------------------|----------------|
| 2023-07-01 to 2023-09-30   | Equities     | 0.05                     | 1.2            |
| 2023-07-01 to 2023-09-30   | Fixed Income | 0.02                     | 0.8            |

## Question
As a senior strategist at a global macro hedge fund, your objective is to deliver a high-conviction forecast for cross-asset volatility and market sentiment over the next 30 days. Based on the provided datasets, please:

1. Classify the current market sentiment regime (e.g., risk-on, risk-off, or transitional) by integrating the sentiment scores from central bank communications, financial news tone, and social media data.
2. Forecast the 30-day forward implied and realized volatility for the following asset classes:
   - Equities (S&P 500 via VIX dynamics)
   - Fixed Income (10Y Treasury via MOVE index)
   - Commodities (Brent Crude and Gold)
3. Identify and quantify the key drivers (e.g., macroeconomic indicators, option term structures, central bank statements) influencing the forecast and highlight any potential volatility clusters or contagion risks across asset classes.
4. Recommend tactical hedging strategies (e.g., VIX calls, long gamma positions, or correlation dispersion trades) based on your forecast.

## Instruction
Please include all intermediate calculations, working steps, and references to the appropriate datasets as evidence in your analysis. Begin your final answer with a brief executive summary that highlights the key findings and recommendations of your forecast. Be sure to illustrate your thought process and justify your conclusions using both quantitative data and qualitative insights drawn from the datasets.


"""

### Step 4
This cell uses the GPT model to generate a response based on the prompt.

In [4]:
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

### Step 5
This cell prints the output from the GPT model.

In [5]:
print(gpt_output)

Executive Summary:
Based on a multi-modal review of volatility indices, sentiment scores, macroeconomic indicators, option term structures, news reports, and commodity prices, our analysis indicates that the current market sentiment is transitioning from a pronounced risk‐off stance to a more neutral, albeit cautious, regime. In the next 30 days, we forecast modestly higher implied—and by extension, likely elevated realized—volatility across major asset classes. For equities, we expect the S&P 500’s forward volatility (as captured by the VIX) to edge into the 23%–24% range, while fixed income volatility (proxied by the MOVE index) should trend toward 11%–11.5%. In the commodities arena, Brent’s volatility is likely to approach 25%, and gold’s near 14.5%. Meanwhile, macroeconomic headwinds (rising inflation, declining PMI, and tepid employment growth) combined with cautious central bank communications are expected to continue feeding cross‐asset contagion risks, thereby supporting a clu

### Step 6
This cell uses the O1 model to generate a response using the same prompt.

In [6]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

### Step 7
This cell prints the output from the O1 model.

In [7]:
print(o1_output)

Executive Summary:
Our analysis indicates that the market is in a transitional sentiment regime—initial risk‐off pressures in mid-September are now partially offset by a modest recovery in central bank and media sentiment. Quantitative data from the volatility indices, option term structures, and macroeconomic signals point to a moderate yet persistent upward drift in implied and realized volatilities across asset classes over the next 30 days. We forecast that equities (as measured via the S&P 500/VIX) will see a further elevation toward roughly 24–25, fixed income (10Y Treasury via the MOVE index) will trend toward levels near 11.5–12, and commodities will exhibit higher volatilities, with Brent crude and gold anticipated to reach around 25% and 14.5%, respectively. Key drivers include tightening signals from central banks, rising inflationary pressures and deteriorating PMI/NFP indicators, and an evolving option term structure that reinforces a spillover of uncertainty. In response,

### Step 8
This cell evaluates and compares both model outputs using a specialized prompt.

In [8]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is our comparative evaluation of the two answers across the requested dimensions, followed by a concise overall summary.

─────────────────────────────  
1. Clarity

• Answer 1’s strengths:
 – It opens with a clear executive summary that neatly outlines the regime classification, quantitative forecasts, key drivers, and tactical recommendations.
 – The discussion is organized into numbered sections with clear subheadings (e.g., “1. Market Sentiment Regime Classification,” “2. 30-Day Forward Volatility Forecasts”), making it easy to follow.
 – Each dataset reference is clearly indicated to support the analysis.

• Answer 1’s weaknesses:
 – Some parts, while detailed, might benefit from even more step-by-step signposting (although it remains generally coherent).

• Answer 2’s strengths:
 – Also begins with an executive summary summarizing key numbers and recommendations.
 – Uses bullet points and section headers effectively and even includes an “Intermediate Calculation” step to demonstrate integration of sentiment scores.
 – Its explicit “Intermediate Trend” notes add clarity on the reasoning process.

• Answer 2’s weaknesses:
 – Its verbosity sometimes means the text is denser, which can slightly affect quick readability.
 – The greater detail (e.g., step-by-step numeric reasoning) might make the overall answer longer than necessary, potentially reducing immediate clarity for users looking for a quick snapshot.

─────────────────────────────  
2. Accuracy & Correctness

• Answer 1’s strengths:
 – Provides correct data-driven trends taken from the given tables (e.g., forecasting equities’ VIX to 24–25, MOVE trending near 11.5–12).
 – Uses correct interpretation of datasets (volatility indices, market sentiment, option term structures) to support the forecasts.

• Answer 1’s weaknesses:
 – Minor differences in the forecast ranges (for example, equities’ forecast is slightly on the higher end), but these fall within plausible ranges given the data.

• Answer 2’s strengths:
 – Offers clear numerical reasoning; for instance, it calculates an “average” sentiment score explicitly.
 – The summaries of volatility trends are perfectly aligned with the dataset numbers and the implied forward trends.
 – All the forecast numbers (23%–24% for equities, 11%–11.5% for fixed income, etc.) are consistent with the data provided.

• Answer 2’s weaknesses:
 – Again, minor slight differences appear compared to Answer 1’s ranges; however, these differences are marginal and can be attributed to different extrapolation assumptions.

─────────────────────────────  
3. Completeness

• Answer 1’s strengths:
 – Fully addresses every part of the prompt: sentiment classification, forecasts for equities, fixed income, and commodities, identification of key drivers, risk/contagion analysis, and tactical hedging recommendations.
 – Provides intermediate steps and dataset references throughout.

• Answer 2’s strengths:
 – Also delivers on all required components and even offers additional nuances (e.g., explicit “risk management” recommendations along with hedging strategies).
 – Its commentary on potential contagion and dispersion trades is comprehensive.
    
• Both answers are complete; the only difference is that Answer 2’s explanation might seem a bit more elaborate in parts.

─────────────────────────────  
4. Relevance & Adherence

• Answer 1’s strengths:
 – It sticks closely to the instruction, clearly referencing each dataset, and follows the step-by-step instructions (including intermediate calculations and supporting evidence).
 – The language and structure are clearly oriented toward a legal, compliance, or due diligence audience.

• Answer 2’s strengths:
 – Also adheres closely to the prompt requirements and integrates both quantitative and qualitative data.
 – Its integration of text from dataset comments (e.g., “cautious optimism” and the calculation from sentiment scores) helps ensure relevance.
    
• No significant deviations are noted in either answer; both fully follow the task’s requirements.

─────────────────────────────  
5. Analytical Depth

• Answer 1’s strengths:
 – Demonstrates mature reasoning by cross-referencing multiple datasets.
 – Offers actionable insights such as “VIX calls,” “long gamma positions,” and “dispersion trades” along with clear intermediate steps.
 – The connection between macro drivers (e.g., rising CPI/declining PMI) and volatility forecasts is well argued.

• Answer 1’s weaknesses:
 – While thorough, there are moments when additional explanation of the quantitative extrapolation (beyond referencing trend direction and ranges) could enhance the depth further.

• Answer 2’s strengths:
 – Provides a slightly deeper dive by including an intermediate calculation (e.g., averaging sentiment scores) and more discussion around “potential clusters & contagion.”
 – It explicitly quantifies trends (e.g., “roughly a 0.6–0.7 point weekly rise”) which enhances the analytical detail.
    
• Answer 2’s weaknesses:
 – The extra detail can sometimes feel verbose, which may dilute the concise analytical punch.

─────────────────────────────  
6. Multi-Dataset Synthesis

• Answer 1’s strengths:
 – Integrates insights across datasets (e.g., volatility indices, option term structures, sentiment scores, and macroeconomic indicators) into a cohesive forecast.
 – Each section clearly cites relevant datasets to back up the reasoning.

• Answer 1’s weaknesses:
 – In a couple of instances, while dataset references are noted, the synthesis between discrete signals (for example, linking macroeconomic indicators to derived hedging strategies) could be even more highlighted.

• Answer 2’s strengths:
 – Synthesizes multiple datasets effectively and shows more explicit calculation exercises (like averaging sentiment scores).
 – The discussion on central bank messaging (drawing on Dataset 5) alongside quantitative trends is well tied together.

• Answer 2’s weaknesses:
 – The response’s length might make it a bit more challenging to quickly isolate the most crucial multi-dataset insights, though none of the individual elements are missed.

─────────────────────────────  
7. Robustness to Ambiguity

• Answer 1’s strengths:
 – Handles the ambiguous tone of central bank communications by categorizing the regime as “transitional,” which is well justified given the mix of qualitative and quantitative signals.
 – Provides clear reasoning despite the inherent uncertainties in the datasets.

• Answer 2’s strengths:
 – Also deals well with ambiguous signals (e.g., “cautious optimism” vs. “risk-off biases”) and quantifies them through average sentiment metrics.
 – The detailed discussion on potential contagion reflects comfort with ambiguous risk factors.

• Both answers show robust handling of ambiguous language and incomplete signals, with Answer 2 offering a bit more numeric detail to tame ambiguity.

─────────────────────────────  
8. Format & Usability

• Answer 1’s strengths:
 – The structured format with sections and bullet points makes the output highly usable for legal, compliance, or due diligence teams.
 – The clear action items in the “Tactical Hedging Recommendations” section translate directly into practical next steps.

• Answer 2’s strengths:
 – Also formatted in a clear, organized manner with headings and explicit tactical suggestions.
 – Its additional risk management commentary (e.g., “tighten stop losses”) offers extra practical value.

• Answer 1’s weaknesses:
 – Though very clear, it could be considered slightly less detailed in risk management nuances compared to Answer 2.
    
• Answer 2’s weaknesses:
 – Its length and denser presentation may require additional time for a quick executive review, though it is comprehensive.

─────────────────────────────  
Concise Summary

Both answers are comprehensive and accurate, and they clearly integrate multiple datasets with actionable recommendations. However, Answer 1 stands out for its enhanced clarity and succinct organization—its clear executive summary and systematic breakdown into easily digestible sections provide a very user-friendly structure for busy legal, compliance, or due diligence teams. Although Answer 2 offers added analytical nuance and extra risk management commentary, its verbosity and density may reduce immediate practical usability. 

Overall, Answer 1 is the better response because it delivers a clear, well-organized, and actionable analysis that fully addresses the prompt while effectively balancing quantitative reasoning with qualitative insights.