# Reasoning usecases : Model comparison o1 vs gpt-4o

In [82]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-40-globalStandard'
O1_MODEL = 'o1'

In [83]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

In [84]:
demo_prompt = """
# Credit Risk Assessment Prompt

## Purpose
Credit Risk Assessment

## Instructions
1. A bank is considering whether to approve a $200,000 mortgage loan for a customer who wants to buy a house. The bank has the following information about the customer.
2. Review the customer's credit score, debt-to-income ratio, loan-to-value ratio, employment status, income, savings, other debts, and the house information.
3. Assign a rating for each criterion based on the customer and the house information.
4. Combine the ratings for each criterion to form a composite rating.
5. Look up the maximum loan amount and the minimum interest rate for the composite rating in the policy table.
6. Compare the requested loan amount and the offered interest rate with the policy limits.
7. Decide whether to approve or reject the loan application based on the comparison and other factors.
8. If you need to make assumptions, approve or decline the loan and list the conditions that must first be validated before this decision can be made.

## Customer Information
- **Credit score:** 720 (out of 850)
- **Debt-to-income ratio:** 35%
- **Loan-to-value ratio:** 80%
- **Employment status:** Full-time, stable
- **Income:** $60,000 per year
- **Savings:** $10,000
- **Other debts:** $15,000 (car loan and credit cards)

## House Information
- **Appraised value:** $250,000
- **Location:** Suburban, low crime rate, good school district
- **Market trend:** Stable, moderate demand, low inventory
- **Interest rate:** 4% fixed for 30 years

## Credit Risk Assessment Model
The bank uses a credit risk assessment model that assigns a rating from A to D based on the following criteria:

- **Credit score:** A (700 or above), B (650–699), C (600–649), D (below 600)
- **Debt-to-income ratio:** A (below 30%), B (30–39%), C (40–49%), D (50% or above)
- **Loan-to-value ratio:** A (below 70%), B (70–79%), C (80–89%), D (90% or above)
- **Employment status:** A (full-time, stable), B (full-time, variable), C (part-time, stable), D (part-time, variable or unemployed)
- **Market trend:** A (rising, high demand, low inventory), B (stable, moderate demand, low inventory), C (stable, moderate demand, high inventory), D (declining, low demand, high inventory)

The bank also uses a credit risk assessment policy that defines the maximum loan amount and the minimum interest rate for each rating combination, as shown in the table below:

| Rating | Maximum loan amount | Minimum interest rate |
|--------|---------------------|-----------------------|
| AAAAA  | $500,000            | 3.5%                  |
| AAAAB  | $450,000            | 3.75%                 |
| AAAAC  | $400,000            | 4%                    |
| AAAAD  | $350,000            | 4.25%                 |
| AAABA  | $400,000            | 3.75%                 |
| AAABB  | $350,000            | 4%                    |
| AAABC  | $300,000            | 4.25%                 |
| AAABD  | $250,000            | 4.5%                  |
| AAACA  | $350,000            | 4%                    |
| AAACB  | $300,000            | 4.25%                 |
| AAACC  | $250,000            | 4.5%                  |
| AAACD  | $200,000            | 4.75%                 |
| AAADA  | $300,000            | 4.25%                 |
| AAADB  | $250,000            | 4.5%                  |
| AAADC  | $200,000            | 4.75%                 |
| AAADD  | $150,000            | 5%                    |
| ABAAA  | $400,000            | 4%                    |
| ...    | ...                 | ...                   |
| DDDDD  | $50,000             | 6.5%                  |

## Credit Risk Assessment Process
1. **Assign Ratings:**  
   Assign a rating for each criterion based on the customer and the house information.  
   *Example:* The customer’s credit score of 720 corresponds to a rating of A.

2. **Composite Rating:**  
   Combine the ratings for each criterion to form a composite rating.  
   *Example:* Ratings for credit score, debt-to-income ratio, loan-to-value ratio, and employment status are A, B, C, and A respectively; and the house has a rating of B for market trend. Thus, the composite rating is **ABACA**.

3. **Policy Lookup:**  
   Look up the maximum loan amount and the minimum interest rate for the composite rating in the policy table.  
   *Example:* For composite rating ABACA, maximum loan amount is $300,000 and minimum interest rate is 4.25%.

4. **Comparison:**  
   Compare the requested loan amount and the offered interest rate with the policy limits.  
   *Example:* The customer requests $200,000 and the offered rate is 4%, which are both within the policy limits.

5. **Decision:**  
   Decide whether to approve or reject the loan application based on the comparison and other factors (e.g., credit history, savings, and other debts).  
   *Example:* The bank may approve the loan due to good credit history, sufficient savings, manageable debts, and acceptable loan terms. Alternatively, if factors like a high debt-to-income ratio or risky market trends are present, the bank might reject the request or approve it with conditions.

"""

In [85]:
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

In [86]:
print(gpt_output)

Below is one example of how to work through the bank’s Credit Risk Assessment Process using the information provided.

──────────────────────────────
1. Step 1. Assign Individual Ratings

Using the bank’s guidelines, we assign a rating on each criterion as follows:

• Credit score (720):  
  • “A” is for 700 or above.  
  → Rating: A

• Debt‐to‐income ratio (35%):  
  • “A” is below 30% and “B” is 30–39%.  
  → Rating: B

• Loan-to-value ratio (80%):  
  • “A” is below 70%, “B” is 70–79%, “C” is 80–89% (and “D” is 90% or above).  
  → Rating: C

• Employment status (full‑time, stable):  
  • This clearly meets the “A” criterion.  
  → Rating: A

• House’s Market Trend & Qualities:  
  • The house is in a suburban area with low crime, good schools, and the market is described as “Stable, moderate demand, low inventory.”  
  • According to the guidelines, “B” is assigned for stable market conditions with moderate demand and low supply.  
  → Rating: B

──────────────────────────────
2. S

In [87]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

In [88]:
print(o1_output)

Below is one way to work through the credit risk assessment given the information and process provided.

─────────────────────────────  
Step 1. Assign Ratings for Each Criterion

• Credit Score:  
 • Customer’s score is 720. Since 700 or above qualifies for an “A” rating, assign Credit Score = A.

• Debt-to-Income (DTI) Ratio:  
 • Customer’s DTI is 35%. The guidelines designate 30–39% as a “B” rating. Thus, assign DTI = B.

• Loan-to-Value (LTV) Ratio:  
 • The customer’s LTV is 80%. According to the model, an LTV of 80–89% earns a “C” rating. Thus, assign LTV = C.

• Employment Status:  
 • The customer is full‐time and stable. This meets the “A” rating criteria. Thus, assign Employment = A.

• House Market Trend:  
 • The house is in a suburban area with stable market conditions (moderate demand and low inventory). The model rates “stable, moderate demand, low inventory” as a “B”. Thus, assign Market Trend = B.

─────────────────────────────  
Step 2. Form the Composite Rating

Com

In [89]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is our detailed evaluation and comparison of the two responses across the eight dimensions:

─────────────────────────────  
1. Clarity  
• Answer 1 is very well‐organized: it clearly labels each “Step” (assign ratings, composite rating, policy lookup, comparison, other considerations, and decision) and uses bullet points and headings throughout. This makes it very easy to follow.  
• Answer 2 is also clear and well‐structured, using a similar step‐by‐step approach and headings. However, Answer 1’s extra emphasis on mapping ratings and listing assumptions (e.g., “AAACB”) makes its logic flow slightly richer.  

─────────────────────────────  
2. Accuracy & Correctness  
• Both answers correctly assign ratings (A for credit score, B for DTI, C for LTV, A for employment, B for market trend) and combine them into a composite rating (“ABCAB”).  
• They both note that the composite is not explicitly in the policy table and then assume a similar rating range yielding a maximum loan amount of $300,000 and minimum rate ~4.25%.  
• Answer 1 goes further by discussing how the “strong credit” factors might be offset by risk areas (and even mentions down payment issues) while Answer 2 maintains a similar analysis but with slightly fewer nuances.  

─────────────────────────────  
3. Completeness  
• Both responses cover all the steps outlined in the prompt: rating assignment, composite rating formation, policy lookup, comparison of loan terms, and conditional reasoning leading to a decision.  
• Answer 1 additionally discusses the customer’s insufficient savings for a required down payment, adding another layer of risk assessment not as explicitly detailed in Answer 2.  

─────────────────────────────  
4. Relevance & Adherence  
• Both answers follow the instructions and address each part of the prompt.  
• Answer 1 directly notes discrepancies between the offered interest rate (4%) and what the policy requires, and further ties in the customer’s down payment shortfall as a risk factor.  
• Answer 2 adheres closely to the instructions and presents a clear decision process but does not integrate the down payment context as thoroughly.  

─────────────────────────────  
5. Analytical Depth  
• Answer 1 shows strong reasoning through detailed discussion of how the composite rating maps to policy—mentioning that despite good credit and employment, the DTI, LTV, and down payment issues are concerns. It also explains the judgment behind mapping “ABCAB” to an internal rating (like “AAACB”).  
• Answer 2 provides a solid analysis and good supporting evidence by clearly listing each calculation and assumption. It offers condition-based recommendations but with slightly less depth regarding the borrower's liquidity (down payment) issues.  

─────────────────────────────  
6. Multi-Dataset Synthesis  
• Both answers effectively pull together customer credit, ratios, employment, and house market data.  
• Answer 1, however, provides a more robust synthesis by tying in the saving levels as an additional dataset and discussing its impact on the overall credit risk.  
• Answer 2 focuses largely on the ratings and the interest rate condition without revisiting additional numerical factors beyond the basic ratings.  

─────────────────────────────  
7. Robustness to Ambiguity  
• Both models address the ambiguity presented by a composite rating that isn’t explicitly mapped in the policy table.  
• Answer 1 details a method to map “ABCAB” to a comparable rating (“AAACB”) and clearly explains the rationale behind that mapping, which demonstrates comfort with ambiguity.  
• Answer 2 similarly makes an assumption but offers a more straightforward mapping with less explanation, meaning it is slightly less robust in unpacking the nuances.  

─────────────────────────────  
8. Format & Usability  
• Both responses are formatted in a usefully structured manner (steps, lists, and clear conclusions) that would aid a legal, compliance, or due diligence team in following the assessment.  
• Answer 1 is more practically useful because it not only lists the conditions under which the loan could be conditionally approved but also emphasizes additional risk factors (like the shortfall in down payment funds), which is crucial for a deeper due diligence process.  
• Answer 2 is equally clear but does not elaborate on the liquidity/asset check as much, which may be a gap for teams needing additional context.  

─────────────────────────────  
Summary of Evaluation  
Answer 1 stands out as the stronger response overall. It is very clear, methodically addresses each instruction, and goes beyond by integrating additional risk factors (such as down payment insufficiency) that further contextualize the credit assessment. Its method of transparently handling the ambiguity in composite rating mapping adds additional depth and makes its decision recommendations more robust. While Answer 2 is also correct, thorough, and well-organized, its analysis is slightly less nuanced regarding additional financial stressors. 

Final Conclusion: Answer 1 is better because it combines clear organization, comprehensive accuracy, deeper analytical reasoning, and richer integration of all available data. This makes its output more actionable and useful for a legal, compliance, or due diligence team.