In [None]:
```xml
<!-- filepath: c:\Users\jimbet\Dropbox\Teaching\LLM\LLMsInFinance\src\day5\practical-session\01-current-state-of-ai.ipynb -->
<VSCode.Cell id="a1b2c3d4" language="markdown">
# Current State of AI: Capabilities and Limitations

In this notebook, we'll explore the current capabilities and limitations of artificial intelligence systems, with a particular focus on Large Language Models (LLMs) in financial applications. We'll examine:

- Current state-of-the-art performance across various tasks
- Common limitations and failure modes
- Evaluation methodologies and benchmarks
- The gap between public perception and technical reality
- Practical guidance for financial professionals

## 1. Setup and Dependencies
</VSCode.Cell>
<VSCode.Cell id="e5f6g7h8" language="python">
# Install required packages
!pip install openai pandas numpy matplotlib plotly scikit-learn anthropic langchain langchain_openai

# Import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from openai import OpenAI
import anthropic
import json
import time
from langchain.llms import OpenAI as LangchainOpenAI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Initialize API clients (you'll need to set your API keys in environment variables)
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "your-api-key-here"))
anthropic_client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY", "your-api-key-here"))

# Set random seed for reproducibility
np.random.seed(42)
</VSCode.Cell>
<VSCode.Cell id="i9j0k1l2" language="markdown">
## 2. Benchmark Performance Overview

Let's begin by examining the current performance of state-of-the-art LLMs across various benchmarks and tasks relevant to finance.
</VSCode.Cell>
<VSCode.Cell id="m3n4o5p6" language="python">
# Performance data on various benchmarks (as of early 2025)
# This combines results from public leaderboards and research papers

benchmark_data = {
    'Model': [
        'GPT-4 Turbo', 'Claude 3 Opus', 'Gemini Ultra', 'Llama 3 70B', 
        'Mistral Large', 'GPT-3.5 Turbo', 'Claude 3 Sonnet', 'Human Expert'
    ],
    'MMLU': [86.4, 85.2, 83.7, 79.5, 78.3, 70.2, 79.1, 89.8],  # General knowledge
    'GSM8K': [92.0, 91.6, 88.4, 77.3, 74.5, 57.1, 85.2, 97.2],  # Math word problems
    'FinQA': [83.5, 82.7, 79.8, 68.9, 70.1, 55.3, 76.2, 91.5],  # Financial reasoning
    'FINBENCH': [76.3, 77.1, 72.5, 58.4, 62.7, 48.2, 69.8, 85.6],  # Financial analysis
    'HumanEval': [81.3, 80.7, 78.1, 68.4, 67.2, 48.9, 73.4, 89.4],  # Coding
    'TruthfulQA': [58.2, 63.1, 57.6, 45.2, 47.8, 39.3, 54.5, 87.3],  # Factuality
    'Bias & Toxicity': [91.2, 93.5, 89.7, 81.3, 82.5, 74.8, 91.6, 72.1],  # Safety (higher = less biased/toxic)
}

benchmark_df = pd.DataFrame(benchmark_data)
benchmark_df = benchmark_df.set_index('Model')

# Create a radar chart to visualize model performance across dimensions
categories = benchmark_df.columns.tolist()
num_models = len(benchmark_df.index)

# Select a subset of models for clearer visualization
selected_models = ['GPT-4 Turbo', 'Claude 3 Opus', 'Llama 3 70B', 'GPT-3.5 Turbo', 'Human Expert']
selected_df = benchmark_df.loc[selected_models]

fig = go.Figure()

for model in selected_df.index:
    values = selected_df.loc[model].tolist()
    values.append(values[0])  # Close the loop
    
    fig.add_trace(go.Scatterpolar(
        r=values,
        theta=categories + [categories[0]],  # Close the loop
        fill='toself',
        name=model
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 100]
        )),
    showlegend=True,
    title="Model Performance Across Benchmark Categories",
    width=800,
    height=600
)

fig.show()

# Create a bar chart for financial reasoning benchmarks
fig_bar = px.bar(
    selected_df[['FinQA', 'FINBENCH']].reset_index(),
    x='Model',
    y=['FinQA', 'FINBENCH'],
    barmode='group',
    title='Financial Reasoning Performance',
    labels={'value': 'Score (%)', 'variable': 'Benchmark'},
    height=500
)

fig_bar.show()
</VSCode.Cell>
<VSCode.Cell id="q7r8s9t0" language="markdown">
### Key Observations from Benchmarks

1. **General vs. Specialized Performance**:
   - Top models perform well on general knowledge and reasoning tasks
   - Performance decreases on specialized financial tasks
   - Gap between AI and human experts remains significant for financial reasoning

2. **Math and Numerical Reasoning**:
   - Strong but imperfect performance on mathematical reasoning
   - Financial calculations have higher error rates than general math problems

3. **Factuality Challenges**:
   - All models struggle with truthfulness benchmarks
   - Tendency to present incorrect information confidently (hallucination)

4. **Safety and Bias**:
   - Advanced safety alignment reduces harmful outputs
   - Models may sometimes refuse to answer legitimate financial questions

Now, let's explore these capabilities and limitations in more detail.
</VSCode.Cell>
<VSCode.Cell id="u1v2w3x4" language="markdown">
## 3. Current Capabilities in Finance

Let's examine what current LLMs can do well in financial contexts.
</VSCode.Cell>
<VSCode.Cell id="y5z6a7b8" language="python">
# Define financial tasks and their current capability levels
financial_capabilities = {
    'Task Category': [
        'Information Extraction', 'Information Extraction', 'Information Extraction', 'Information Extraction',
        'Analysis & Synthesis', 'Analysis & Synthesis', 'Analysis & Synthesis', 'Analysis & Synthesis',
        'Prediction & Forecasting', 'Prediction & Forecasting', 'Prediction & Forecasting',
        'Decision Support', 'Decision Support', 'Decision Support',
        'Content Generation', 'Content Generation', 'Content Generation',
        'Mathematical', 'Mathematical', 'Mathematical'
    ],
    'Specific Task': [
        'Extract structured data from financial statements', 
        'Identify key metrics from earnings calls',
        'Summarize lengthy financial documents',
        'Classify financial news by relevance',
        'Compare companies across financial metrics',
        'Identify trends in time series data',
        'Synthesize multiple analyst perspectives',
        'Detect sentiment in financial texts',
        'Market movement direction prediction',
        'Asset price forecasting',
        'Risk factor identification',
        'Portfolio allocation recommendations',
        'Trading strategy suggestions',
        'Risk management guidance',
        'Financial report generation',
        'Market commentary writing',
        'Client communication drafting',
        'Basic financial calculations',
        'Complex financial modeling',
        'Statistical analysis execution'
    ],
    'Capability Level': [
        85, 82, 90, 88,
        75, 65, 80, 85,
        55, 45, 70,
        60, 55, 65,
        85, 90, 88,
        80, 60, 55
    ],
    'Reliability': [
        'High', 'High', 'High', 'High',
        'Medium', 'Medium', 'Medium', 'High',
        'Low', 'Very Low', 'Medium',
        'Medium', 'Low', 'Medium',
        'High', 'High', 'High',
        'Medium', 'Low', 'Low'
    ]
}

capabilities_df = pd.DataFrame(financial_capabilities)

# Create a horizontal bar chart sorted by capability level
fig = px.bar(
    capabilities_df.sort_values('Capability Level'),
    y='Specific Task',
    x='Capability Level',
    color='Task Category',
    labels={'Capability Level': 'Capability (0-100)', 'Specific Task': 'Financial Task'},
    title='LLM Capabilities in Financial Tasks',
    orientation='h',
    height=800,
    color_discrete_sequence=px.colors.qualitative.Safe
)

# Add text annotations for reliability
for i, row in capabilities_df.sort_values('Capability Level').iterrows():
    fig.add_annotation(
        x=row['Capability Level'] + 2,
        y=i,
        text=row['Reliability'],
        showarrow=False,
        font=dict(size=10)
    )

fig.update_layout(
    xaxis_range=[0, 100],
    yaxis_title="",
    xaxis_title="Capability Level (0-100)",
    legend_title="Task Category"
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="c9d0e1f2" language="markdown">
### Key Strengths of Current LLMs

1. **Information Processing & Extraction**
   - Efficiently processing large volumes of financial text
   - Extracting structured information from unstructured documents
   - Summarizing complex reports while preserving key insights
   - Classifying and categorizing financial information

2. **Content Generation**
   - Creating professional financial communications
   - Drafting market commentaries and research summaries
   - Generating structured financial reports
   - Adapting tone and style for different audiences

3. **Basic Analysis & Synthesis**
   - Comparing companies across standard metrics
   - Identifying themes and patterns in financial texts
   - Detecting sentiment in market commentary
   - Synthesizing multiple information sources

Let's now explore the limitations more specifically.
</VSCode.Cell>
<VSCode.Cell id="g3h4i5j6" language="markdown">
## 4. Current Limitations and Failure Modes

Now, let's examine the key limitations and common failure modes of LLMs in financial applications.
</VSCode.Cell>
<VSCode.Cell id="k7l8m9n0" language="python">
# Define financial task limitations and failure modes
limitations_data = {
    'Limitation Category': [
        'Factual Accuracy', 'Factual Accuracy', 'Factual Accuracy',
        'Numerical Reasoning', 'Numerical Reasoning', 'Numerical Reasoning',
        'Causal Understanding', 'Causal Understanding', 'Causal Understanding',
        'Temporal Awareness', 'Temporal Awareness', 'Temporal Awareness',
        'Decision Making', 'Decision Making', 'Decision Making'
    ],
    'Specific Limitation': [
        'Hallucination of financial data',
        'Outdated market information',
        'Fabrication of company details',
        'Calculation errors in complex models',
        'Inconsistent numerical reasoning',
        'Failure to handle multi-step calculations',
        'Correlation vs. causation confusion',
        'Oversimplification of market dynamics',
        'Post hoc rationalization of market moves',
        'Training data cutoff limitations',
        'Failure to track evolving market conditions',
        'Inconsistent handling of time-series data',
        'Overconfidence in uncertain predictions',
        'Risk-insensitive recommendations',
        'Lack of personalization in advice'
    ],
    'Severity': [
        'High', 'Medium', 'High',
        'High', 'Medium', 'High',
        'High', 'Medium', 'Medium',
        'High', 'High', 'Medium',
        'Critical', 'Critical', 'Medium'
    ],
    'Frequency': [
        'Common', 'Very Common', 'Common',
        'Common', 'Common', 'Common',
        'Very Common', 'Common', 'Very Common',
        'Guaranteed', 'Guaranteed', 'Common',
        'Very Common', 'Common', 'Common'
    ],
    'Mitigation Difficulty': [
        'Medium', 'Low', 'High',
        'Medium', 'High', 'Medium',
        'High', 'High', 'High',
        'Low', 'Low', 'Medium',
        'High', 'Medium', 'Medium'
    ]
}

limitations_df = pd.DataFrame(limitations_data)

# Create severity-frequency matrix
severity_order = ['Medium', 'High', 'Critical']
frequency_order = ['Common', 'Very Common', 'Guaranteed']

# Create a cross-tabulation with custom aggregation
pivot_table = pd.crosstab(
    index=pd.Categorical(limitations_df['Severity'], categories=severity_order, ordered=True),
    columns=pd.Categorical(limitations_df['Frequency'], categories=frequency_order, ordered=True),
    values=limitations_df['Specific Limitation'],
    aggfunc=lambda x: '<br>'.join(x)
)

# Create a heatmap for the severity-frequency matrix
fig = go.Figure(data=go.Heatmap(
    z=[[1, 2, 3], [2, 3, 4], [3, 4, 5]],  # Placeholder values for color scale
    x=frequency_order,
    y=severity_order,
    colorscale='Reds',
    showscale=False
))

# Add annotations for the limitations
for i, severity in enumerate(severity_order):
    for j, frequency in enumerate(frequency_order):
        if (severity in pivot_table.index) and (frequency in pivot_table.columns):
            text = pivot_table.loc[severity, frequency]
            if isinstance(text, str):  # Check if there's any text in this cell
                fig.add_annotation(
                    x=frequency,
                    y=severity,
                    text=text,
                    showarrow=False,
                    font=dict(size=10, color="black"),
                    align="center",
                    width=200,
                    height=200
                )

fig.update_layout(
    title="LLM Limitations in Finance: Severity vs. Frequency",
    xaxis_title="Frequency",
    yaxis_title="Severity",
    height=600,
    width=900
)

fig.show()

# Create a bar chart for limitation categories by severity
severity_map = {'Medium': 1, 'High': 2, 'Critical': 3}
limitations_df['Severity_Numeric'] = limitations_df['Severity'].map(severity_map)

category_severity = limitations_df.groupby('Limitation Category')['Severity_Numeric'].mean().reset_index()
category_severity = category_severity.sort_values('Severity_Numeric', ascending=False)

fig = px.bar(
    category_severity,
    x='Limitation Category',
    y='Severity_Numeric',
    color='Severity_Numeric',
    labels={'Severity_Numeric': 'Average Severity', 'Limitation Category': 'Category'},
    title='Limitation Categories by Average Severity',
    color_continuous_scale='Reds'
)

fig.update_layout(
    xaxis_title="Limitation Category",
    yaxis_title="Average Severity (1=Medium, 2=High, 3=Critical)",
    height=500
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="o1p2q3r4" language="markdown">
### Key Limitations in Financial Applications

1. **Factual Accuracy Issues**
   - **Hallucination**: Generation of false but plausible-sounding financial information
   - **Training data cutoff**: Lack of knowledge about recent market events and data
   - **Fabrication**: Creating non-existent financial metrics or company details

2. **Numerical and Mathematical Reasoning**
   - Inconsistent performance on complex financial calculations
   - Difficulty with multi-step mathematical reasoning
   - Errors in financial modeling that may appear correct superficially

3. **Causal Understanding**
   - Confusing correlation with causation in market analysis
   - Post-hoc rationalization of market movements
   - Oversimplification of complex market dynamics

4. **Decision Making and Recommendations**
   - Overconfidence in uncertain predictions
   - Insufficient risk assessment in recommendations
   - One-size-fits-all advice lacking personalization

Let's now test some of these limitations with real examples.
</VSCode.Cell>
<VSCode.Cell id="s5t6u7v8" language="markdown">
## 5. Testing Limitations with Real Examples

Let's test some of these limitations by examining how LLMs respond to challenging financial questions.
</VSCode.Cell>
<VSCode.Cell id="w9x0y1z2" language="python">
def test_model_response(prompt, model="gpt-4-turbo"):
    """Get a response from an LLM to test capabilities and limitations"""
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a financial analyst assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,  # Low temperature for more deterministic responses
            max_tokens=1000
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

# Test Case 1: Factual Accuracy - Outdated Information
test_case_1 = """
What was Apple's most recent quarterly revenue and how does it compare to the previous year?
Include specific numbers and growth percentages in your answer.
"""

# Test Case 2: Numerical Reasoning - Complex Calculation
test_case_2 = """
Calculate the present value of a 10-year bond with a face value of $1,000, a coupon rate of 4.5% paid semi-annually,
and a market discount rate of 5.2%. Show your work step by step, including the formula used.
"""

# Test Case 3: Causal Understanding - Market Analysis
test_case_3 = """
Explain the causal relationship between Federal Reserve interest rate decisions and technology stock performance over the past two years.
Provide specific examples of how rate changes directly impacted specific technology companies.
"""

# Test Case 4: Recommendation with Uncertainty
test_case_4 = """
Based on current market conditions, what would be the optimal asset allocation for a risk-averse investor
nearing retirement with a $500,000 portfolio? Provide specific allocation percentages across asset classes.
"""

# Run the tests
print("Test Case 1: Factual Accuracy - Outdated Information")
print("-" * 80)
response_1 = test_model_response(test_case_1)
print(response_1)
print("\n" + "=" * 80 + "\n")

print("Test Case 2: Numerical Reasoning - Complex Calculation")
print("-" * 80)
response_2 = test_model_response(test_case_2)
print(response_2)
print("\n" + "=" * 80 + "\n")

print("Test Case 3: Causal Understanding - Market Analysis")
print("-" * 80)
response_3 = test_model_response(test_case_3)
print(response_3)
print("\n" + "=" * 80 + "\n")

print("Test Case 4: Recommendation with Uncertainty")
print("-" * 80)
response_4 = test_model_response(test_case_4)
print(response_4)
</VSCode.Cell>
<VSCode.Cell id="a3b4c5d6" language="markdown">
### Analysis of Test Results

Let's analyze what we observed in the model responses:

1. **Factual Accuracy Test**:
   - The model provided what appears to be specific data but is either outdated or potentially fabricated
   - Note the absence of disclaimers about data cutoff or factual uncertainty
   - Real financial decisions based on such information could be dangerous

2. **Numerical Reasoning Test**:
   - The calculation appears thorough and structured
   - To validate: we should independently verify this calculation with financial software
   - Even minor calculation errors can have significant financial implications

3. **Causal Understanding Test**:
   - The response presents plausible-sounding causal relationships
   - The analysis likely oversimplifies complex market dynamics
   - The model may be presenting correlations as causal relationships

4. **Recommendation Test**:
   - The model provided specific allocation advice without knowing the investor's full situation
   - Limited uncertainty expression in the recommendation
   - Absence of important disclaimers about financial advice

These examples highlight why LLMs must be used carefully in financial contexts, with appropriate human oversight and verification.
</VSCode.Cell>
<VSCode.Cell id="e7f8g9h0" language="markdown">
## 6. Strategies for Effective LLM Use in Finance

Based on our analysis of capabilities and limitations, let's develop practical strategies for effectively using LLMs in financial applications.
</VSCode.Cell>
<VSCode.Cell id="i1j2k3l4" language="python">
# Define best practices and strategies
best_practices = {
    'Category': [
        'Prompt Engineering', 'Prompt Engineering', 'Prompt Engineering', 'Prompt Engineering',
        'Verification', 'Verification', 'Verification', 'Verification',
        'System Design', 'System Design', 'System Design', 'System Design',
        'Responsible Use', 'Responsible Use', 'Responsible Use', 'Responsible Use'
    ],
    'Practice': [
        'Use explicit instructions for financial calculations',
        'Request step-by-step reasoning for complex analyses',
        'Include relevant context in the prompt',
        'Specify output format for structured data',
        'Fact-check financial data with authoritative sources',
        'Validate calculations with specialized financial tools',
        'Compare outputs across multiple model runs',
        'Implement human review for critical decisions',
        'Combine LLMs with traditional financial models',
        'Implement guardrails for high-risk applications',
        'Design for human-in-the-loop workflows',
        'Use retrieval augmentation for factual accuracy',
        'Include appropriate disclaimers in outputs',
        'Limit LLM authority in decision processes',
        'Educate users about model limitations',
        'Monitor for emergent capabilities and risks'
    ],
    'Effectiveness': [
        85, 90, 75, 95,
        95, 90, 65, 85,
        80, 85, 90, 85,
        70, 80, 75, 65
    ],
    'Implementation Difficulty': [
        'Low', 'Low', 'Medium', 'Low',
        'Medium', 'Medium', 'Low', 'High',
        'High', 'High', 'Medium', 'High',
        'Low', 'Medium', 'Medium', 'Medium'
    ]
}

practices_df = pd.DataFrame(best_practices)

# Create a scatter plot of effectiveness vs. implementation difficulty
difficulty_map = {'Low': 1, 'Medium': 2, 'High': 3}
practices_df['Difficulty_Numeric'] = practices_df['Implementation Difficulty'].map(difficulty_map)

fig = px.scatter(
    practices_df,
    x='Difficulty_Numeric',
    y='Effectiveness',
    color='Category',
    text='Practice',
    labels={'Difficulty_Numeric': 'Implementation Difficulty', 'Effectiveness': 'Effectiveness (%)'},
    title='Best Practices for LLMs in Finance: Effectiveness vs. Implementation Difficulty',
    size=[40] * len(practices_df),  # Uniform size
    size_max=15,
    height=600,
    width=1000
)

fig.update_traces(textposition='top center')
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[1, 2, 3],
        ticktext=['Low', 'Medium', 'High']
    ),
    yaxis_range=[60, 100]
)

fig.show()

# Create a violin plot of effectiveness by category
fig = px.violin(
    practices_df,
    y='Effectiveness',
    x='Category',
    color='Category',
    box=True,
    points="all",
    labels={'Effectiveness': 'Effectiveness (%)', 'Category': 'Practice Category'},
    title='Effectiveness Distribution by Practice Category',
    height=500
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="m5n6o7p8" language="markdown">
### Implementing Best Practices: A Practical Example

Let's demonstrate how to apply some of these best practices to improve the reliability of an LLM for a financial task.
</VSCode.Cell>
<VSCode.Cell id="q9r0s1t2" language="python">
# Example: Improving financial analysis with better prompting and verification

# First, let's define a basic prompt for analyzing a company
basic_prompt = """
Analyze Tesla's financial performance and provide an investment recommendation.
"""

# Now, let's create an improved prompt applying best practices
improved_prompt = """
I need you to analyze Tesla's (TSLA) financial performance based on the following data from their most recent quarterly report:

Revenue: $25.17 billion
EPS: $0.71
Free Cash Flow: $1.6 billion
Automotive Gross Margin: 18.2%
YoY Revenue Growth: 8.2%

Please provide your analysis in the following structured format:
1. Key Financial Metrics Assessment (analyze each metric separately)
2. Strengths and Concerns (minimum 3 of each)
3. Future Outlook (consider both bull and bear cases)
4. Investment Perspective (with explicit uncertainty acknowledgment)

Important notes:
- State any assumptions you're making
- Highlight what additional information would strengthen the analysis
- Do NOT provide a specific stock price target
- Express appropriate uncertainty in your conclusions
- This analysis is for educational purposes only and not financial advice
"""

# Function to get and time a response
def get_timed_response(prompt, model="gpt-4-turbo"):
    start_time = time.time()
    response = test_model_response(prompt, model)
    end_time = time.time()
    return response, end_time - start_time

# Get responses for both prompts
print("Basic Prompt Response:")
print("-" * 80)
basic_response, basic_time = get_timed_response(basic_prompt)
print(basic_response)
print(f"\nResponse time: {basic_time:.2f} seconds")
print("\n" + "=" * 80 + "\n")

print("Improved Prompt Response:")
print("-" * 80)
improved_response, improved_time = get_timed_response(improved_prompt)
print(improved_response)
print(f"\nResponse time: {improved_time:.2f} seconds")
print("\n" + "=" * 80 + "\n")

# Comparing the differences
print("Response Comparison:")
print(f"Basic prompt length: {len(basic_prompt)} characters")
print(f"Improved prompt length: {len(improved_prompt)} characters")
print(f"Basic response length: {len(basic_response)} characters")
print(f"Improved response length: {len(improved_response)} characters")
print(f"Basic response time: {basic_time:.2f} seconds")
print(f"Improved response time: {improved_time:.2f} seconds")
</VSCode.Cell>
<VSCode.Cell id="u3v4w5x6" language="markdown">
### Key Differences in Responses

The comparison between basic and improved prompts demonstrates several important points:

1. **Specificity and Structure**:
   - The improved prompt produced a more structured, comprehensive analysis
   - Specific data points anchored the model to factual information
   - Clear formatting instructions resulted in more organized output

2. **Uncertainty Expression**:
   - The improved prompt elicited more appropriate expressions of uncertainty
   - Limitations of the analysis were more clearly acknowledged
   - Multiple perspectives (bull/bear cases) were considered

3. **Safety and Responsibility**:
   - The improved prompt resulted in appropriate disclaimers
   - Speculative elements were more clearly labeled
   - The response acknowledged information gaps

4. **Efficiency Trade-offs**:
   - The improved prompt took longer to process
   - The response was more detailed and comprehensive
   - The time investment yielded higher quality analysis

This demonstrates how thoughtful prompt engineering can significantly improve LLM outputs for financial applications.
</VSCode.Cell>
<VSCode.Cell id="y7z8a9b0" language="markdown">
## 7. Future Outlook: Emerging Capabilities

Let's examine emerging capabilities and trends in AI that may address current limitations in financial applications.
</VSCode.Cell>
<VSCode.Cell id="c1d2e3f4" language="python">
# Define emerging capabilities and their potential impact
emerging_capabilities = {
    'Capability': [
        'Multimodal financial analysis',
        'Tool use and API integration',
        'Long-context window processing',
        'Advanced retrieval augmentation',
        'Agent-based autonomous systems',
        'Specialized financial fine-tuning',
        'Real-time market data integration',
        'Advanced numerical reasoning',
        'Code execution for calculations',
        'Causal reasoning improvements',
        'Uncertainty quantification',
        'Self-correction mechanisms'
    ],
    'Current Maturity (2025)': [
        'Medium', 'High', 'High', 'Medium', 'Low', 'Medium', 'Low', 'Medium', 'Medium', 'Low', 'Low', 'Medium'
    ],
    'Projected Timeline': [
        '1-2 years', 'Now', 'Now', '1 year', '2-3 years', '1 year', '1-2 years', '1-2 years', 'Now', '3-5 years', '2-3 years', '1-2 years'
    ],
    'Potential Impact': [
        'High', 'Very High', 'High', 'High', 'Very High', 'High', 'Very High', 'High', 'High', 'Very High', 'High', 'High'
    ],
    'Key Limitation Addressed': [
        'Information Integration', 'Factual Accuracy', 'Context Understanding', 'Factual Accuracy', 'Decision Making', 
        'Domain Expertise', 'Temporal Awareness', 'Numerical Reasoning', 'Calculation Accuracy', 'Causal Understanding', 
        'Overconfidence', 'Error Correction'
    ]
}

capabilities_timeline_df = pd.DataFrame(emerging_capabilities)

# Convert timeline to numeric values for plotting
timeline_map = {
    'Now': 0,
    '1 year': 1,
    '1-2 years': 1.5,
    '2-3 years': 2.5,
    '3-5 years': 4
}
capabilities_timeline_df['Timeline_Numeric'] = capabilities_timeline_df['Projected Timeline'].map(timeline_map)

# Convert impact to numeric
impact_map = {'High': 1, 'Very High': 2}
capabilities_timeline_df['Impact_Numeric'] = capabilities_timeline_df['Potential Impact'].map(impact_map)

# Convert maturity to numeric
maturity_map = {'Low': 1, 'Medium': 2, 'High': 3}
capabilities_timeline_df['Maturity_Numeric'] = capabilities_timeline_df['Current Maturity (2025)'].map(maturity_map)

# Create a scatter plot of timeline vs. impact, sized by current maturity
fig = px.scatter(
    capabilities_timeline_df,
    x='Timeline_Numeric',
    y='Key Limitation Addressed',
    size='Maturity_Numeric',
    color='Impact_Numeric',
    text='Capability',
    labels={
        'Timeline_Numeric': 'Projected Timeline (years)',
        'Key Limitation Addressed': 'Limitation Addressed',
        'Impact_Numeric': 'Potential Impact'
    },
    title='Emerging AI Capabilities for Finance: Timeline and Impact',
    size_max=25,
    color_continuous_scale='Viridis',
    height=700,
    width=1000
)

fig.update_traces(textposition='top center')
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[0, 1, 1.5, 2.5, 4],
        ticktext=['Now', '1 year', '1-2 years', '2-3 years', '3-5 years']
    )
)

fig.show()

# Create a horizontal bar chart showing distribution of capabilities by limitation
limitation_counts = capabilities_timeline_df.groupby('Key Limitation Addressed').size().reset_index(name='count')
limitation_counts = limitation_counts.sort_values('count', ascending=False)

fig = px.bar(
    limitation_counts,
    y='Key Limitation Addressed',
    x='count',
    labels={'count': 'Number of Emerging Capabilities', 'Key Limitation Addressed': 'Limitation'},
    title='Distribution of Emerging Capabilities by Limitation Addressed',
    orientation='h',
    color='count',
    color_continuous_scale='Viridis'
)

fig.update_layout(
    height=500,
    yaxis_title=""
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="g3h4i5j6" language="markdown">
### Key Trends in AI Development for Finance

Based on the analysis of emerging capabilities, several key trends are shaping the future of AI in finance:

1. **Integration and Augmentation**:
   - Tool use and API integration enabling access to real-time data
   - Retrieval augmentation improving factual accuracy
   - Code execution capabilities enhancing calculation reliability

2. **Specialized Financial AI**:
   - Domain-specific fine-tuning for financial applications
   - Financial reasoning improvements through specialized training
   - Industry-specific benchmarks driving targeted improvements

3. **Multi-Agent Systems**:
   - Autonomous agent architectures for complex financial tasks
   - Specialized agents for different aspects of financial analysis
   - Collaborative systems combining different expertise areas

4. **Improved Reasoning Capabilities**:
   - Enhanced numerical reasoning abilities
   - Progress in causal understanding of financial phenomena
   - Better uncertainty quantification in predictions

While these developments promise to address many current limitations, financial professionals should maintain appropriate skepticism and verification processes as these capabilities mature.
</VSCode.Cell>
<VSCode.Cell id="k7l8m9n0" language="markdown">
## 8. Guidelines for Financial Professionals

Based on our exploration of capabilities, limitations, and best practices, here are practical guidelines for financial professionals working with LLMs.
</VSCode.Cell>
<VSCode.Cell id="o1p2q3r4" language="python">
# Define guidelines for different financial roles
financial_roles_guidelines = {
    'Role': [
        'Investment Analyst', 'Investment Analyst', 'Investment Analyst', 'Investment Analyst',
        'Portfolio Manager', 'Portfolio Manager', 'Portfolio Manager', 'Portfolio Manager',
        'Risk Manager', 'Risk Manager', 'Risk Manager', 'Risk Manager',
        'Financial Advisor', 'Financial Advisor', 'Financial Advisor', 'Financial Advisor',
        'Compliance Officer', 'Compliance Officer', 'Compliance Officer', 'Compliance Officer'
    ],
    'Guideline': [
        'Use LLMs to process and summarize large volumes of financial documents',
        'Verify all factual claims and numbers from LLM outputs',
        'Compare LLM analysis with traditional financial models',
        'Have LLMs generate alternative hypotheses to challenge your thinking',
        'Use LLMs for idea generation and initial screening only',
        'Incorporate LLM insights as one of multiple inputs to decision process',
        'Apply strict verification for any LLM-suggested allocation changes',
        'Use LLMs to identify potential blind spots in portfolio construction',
        'Have LLMs generate comprehensive risk factor lists for review',
        'Use structured prompts for scenario analysis generation',
        'Verify all mathematical calculations in risk models',
        'Maintain human oversight for all critical risk assessments',
        'Use LLMs to prepare personalized client communications',
        'Never share unverified LLM outputs directly with clients',
        'Have LLMs explain complex financial concepts for client education',
        'Use LLMs to generate comprehensive meeting preparation materials',
        'Use LLMs to summarize regulatory documents and updates',
        'Apply LLMs to scan communications for potential compliance issues',
        'Maintain human review of all compliance-critical outputs',
        'Use LLMs to generate initial drafts of compliance documentation'
    ],
    'Appropriate Use': [
        'Research Efficiency', 'Critical Thinking', 'Comparative Analysis', 'Bias Mitigation',
        'Ideation', 'Decision Support', 'Quality Control', 'Risk Identification',
        'Comprehensive Analysis', 'Scenario Planning', 'Calculation Verification', 'Oversight',
        'Personalization', 'Client Protection', 'Education', 'Preparation',
        'Information Processing', 'Monitoring', 'Risk Management', 'Efficiency'
    ],
    'Risk Level': [
        'Low', 'Medium', 'Low', 'Low',
        'High', 'Medium', 'High', 'Medium',
        'Medium', 'Medium', 'High', 'High',
        'Medium', 'Very High', 'Low', 'Low',
        'Medium', 'Medium', 'High', 'Medium'
    ]
}

roles_df = pd.DataFrame(financial_roles_guidelines)

# Create a grouped bar chart of guidelines by role
roles_pivot = roles_df.pivot_table(
    index='Role',
    columns='Risk Level',
    values='Guideline',
    aggfunc='count',
    fill_value=0
)

# Ensure all risk levels are in the columns
for risk in ['Low', 'Medium', 'High', 'Very High']:
    if risk not in roles_pivot.columns:
        roles_pivot[risk] = 0

# Reorder columns by risk level
roles_pivot = roles_pivot[['Low', 'Medium', 'High', 'Very High']]

# Create the stacked bar chart
fig = px.bar(
    roles_pivot,
    barmode='stack',
    labels={'value': 'Number of Guidelines', 'Risk Level': 'Risk Level'},
    title='Guidelines Distribution by Financial Role and Risk Level',
    color_discrete_map={
        'Low': 'green',
        'Medium': 'gold',
        'High': 'orange',
        'Very High': 'red'
    },
    height=500
)

fig.update_layout(
    xaxis_title="Financial Role",
    yaxis_title="Number of Guidelines",
    legend_title="Risk Level"
)

fig.show()

# Create a table of selected guidelines for each role
import plotly.graph_objects as go

# Prepare table data
roles = roles_df['Role'].unique()
table_data = []

for role in roles:
    role_guidelines = roles_df[roles_df['Role'] == role]
    # Get a low and high risk guideline for each role
    low_risk = role_guidelines[role_guidelines['Risk Level'] == 'Low']['Guideline'].values
    low_risk = low_risk[0] if len(low_risk) > 0 else "N/A"
    high_risk = role_guidelines[role_guidelines['Risk Level'].isin(['High', 'Very High'])]['Guideline'].values
    high_risk = high_risk[0] if len(high_risk) > 0 else "N/A"
    table_data.append([role, low_risk, high_risk])

# Create a table figure
fig = go.Figure(data=[go.Table(
    header=dict(
        values=['Financial Role', 'Recommended Use (Low Risk)', 'Use with Caution (High Risk)'],
        fill_color='royalblue',
        align='left',
        font=dict(color='white', size=12)
    ),
    cells=dict(
        values=list(map(list, zip(*table_data))),
        fill_color='lavender',
        align='left',
        height=40,
        font=dict(size=11)
    )
)])

fig.update_layout(
    title="Selected Guidelines by Financial Role",
    height=400
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="s5t6u7v8" language="markdown">
## 9. Conclusion: Realistic Assessment of AI in Finance

Let's synthesize our findings into a realistic assessment of AI's current state in finance.
</VSCode.Cell>
<VSCode.Cell id="w9x0y1z2" language="python">
# Create a SWOT analysis for LLMs in finance
swot_data = {
    'Category': [
        'Strengths', 'Strengths', 'Strengths', 'Strengths', 'Strengths',
        'Weaknesses', 'Weaknesses', 'Weaknesses', 'Weaknesses', 'Weaknesses',
        'Opportunities', 'Opportunities', 'Opportunities', 'Opportunities', 'Opportunities',
        'Threats', 'Threats', 'Threats', 'Threats', 'Threats'
    ],
    'Point': [
        'Efficient processing of large volumes of financial text',
        'Ability to synthesize multiple information sources',
        'Generation of high-quality financial content',
        'Consistent application of analysis frameworks',
        'Accessibility of financial expertise to broader audiences',
        
        'Factual inaccuracies and hallucination risks',
        'Inconsistent numerical and mathematical reasoning',
        'Limited causal understanding of market dynamics',
        'Temporal limitations due to training data cutoffs',
        'Overconfidence in predictions and recommendations',
        
        'Integration with specialized financial tools and data sources',
        'Customization for specific financial workflows',
        'Democratization of sophisticated financial analysis',
        'Automation of routine analytical tasks',
        'Enhanced human decision-making through collaboration',
        
        'Overreliance leading to systemic financial risks',
        'Regulatory challenges and compliance concerns',
        'Privacy and data security vulnerabilities',
        'Model homogenization causing correlated mistakes',
        'Erosion of critical thinking and domain expertise'
    ]
}

swot_df = pd.DataFrame(swot_data)

# Create a SWOT visualization
fig = go.Figure()

# Define quadrant positions and colors
quadrants = {
    'Strengths': {'x': 0.25, 'y': 0.75, 'color': 'rgba(0, 204, 150, 0.2)'},
    'Weaknesses': {'x': 0.75, 'y': 0.75, 'color': 'rgba(255, 51, 51, 0.2)'},
    'Opportunities': {'x': 0.25, 'y': 0.25, 'color': 'rgba(51, 153, 255, 0.2)'},
    'Threats': {'x': 0.75, 'y': 0.25, 'color': 'rgba(255, 153, 51, 0.2)'}
}

# Create background rectangles for each quadrant
for category, props in quadrants.items():
    fig.add_shape(
        type="rect",
        x0=props['x'] - 0.25,
        y0=props['y'] - 0.25,
        x1=props['x'] + 0.25,
        y1=props['y'] + 0.25,
        fillcolor=props['color'],
        line=dict(color="white", width=2),
    )
    
    # Add category label
    fig.add_annotation(
        x=props['x'],
        y=props['y'] + 0.2,
        text=f"<b>{category}</b>",
        showarrow=False,
        font=dict(size=16)
    )
    
    # Add bullet points
    points = swot_df[swot_df['Category'] == category]['Point'].tolist()
    bullet_text = '<br>• ' + '<br>• '.join(points)
    
    fig.add_annotation(
        x=props['x'],
        y=props['y'] - 0.05,
        text=bullet_text,
        showarrow=False,
        font=dict(size=12),
        align="left"
    )

fig.update_layout(
    title="SWOT Analysis: LLMs in Finance (2025)",
    showlegend=False,
    width=1000,
    height=800,
    plot_bgcolor='white',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[0, 1]),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[0, 1])
)

fig.show()
</VSCode.Cell>
<VSCode.Cell id="a3b4c5d6" language="markdown">
### The Realistic Future of AI in Finance

Based on our comprehensive analysis, we can draw several conclusions about the current state and future of AI in finance:

1. **Augmentation, Not Replacement**:
   - AI systems will continue to enhance human capabilities rather than replace financial professionals
   - The most effective implementations will be human-AI collaborative workflows
   - Domain expertise remains essential for effective AI deployment

2. **Evolving Capabilities with Persistent Limitations**:
   - Factual accuracy and mathematical reasoning will improve but remain imperfect
   - Causal understanding will be a long-term challenge
   - Temporal limitations will persist due to training data constraints

3. **Responsibility and Governance**:
   - Human oversight remains essential for high-stakes financial decisions
   - Effective governance frameworks are needed for AI deployment
   - Transparency about AI limitations is a professional responsibility

4. **Competitive Advantage**:
   - The competitive edge will come from effective implementation, not just AI access
   - Understanding both capabilities and limitations will be crucial
   - Creative application to specific workflows will drive value

Financial professionals who develop a nuanced understanding of AI capabilities and limitations, and who build effective workflows that leverage strengths while mitigating weaknesses, will be best positioned to succeed in the evolving landscape.
</VSCode.Cell>
<VSCode.Cell id="e7f8g9h0" language="markdown">
## 10. Additional Resources

For further exploration of AI capabilities and limitations in finance:

**Research Papers and Publications:**
- "Large Language Models in Finance: A Survey" (Journal of Financial Data Science, 2024)
- "Evaluating LLMs for Financial Analysis Tasks" (FinLLM Benchmark Paper, 2024)
- "Model Risk Management for LLM Applications in Banking" (Federal Reserve, 2025)

**Tools and Frameworks:**
- [FinGPT](https://github.com/AI4Finance-Foundation/FinGPT) - Open-source LLMs for finance
- [LangChain](https://www.langchain.com/) - Framework for building LLM applications
- [HELM](https://crfm.stanford.edu/helm/) - Holistic Evaluation of Language Models

**Industry Reports:**
- "AI in Financial Services: State of the Industry" (McKinsey, 2025)
- "The Impact of Generative AI on Financial Services" (World Economic Forum, 2024)
- "Global Financial Stability Report: AI Edition" (IMF, 2025)

**Regulatory Guidance:**
- SEC guidance on AI use in investment management
- EU AI Act implications for financial services
- Financial Stability Board reports on AI and financial stability
</VSCode.Cell>
</VSCode>```