

##  Dataset Creation for Financial Prompt Optimization Interface Explained

#### How the Dataset Was Created:

The dataset was generated using Python, leveraging the `random` and `csv` modules. The core idea was to define various categories of financial queries and then programmatically generate a multitude of variations within each category.

Here's a breakdown of the creation process:

1.  **Defining Core Entities and Categories:**
    * **Companies and Tickers:** A list of major US-based public companies and their stock tickers (e.g., Apple, AAPL) was established to ensure real-world relevance.
    * **Timeframes:** Common time references used in financial reporting (e.g., "last quarter," "2023," "this year") were defined.
    * **Financial Terms:** A vocabulary of financial analysis terms (e.g., "performance," "outlook," "valuation") was curated to add variation to queries.
    * **Fixed Income Instruments:** Various types of fixed income securities (e.g., "corporate bonds," "Treasury notes") were included for a specialized category.
    * **Market Sectors:** Key market sectors (e.g., "technology," "healthcare," "financial") were listed to facilitate sector-specific queries.

2.  **Intent-Specific Query Generation Functions:**
    * **`generate_stock_queries()`:** Focuses on queries related to individual company stock analysis (e.g., "Analyze Apple stock performance," "AAPL stock forecast"). It combines companies, timeframes, and financial terms to create diverse phrases.
    * **`generate_quarterly_queries()`:** Generates queries specifically asking for information about company quarterly or annual financial reports (e.g., "Summary of Microsoft's Q1 report," "Key metrics from GOOGL's latest earnings").
    * **`generate_fixed_income_queries()`:** Creates queries related to fixed income instruments, incorporating companies and various bond types (e.g., "Yield analysis for Amazon corporate bonds," "Credit risk assessment of TSLA bonds").
    * **`generate_sector_queries()`:** Produces queries focused on market sector analysis (e.g., "Technology sector performance last year," "Growth projections for healthcare sector").
    * **`generate_unclassified_queries()`:** This is a crucial part. It includes a list of non-financial, "out-of-scope" queries (e.g., "What's the weather forecast tomorrow?", "Recipe for chocolate chip cookies"). These queries are vital for training a model to correctly identify when a user's intent *does not* fall into any of the predefined financial categories, thereby preventing misclassifications.

3.  **Dataset Assembly and Shuffling:**
    * The `generate_dataset()` function calls each of the specific generation functions to create a collection of queries for each intended category.
    * The generated queries are then combined into a single list.
    * Crucially, `random.shuffle()` is used to randomize the order of the queries in the final dataset. This prevents any ordering bias if the dataset were to be used for training a model, ensuring that the model doesn't learn patterns based on the order in which intents appear.

4.  **CSV Output:**
    * The `save_to_csv()` function writes the generated dataset into a CSV file. Each row in the CSV contains two columns:
        * `query`: The generated user query string.
        * `expected_intent`: The predefined category or "intent" for that query (e.g., "stock_analysis," "quarterly_report_summary," "unclassified").
    * A timestamp is incorporated into the filename (e.g., `financial_queries_20250711_042810.csv`) to ensure unique filenames for each run.

#### What Was Created:

The output is a CSV file named something like `financial_queries_YYYYMMDD_HHMMSS.csv`. This file contains a dataset of 160 distinct queries, each meticulously labeled with its corresponding expected intent.

Specifically, the dataset includes:

* **40 Stock Analysis Queries:** Examples: "Analyze Microsoft stock performance," "AAPL stock forecast and price targets."
* **35 Quarterly Report Summary Queries:** Examples: "Summary of Amazon's last quarter report," "Key metrics from NVDA's latest earnings."
* **30 Fixed Income Analysis Queries:** Examples: "Yield analysis for JPMorgan corporate bonds," "Credit risk assessment of MSFT bonds."
* **30 Sector Analysis Queries:** Examples: "Technology sector performance this year," "Growth projections for healthcare sector."
* **25 Unclassified Queries:** Examples: "Should I buy or rent a home?", "Recipe for chocolate chip cookies."

This structured and labeled dataset serves as a foundational resource for developing and evaluating NLU models tailored for financial domain understanding, ensuring that a system can accurately interpret user requests and provide relevant information.

In [13]:
import csv
import random
from datetime import datetime, timedelta

# US-based public companies and tickers
COMPANIES = [
    ("Apple", "AAPL"), ("Microsoft", "MSFT"), ("Amazon", "AMZN"),
    ("Google", "GOOGL"), ("Tesla", "TSLA"), ("Meta", "META"),
    ("NVIDIA", "NVDA"), ("JPMorgan", "JPM"), ("Bank of America", "BAC"),
    ("Walmart", "WMT"), ("Johnson & Johnson", "JNJ"), ("Exxon", "XOM"),
    ("Pfizer", "PFE"), ("Netflix", "NFLX"), ("Disney", "DIS"),
    ("Coca-Cola", "KO"), ("Intel", "INTC"), ("AMD", "AMD"),
    ("Salesforce", "CRM"), ("Oracle", "ORCL")
]

# Timeframes for reports
TIME_FRAMES = [
    "last quarter", "Q1", "Q2", "Q3", "Q4", "2023", "2024",
    "this year", "last year", "recent", "previous", "last 6 months"
]

# Financial terms for variation
FINANCIAL_TERMS = [
    "performance", "outlook", "trends", "metrics", "valuation",
    "growth", "projections", "indicators", "ratios", "fundamentals"
]

# Fixed income instruments
FIXED_INCOME_TYPES = [
    "corporate bonds", "Treasury notes", "municipal bonds", "TIPS",
    "commercial paper", "high-yield bonds", "investment grade debt",
    "agency bonds", "zero-coupon bonds", "floating rate notes"
]

# Market sectors
SECTORS = [
    "technology", "healthcare", "financial", "consumer cyclical",
    "energy", "industrials", "utilities", "real estate",
    "materials", "communication services"
]

# Generate stock analysis queries
def generate_stock_queries(n=40):
    queries = []
    for _ in range(n):
        company, ticker = random.choice(COMPANIES)
        time_frame = random.choice(TIME_FRAMES + [""])
        term = random.choice(FINANCIAL_TERMS)
        
        patterns = [
            f"Analyze {company} stock {term}",
            f"Technical analysis for {ticker}",
            f"How is {company} stock performing?",
            f"{ticker} stock forecast and price targets",
            f"Valuation metrics for {company} shares",
            f"Show me the fundamentals for {ticker}",
            f"Investment potential of {company} stock",
            f"Recent price action for {ticker}",
            f"Comparative analysis of {company} vs competitors",
            f"Long-term outlook for {ticker} shares"
        ]
        queries.append((random.choice(patterns), "stock_analysis"))
    return queries

# Generate quarterly report queries
def generate_quarterly_queries(n=35):
    queries = []
    for _ in range(n):
        company, ticker = random.choice(COMPANIES)
        time_frame = random.choice(TIME_FRAMES)
        
        patterns = [
            f"Summary of {company}'s {time_frame} report",
            f"Key metrics from {ticker}'s latest earnings",
            f"{company} {time_frame} financial highlights",
            f"Revenue and profit for {ticker} {time_frame}",
            f"{time_frame} earnings call summary for {company}",
            f"Did {ticker} beat estimates last quarter?",
            f"Cash flow statement for {company} {time_frame}",
            f"{ticker} earnings per share {time_frame}",
            f"Balance sheet summary for {company}",
            f"Operating margins for {ticker} {time_frame}"
        ]
        queries.append((random.choice(patterns), "quarterly_report_summary"))
    return queries

# Generate fixed income queries
def generate_fixed_income_queries(n=30):
    queries = []
    for _ in range(n):
        company, ticker = random.choice(COMPANIES)
        inst_type = random.choice(FIXED_INCOME_TYPES)
        
        patterns = [
            f"Yield analysis for {company} {inst_type}",
            f"Credit risk assessment of {ticker} bonds",
            f"Current rates for {inst_type}",
            f"{company} debt instrument analysis",
            f"Duration and convexity for {ticker} bonds",
            f"Spreads for {inst_type} in current market",
            f"Default probabilities for {company} debt",
            f"Analysis of {ticker} corporate bonds",
            f"Liquidity in {inst_type} market",
            f"Yield curve for {company} fixed income"
        ]
        queries.append((random.choice(patterns), "fixed_income_analysis"))
    return queries

# Generate sector analysis queries
def generate_sector_queries(n=30):
    queries = []
    for _ in range(n):
        sector = random.choice(SECTORS)
        time_frame = random.choice(TIME_FRAMES + [""])
        
        patterns = [
            f"{sector.capitalize()} sector performance {time_frame}",
            f"Industry analysis for {sector} companies",
            f"Growth projections for {sector} sector",
            f"Competitive landscape in {sector} industry",
            f"Top performers in {sector} sector {time_frame}",
            f"Market share analysis for {sector}",
            f"Regulatory impact on {sector} industry",
            f"Supply chain dynamics in {sector} sector",
            f"Emerging trends in {sector} market",
            f"Profitability metrics across {sector} companies"
        ]
        queries.append((random.choice(patterns), "sector_analysis"))
    return queries

# Generate out-of-scope queries
def generate_unclassified_queries(n=25):
    queries = [
        ("Should I buy or rent a home?", "unclassified"),
        ("Best way to save for retirement", "unclassified"),
        ("How to pay off student loans faster", "unclassified"),
        ("What's the weather forecast tomorrow?", "unclassified"),
        ("Recipe for chocolate chip cookies", "unclassified"),
        ("Career advice for finance professionals", "unclassified"),
        ("How to improve my credit score", "unclassified"),
        ("Where should I go for vacation?", "unclassified"),
        ("Compare car insurance quotes", "unclassified"),
        ("Meditation techniques for stress relief", "unclassified"),
        ("What is the meaning of life?", "unclassified"),
        ("Best books for personal development", "unclassified"),
        ("How to start a small business", "unclassified"),
        ("Tips for job interviews", "unclassified"),
        ("Healthy meal prep ideas", "unclassified"),
        ("Python programming tutorials", "unclassified"),
        ("How to change my password", "unclassified"),
        ("Local restaurants with vegan options", "unclassified"),
        ("Upcoming concerts in my area", "unclassified"),
        ("How to fix a leaky faucet", "unclassified"),
        ("Benefits of yoga for beginners", "unclassified"),
        ("Best time to visit Japan", "unclassified"),
        ("How to train for a marathon", "unclassified"),
        ("DIY home renovation tips", "unclassified"),
        ("Signs of a toxic workplace", "unclassified")
    ]
    return queries[:n]

# Generate dataset
def generate_dataset():
    dataset = []
    dataset += generate_stock_queries(40)
    dataset += generate_quarterly_queries(35)
    dataset += generate_fixed_income_queries(30)
    dataset += generate_sector_queries(30)
    dataset += generate_unclassified_queries(25)
    random.shuffle(dataset)
    return dataset

# Save to CSV
def save_to_csv(filename, dataset):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['query', 'expected_intent'])
        writer.writerows(dataset)

# Main execution
if __name__ == "__main__":
    dataset = generate_dataset()
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"financial_queries_{timestamp}.csv"
    save_to_csv(filename, dataset)
    print(f"Generated {len(dataset)} queries in {filename}")

Generated 160 queries in financial_queries_20250707_151113.csv
