# Financial Graph Query Samples

This notebook contains sample Cypher queries for analyzing the retail banking and investment portfolio graph using the **Neo4j Spark Connector**.

## Prerequisites

1. Run `import_financial_data_to_neo4j.ipynb` first to load the data
2. Databricks Secrets configured with `neo4j-creds` scope
3. Neo4j Spark Connector installed on the cluster

## Best Practices Used

- **Spark DataSource V2 API** for all Neo4j operations
- **Pushdown optimizations** enabled by default (filters, columns, aggregates, limits)
- **Partitioning** for parallel reads on large result sets
- **Query mode** for complex Cypher patterns

---

## Setup Connection

In [None]:
# =============================================================================
# SPARK CONNECTOR CONFIGURATION
# =============================================================================

print("Loading Neo4j credentials from Databricks Secrets...")

NEO4J_URL = dbutils.secrets.get(scope="neo4j-creds", key="url")
NEO4J_USER = dbutils.secrets.get(scope="neo4j-creds", key="username")
NEO4J_PASS = dbutils.secrets.get(scope="neo4j-creds", key="password")
NEO4J_DATABASE = "neo4j"

# Configure Spark session for Neo4j Connector
# These settings apply to all subsequent reads/writes
spark.conf.set("neo4j.url", NEO4J_URL)
spark.conf.set("neo4j.authentication.type", "basic")
spark.conf.set("neo4j.authentication.basic.username", NEO4J_USER)
spark.conf.set("neo4j.authentication.basic.password", NEO4J_PASS)
spark.conf.set("neo4j.database", NEO4J_DATABASE)

print(f"Neo4j URL: {NEO4J_URL}")
print(f"Database: {NEO4J_DATABASE}")
print("Spark session configured for Neo4j Connector.")

In [None]:
# =============================================================================
# HELPER FUNCTION FOR CYPHER QUERIES
# =============================================================================

def run_query(query: str, partitions: int = 1):
    """
    Execute a Cypher query using the Neo4j Spark Connector.
    
    Args:
        query: Cypher query string
        partitions: Number of partitions for parallel reads (default: 1)
                   Use higher values for large result sets
    
    Returns:
        Spark DataFrame with query results
    
    Note: Pushdown optimizations are enabled by default:
        - pushdown.filters.enabled = true
        - pushdown.columns.enabled = true  
        - pushdown.aggregate.enabled = true
        - pushdown.limit.enabled = true (disabled when partitions > 1)
    """
    return (
        spark.read
        .format("org.neo4j.spark.DataSource")
        .option("query", query)
        .option("partitions", str(partitions))
        .load()
    )

# Test connection
test_df = run_query("RETURN 'Connected!' AS status")
print(f"Connection: {test_df.collect()[0]['status']}")

---

## Portfolio Analysis

Analyze customer investment portfolios across the graph.

In [None]:
# =============================================================================
# Total Portfolio Value by Customer
# =============================================================================
print("Top 10 Customers by Total Portfolio Value")
print("=" * 50)

# Best Practice: Use explicit WITH clause for aggregation grouping
query = """
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)-[:HAS_POSITION]->(p:Position)
WITH c, round(SUM(p.current_value), 2) AS total_portfolio_value
RETURN
    c.customer_id AS customer_id,
    c.first_name + ' ' + c.last_name AS customer_name,
    total_portfolio_value
ORDER BY total_portfolio_value DESC
LIMIT 10
"""

display(run_query(query))

In [None]:
# =============================================================================
# Accounts with Multiple Positions (Diversified Portfolios)
# =============================================================================
print("Accounts with Multiple Holdings")
print("=" * 50)

query = """
MATCH (a:Account)-[:HAS_POSITION]->(p:Position)
WITH a, COUNT(p) AS position_count, round(SUM(p.current_value), 2) AS total_value
WHERE position_count > 1
RETURN
    a.account_id AS account,
    a.account_type AS type,
    position_count AS num_positions,
    total_value AS portfolio_value
ORDER BY position_count DESC
LIMIT 10
"""

display(run_query(query))

In [None]:
# =============================================================================
# Sector Allocation Analysis
# =============================================================================
print("Investment Allocation by Sector")
print("=" * 50)

# Best Practice: Filter NULL values before sorting
query = """
MATCH (a:Account)-[:HAS_POSITION]->(p:Position)-[:OF_SECURITY]->(s:Stock)-[:OF_COMPANY]->(c:Company)
WHERE c.sector IS NOT NULL
WITH c.sector AS sector, round(SUM(p.current_value), 2) AS sector_value
WITH sector, sector_value, SUM(sector_value) OVER () AS total_value
RETURN
    sector,
    sector_value,
    round(sector_value * 100.0 / total_value, 2) AS pct_of_total
ORDER BY sector_value DESC
"""

display(run_query(query))

In [None]:
# =============================================================================
# Stock Diversification by Account
# =============================================================================
print("Sector Diversification Analysis")
print("=" * 50)

query = """
MATCH (a:Account)-[:HAS_POSITION]->(p:Position)-[:OF_SECURITY]->(s:Stock)-[:OF_COMPANY]->(c:Company)
WITH a, 
     COUNT(DISTINCT c.sector) AS num_sectors,
     COLLECT(DISTINCT c.sector) AS sectors,
     round(SUM(p.current_value), 2) AS total_portfolio_value
RETURN
    a.account_id AS account,
    num_sectors,
    sectors,
    total_portfolio_value
ORDER BY num_sectors DESC, total_portfolio_value DESC
LIMIT 10
"""

display(run_query(query))

---

## Transaction Network Analysis

Analyze money flow patterns across the transaction network.

In [None]:
# =============================================================================
# Accounts with Most Outbound Transactions
# =============================================================================
print("Most Active Sending Accounts")
print("=" * 50)

query = """
MATCH (a:Account)-[:PERFORMS]->(t:Transaction)
WITH a, COUNT(t) AS tx_count, round(SUM(t.amount), 2) AS total_sent
ORDER BY tx_count DESC
LIMIT 10
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a)
RETURN
    c.first_name + ' ' + c.last_name AS customer_name,
    a.account_id AS account,
    a.account_type AS account_type,
    tx_count AS num_transactions,
    total_sent
"""

display(run_query(query))

In [None]:
# =============================================================================
# Accounts That Both Send and Receive Transactions
# =============================================================================
print("Accounts with Bidirectional Transaction Flow")
print("=" * 50)

query = """
MATCH (a:Account)-[:PERFORMS]->(:Transaction)
WITH a, COUNT(*) AS sent_count
MATCH (a)<-[:BENEFITS_TO]-(:Transaction)
WITH a, sent_count, COUNT(*) AS received_count
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a)
RETURN
    c.first_name + ' ' + c.last_name AS customer_name,
    a.account_id AS account,
    a.account_type AS type,
    a.balance AS balance,
    sent_count,
    received_count
ORDER BY sent_count + received_count DESC
LIMIT 10
"""

display(run_query(query))

In [None]:
# =============================================================================
# High-Value Transactions
# =============================================================================
print("High-Value Transactions (> $1,000)")
print("=" * 50)

query = """
MATCH (from:Account)-[:PERFORMS]->(t:Transaction)-[:BENEFITS_TO]->(to:Account)
WHERE t.amount > 1000
RETURN
    from.account_id AS sender,
    to.account_id AS recipient,
    t.amount AS amount,
    t.transaction_date AS date,
    t.description AS description
ORDER BY t.amount DESC
LIMIT 10
"""

display(run_query(query))

In [None]:
# =============================================================================
# Transaction Status Summary
# =============================================================================
print("Transaction Status Distribution")
print("=" * 50)

query = """
MATCH (t:Transaction)
RETURN t.status AS status, COUNT(t) AS count, round(SUM(t.amount), 2) AS total_amount
ORDER BY count DESC
"""

display(run_query(query))

---

## Risk Profile Analysis

Segment customers and analyze behavior by risk profile.

In [None]:
# =============================================================================
# Risk Profile Segmentation
# =============================================================================
print("Portfolio Characteristics by Risk Profile")
print("=" * 50)

# Best Practice: Filter NULL values before sorting
query = """
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
WHERE c.risk_profile IS NOT NULL
OPTIONAL MATCH (a)-[:HAS_POSITION]->(p:Position)
WITH c.risk_profile AS risk_profile,
     COUNT(DISTINCT c) AS num_customers,
     AVG(c.annual_income) AS avg_income,
     AVG(c.credit_score) AS avg_credit_score,
     AVG(a.balance) AS avg_account_balance,
     SUM(p.current_value) AS total_investment_value
RETURN
    risk_profile,
    num_customers,
    round(avg_income, 0) AS avg_income,
    round(avg_credit_score, 0) AS avg_credit_score,
    round(avg_account_balance, 2) AS avg_account_balance,
    round(total_investment_value, 2) AS total_investment_value
ORDER BY risk_profile
"""

display(run_query(query))

---

## Reading Nodes and Relationships Directly

The Spark Connector can also read nodes and relationships directly without custom Cypher queries. This enables automatic pushdown optimizations.

In [None]:
# =============================================================================
# Read Nodes by Label with Pushdown
# =============================================================================
print("Reading Customer Nodes (with automatic pushdown)")
print("=" * 50)

# Read all Customer nodes - Spark handles pushdown automatically
customers_df = (
    spark.read
    .format("org.neo4j.spark.DataSource")
    .option("labels", "Customer")
    .load()
)

# Filter and select - these operations are pushed down to Neo4j
high_income_customers = (
    customers_df
    .filter("annual_income > 100000")
    .filter("credit_score > 700")
    .select("customer_id", "first_name", "last_name", "annual_income", "credit_score", "risk_profile")
    .orderBy("annual_income", ascending=False)
    .limit(10)
)

display(high_income_customers)

In [None]:
# =============================================================================
# Read Relationships Directly
# =============================================================================
print("Reading HAS_ACCOUNT Relationships")
print("=" * 50)

# Read relationships with source and target node properties
has_account_df = (
    spark.read
    .format("org.neo4j.spark.DataSource")
    .option("relationship", "HAS_ACCOUNT")
    .option("relationship.source.labels", "Customer")
    .option("relationship.target.labels", "Account")
    .load()
)

print(f"Schema: {has_account_df.columns}")
display(has_account_df.limit(5))

---

## Exporting Query Results to Delta Lake

A common pattern is to run graph queries and store results in Delta tables for downstream analytics.

In [None]:
# =============================================================================
# Export Graph Analytics to Delta Table (Example)
# =============================================================================
print("Exporting Portfolio Summary to Delta (example pattern)")
print("=" * 50)

# Query aggregated portfolio data from Neo4j
# Best Practice: Use explicit WITH for aggregation and handle NULL sort values
portfolio_query = """
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
OPTIONAL MATCH (a)-[:HAS_POSITION]->(p:Position)-[:OF_SECURITY]->(s:Stock)-[:OF_COMPANY]->(co:Company)
WITH c,
     COUNT(DISTINCT a) AS num_accounts,
     SUM(a.balance) AS total_balance,
     COUNT(DISTINCT p) AS num_positions,
     SUM(p.current_value) AS total_investments,
     COLLECT(DISTINCT co.sector) AS sectors
RETURN
    c.customer_id AS customer_id,
    c.first_name + ' ' + c.last_name AS customer_name,
    c.risk_profile AS risk_profile,
    num_accounts,
    round(total_balance, 2) AS total_balance,
    num_positions,
    round(coalesce(total_investments, 0), 2) AS total_investments,
    SIZE(sectors) AS num_sectors
ORDER BY total_investments DESC NULLS LAST
"""

portfolio_df = run_query(portfolio_query)
display(portfolio_df.limit(10))

# Uncomment to write to Delta:
# portfolio_df.write \
#     .format("delta") \
#     .mode("overwrite") \
#     .saveAsTable("gold.customer_portfolio_summary")

---

## Graph Visualization Queries

Queries designed for visual exploration in Neo4j Browser or Bloom.

In [None]:
# =============================================================================
# Customer Financial Network (for Neo4j Browser)
# =============================================================================
print("Customer C0001's Complete Financial Network")
print("=" * 50)
print("")
print("Run this query in Neo4j Browser for visual exploration:")
print("-" * 50)

viz_query = """
MATCH path = (c:Customer {customer_id: 'C0001'})-[*1..3]-(connected)
RETURN path
LIMIT 50
"""

print(viz_query)
print("-" * 50)
print("")

# Show node counts for context
count_query = """
MATCH (c:Customer {customer_id: 'C0001'})-[r*1..3]-(connected)
RETURN labels(connected)[0] AS node_type, COUNT(DISTINCT connected) AS count
ORDER BY count DESC
"""

print("Connected nodes:")
display(run_query(count_query))

In [None]:
# =============================================================================
# Transaction Flow Visualization (for Neo4j Browser)
# =============================================================================
print("Transaction Flow Network")
print("=" * 50)
print("")
print("Run this query in Neo4j Browser for visual exploration:")
print("-" * 50)

viz_query = """
MATCH path = (from:Account)-[:PERFORMS]->(t:Transaction)-[:BENEFITS_TO]->(to:Account)
RETURN path
LIMIT 25
"""

print(viz_query)

---

## Query Summary

This notebook demonstrated using the **Neo4j Spark Connector** for:

**Portfolio Analysis**
- Total portfolio value by customer
- Accounts with diversified holdings
- Sector allocation breakdown
- Sector diversification per account

**Transaction Network**
- Most active sending accounts
- Bidirectional transaction flow
- High-value transactions
- Transaction status distribution

**Risk Analysis**
- Customer segmentation by risk profile

**Connector Patterns**
- Reading nodes by label with automatic pushdown
- Reading relationships with source/target properties
- Exporting to Delta Lake

**Visualization**
- Customer financial network paths
- Transaction flow patterns

### Best Practices Applied

1. **Use `query` option** for complex Cypher with multiple MATCH clauses
2. **Use `labels` option** for simple node reads (enables automatic pushdown)
3. **Use `relationship` option** for relationship reads with source/target nodes
4. **Enable partitions** for large result sets (disables limit pushdown)
5. **Leverage pushdown** - filters, columns, aggregates, limits are pushed to Neo4j

For more queries, see `DATA_IMPORT.md` in the project root.