# Module 1: Graph Basics
**Modular GenAI Workshops 2025**

This notebook accompanies Module 1 of the workshop series. You'll learn graph database fundamentals and explore financial data using Neo4j and Cypher.

## Learning Objectives
- Understand graph concepts and Neo4j fundamentals
- Write basic Cypher queries
- Explore financial data patterns
- Connect graph concepts to AI/ML use cases

## Environment Setup

## Install Required Packages

Before we can work with Neo4j and analyze graph data, we need to install several Python packages:

- **neo4j**: The official Neo4j Python driver for connecting to and querying the database
- **pandas**: For data manipulation and analysis 
- **python-dotenv**: For loading environment variables from .env files
- **matplotlib & seaborn**: For creating data visualizations

These packages form the foundation of our graph data analysis toolkit.

## Import Libraries and Load Environment Variables

Here we import all the necessary libraries for our graph analysis session:

- **os**: For accessing operating system environment variables
- **pandas**: We'll use this to convert Neo4j query results into DataFrames for analysis
- **GraphDatabase from neo4j**: The main class for connecting to Neo4j
- **load_dotenv**: Loads environment variables from a .env file (keeps credentials secure)
- **matplotlib.pyplot & seaborn**: For creating charts and visualizations

The `load_dotenv()` function automatically loads credentials from your .env file, which should contain your Neo4j connection details.

Now we'll import the necessary libraries and load environment variables from our .env file.

## Create a Neo4j Connection Helper Class

This `Neo4jConnection` class encapsulates all the logic needed to connect to and query Neo4j:

**Key Components:**
- **`__init__`**: Gets credentials from environment variables and creates the driver
- **`query`**: Executes Cypher queries and returns results as Python dictionaries
- **`close`**: Properly closes the database connection when done

**Security Best Practice**: Notice how we store credentials in environment variables rather than hardcoding them in the notebook. This keeps sensitive information secure.

**Error Handling**: The class provides helpful error messages if credentials are missing, making debugging easier during setup.

Let's create a helper class to manage our Neo4j connection. This will handle authentication and provide a simple query method.

## Test Database Connection

Before proceeding with data analysis, it's crucial to verify our connection works properly.

**What this code does:**
- Executes a simple `RETURN` statement that doesn't require any data
- If successful, displays a confirmation message
- If it fails, provides troubleshooting guidance

**Why this matters:**
- Catches connection issues early before we try complex queries
- Provides clear feedback about what might be wrong
- Saves time by identifying configuration problems immediately

This is a best practice whenever working with external databases or APIs.

Let's test our Neo4j connection to make sure everything is working properly.

## Explore Node Types in the Database

**Understanding the Cypher Query:**
```cypher
MATCH (n) 
RETURN labels(n) AS nodeType, count(n) AS count 
ORDER BY count DESC
```

**Breaking it down:**
- `MATCH (n)`: Finds all nodes in the database (n is a variable representing any node)
- `labels(n)`: Gets the label(s) for each node (like Customer, Account, etc.)
- `count(n)`: Counts how many nodes of each type exist
- `ORDER BY count DESC`: Shows the most common node types first

**Why this matters:**
This gives us a bird's-eye view of our data structure and helps us understand what entities are most prevalent in our financial dataset.

## Examine Relationship Types

**Understanding the Cypher Query:**
```cypher
MATCH ()-[r]->() 
RETURN type(r) AS relationshipType, count(r) AS count 
ORDER BY count DESC
```

**Breaking it down:**
- `()-[r]->()`: Matches any relationship (r) between any two nodes
- `type(r)`: Gets the relationship type (like HAS_ACCOUNT, TRANSACTION, etc.)
- `count(r)`: Counts how many relationships of each type exist

**Key Insight:**
Relationships in graphs represent the connections and interactions between entities. In financial data, these might include:
- Customer HAS_ACCOUNT relationships
- TRANSACTION relationships between accounts
- OWNS relationships for ownership structures

Understanding relationship patterns helps identify the most important connections in your data.

Next, let's examine the relationship types that connect our nodes together.

## Find High-Value Customer Accounts

**Understanding the Graph Pattern:**
```cypher
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
```

This pattern demonstrates a fundamental graph concept: **traversing relationships**

**Breaking it down:**
- `(c:Customer)`: Find nodes labeled as Customer
- `-[:HAS_ACCOUNT]->`: Follow the HAS_ACCOUNT relationship
- `(a:Account)`: To nodes labeled as Account

**Business Value:**
- Identifies customers with the highest account balances
- Shows the relationship between customer identity and financial assets
- Provides data for customer segmentation and targeting

**Graph Advantage:** In a relational database, this would require joining multiple tables. In a graph, it's a simple pattern traversal.

## Visualize Account Type Distribution

**Data Visualization in Graph Analytics:**

Here we demonstrate how to combine graph queries with data visualization:

1. **Query Neo4j** to get account type counts
2. **Convert to DataFrame** using pandas for easier manipulation
3. **Create visualization** using seaborn/matplotlib

**Why visualization matters:**
- Makes patterns immediately visible
- Helps identify data quality issues
- Communicates insights effectively to stakeholders
- Supports exploratory data analysis

**Learning Note:** This pattern of query → DataFrame → visualization is fundamental in graph analytics workflows.

Now let's create a visualization to see the distribution of different account types.

## Analyze Large Transaction Patterns

**Understanding Multi-Hop Relationships:**
```cypher
MATCH (from:Account)-[t:TRANSACTION]->(to:Account)
```

This introduces a key graph concept: **relationships with properties**

**Key Learning Points:**
- `[t:TRANSACTION]`: The relationship itself has properties (amount, date, type)
- `WHERE t.amount > 5000`: We can filter on relationship properties
- Graph relationships aren't just connections—they carry meaningful data

**Financial Analytics Application:**
- Large transactions might indicate business transfers
- Could be useful for fraud detection
- Helps understand money flow patterns
- Supports regulatory compliance monitoring

## Statistical Analysis of Transaction Amounts

**Combining Graph Data with Statistical Analysis:**

This cell demonstrates how to extract data from Neo4j and apply statistical analysis:

**Statistical Concepts:**
- **Distribution analysis**: Understanding how transaction amounts are spread
- **Box plots**: Identify outliers and quartiles
- **Descriptive statistics**: Mean, median help understand typical behavior

**Graph Analytics Application:**
- Unusual distributions might indicate fraud or data quality issues
- Understanding normal patterns helps identify anomalies
- Statistical baselines are crucial for machine learning models

**Integration Pattern:** Graph → Extract → Analyze → Visualize is a common workflow in graph analytics.

Let's analyze the distribution of transaction amounts to understand spending patterns.

## Identify Multi-Account Customers

**Advanced Graph Pattern: Aggregation**
```cypher
WITH c, count(a) AS accountCount, collect(a.type) AS accountTypes
```

**Key Learning Concepts:**
- **`WITH` clause**: Groups results before further processing (like GROUP BY in SQL)
- **`count(a)`**: Aggregates the number of accounts per customer
- **`collect(a.type)`**: Gathers all account types into a list
- **`WHERE accountCount > 1`**: Filters on the aggregated result

**Business Intelligence Value:**
- Multi-account customers are typically more valuable
- Understanding account type combinations helps with product recommendations
- Enables customer segmentation strategies

Let's identify customers who have multiple accounts, as these may be our most valuable customers.

In [None]:
# Get node counts by type
node_counts = neo4j.query("""
    MATCH (n) 
    RETURN labels(n) AS nodeType, count(n) AS count 
    ORDER BY count DESC
""")

print("=== Database Overview ===")
for record in node_counts:
    node_type = record['nodeType'][0] if record['nodeType'] else 'No Label'
    print(f"{node_type}: {record['count']:,} nodes")

In [None]:
# Get relationship counts by type
rel_counts = neo4j.query("""
    MATCH ()-[r]->() 
    RETURN type(r) AS relationshipType, count(r) AS count 
    ORDER BY count DESC
""")

print("\n=== Relationship Types ===")
for record in rel_counts:
    print(f"{record['relationshipType']}: {record['count']:,} relationships")

## Exercise 2: Basic Graph Patterns
Now let's explore some basic patterns in our financial data.

In [None]:
# Find sample customers and their accounts
customers = neo4j.query("""
    MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
    RETURN c.name AS customer, 
           a.type AS accountType, 
           a.balance AS balance,
           a.number AS accountNumber
    ORDER BY a.balance DESC
    LIMIT 10
""")

df_customers = pd.DataFrame(customers)
print("=== Top Customers by Account Balance ===")
print(df_customers.to_string(index=False))

In [None]:
# Visualize account types distribution
account_types = neo4j.query("""
    MATCH (a:Account)
    RETURN a.type AS accountType, count(a) AS count
    ORDER BY count DESC
""")

df_types = pd.DataFrame(account_types)

plt.figure(figsize=(10, 6))
sns.barplot(data=df_types, x='accountType', y='count')
plt.title('Account Types Distribution')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Exercise 3: Transaction Patterns

In [None]:
# Find large transactions
large_transactions = neo4j.query("""
    MATCH (from:Account)-[t:TRANSACTION]->(to:Account)
    WHERE t.amount > 5000
    RETURN from.number AS fromAccount,
           to.number AS toAccount,
           t.amount AS amount,
           t.date AS date,
           t.type AS transactionType
    ORDER BY t.amount DESC
    LIMIT 10
""")

df_transactions = pd.DataFrame(large_transactions)
print("=== Largest Transactions ===")
print(df_transactions.to_string(index=False))

In [None]:
# Analyze transaction amounts distribution
transaction_stats = neo4j.query("""
    MATCH ()-[t:TRANSACTION]-()
    RETURN t.amount AS amount
    ORDER BY t.amount
""")

amounts = [record['amount'] for record in transaction_stats]

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(amounts, bins=50, alpha=0.7)
plt.title('Transaction Amount Distribution')
plt.xlabel('Amount')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.boxplot(amounts)
plt.title('Transaction Amount Box Plot')
plt.ylabel('Amount')

plt.tight_layout()
plt.show()

print(f"Total transactions: {len(amounts):,}")
print(f"Average amount: ${pd.Series(amounts).mean():.2f}")
print(f"Median amount: ${pd.Series(amounts).median():.2f}")

## Exercise 4: Customer Analysis

In [None]:
# Find customers with multiple accounts
multi_account_customers = neo4j.query("""
    MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
    WITH c, count(a) AS accountCount, collect(a.type) AS accountTypes
    WHERE accountCount > 1
    RETURN c.name AS customer, 
           accountCount,
           accountTypes
    ORDER BY accountCount DESC
    LIMIT 10
""")

print("=== Customers with Multiple Accounts ===")
for record in multi_account_customers:
    print(f"{record['customer']}: {record['accountCount']} accounts ({', '.join(record['accountTypes'])})")

In [None]:
# Find most active customers by transaction volume
active_customers = neo4j.query("""
    MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)-[t:TRANSACTION]-()
    WITH c, count(t) AS transactionCount, sum(abs(t.amount)) AS totalVolume
    WHERE transactionCount > 5
    RETURN c.name AS customer, 
           transactionCount, 
           totalVolume
    ORDER BY totalVolume DESC
    LIMIT 10
""")

df_active = pd.DataFrame(active_customers)
print("=== Most Active Customers by Transaction Volume ===")
print(df_active.to_string(index=False))

## Exercise 5: Graph Patterns for AI Applications
Let's explore some patterns that would be useful for AI applications like fraud detection and recommendations.

In [None]:
# Find potential fraud patterns - accounts with unusual transaction patterns
unusual_patterns = neo4j.query("""
    MATCH (a:Account)-[t:TRANSACTION]-()
    WITH a, 
         count(t) AS transactionCount,
         avg(t.amount) AS avgAmount,
         stdev(t.amount) AS amountStdev
    WHERE transactionCount > 10 
      AND amountStdev > avgAmount * 1.5
    RETURN a.number AS account, 
           a.type AS accountType,
           transactionCount, 
           avgAmount, 
           amountStdev
    ORDER BY amountStdev DESC
    LIMIT 10
""")

print("=== Accounts with Unusual Transaction Patterns ===")
print("(High variance in transaction amounts - potential fraud indicators)")
df_unusual = pd.DataFrame(unusual_patterns)
print(df_unusual.to_string(index=False))

In [None]:
# Find customer similarity patterns (for recommendations)
customer_similarity = neo4j.query("""
    MATCH (c1:Customer)-[:HAS_ACCOUNT]->()-[:TRANSACTION]->(m:Account)<-[:TRANSACTION]-()-[:HAS_ACCOUNT]-(c2:Customer)
    WHERE c1 <> c2
    WITH c1, c2, count(m) AS sharedConnections
    WHERE sharedConnections >= 2
    RETURN c1.name AS customer1, 
           c2.name AS customer2, 
           sharedConnections
    ORDER BY sharedConnections DESC
    LIMIT 10
""")

print("\n=== Customer Similarity (Shared Transaction Patterns) ===")
print("(Customers who transact with similar accounts - useful for recommendations)")
df_similarity = pd.DataFrame(customer_similarity)
print(df_similarity.to_string(index=False))

## Challenge Exercise: Write Your Own Queries
Try writing Cypher queries to answer these business questions:

In [None]:
# Challenge 1: Find customers who have accounts but no transactions
# Your query here:
query1 = """
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
WHERE NOT (a)-[:TRANSACTION]-()
RETURN c.name AS customer, a.number AS account, a.type AS accountType
LIMIT 10
"""

result1 = neo4j.query(query1)
print("=== Customers with Accounts but No Transactions ===")
for record in result1:
    print(f"{record['customer']}: {record['account']} ({record['accountType']})")

In [None]:
# Challenge 2: Find the highest total balance by customer
# Your query here:
query2 = """
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)
WITH c, sum(a.balance) AS totalBalance
RETURN c.name AS customer, totalBalance
ORDER BY totalBalance DESC
LIMIT 10
"""

result2 = neo4j.query(query2)
print("\n=== Customers by Total Balance ===")
for record in result2:
    print(f"{record['customer']}: ${record['totalBalance']:,.2f}")

In [None]:
# Challenge 3: Find the most common transaction types
# Your query here:
query3 = """
MATCH ()-[t:TRANSACTION]-()
RETURN t.type AS transactionType, count(t) AS frequency
ORDER BY frequency DESC
"""

result3 = neo4j.query(query3)
print("\n=== Most Common Transaction Types ===")
for record in result3:
    print(f"{record['transactionType']}: {record['frequency']:,} transactions")

## Summary and Reflection

In this module, you've learned:

1. **Graph Database Fundamentals**: How graphs store data as nodes and relationships
2. **Cypher Query Language**: Basic patterns for finding and analyzing data
3. **Financial Data Patterns**: Real-world examples of customer, account, and transaction data
4. **AI Application Potential**: How graph patterns can support fraud detection and recommendations

### Key Insights from the Data:
- Graphs naturally represent financial relationships and transaction flows
- Pattern-based queries can identify unusual behavior and customer similarities
- Graph traversals reveal insights that would be complex in relational databases

### Next Steps:
In the next module, we'll learn how to import and model structured data to build graphs like this one.

## Cleanup

In [None]:
# Close the Neo4j connection
neo4j.close()
print("✅ Connection closed")