# Module 2: Structured Data Import
**Modular GenAI Workshops 2025**

This notebook guides you through importing structured data into Neo4j, covering data modeling, CSV import, and optimization techniques.

## Learning Objectives
- Design effective graph data models
- Import structured data using LOAD CSV
- Transform and validate data during import
- Optimize import performance

In [None]:
# Install required packages
!pip install neo4j pandas python-dotenv requests

## Install Required Packages

Before we begin working with structured data import, we need to install essential Python packages:

- **neo4j**: The official Neo4j Python driver for database connectivity
- **pandas**: For data manipulation and transformation during import
- **python-dotenv**: For secure environment variable management
- **requests**: For downloading data from external sources

These packages provide the foundation for building robust data import pipelines.

## Import Libraries and Initialize Environment

Here we import all necessary libraries for our data import workflow:

**Key Libraries:**
- **datetime**: For handling date/time data transformations
- **time**: For measuring import performance
- **random & Faker**: For generating realistic sample data
- **pandas**: For data manipulation and validation

**Environment Setup:**
The `load_dotenv()` function loads our Neo4j credentials from the .env file, keeping sensitive information secure and separate from code.

In [None]:
import os
import pandas as pd
import requests
from neo4j import GraphDatabase
from dotenv import load_dotenv
import time
from datetime import datetime, date

load_dotenv()
print("✅ Packages imported successfully")

## Create Enhanced Neo4j Connection Class

This `Neo4jImporter` class extends our basic connection with import-specific functionality:

**Key Features:**
- **`execute()`**: For read queries and schema operations
- **`execute_write()`**: For write operations with proper transaction handling
- **Database selection**: Supports connecting to specific Neo4j databases
- **Error handling**: Robust connection management for long-running imports

**Why This Matters:**
Import operations require different patterns than simple queries - they need transaction management, better error handling, and performance monitoring.

## Neo4j Connection

## Clear Database and Create Constraints

**Database Preparation:**
Before importing new data, we clear any existing data to start fresh. In production, you would be more selective about this!

**Constraints for Data Integrity:**
We create UNIQUE constraints on key identifiers:
- **customer_id**: Ensures no duplicate customers
- **account_number**: Prevents duplicate accounts
- **merchant_id**: Maintains merchant uniqueness
- **location_id**: Ensures location integrity

**Why Constraints Matter:**
- Prevent data duplication during import
- Improve query performance 
- Maintain referential integrity
- Enable safe re-running of import scripts

## Create Performance Indexes

**Index Strategy for Fast Queries:**

We create indexes on frequently-queried properties:
- **customer_email & customer_name**: For customer lookups
- **account_type**: For account categorization queries
- **merchant_category**: For merchant analysis
- **transaction_date & transaction_amount**: For transaction analysis

**Index Performance Impact:**
- Dramatically speeds up WHERE clause filtering
- Enables efficient sorting and range queries
- Critical for real-time analytics queries
- Must be created BEFORE large data imports for best performance

In [None]:
class Neo4jImporter:
    def __init__(self):
        self.uri = os.getenv('NEO4J_URI')
        self.username = os.getenv('NEO4J_USERNAME')
        self.password = os.getenv('NEO4J_PASSWORD')
        self.database = os.getenv('NEO4J_DATABASE', 'neo4j')
        self.driver = GraphDatabase.driver(self.uri, auth=(self.username, self.password))
    
    def execute(self, query, parameters=None):
        with self.driver.session(database=self.database) as session:
            result = session.run(query, parameters or {})
            return [record.data() for record in result]
    
    def execute_write(self, query, parameters=None):
        with self.driver.session(database=self.database) as session:
            return session.execute_write(lambda tx: tx.run(query, parameters or {}))
    
    def close(self):
        self.driver.close()

neo4j = Neo4jImporter()
print("✅ Neo4j connection established")

## Generate Realistic Location Data

**Data Generation Strategy:**

Since we're demonstrating import concepts, we generate realistic sample data using:
- **Real US cities** with actual coordinates
- **Faker library** for realistic addresses and postal codes
- **Geographic distribution** across major metropolitan areas

**Key Learning Points:**
- Real-world imports often require data transformation
- Geographic data includes multiple related attributes
- Data should be realistic for meaningful analysis
- Small datasets help us understand import patterns before scaling

## Generate Customer Demographic Data

**Customer Data Modeling:**

We create realistic customer profiles including:
- **Personal information**: Names, emails, phone numbers
- **Financial attributes**: Income bracket, credit score
- **Temporal data**: Registration dates
- **Geographic references**: Location IDs for relationships

**Data Realism Features:**
- Age ranges from 18-80 (realistic banking customers)
- Credit scores follow actual distribution (300-850)
- Registration dates span 5 years of history
- Income brackets for financial segmentation

## Generate Merchant Business Data

**Merchant Entity Design:**

Creating diverse merchant data including:
- **Business information**: Names and establishment years
- **Categorization**: Primary and subcategories for analytics
- **Geographic distribution**: Connected to locations
- **Realistic business types**: Restaurants, retail, gas stations, etc.

**Category Strategy:**
The merchant categories are designed to support:
- Spending pattern analysis
- Customer behavior insights
- Fraud detection (unusual category patterns)
- Recommendation engines

## Generate Account Portfolio Data

**Account Relationship Modeling:**

This demonstrates a key graph concept: **one-to-many relationships**
- Each customer can have 1-3 accounts
- Account types: checking, savings, credit, investment
- Realistic balance distributions by account type
- Primary account designation

**Financial Data Realism:**
- **Checking**: $100-$10,000 (daily spending money)
- **Savings**: $500-$50,000 (accumulated savings)
- **Credit**: Negative balances (debt)
- **Investment**: $1,000-$100,000 (long-term wealth)

**Graph Design Decision:** Accounts are separate entities (not just properties) because they have their own lifecycle and relationships.

## Step 1: Clear Previous Data and Set Up Schema

## Import Location Entities

**Basic Node Creation Pattern:**

This demonstrates the fundamental import pattern:
1. **Iterate through data** using pandas DataFrame
2. **Execute CREATE statements** for each record
3. **Monitor performance** with timing
4. **Provide feedback** on import progress

**Learning Points:**
- Individual CREATE statements are simple but not fastest for large datasets
- Each row becomes a node with properties
- Performance timing helps identify bottlenecks
- Good for understanding the import process before optimization

## Import Merchants with Relationships

**Node + Relationship Creation Pattern:**

This shows a more complex import pattern:
1. **Create the merchant node** with business properties
2. **Create relationship to location** using MATCH + CREATE
3. **Handle foreign key references** through location_id

**Key Graph Concepts:**
- **Two-step process**: Create node, then connect to existing nodes
- **MATCH before CREATE**: Find existing location to connect to
- **Foreign key mapping**: location_id becomes a graph relationship
- **Relationship semantics**: LOCATED_IN expresses business-location connection

## Import Customers with Data Validation

**Data Validation During Import:**

This demonstrates production-quality import practices:
1. **Validate data quality** before creating nodes
2. **Skip invalid records** with helpful counters
3. **Add computed properties** (risk_category based on credit_score)
4. **Create relationships** to geographic data

**Validation Rules:**
- Email must contain '@' symbol
- Age must be realistic (18-120)
- Skip bad data rather than failing entire import

**Computed Properties:** 
Risk categories are calculated during import, creating analytics-ready data from the start.

## Import Account Portfolio with Business Logic

**Advanced Import with Business Rules:**

This pattern shows sophisticated import logic:
1. **Business validation**: Skip accounts with negative balances (except credit)
2. **Computed categorization**: Balance categories for analytics
3. **Relationship properties**: HAS_ACCOUNT has opening date and primary flag
4. **Single transaction**: Create account and relationship together

**Graph Design Excellence:**
- **Rich relationships**: The HAS_ACCOUNT relationship carries meaningful data
- **Analytics preparation**: Balance categories enable fast segmentation
- **Data consistency**: Business rules prevent invalid data states

In [None]:
# Clear existing data (be careful with this in production!)
print("🧹 Clearing existing data...")
neo4j.execute("MATCH (n) DETACH DELETE n")

# Create constraints for data integrity
print("🔧 Creating constraints...")
constraints = [
    "CREATE CONSTRAINT customer_id IF NOT EXISTS FOR (c:Customer) REQUIRE c.id IS UNIQUE",
    "CREATE CONSTRAINT account_number IF NOT EXISTS FOR (a:Account) REQUIRE a.number IS UNIQUE",
    "CREATE CONSTRAINT merchant_id IF NOT EXISTS FOR (m:Merchant) REQUIRE m.id IS UNIQUE",
    "CREATE CONSTRAINT location_id IF NOT EXISTS FOR (l:Location) REQUIRE l.id IS UNIQUE"
]

for constraint in constraints:
    try:
        neo4j.execute(constraint)
        print(f"✅ {constraint.split()[2]}")
    except Exception as e:
        print(f"⚠️  {constraint.split()[2]}: {str(e)}")

## Generate Transaction Activity Data

**Transaction Data Complexity:**

Financial transactions represent the most complex data in our graph:
- **Account-to-account relationships**: Money flow between entities
- **Temporal patterns**: Date/time stamps for activity analysis
- **Amount variations**: Realistic distributions by transaction type
- **Channel tracking**: How transactions occur (online, mobile, ATM)
- **Merchant connections**: Purchase transactions link to merchants

**Business Logic in Data Generation:**
- Different amount ranges by transaction type
- Temporal patterns (recent transactions)
- Channel diversity for analysis
- Merchant integration for purchase tracking

## Import Complex Transaction Relationships

**Multi-Relationship Import Pattern:**

This demonstrates the most sophisticated import pattern:
1. **Account-to-account TRANSACTION** relationships with rich properties
2. **Conditional merchant relationships** for purchase transactions
3. **Multiple computed properties** for analytics
4. **Complex validation** (positive amounts, different accounts)

**Analytics-Ready Properties:**
- **amount_category**: Enables fast transaction size analysis
- **is_weekend**: Supports fraud detection patterns
- **hour_of_day**: Time-based behavior analysis
- **PURCHASED_FROM**: Direct merchant-customer connection tracking

**Performance Note:** This creates the most relationships and is typically the slowest import step.

In [None]:
# Create indexes for performance
print("📊 Creating indexes...")
indexes = [
    "CREATE INDEX customer_email IF NOT EXISTS FOR (c:Customer) ON (c.email)",
    "CREATE INDEX customer_name IF NOT EXISTS FOR (c:Customer) ON (c.name)",
    "CREATE INDEX account_type IF NOT EXISTS FOR (a:Account) ON (a.type)",
    "CREATE INDEX merchant_category IF NOT EXISTS FOR (m:Merchant) ON (m.category)",
    "CREATE INDEX transaction_date IF NOT EXISTS FOR ()-[t:TRANSACTION]-() ON (t.date)",
    "CREATE INDEX transaction_amount IF NOT EXISTS FOR ()-[t:TRANSACTION]-() ON (t.amount)"
]

for index in indexes:
    try:
        neo4j.execute(index)
        print(f"✅ {index.split()[2]}")
    except Exception as e:
        print(f"⚠️  {index.split()[2]}: {str(e)}")

## Comprehensive Data Quality Validation

**Post-Import Data Verification:**

After importing data, always validate the results:
1. **Completeness checks**: Count nodes and relationships
2. **Orphaned data detection**: Find nodes without expected relationships
3. **Data distribution analysis**: Verify realistic patterns
4. **Referential integrity**: Ensure all relationships are valid

**Why Validation Matters:**
- Catches import errors early
- Verifies data distribution is realistic
- Identifies performance issues
- Provides confidence in data quality for analytics

## Step 2: Generate Sample Data
Since we're working with sample data, let's create some realistic financial data.

## Query Performance Analysis

**Testing Import Success with Performance Metrics:**

After import, test that your indexes and data structure support efficient queries:
1. **Index effectiveness**: Email searches should be sub-second
2. **Relationship traversals**: Complex queries across multiple hops
3. **Aggregate operations**: Count/sum operations on large datasets

**Performance Benchmarking:**
- Sub-second response times indicate good index usage
- Multiple second response times suggest missing indexes
- Compare performance before/after index creation
- Monitor query patterns that will be used in production

In [None]:
import random
from faker import Faker

# Install faker if not available
try:
    fake = Faker()
except:
    !pip install faker
    fake = Faker()

# Generate sample locations data
locations_data = []
cities = [
    ('New York', 'NY', 'USA', 40.7128, -74.0060),
    ('Los Angeles', 'CA', 'USA', 34.0522, -118.2437),
    ('Chicago', 'IL', 'USA', 41.8781, -87.6298),
    ('Houston', 'TX', 'USA', 29.7604, -95.3698),
    ('Phoenix', 'AZ', 'USA', 33.4484, -112.0740),
    ('Philadelphia', 'PA', 'USA', 39.9526, -75.1652),
    ('San Antonio', 'TX', 'USA', 29.4241, -98.4936),
    ('San Diego', 'CA', 'USA', 32.7157, -117.1611),
    ('Dallas', 'TX', 'USA', 32.7767, -96.7970),
    ('San Jose', 'CA', 'USA', 37.3382, -121.8863)
]

for i, (city, state, country, lat, lon) in enumerate(cities, 1):
    locations_data.append({
        'location_id': i,
        'city': city,
        'state': state,
        'country': country,
        'zip_code': fake.zipcode(),
        'latitude': lat,
        'longitude': lon
    })

df_locations = pd.DataFrame(locations_data)
print(f"Generated {len(df_locations)} locations")
df_locations.head()

## Customer Similarity Analysis for AI Applications

**Graph-Based Recommendation Engine Pattern:**

This query demonstrates how graph structures enable advanced analytics:
1. **Multi-hop traversal**: Customer → Account → Purchase → Merchant ← Purchase ← Account ← Customer
2. **Pattern recognition**: Find customers with shared purchasing behavior
3. **Similarity scoring**: Count shared merchants as similarity metric
4. **AI application**: Foundation for collaborative filtering recommendations

**Business Value:**
This pattern enables personalized recommendations: "Customers like you also shop at..."

## Fraud Detection Pattern Analysis

**Anomaly Detection Using Graph Patterns:**

This query shows how to identify potentially fraudulent activity:
1. **Behavioral patterns**: Large transactions on weekends (unusual timing)
2. **Risk correlation**: Connect transaction patterns to customer risk profiles
3. **Context enrichment**: Include transaction details for investigation
4. **Prioritization**: Order by amount to focus on highest-risk activities

**Fraud Detection Insights:**
- Weekend large transactions may indicate compromised accounts
- Risk category correlation helps prioritize investigations
- Graph queries naturally combine multiple risk factors

## Geographic Wealth Distribution Analysis

**Spatial Analytics with Graph Data:**

This query demonstrates geographic analysis capabilities:
1. **Geographic aggregation**: Group customers by location
2. **Wealth concentration**: Sum account balances by region
3. **Customer density**: Count customers per location
4. **Market insights**: Identify high-value geographic markets

**Business Applications:**
- Branch location planning
- Marketing campaign targeting
- Risk assessment by geography
- Market penetration analysis

**Graph Advantage:** Easily traverse customer → location relationships for spatial analytics.

In [None]:
# Generate sample customers data
customers_data = []
for i in range(1, 101):  # 100 customers
    customers_data.append({
        'customer_id': i,
        'full_name': fake.name(),
        'email': fake.email(),
        'phone': fake.phone_number(),
        'age': random.randint(18, 80),
        'income_bracket': random.choice(['Low', 'Medium', 'High', 'Very High']),
        'credit_score': random.randint(300, 850),
        'registration_date': fake.date_between(start_date='-5y', end_date='today'),
        'location_id': random.randint(1, len(locations_data))
    })

df_customers = pd.DataFrame(customers_data)
print(f"Generated {len(df_customers)} customers")
df_customers.head()

In [None]:
# Generate sample merchants data
merchant_categories = ['Restaurant', 'Retail', 'Gas Station', 'Grocery', 'Online', 'ATM', 'Pharmacy', 'Entertainment']
merchants_data = []

for i in range(1, 51):  # 50 merchants
    category = random.choice(merchant_categories)
    merchants_data.append({
        'merchant_id': i,
        'merchant_name': fake.company(),
        'category': category,
        'subcategory': f"{category} - {fake.word().title()}",
        'established_year': random.randint(1980, 2020),
        'location_id': random.randint(1, len(locations_data))
    })

df_merchants = pd.DataFrame(merchants_data)
print(f"Generated {len(df_merchants)} merchants")
df_merchants.head()

## Graceful Resource Cleanup

**Proper Connection Management:**

Always close database connections properly:
- Releases connection pool resources
- Prevents connection leaks
- Ensures clean shutdown
- Good practice for production applications

This is especially important in Jupyter notebooks where connections can persist between cell executions.

In [None]:
# Generate sample accounts data
account_types = ['checking', 'savings', 'credit', 'investment']
accounts_data = []
account_counter = 1

for customer_id in range(1, 101):
    # Each customer has 1-3 accounts
    num_accounts = random.randint(1, 3)
    
    for _ in range(num_accounts):
        account_type = random.choice(account_types)
        
        # Balance varies by account type
        if account_type == 'checking':
            balance = random.uniform(100, 10000)
        elif account_type == 'savings':
            balance = random.uniform(500, 50000)
        elif account_type == 'credit':
            balance = -random.uniform(0, 5000)  # Credit accounts have negative balances
        else:  # investment
            balance = random.uniform(1000, 100000)
        
        accounts_data.append({
            'customer_id': customer_id,
            'account_number': f"ACC{account_counter:06d}",
            'account_type': account_type,
            'current_balance': round(balance, 2),
            'opened_date': fake.date_between(start_date='-3y', end_date='today'),
            'status': random.choice(['active', 'active', 'active', 'closed']),  # 75% active
            'is_primary': 'true' if _ == 0 else 'false'  # First account is primary
        })
        account_counter += 1

df_accounts = pd.DataFrame(accounts_data)
print(f"Generated {len(df_accounts)} accounts")
df_accounts.head()

## Step 3: Import Data Using LOAD CSV Pattern
We'll simulate the LOAD CSV process by importing our generated data.

In [None]:
# Import Locations
print("📍 Importing locations...")
start_time = time.time()

for _, row in df_locations.iterrows():
    neo4j.execute("""
        CREATE (l:Location {
            id: $location_id,
            city: $city,
            state: $state,
            country: $country,
            zip_code: $zip_code,
            latitude: $latitude,
            longitude: $longitude
        })
    """, row.to_dict())

print(f"✅ Imported {len(df_locations)} locations in {time.time() - start_time:.2f} seconds")

In [None]:
# Import Merchants
print("🏪 Importing merchants...")
start_time = time.time()

for _, row in df_merchants.iterrows():
    # Create merchant
    neo4j.execute("""
        CREATE (m:Merchant {
            id: $merchant_id,
            name: $merchant_name,
            category: $category,
            subcategory: $subcategory,
            established_year: $established_year
        })
    """, row.to_dict())
    
    # Connect to location
    neo4j.execute("""
        MATCH (m:Merchant {id: $merchant_id})
        MATCH (l:Location {id: $location_id})
        CREATE (m)-[:LOCATED_IN]->(l)
    """, row.to_dict())

print(f"✅ Imported {len(df_merchants)} merchants in {time.time() - start_time:.2f} seconds")

In [None]:
# Import Customers with data validation
print("👥 Importing customers...")
start_time = time.time()
imported_count = 0
skipped_count = 0

for _, row in df_customers.iterrows():
    # Data validation
    if '@' not in row['email'] or row['age'] < 18 or row['age'] > 120:
        skipped_count += 1
        continue
    
    # Create customer with computed risk category
    risk_category = 'low' if row['credit_score'] >= 750 else 'medium' if row['credit_score'] >= 650 else 'high'
    
    customer_data = row.to_dict()
    customer_data['risk_category'] = risk_category
    
    neo4j.execute("""
        CREATE (c:Customer {
            id: $customer_id,
            name: $full_name,
            email: $email,
            phone: $phone,
            age: $age,
            income_bracket: $income_bracket,
            credit_score: $credit_score,
            registration_date: date($registration_date),
            risk_category: $risk_category
        })
    """, customer_data)
    
    # Connect to location
    neo4j.execute("""
        MATCH (c:Customer {id: $customer_id})
        MATCH (l:Location {id: $location_id})
        CREATE (c)-[:LIVES_IN]->(l)
    """, customer_data)
    
    imported_count += 1

print(f"✅ Imported {imported_count} customers, skipped {skipped_count} in {time.time() - start_time:.2f} seconds")

In [None]:
# Import Accounts
print("🏦 Importing accounts...")
start_time = time.time()
imported_count = 0

for _, row in df_accounts.iterrows():
    # Only import accounts for valid customers and positive balances for non-credit accounts
    if row['account_type'] != 'credit' and row['current_balance'] < 0:
        continue
    
    # Create account with computed properties
    balance = row['current_balance']
    if balance < 1000:
        balance_category = 'low'
    elif balance < 10000:
        balance_category = 'medium'
    elif balance < 100000:
        balance_category = 'high'
    else:
        balance_category = 'very_high'
    
    account_data = row.to_dict()
    account_data['balance_category'] = balance_category
    
    neo4j.execute("""
        MATCH (c:Customer {id: $customer_id})
        CREATE (a:Account {
            number: $account_number,
            type: $account_type,
            balance: $current_balance,
            opened_date: date($opened_date),
            status: $status,
            balance_category: $balance_category
        })
        CREATE (c)-[:HAS_ACCOUNT {
            opened: date($opened_date),
            primary_account: $is_primary = 'true'
        }]->(a)
    """, account_data)
    
    imported_count += 1

print(f"✅ Imported {imported_count} accounts in {time.time() - start_time:.2f} seconds")

## Step 4: Generate and Import Transaction Data

In [None]:
# Generate sample transactions
print("💳 Generating transaction data...")

# Get all account numbers
accounts = neo4j.execute("""
    MATCH (a:Account) 
    RETURN a.number AS number, a.type AS type
""")

account_numbers = [acc['number'] for acc in accounts]
transaction_types = ['transfer', 'payment', 'deposit', 'withdrawal', 'purchase']
channels = ['online', 'mobile', 'atm', 'branch', 'phone']

transactions_data = []
for i in range(1, 1001):  # Generate 1000 transactions
    from_account = random.choice(account_numbers)
    to_account = random.choice([acc for acc in account_numbers if acc != from_account])
    
    transaction_type = random.choice(transaction_types)
    
    # Amount varies by transaction type
    if transaction_type == 'purchase':
        amount = round(random.uniform(5, 500), 2)
    elif transaction_type == 'transfer':
        amount = round(random.uniform(50, 5000), 2)
    else:
        amount = round(random.uniform(20, 2000), 2)
    
    transactions_data.append({
        'transaction_id': f"TXN{i:06d}",
        'from_account': from_account,
        'to_account': to_account,
        'amount': amount,
        'transaction_timestamp': fake.date_time_between(start_date='-1y', end_date='now'),
        'transaction_type': transaction_type,
        'description': f"{transaction_type.title()} - {fake.sentence(nb_words=3)}",
        'channel': random.choice(channels),
        'merchant_id': random.choice(range(1, 26)) if transaction_type == 'purchase' else None
    })

df_transactions = pd.DataFrame(transactions_data)
print(f"Generated {len(df_transactions)} transactions")
df_transactions.head()

In [None]:
# Import Transactions
print("💸 Importing transactions...")
start_time = time.time()
imported_count = 0

for _, row in df_transactions.iterrows():
    # Validate transaction data
    if row['amount'] <= 0 or row['from_account'] == row['to_account']:
        continue
    
    # Compute additional properties
    amount = row['amount']
    if amount < 50:
        amount_category = 'micro'
    elif amount < 500:
        amount_category = 'small'
    elif amount < 5000:
        amount_category = 'medium'
    else:
        amount_category = 'large'
    
    transaction_data = row.to_dict()
    transaction_data['amount_category'] = amount_category
    transaction_data['is_weekend'] = row['transaction_timestamp'].weekday() >= 5
    transaction_data['hour_of_day'] = row['transaction_timestamp'].hour
    
    # Create transaction relationship
    neo4j.execute("""
        MATCH (from:Account {number: $from_account})
        MATCH (to:Account {number: $to_account})
        CREATE (from)-[:TRANSACTION {
            id: $transaction_id,
            amount: $amount,
            date: datetime($transaction_timestamp),
            type: $transaction_type,
            description: $description,
            channel: $channel,
            amount_category: $amount_category,
            is_weekend: $is_weekend,
            hour_of_day: $hour_of_day
        }]->(to)
    """, transaction_data)
    
    # Connect to merchant if applicable
    if row['merchant_id'] is not None:
        neo4j.execute("""
            MATCH (from:Account {number: $from_account})-[t:TRANSACTION {id: $transaction_id}]->(to:Account)
            MATCH (m:Merchant {id: $merchant_id})
            CREATE (from)-[:PURCHASED_FROM {transaction_id: $transaction_id, amount: $amount}]->(m)
        """, transaction_data)
    
    imported_count += 1

print(f"✅ Imported {imported_count} transactions in {time.time() - start_time:.2f} seconds")

## Step 5: Data Quality Validation

In [None]:
# Comprehensive data validation
print("🔍 Validating imported data...")

# Check data completeness
data_stats = neo4j.execute("""
    MATCH (c:Customer)
    OPTIONAL MATCH (c)-[:HAS_ACCOUNT]->(a:Account)
    OPTIONAL MATCH (a)-[t:TRANSACTION]-()
    RETURN 'Data Coverage' AS metric,
           count(DISTINCT c) AS customers,
           count(DISTINCT a) AS accounts,
           count(t) AS transactions
""")

for stat in data_stats:
    print(f"📊 {stat['metric']}: {stat['customers']} customers, {stat['accounts']} accounts, {stat['transactions']} transactions")

# Check for orphaned data
orphaned_accounts = neo4j.execute("""
    MATCH (a:Account)
    WHERE NOT (a)<-[:HAS_ACCOUNT]-()
    RETURN count(a) AS orphaned_accounts
""")

print(f"🔗 Orphaned accounts: {orphaned_accounts[0]['orphaned_accounts']}")

# Check data distribution
risk_distribution = neo4j.execute("""
    MATCH (c:Customer)
    RETURN c.risk_category AS risk_level,
           count(c) AS customer_count,
           avg(c.credit_score) AS avg_credit_score
    ORDER BY risk_level
""")

print("\n📈 Customer Risk Distribution:")
for dist in risk_distribution:
    print(f"  {dist['risk_level']}: {dist['customer_count']} customers (avg credit: {dist['avg_credit_score']:.0f})")

## Step 6: Performance Analysis

In [None]:
# Test query performance
print("⚡ Testing query performance...")

# Test 1: Customer lookup by email (should be fast with index)
start_time = time.time()
result = neo4j.execute("""
    MATCH (c:Customer) 
    WHERE c.email CONTAINS '@gmail.com'
    RETURN count(c) AS gmail_customers
""")
print(f"📧 Email search: {time.time() - start_time:.3f}s - Found {result[0]['gmail_customers']} Gmail users")

# Test 2: Transaction pattern analysis
start_time = time.time()
result = neo4j.execute("""
    MATCH ()-[t:TRANSACTION]-()
    WHERE t.amount > 1000
    RETURN count(t) AS large_transactions
""")
print(f"💰 Large transaction search: {time.time() - start_time:.3f}s - Found {result[0]['large_transactions']} large transactions")

# Test 3: Complex relationship query
start_time = time.time()
result = neo4j.execute("""
    MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account)-[:TRANSACTION]->()
    WITH c, count(*) AS transaction_count
    WHERE transaction_count > 5
    RETURN count(c) AS active_customers
""")
print(f"🔄 Complex relationship query: {time.time() - start_time:.3f}s - Found {result[0]['active_customers']} active customers")

## Step 7: AI-Ready Data Analysis
Let's test some queries that would be useful for AI applications.

In [None]:
# Customer similarity analysis for recommendations
print("🤖 AI Application Testing...")

customer_similarity = neo4j.execute("""
    MATCH (c1:Customer)-[:HAS_ACCOUNT]->()-[:PURCHASED_FROM]->(m:Merchant)
          <-[:PURCHASED_FROM]-()-[:HAS_ACCOUNT]-(c2:Customer)
    WHERE c1 <> c2
    WITH c1, c2, count(DISTINCT m) AS shared_merchants
    WHERE shared_merchants >= 2
    RETURN c1.name AS customer1, c2.name AS customer2, shared_merchants,
           c1.risk_category AS risk1, c2.risk_category AS risk2
    ORDER BY shared_merchants DESC
    LIMIT 10
""")

print("\n🎯 Customer Similarity (for recommendations):")
for sim in customer_similarity:
    print(f"  {sim['customer1']} ↔ {sim['customer2']}: {sim['shared_merchants']} shared merchants")

In [None]:
# Fraud detection patterns
fraud_patterns = neo4j.execute("""
    MATCH (a:Account)-[t:TRANSACTION]->()
    WHERE t.amount > 2000 AND t.is_weekend = true
    MATCH (a)<-[:HAS_ACCOUNT]-(c:Customer)
    RETURN c.name AS customer, c.risk_category, t.amount, t.date, t.description
    ORDER BY t.amount DESC
    LIMIT 10
""")

print("\n🚨 Potential Fraud Patterns (Large weekend transactions):")
for pattern in fraud_patterns:
    print(f"  {pattern['customer']} ({pattern['risk_category']}): ${pattern['amount']:.2f} on {pattern['date']}")

In [None]:
# Geographic analysis
geographic_analysis = neo4j.execute("""
    MATCH (c:Customer)-[:LIVES_IN]->(l:Location)
    MATCH (c)-[:HAS_ACCOUNT]->(a:Account)
    WITH l, count(c) AS customer_count, sum(a.balance) AS total_deposits
    WHERE customer_count > 2
    RETURN l.city AS city, l.state AS state, customer_count, total_deposits
    ORDER BY total_deposits DESC
    LIMIT 10
""")

print("\n🗺️ Geographic Analysis (Customer concentration and wealth):")
for geo in geographic_analysis:
    print(f"  {geo['city']}, {geo['state']}: {geo['customer_count']} customers, ${geo['total_deposits']:,.2f} total deposits")

## Summary and Key Learnings

In this module, you successfully:

### ✅ **Data Modeling**
- Designed a graph model from relational concepts
- Made strategic decisions about nodes vs. properties
- Considered performance implications in the design

### ✅ **Import Process**
- Set up proper constraints and indexes before import
- Implemented data validation during import
- Added computed properties for analytics
- Handled relationships and foreign key mappings

### ✅ **Performance Optimization**
- Created appropriate indexes for query patterns
- Used constraints for data integrity
- Validated import performance

### ✅ **AI Readiness**
- Created graph structures suitable for ML feature engineering
- Implemented patterns for recommendation systems
- Designed fraud detection query patterns
- Enabled geographic and demographic analysis

### 🎯 **Best Practices Learned**
1. **Always validate data** during import
2. **Create indexes before** large imports
3. **Design for your query patterns**, not just data structure
4. **Add computed properties** to support analytics
5. **Test performance** with realistic data volumes

### 🚀 **Next Steps**
In the next module, we'll learn how to work with unstructured data and create knowledge graphs from text documents.

## Cleanup

In [None]:
# Close Neo4j connection
neo4j.close()
print("✅ Neo4j connection closed")