# üóΩ NYC DOB Fraud Detection - Data Exploration

This notebook provides an interactive environment for exploring the 25GB NYC Department of Buildings dataset and developing fraud detection algorithms.

## What's Available
- **94 Datasets**: Complete NYC DOB data (violations, permits, complaints, etc.)
- **25GB of Data**: Comprehensive fraud detection dataset
- **Neo4j Database**: Graph analysis capabilities
- **ML Libraries**: Scikit-learn, NetworkX, Pandas, Polars

## Quick Start
1. Run the setup cell below
2. Load a dataset for exploration
3. Develop fraud detection patterns
4. Test community detection algorithms

In [None]:
# Setup: Import required libraries
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("üóΩ NYC DOB Fraud Detection Environment Ready!")
print(f"üìä Libraries loaded: pandas {pd.__version__}, polars {pl.__version__}")

## üìÅ Dataset Discovery

Let's explore what datasets are available and their sizes.

In [None]:
# Discover available datasets
data_dir = Path("/app/data/raw")
datasets = {}

for dataset_dir in data_dir.iterdir():
    if dataset_dir.is_dir():
        csv_files = list(dataset_dir.glob("*.csv"))
        if csv_files:
            latest_csv = max(csv_files, key=lambda x: x.stat().st_mtime)
            size_mb = latest_csv.stat().st_size / (1024 * 1024)
            datasets[dataset_dir.name] = {
                'path': latest_csv,
                'size_mb': round(size_mb, 2),
                'modified': datetime.fromtimestamp(latest_csv.stat().st_mtime)
            }

# Display datasets sorted by size
datasets_df = pd.DataFrame([
    {'Dataset': name, 'Size (MB)': info['size_mb'], 'Modified': info['modified']}
    for name, info in datasets.items()
]).sort_values('Size (MB)', ascending=False)

print(f"üìä Found {len(datasets)} datasets totaling {datasets_df['Size (MB)'].sum():.1f} MB")
print("\nüîù Top 10 Largest Datasets:")
display(datasets_df.head(10))

## üîç Load and Explore a Dataset

Let's start with one of the key fraud detection datasets.

In [None]:
# Load a key dataset (you can change this)
DATASET_NAME = "dob_violations"  # Change this to explore different datasets
SAMPLE_SIZE = 10000  # Adjust for performance

if DATASET_NAME in datasets:
    dataset_path = datasets[DATASET_NAME]['path']
    
    print(f"üìÇ Loading {DATASET_NAME} ({datasets[DATASET_NAME]['size_mb']} MB)")
    print(f"üîÑ Using sample of {SAMPLE_SIZE:,} rows for quick exploration")
    
    # Load sample for quick exploration
    df = pd.read_csv(dataset_path, nrows=SAMPLE_SIZE, low_memory=False)
    
    print(f"‚úÖ Loaded {len(df):,} rows √ó {len(df.columns)} columns")
    print(f"üíæ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    # Basic info
    display(df.head())
    
else:
    print(f"‚ùå Dataset '{DATASET_NAME}' not found")
    print(f"Available datasets: {list(datasets.keys())[:10]}...")

## üìä Quick Data Analysis

In [None]:
# Dataset overview
print("üìã Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Data types: {df.dtypes.value_counts().to_dict()}")
print(f"Missing values: {df.isnull().sum().sum()} total")

# Show data info
df.info()

In [None]:
# Visualize missing data patterns
missing_data = df.isnull().sum().sort_values(ascending=False)
missing_data = missing_data[missing_data > 0]

if len(missing_data) > 0:
    plt.figure(figsize=(12, 6))
    missing_data.plot(kind='bar')
    plt.title('Missing Data by Column')
    plt.xlabel('Columns')
    plt.ylabel('Missing Count')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("‚úÖ No missing data found!")

## üîç Fraud Detection Analysis

Look for patterns that might indicate fraudulent activity.

In [None]:
# Identify key columns for fraud detection
fraud_columns = []
for col in df.columns:
    col_lower = col.lower()
    if any(keyword in col_lower for keyword in 
           ['bin', 'contractor', 'owner', 'violation', 'permit', 'license', 'address']):
        fraud_columns.append(col)

print(f"üîç Key columns for fraud detection: {fraud_columns}")

# Show sample of fraud-relevant data
if fraud_columns:
    display(df[fraud_columns].head(10))

In [None]:
# Look for potential patterns
# Example: Frequency analysis of key entities

if 'bin' in [col.lower() for col in df.columns]:
    bin_col = [col for col in df.columns if col.lower() == 'bin'][0]
    
    # Properties with multiple violations (potential red flag)
    bin_counts = df[bin_col].value_counts()
    high_violation_properties = bin_counts[bin_counts > 5]  # Properties with >5 violations
    
    print(f"üö® Properties with >5 violations: {len(high_violation_properties)}")
    print(f"üèÜ Top violators:")
    display(high_violation_properties.head(10))
    
    # Visualize violation distribution
    plt.figure(figsize=(12, 6))
    bin_counts.head(20).plot(kind='bar')
    plt.title('Top 20 Properties by Violation Count')
    plt.xlabel('Building ID (BIN)')
    plt.ylabel('Violation Count')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

## üï∏Ô∏è Network Analysis for Fraud Detection

Create graphs to identify suspicious relationships between entities.

In [None]:
# Import network analysis libraries
import networkx as nx
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

print("üï∏Ô∏è Network analysis libraries loaded")
print("üìà Ready for community detection algorithms:")
print("   ‚Ä¢ Louvain community detection")
print("   ‚Ä¢ Label propagation")
print("   ‚Ä¢ DBSCAN clustering")
print("   ‚Ä¢ Graph-based fraud pattern discovery")

In [None]:
# Example: Create a simple network from violation data
# This connects properties (BIN) to violation types

if 'bin' in [col.lower() for col in df.columns] and len(df) > 0:
    # Find BIN and violation type columns
    bin_col = [col for col in df.columns if col.lower() == 'bin'][0]
    
    # Look for violation type column
    violation_cols = [col for col in df.columns if 'violation' in col.lower() or 'class' in col.lower()]
    
    if violation_cols:
        violation_col = violation_cols[0]
        
        # Create a bipartite graph: Properties <-> Violation Types
        G = nx.Graph()
        
        # Add edges between properties and violation types
        for _, row in df.dropna(subset=[bin_col, violation_col]).head(1000).iterrows():  # Limit for performance
            property_id = f"PROP_{row[bin_col]}"
            violation_type = f"VIOL_{row[violation_col]}"
            G.add_edge(property_id, violation_type)
        
        print(f"üï∏Ô∏è Created network with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
        print(f"üìä Network density: {nx.density(G):.4f}")
        
        # Find properties with most violation types (potential fraud indicators)
        property_nodes = [n for n in G.nodes() if n.startswith('PROP_')]
        property_degrees = [(node, G.degree(node)) for node in property_nodes]
        property_degrees.sort(key=lambda x: x[1], reverse=True)
        
        print("\nüö® Properties with most violation types (potential fraud indicators):")
        for prop, degree in property_degrees[:10]:
            print(f"   {prop}: {degree} different violation types")
    else:
        print("‚ùå No violation type column found for network analysis")
else:
    print("‚ùå BIN column not found for network analysis")

## üîó Connect to Neo4j Database

For advanced graph analysis and persistent storage of relationships.

In [None]:
# Connect to Neo4j database
try:
    from neo4j import GraphDatabase
    
    # Connection details (using Docker service names)
    NEO4J_URI = "bolt://neo4j:7687"
    NEO4J_USER = "neo4j"
    NEO4J_PASSWORD = "password"  # Change this to your actual password
    
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    
    # Test connection
    with driver.session() as session:
        result = session.run("RETURN 'Connected to Neo4j!' as message")
        print(result.single()["message"])
        
        # Show database stats
        result = session.run("MATCH (n) RETURN count(n) as node_count")
        node_count = result.single()["node_count"]
        
        result = session.run("MATCH ()-[r]->() RETURN count(r) as rel_count")
        rel_count = result.single()["rel_count"]
        
        print(f"üìä Neo4j Database: {node_count:,} nodes, {rel_count:,} relationships")
    
    print("‚úÖ Neo4j connection successful!")
    
except Exception as e:
    print(f"‚ùå Neo4j connection failed: {e}")
    print("üí° Make sure Neo4j is running and credentials are correct")
    print("   You can access Neo4j browser at: http://localhost:37474")

## üß† Community Detection Example

Detect communities in the violation network to find suspicious clusters.

In [None]:
# Run community detection on the network we created
if 'G' in locals() and G.number_of_nodes() > 0:
    
    # Use NetworkX's built-in community detection
    try:
        from networkx.algorithms import community
        
        # Louvain community detection
        communities = community.louvain_communities(G, seed=42)
        
        print(f"üèòÔ∏è Found {len(communities)} communities")
        
        # Analyze communities
        community_info = []
        for i, comm in enumerate(communities):
            properties = [n for n in comm if n.startswith('PROP_')]
            violations = [n for n in comm if n.startswith('VIOL_')]
            
            community_info.append({
                'Community': i,
                'Size': len(comm),
                'Properties': len(properties),
                'Violation_Types': len(violations)
            })
        
        comm_df = pd.DataFrame(community_info).sort_values('Size', ascending=False)
        
        print("\nüìä Community Analysis:")
        display(comm_df.head(10))
        
        # Look for suspicious communities (many properties, few violation types)
        suspicious = comm_df[
            (comm_df['Properties'] > 2) & 
            (comm_df['Violation_Types'] <= 2)
        ]
        
        if len(suspicious) > 0:
            print("\nüö® Potentially suspicious communities (many properties, few violation types):")
            display(suspicious)
        
    except ImportError:
        print("‚ùå Community detection requires networkx >= 2.8")
        print("   Using simple connected components instead")
        
        components = list(nx.connected_components(G))
        print(f"üîó Found {len(components)} connected components")
        
        # Show largest components
        components.sort(key=len, reverse=True)
        for i, comp in enumerate(components[:5]):
            print(f"   Component {i}: {len(comp)} nodes")

else:
    print("‚ùå No network available for community detection")

## üìà Fraud Scoring Example

Create a simple fraud risk score for properties based on violation patterns.

In [None]:
# Create fraud risk scores for properties
if 'bin' in [col.lower() for col in df.columns]:
    bin_col = [col for col in df.columns if col.lower() == 'bin'][0]
    
    # Calculate fraud indicators
    fraud_scores = []
    
    for bin_id in df[bin_col].value_counts().head(50).index:  # Top 50 for performance
        property_data = df[df[bin_col] == bin_id]
        
        # Fraud indicators
        violation_count = len(property_data)
        unique_violation_types = property_data.iloc[:, -1].nunique() if len(property_data.columns) > 1 else 1
        
        # Simple fraud score (you can make this more sophisticated)
        fraud_score = (
            violation_count * 0.3 +  # Number of violations
            unique_violation_types * 0.7  # Variety of violation types
        )
        
        fraud_scores.append({
            'BIN': bin_id,
            'Violation_Count': violation_count,
            'Unique_Violation_Types': unique_violation_types,
            'Fraud_Score': round(fraud_score, 2)
        })
    
    fraud_df = pd.DataFrame(fraud_scores).sort_values('Fraud_Score', ascending=False)
    
    print("üéØ Property Fraud Risk Scores (Top 20):")
    display(fraud_df.head(20))
    
    # Visualize fraud scores
    plt.figure(figsize=(12, 6))
    plt.scatter(fraud_df['Violation_Count'], fraud_df['Unique_Violation_Types'], 
               c=fraud_df['Fraud_Score'], cmap='Reds', alpha=0.7)
    plt.colorbar(label='Fraud Score')
    plt.xlabel('Number of Violations')
    plt.ylabel('Unique Violation Types')
    plt.title('Property Fraud Risk Analysis')
    plt.show()
    
else:
    print("‚ùå BIN column not found for fraud scoring")

## üõ†Ô∏è Run Existing Fraud Detection Scripts

Execute the pre-built community detection and analysis scripts.

In [None]:
# List available fraud detection scripts
scripts_dir = Path("/app/scripts/fraud_detection")

if scripts_dir.exists():
    script_files = list(scripts_dir.glob("*.py"))
    print("üõ†Ô∏è Available fraud detection scripts:")
    for script in script_files:
        print(f"   ‚Ä¢ {script.name}")
    
    print("\nüí° To run a script, use:")
    print("   !python /app/scripts/fraud_detection/SCRIPT_NAME.py")
else:
    print("‚ùå Scripts directory not found")

In [None]:
# Example: Run the community detection algorithms
# Uncomment the line below to execute
# !python /app/scripts/fraud_detection/community_detection_algorithms.py

## üí° Next Steps

Now you can:

### üîÑ Explore Different Datasets
Change `DATASET_NAME` to explore:
- `"ecb_violations"` - Environmental violations
- `"maintenance_code_violations"` - Housing maintenance issues
- `"housing_litigations"` - Legal cases
- `"job_application_filings"` - Permit applications
- `"complaints_received"` - Public complaints

### üìä Advanced Analysis
- **Load full datasets**: Remove `nrows=SAMPLE_SIZE`
- **Cross-reference datasets**: Join multiple datasets on BIN/address
- **Time series analysis**: Look for temporal fraud patterns
- **Geographic analysis**: Map fraud hotspots

### üï∏Ô∏è Network Analysis
- **Multi-layer networks**: Connect contractors, properties, inspectors
- **Temporal networks**: Track relationships over time
- **Anomaly detection**: Find unusual network patterns

### ü§ñ Machine Learning
- **Classification**: Predict fraudulent vs. legitimate activity
- **Clustering**: Group similar fraud patterns
- **Feature engineering**: Create sophisticated fraud indicators

### üîó Integration
- **Store in Neo4j**: Persist networks for complex queries
- **Export results**: Save findings for reporting
- **Automate detection**: Schedule regular fraud scans

In [None]:
# Your custom fraud detection experiments start here! üöÄ
# 
# Ideas to try:
# 1. Load multiple datasets and join them
# 2. Create contractor-property networks
# 3. Implement temporal analysis
# 4. Build ML models for fraud prediction
# 5. Integrate with Neo4j for persistent storage