# Databricks Collaboration & Adoption Monitor

## Overview

This notebook provides a **comprehensive analysis of platform adoption and user collaboration** by tracking user activity, feature usage, AI/agent adoption, and collaboration patterns. The output includes detailed reports on active users, feature adoption rates, AI assistant usage, inactive accounts, and training needs.

**✨ Enterprise-grade adoption monitoring with user activity tracking, AI/agent usage analysis, feature adoption metrics, and collaboration insights.**

---

## Features

### User Activity Tracking
* **Active Users**: Daily (DAU), Weekly (WAU), Monthly (MAU) active users
* **Login Patterns**: Last login times, login frequency
* **Inactive Users**: Users with no activity in 30/60/90 days
* **User Segmentation**: Power users, regular users, occasional users, inactive
* **Activity Trends**: Week-over-week, month-over-month growth
* **Service Principal Filtering**: Excludes automated accounts from metrics

### Feature Usage Analysis
* **Feature Categorization**: Groups features into logical buckets (Notebooks, SQL, Genie/AI, Jobs, Git, MLflow, Dashboards)
* **Category-Level Adoption**: Overall adoption rates by feature category
* **Top Features per Category**: Most used features in each bucket
* **Adoption Rates**: Percentage of users using each category and feature
* **Category Ranking**: Which feature buckets are most popular

### AI & Agent Adoption
* **AI Assistant Usage**: Users leveraging Databricks Assistant
* **AI Adoption Funnel**: Tried once → Light → Regular → Power users
* **AI Feature Diversity**: Number of AI features explored per user
* **AI Engagement Patterns**: Consistent vs. one-time users
* **AI Retention**: Active vs. dormant AI users
* **Agent Interactions**: Frequency of agent usage per user
* **Power AI Users**: Top users by agent interaction count

### User Journey Mapping
* **Entry Points**: Where users first discover the platform (Notebooks/SQL/AI)
* **Feature Discovery Patterns**: Natural onboarding paths
* **Adoption Timeline**: Days to adopt key features
* **Feature Breadth**: Single vs. multi-feature users
* **Onboarding Success**: Users who explore multiple features

### Collaboration Metrics
* **Collaboration Score**: 0-100 score measuring team effectiveness
* **Shared Notebooks**: Notebook editing and collaboration
* **Dashboard Sharing**: Dashboard creation for team visibility
* **Git Integration**: Version control usage
* **Collaboration Recommendations**: Areas for improvement

### Adoption Insights
* **Feature Adoption Rates**: % of users using each feature
* **Adoption Trends**: Feature usage over time
* **Training Needs**: Users with low activity
* **Onboarding Success**: New user activity patterns
* **Department Analysis**: Usage by team/department

### Recommendations
* **Inactive User Cleanup**: Accounts to deactivate
* **Training Opportunities**: Users who could benefit from training
* **Feature Promotion**: Underutilized features to promote
* **AI Adoption**: Users who should try AI assistant
* **Collaboration Improvements**: Git and dashboard adoption

---

## Version Control

| Version | Date | Author | Changes |
|---------|------|--------|---------|  
| 1.0.0 | 2026-02-16 | Assistant | Comprehensive collaboration and adoption monitoring system with complete user activity tracking. Features include: user inventory with activity status, login pattern analysis, active user metrics (DAU/WAU/MAU), feature usage tracking across notebooks, SQL queries, dashboards, jobs, repos, MLflow, and DLT pipelines, AI/agent usage analysis (Databricks Assistant adoption, interaction frequency, feature breakdown), inactive user identification (30/60/90 day thresholds), user segmentation (power users, regular, occasional, inactive), collaboration metrics (shared notebooks, cross-team activity, group membership), adoption rate calculations per feature, adoption trend analysis, training needs identification, recommendations for inactive user cleanup and feature promotion, multiple export formats (Delta table with historical tracking, Excel multi-sheet workbook), interactive visualizations (adoption trends, feature usage heatmap, AI adoption charts), system.access.audit log queries for activity data, job mode support with automatic configuration, serverless compute optimization, parallel processing, retry logic, progress tracking, and comprehensive error handling. |
| 1.1.0 | 2026-02-18 | Assistant | Added service principal filtering and feature categorization. New features include: service principal filtering to exclude automated accounts from adoption metrics (filters explicit list of service accounts, UUID-pattern accounts, and group accounts starting with 'Developers-'), configurable filtering options (FILTER_SERVICE_PRINCIPALS, FILTER_UUID_ACCOUNTS, FILTER_GROUP_ACCOUNTS), before/after filtering statistics showing accounts and records removed, feature categorization into logical buckets (Notebooks, SQL, Genie/AI, Dashboards, Jobs, Git/Repos, MLflow), category-level adoption metrics with overall usage and unique user counts per category, top features displayed within each category, adoption rate calculations at both category and feature level, category adoption ranking showing most to least adopted feature areas, enhanced reporting with category summaries and feature breakdowns, improved accuracy of DAU/WAU/MAU metrics by excluding service principals, better identification of real power users vs automation, more actionable insights for training and feature promotion focused on actual human users. |
| 1.2.0 | 2026-02-18 | Assistant | Added advanced analytics and expanded visualizations. New features include: AI Feature Breakdown with detailed adoption funnel analysis (tried once, light users 2-5, regular users 6-20, power users 20+), AI feature diversity tracking (number of AI features explored per user), AI engagement patterns (consistent vs one-time users), AI retention metrics (active vs dormant users), multi-service AI usage tracking. User Journey Mapping with entry point analysis (where users start: Notebooks/SQL/AI), feature discovery patterns, adoption timeline (days to adopt key features), feature breadth analysis (single vs multi-feature users), onboarding success metrics. Collaboration Score calculation (0-100 scale) with component breakdown (notebooks 40pts, dashboards 30pts, git 30pts), collaboration recommendations, team effectiveness assessment. Expanded visualizations from 6 to 9 charts in 3x3 grid including: AI Adoption Funnel chart showing user progression through AI adoption stages, User Journey Entry Points chart showing where users discover the platform, Collaboration Score Breakdown chart with actual vs potential scores. Enhanced insights for AI adoption strategies, onboarding optimization, and collaboration improvement initiatives. |

---

## Configuration

### Analysis Period:
* `LOOKBACK_DAYS = 30` - Days of activity to analyze (default: 30)
* `INACTIVE_THRESHOLD_DAYS = 90` - Days to consider user inactive
* `POWER_USER_THRESHOLD = 50` - Activity count to classify as power user

### Service Principal Filtering:
* `FILTER_SERVICE_PRINCIPALS = True` - Enable/disable filtering
* `FILTER_UUID_ACCOUNTS = True` - Filter UUID-pattern accounts
* `FILTER_GROUP_ACCOUNTS = True` - Filter 'Developers-' accounts
* `SERVICE_PRINCIPALS = [...]` - Explicit list of service accounts to exclude

### Activity Thresholds:
* `MIN_ACTIVITY_FOR_ACTIVE = 1` - Minimum actions to count as active
* `DAU_DAYS = 1` - Daily active users lookback
* `WAU_DAYS = 7` - Weekly active users lookback
* `MAU_DAYS = 30` - Monthly active users lookback

### Export Settings:
* `EXPORT_PATH = '/dbfs/tmp/adoption_export'` - Export directory
* `ENABLE_EXCEL_EXPORT = True` - Excel workbook generation
* `ENABLE_DELTA_EXPORT = True` - Delta table for historical tracking
* `ENABLE_VISUALIZATIONS = True` - Generate charts (interactive mode)
* `DELTA_TABLE_NAME = 'main.default.adoption_history'` - Delta table name

### Performance Settings:
* `MAX_USERS = 999` - Maximum users to analyze (999 = all)
* `MAX_WORKERS = 10` - Parallel threads for API calls
* `MAX_RETRIES = 3` - Retries for failed operations

---

## Usage

### Interactive Mode
1. Run all cells to analyze adoption metrics
2. Review user activity and feature usage by category
3. Analyze AI adoption funnel and user journey patterns
4. Review collaboration score and recommendations
5. View 9 comprehensive visualizations
6. Download Excel report from export path

### Job Mode
1. Schedule as a Databricks job (weekly recommended)
2. Automatically runs comprehensive analysis
3. Exports to Delta table for trend tracking
4. Returns JSON summary for orchestration

---

## Data Sources

| Data Source | Purpose |
|-------------|----------|
| `system.access.audit` | User activity logs (notebook runs, queries, dashboard views, agent usage) |
| Databricks SDK - Users API | User inventory, login times, active status |
| Databricks SDK - Groups API | Group membership, collaboration patterns |
| Workspace API | Shared notebooks, collaboration metrics |

---

## Key Features

✓ **User Activity Tracking**: DAU/WAU/MAU metrics (real users only)  
✓ **Service Principal Filtering**: Excludes automation from metrics  
✓ **Feature Categorization**: Groups features into logical buckets  
✓ **Category Adoption Metrics**: Adoption rates by feature area  
✓ **AI Feature Breakdown**: Adoption funnel, diversity, retention  
✓ **User Journey Mapping**: Entry points, discovery, onboarding  
✓ **Collaboration Score**: Team effectiveness measurement (0-100)  
✓ **9 Visualizations**: Comprehensive charts including AI funnel, journey, collaboration  
✓ **Inactive User Detection**: 30/60/90 day inactivity thresholds  
✓ **Power User Identification**: Top users by activity  
✓ **Training Needs**: Identify users needing support  
✓ **System Table Queries**: Leverages system.access.audit  
✓ **Multiple Export Formats**: Excel and Delta table  
✓ **Job Mode Support**: Automated scheduled execution  
✓ **Serverless Optimized**: Compute-aware optimizations  
✓ **Comprehensive Error Handling**: Graceful degradation  
✓ **Historical Tracking**: Delta table with append mode

In [0]:
%pip install openpyxl --quiet

In [0]:
# ============================================================================
# IMPORTS
# ============================================================================

# Standard library
import time
import os
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed

# Third-party
import pandas as pd
import pytz

# Databricks SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import NotFound, PermissionDenied

# PySpark
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, TimestampType, DoubleType

# ============================================================================
# JOB MODE DETECTION (MUST BE FIRST)
# ============================================================================

try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

# ============================================================================
# SERVERLESS DETECTION
# ============================================================================

try:
    test_df = spark.range(1)
    test_df.cache()
    test_df.count()
    test_df.unpersist()
    is_serverless = False
except Exception as e:
    if 'PERSIST' in str(e).upper() or 'CACHE' in str(e).upper():
        is_serverless = True
    else:
        is_serverless = False

# ============================================================================
# TIMEZONE CONFIGURATION
# ============================================================================

TIMEZONE = 'America/New_York'
eastern = pytz.timezone(TIMEZONE)

# ============================================================================
# LOGGING FUNCTION
# ============================================================================

def log(message):
    """Print messages (always in interactive, selectively in job mode)"""
    print(message)

# ============================================================================
# CONFIGURATION PARAMETERS
# ============================================================================

# Analysis period
LOOKBACK_DAYS = 30  # Days of activity to analyze

# Activity thresholds
INACTIVE_THRESHOLD_DAYS = 90  # Days to consider user inactive
POWER_USER_THRESHOLD = 50  # Activity count to classify as power user
MIN_ACTIVITY_FOR_ACTIVE = 1  # Minimum actions to count as active

# Active user metrics
DAU_DAYS = 1  # Daily active users
WAU_DAYS = 7  # Weekly active users
MAU_DAYS = 30  # Monthly active users

# User limits
MAX_USERS = 999  # Maximum users to analyze (999 = all)

# Service principals to exclude from adoption analysis
# These are automated accounts that skew adoption metrics
SERVICE_PRINCIPALS = [
    'System-User',
    'System user',
    'unknown',
    ''   # Empty string for null emails
]

# Filter service principals by pattern
FILTER_SERVICE_PRINCIPALS = True  # Set to False to include all accounts
FILTER_UUID_ACCOUNTS = True       # Filter accounts that look like UUIDs
FILTER_GROUP_ACCOUNTS = True      # Filter accounts starting with 'Developers-'

# Performance settings
MAX_WORKERS = 10
MAX_RETRIES = 3
RETRY_DELAY = 2

# Export settings (disabled in interactive mode, enabled in job mode)
EXPORT_PATH = '/dbfs/tmp/adoption_export'
if is_job_mode:
    ENABLE_EXCEL_EXPORT = True
    ENABLE_DELTA_EXPORT = True
    ENABLE_JSON_EXPORT = True
    log("🤖 Job mode: Exports ENABLED")
else:
    ENABLE_EXCEL_EXPORT = False
    ENABLE_DELTA_EXPORT = False
    ENABLE_JSON_EXPORT = False
    log("💻 Interactive mode: Exports DISABLED")

ENABLE_VISUALIZATIONS = True

# Delta table configuration
DELTA_TABLE_NAME = 'main.default.adoption_history'

# ============================================================================
# EXECUTION STATISTICS
# ============================================================================

execution_stats = {
    'start_time': time.time(),
    'api_calls': 0,
    'api_failures': 0,
    'users_processed': 0,
    'audit_records_processed': 0
}

# ============================================================================
# INITIALIZE SDK CLIENT
# ============================================================================

wc = WorkspaceClient()

log("\n" + "="*60)
log("COLLABORATION & ADOPTION MONITOR")
log("="*60)
log(f"Execution mode: {'JOB' if is_job_mode else 'INTERACTIVE'}")
log(f"Compute type: {'SERVERLESS' if is_serverless else 'TRADITIONAL'}")
log(f"Timezone: {TIMEZONE}")
log(f"Lookback period: {LOOKBACK_DAYS} days")
log(f"Inactive threshold: {INACTIVE_THRESHOLD_DAYS} days")
log(f"Service principal filtering: {'ENABLED' if FILTER_SERVICE_PRINCIPALS else 'DISABLED'}")
log(f"Excel export: {'ENABLED' if ENABLE_EXCEL_EXPORT else 'DISABLED'}")
log(f"Delta export: {'ENABLED' if ENABLE_DELTA_EXPORT else 'DISABLED'}")
log(f"JSON export: {'ENABLED' if ENABLE_JSON_EXPORT else 'DISABLED'}")
log("="*60)

In [0]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def log_execution_time(cell_name, start_time):
    """Log execution time for a cell"""
    elapsed = time.time() - start_time
    log(f"⏱️  {cell_name} completed in {elapsed:.2f} seconds")

def validate_dataframe_exists(df_name, df):
    """Validate that a DataFrame exists and has data"""
    if df is None:
        log(f"⚠️  Warning: {df_name} is None")
        return False
    try:
        count = df.count()
        if count == 0:
            log(f"⚠️  Warning: {df_name} is empty (0 rows)")
            return False
        return True
    except Exception as e:
        log(f"❌ Error validating {df_name}: {str(e)}")
        return False

def is_service_principal(user_name):
    """Determine if a user is a service principal"""
    if not user_name or pd.isna(user_name):
        return True  # Treat null/empty as service principal
    
    user_str = str(user_name).strip()
    
    # Check explicit list
    if user_str in SERVICE_PRINCIPALS:
        return True
    
    # Check UUID pattern (8-4-4-4-12 hex digits)
    if FILTER_UUID_ACCOUNTS:
        import re
        uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
        if re.match(uuid_pattern, user_str.lower()):
            return True
    
    # Check group accounts
    if FILTER_GROUP_ACCOUNTS:
        if user_str.startswith('Developers-') or user_str.startswith('developers-'):
            return True
    
    return False

def filter_service_principals_spark(df, user_column='user_name'):
    """Filter out service principals from a Spark DataFrame"""
    if not FILTER_SERVICE_PRINCIPALS:
        return df
    
    # Filter explicit list
    df_filtered = df.filter(~F.col(user_column).isin(SERVICE_PRINCIPALS))
    
    # Filter UUIDs
    if FILTER_UUID_ACCOUNTS:
        df_filtered = df_filtered.filter(
            ~F.col(user_column).rlike('^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$')
        )
    
    # Filter group accounts
    if FILTER_GROUP_ACCOUNTS:
        df_filtered = df_filtered.filter(
            ~F.lower(F.col(user_column)).startswith('developers-')
        )
    
    return df_filtered

def retry_with_backoff(func, *args, **kwargs):
    """Retry a function with exponential backoff"""
    for attempt in range(MAX_RETRIES):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            if attempt == MAX_RETRIES - 1:
                raise
            wait_time = RETRY_DELAY * (2 ** attempt)
            log(f"⚠️  Attempt {attempt + 1} failed: {str(e)}. Retrying in {wait_time}s...")
            time.sleep(wait_time)

log("✓ Helper functions loaded")

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("FETCHING USERS AND GROUPS")
log("="*60)

try:
    # Fetch all users
    log("Fetching users...")
    users = list(wc.users.list())
    
    if MAX_USERS < 999:
        users = users[:MAX_USERS]
    
    users_data = []
    for user in users:
        users_data.append({
            'user_name': user.user_name,
            'display_name': user.display_name,
            'active': user.active,
            'user_id': user.id
        })
    
    users_df = spark.createDataFrame(users_data)
    
    log(f"✓ Fetched {len(users_data)} users")
    log(f"  Active: {users_df.filter(F.col('active') == True).count()}")
    log(f"  Inactive: {users_df.filter(F.col('active') == False).count()}")
    
    # Fetch groups
    log("Fetching groups...")
    groups = list(wc.groups.list())
    
    log(f"✓ Fetched {len(groups)} groups")
    
    execution_stats['users_processed'] = len(users_data)
    
except Exception as e:
    log(f"✗ Error fetching users: {str(e)}")
    users_df = None

log_execution_time("Fetch Users", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log(f"QUERYING USER ACTIVITY (LAST {LOOKBACK_DAYS} DAYS)")
log("="*60)

if users_df is not None:
    try:
        # Calculate date range
        start_date = (datetime.now(eastern) - timedelta(days=LOOKBACK_DAYS)).strftime('%Y-%m-%d')
        
        log(f"Querying system.access.audit since {start_date}...")
        
        # Query audit logs for various activities
        activity_query = f"""
        SELECT 
            user_identity.email as user_name,
            action_name,
            service_name,
            DATE(event_date) as activity_date,
            COUNT(*) as activity_count
        FROM system.access.audit
        WHERE event_date >= '{start_date}'
            AND user_identity.email IS NOT NULL
            AND (
                action_name IN (
                    'runCommand',  -- Notebook execution
                    'submitCommand',  -- Notebook command submission
                    'modifyNotebook',  -- Notebook editing
                    'createNotebook',
                    'executeQuery',  -- SQL query execution
                    'createQuery',
                    'createDashboard',
                    'createJob',
                    'runJob',
                    'gitSync',  -- Repo activity
                    'mlflowRunCreated'  -- MLflow usage
                )
                OR service_name IN (
                    'agents',  -- AI/BI Genie agents
                    'knowledge_assistant',  -- Knowledge assistant
                    'supervisor_agent',  -- Supervisor agent
                    'agentFramework',  -- Agent framework
                    'agentEvaluation',  -- Agent evaluation
                    'aibiGenie'  -- AI/BI Genie
                )
            )
        GROUP BY user_identity.email, action_name, service_name, DATE(event_date)
        ORDER BY activity_date DESC, user_name
        """
        
        activity_df_raw = spark.sql(activity_query)
        
        # Get counts before filtering
        total_records = activity_df_raw.count()
        total_users = activity_df_raw.select('user_name').distinct().count()
        
        log(f"✓ Fetched {total_records:,} activity records from {total_users} accounts")
        
        # Filter out service principals
        if FILTER_SERVICE_PRINCIPALS:
            log("Filtering out service principals...")
            activity_df = filter_service_principals_spark(activity_df_raw, 'user_name')
            
            # Get counts after filtering
            filtered_records = activity_df.count()
            filtered_users = activity_df.select('user_name').distinct().count()
            
            records_removed = total_records - filtered_records
            users_removed = total_users - filtered_users
            
            log(f"✓ After filtering:")
            log(f"  Activity records: {filtered_records:,} (removed {records_removed:,}, {records_removed/total_records*100:.1f}%)")
            log(f"  Unique users: {filtered_users} (removed {users_removed} service principals)")
        else:
            activity_df = activity_df_raw
            log("ℹ️  Service principal filtering disabled")
        
        execution_stats['audit_records_processed'] = activity_df.count()
        
    except Exception as e:
        log(f"✗ Error querying audit logs: {str(e)}")
        log("  Note: Requires system.access.audit access")
        activity_df = None
else:
    log("⚠️  Skipping activity query (no users)")
    activity_df = None

log_execution_time("Query Activity", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("ANALYZING AI/AGENT USAGE")
log("="*60)

if activity_df is not None:
    try:
        # Filter for AI/Agent related activities by service name
        agent_activities = activity_df.filter(
            F.col('service_name').isin([
                'agents',
                'knowledge_assistant',
                'supervisor_agent',
                'agentFramework',
                'agentEvaluation',
                'aibiGenie'
            ])
        )
        
        if agent_activities.count() > 0:
            # Calculate agent usage per user
            agent_usage = agent_activities.groupBy('user_name').agg(
                F.sum('activity_count').alias('agent_interactions'),
                F.countDistinct('activity_date').alias('days_used_agent'),
                F.max('activity_date').alias('last_agent_use'),
                F.countDistinct('action_name').alias('unique_agent_actions')
            )
            
            # Calculate agent adoption metrics
            total_users = users_df.filter(F.col('active') == True).count()
            agent_users_count = agent_usage.count()
            agent_adoption_rate = (agent_users_count / total_users * 100) if total_users > 0 else 0
            
            total_interactions = agent_activities.agg(F.sum('activity_count')).first()[0]
            
            log(f"✓ AI/Agent Usage Analysis:")
            log(f"  Users using agents: {agent_users_count} ({agent_adoption_rate:.1f}% of active users)")
            log(f"  Total agent interactions: {total_interactions:,}")
            log(f"  Average interactions per user: {total_interactions / agent_users_count:.1f}")
            
            # Top agent users
            log(f"\n  Top 5 AI/Agent power users:")
            top_agent_users = agent_usage.orderBy(F.desc('agent_interactions')).limit(5)
            for row in top_agent_users.collect():
                log(f"    - {row.user_name}: {row.agent_interactions:,} interactions over {row.days_used_agent} days")
            
            # Agent service breakdown
            log(f"\n  Agent service usage:")
            service_breakdown = agent_activities.groupBy('service_name').agg(
                F.sum('activity_count').alias('total_count')
            ).orderBy(F.desc('total_count'))
            for row in service_breakdown.collect():
                log(f"    - {row.service_name}: {row.total_count:,} interactions")
            
            # Agent action breakdown
            log(f"\n  Top agent actions:")
            action_breakdown = agent_activities.groupBy('action_name').agg(
                F.sum('activity_count').alias('total_count')
            ).orderBy(F.desc('total_count')).limit(10)
            for row in action_breakdown.collect():
                log(f"    - {row.action_name}: {row.total_count:,} uses")
            
        else:
            log("ℹ️  No AI/Agent usage detected in audit logs")
            log("  Note: Agent usage tracked via service_name (agents, knowledge_assistant, etc.)")
            agent_usage = None
            agent_adoption_rate = 0
            
    except Exception as e:
        log(f"✗ Error analyzing agent usage: {str(e)}")
        agent_usage = None
        agent_adoption_rate = 0
else:
    log("⚠️  Skipping agent analysis (no activity data)")
    agent_usage = None
    agent_adoption_rate = 0

log_execution_time("Agent Analysis", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("AI FEATURE BREAKDOWN (DETAILED)")
log("="*60)

if 'agent_activities' in dir() and agent_activities is not None and agent_activities.count() > 0:
    try:
        # Detailed AI feature analysis
        log("Analyzing AI feature usage patterns...")
        
        # 1. AI Feature Adoption Funnel
        log("\n📊 AI ADOPTION FUNNEL:")
        
        # Users who tried AI once
        tried_once = agent_usage.filter(F.col('agent_interactions') == 1).count()
        # Users who used AI 2-5 times
        light_users = agent_usage.filter((F.col('agent_interactions') >= 2) & (F.col('agent_interactions') <= 5)).count()
        # Users who used AI 6-20 times
        regular_users = agent_usage.filter((F.col('agent_interactions') >= 6) & (F.col('agent_interactions') <= 20)).count()
        # Users who used AI 20+ times (power users)
        power_users = agent_usage.filter(F.col('agent_interactions') > 20).count()
        
        total_ai_users = agent_usage.count()
        
        log(f"  Tried once: {tried_once} ({tried_once/total_ai_users*100:.1f}%)")
        log(f"  Light users (2-5 uses): {light_users} ({light_users/total_ai_users*100:.1f}%)")
        log(f"  Regular users (6-20 uses): {regular_users} ({regular_users/total_ai_users*100:.1f}%)")
        log(f"  Power users (20+ uses): {power_users} ({power_users/total_ai_users*100:.1f}%)")
        
        # 2. AI Feature Diversity
        log("\n🎯 AI FEATURE DIVERSITY:")
        
        # Users by number of different AI features used
        feature_diversity = agent_usage.groupBy('unique_agent_actions').count().orderBy('unique_agent_actions')
        
        log("  Users by AI features explored:")
        for row in feature_diversity.collect():
            log(f"    {row['unique_agent_actions']} features: {row['count']} users")
        
        # 3. AI Engagement Patterns
        log("\n📅 AI ENGAGEMENT PATTERNS:")
        
        # Average days between AI usage
        avg_days_used = agent_usage.agg(F.avg('days_used_agent')).first()[0]
        log(f"  Average days with AI activity: {avg_days_used:.1f}")
        
        # Users with consistent AI usage (used on 5+ different days)
        consistent_users = agent_usage.filter(F.col('days_used_agent') >= 5).count()
        log(f"  Consistent AI users (5+ days): {consistent_users} ({consistent_users/total_ai_users*100:.1f}%)")
        
        # 4. AI Feature Combinations
        log("\n🔗 POPULAR AI FEATURE COMBINATIONS:")
        
        # Users using multiple AI services
        user_service_combos = agent_activities.groupBy('user_name').agg(
            F.collect_set('service_name').alias('services_used')
        )
        
        multi_service_users = user_service_combos.filter(F.size('services_used') > 1).count()
        log(f"  Users using multiple AI services: {multi_service_users} ({multi_service_users/total_ai_users*100:.1f}%)")
        
        # 5. AI Retention
        log("\n📌 AI USER RETENTION:")
        
        # Users who used AI in last 7 days
        recent_cutoff = (datetime.now(eastern) - timedelta(days=7)).date()
        recent_ai_users = agent_usage.filter(F.col('last_agent_use') >= F.lit(recent_cutoff)).count()
        log(f"  Active in last 7 days: {recent_ai_users} ({recent_ai_users/total_ai_users*100:.1f}%)")
        
        # Users who haven't used AI in 14+ days
        dormant_cutoff = (datetime.now(eastern) - timedelta(days=14)).date()
        dormant_ai_users = agent_usage.filter(F.col('last_agent_use') < F.lit(dormant_cutoff)).count()
        log(f"  Dormant (14+ days): {dormant_ai_users} ({dormant_ai_users/total_ai_users*100:.1f}%)")
        
        # Create AI metrics summary
        ai_metrics_summary = {
            'total_ai_users': total_ai_users,
            'power_users': power_users,
            'consistent_users': consistent_users,
            'recent_active': recent_ai_users,
            'adoption_rate': agent_adoption_rate
        }
        
    except Exception as e:
        log(f"❌ Error in AI feature breakdown: {str(e)}")
        ai_metrics_summary = None
else:
    log("ℹ️  No AI usage data available for detailed analysis")
    ai_metrics_summary = None

log_execution_time("AI Feature Breakdown", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("USER JOURNEY MAPPING")
log("="*60)

if activity_df is not None:
    try:
        log("Analyzing user journey patterns...")
        
        # 1. First Action Analysis
        log("\n🎯 ENTRY POINTS (First Actions):")
        
        # Get first action for each user
        user_first_action = activity_df.groupBy('user_name').agg(
            F.min('activity_date').alias('first_activity_date')
        )
        
        # Join to get the action on first day - use aliases to avoid ambiguity
        activity_aliased = activity_df.alias('act')
        first_action_aliased = user_first_action.alias('first')
        
        first_actions = activity_aliased.join(
            first_action_aliased,
            (F.col('act.user_name') == F.col('first.user_name')) & 
            (F.col('act.activity_date') == F.col('first.first_activity_date')),
            'inner'
        ).select(
            F.col('act.user_name'),
            F.col('act.action_name').alias('first_action'),
            F.col('act.service_name').alias('first_service')
        ).distinct()
        
        # Count first actions
        first_action_counts = first_actions.groupBy('first_action').count().orderBy(F.desc('count'))
        
        log("  Most common first actions:")
        for row in first_action_counts.limit(10).collect():
            log(f"    {row['first_action']}: {row['count']} users")
        
        # 2. Feature Discovery Sequence
        log("\n🔍 FEATURE DISCOVERY PATTERNS:")
        
        # Users who started with notebooks
        notebook_starters = first_actions.filter(
            F.col('first_action').isin(['runCommand', 'modifyNotebook', 'createNotebook'])
        ).count()
        
        # Users who started with SQL
        sql_starters = first_actions.filter(
            F.col('first_action').isin(['executeQuery', 'createQuery'])
        ).count()
        
        # Users who started with AI
        ai_starters = first_actions.filter(
            F.col('first_service').isin(['agents', 'aibiGenie', 'knowledge_assistant'])
        ).count()
        
        total_users_with_activity = first_actions.count()
        
        log(f"  Started with Notebooks: {notebook_starters} ({notebook_starters/total_users_with_activity*100:.1f}%)")
        log(f"  Started with SQL: {sql_starters} ({sql_starters/total_users_with_activity*100:.1f}%)")
        log(f"  Started with AI: {ai_starters} ({ai_starters/total_users_with_activity*100:.1f}%)")
        
        # 3. Feature Adoption Timeline
        log("\n📅 FEATURE ADOPTION TIMELINE:")
        
        # Calculate days to adopt different features
        user_feature_timeline = activity_df.groupBy('user_name', 'action_name').agg(
            F.min('activity_date').alias('first_use_date')
        )
        
        # Join with user's first activity to calculate days to adoption
        feature_adoption_timeline = user_feature_timeline.join(
            user_first_action,
            'user_name'
        ).withColumn(
            'days_to_adoption',
            F.datediff(F.col('first_use_date'), F.col('first_activity_date'))
        )
        
        # Average days to adopt key features
        key_features = ['createNotebook', 'executeQuery', 'createDashboard']
        
        log("  Average days to adopt key features:")
        for feature in key_features:
            avg_days = feature_adoption_timeline.filter(
                (F.col('action_name') == feature) & (F.col('days_to_adoption') > 0)
            ).agg(F.avg('days_to_adoption')).first()[0]
            
            if avg_days:
                log(f"    {feature}: {avg_days:.1f} days")
        
        # 4. Multi-Feature Users
        log("\n🌐 FEATURE BREADTH:")
        
        # Users by number of unique features used
        user_feature_breadth = activity_df.groupBy('user_name').agg(
            F.countDistinct('action_name').alias('unique_features')
        )
        
        # Categorize users by feature breadth
        single_feature = user_feature_breadth.filter(F.col('unique_features') == 1).count()
        few_features = user_feature_breadth.filter((F.col('unique_features') >= 2) & (F.col('unique_features') <= 5)).count()
        many_features = user_feature_breadth.filter(F.col('unique_features') > 5).count()
        
        log(f"  Single feature users: {single_feature}")
        log(f"  Few features (2-5): {few_features}")
        log(f"  Many features (6+): {many_features}")
        
        # 5. Onboarding Success
        log("\n🎓 ONBOARDING SUCCESS:")
        
        # Users who became active within 7 days of first action
        quick_adopters = user_feature_breadth.filter(F.col('unique_features') >= 3)
        quick_adopter_count = quick_adopters.count()
        
        log(f"  Users who explored 3+ features: {quick_adopter_count} ({quick_adopter_count/total_users_with_activity*100:.1f}%)")
        
        # Store journey metrics
        journey_metrics = {
            'notebook_starters': notebook_starters,
            'sql_starters': sql_starters,
            'ai_starters': ai_starters,
            'multi_feature_users': many_features
        }
        
    except Exception as e:
        log(f"❌ Error in user journey mapping: {str(e)}")
        import traceback
        log(traceback.format_exc())
        journey_metrics = None
else:
    log("ℹ️  No activity data available for journey mapping")
    journey_metrics = None

log_execution_time("User Journey Mapping", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("COLLABORATION SCORE ANALYSIS")
log("="*60)

if activity_df is not None and users_df is not None:
    try:
        log("Calculating collaboration metrics...")
        
        # 1. Shared Workspace Activity
        log("\n🤝 SHARED WORKSPACE ACTIVITY:")
        
        # Count users working on shared notebooks (modifyNotebook action)
        shared_notebook_users = activity_df.filter(
            F.col('action_name') == 'modifyNotebook'
        ).select('user_name').distinct().count()
        
        total_active = users_df.filter(F.col('active') == True).count()
        
        log(f"  Users editing notebooks: {shared_notebook_users} ({shared_notebook_users/total_active*100:.1f}%)")
        
        # 2. Group Membership Analysis
        log("\n👥 GROUP COLLABORATION:")
        
        if 'groups_df' in dir() and groups_df is not None:
            # Average groups per user
            total_groups = groups_df.count()
            total_users = users_df.count()
            
            log(f"  Total groups: {total_groups}")
            log(f"  Average groups per user: {total_groups/total_users:.1f}")
            
            # Users in multiple groups (proxy for cross-team collaboration)
            # Note: This would require group membership data from SDK
            log("  ℹ️  Detailed group membership requires additional API calls")
        else:
            log("  ℹ️  Group data not available")
        
        # 3. Collaborative Features Usage
        log("\n📝 COLLABORATIVE FEATURES:")
        
        # Dashboard creation (often shared)
        dashboard_creators = activity_df.filter(
            F.col('action_name') == 'createDashboard'
        ).select('user_name').distinct().count()
        
        log(f"  Dashboard creators: {dashboard_creators} ({dashboard_creators/total_active*100:.1f}%)")
        
        # Git sync (repo collaboration)
        git_users = activity_df.filter(
            F.col('action_name') == 'gitSync'
        ).select('user_name').distinct().count()
        
        if git_users > 0:
            log(f"  Git/Repo users: {git_users} ({git_users/total_active*100:.1f}%)")
        else:
            log(f"  Git/Repo users: 0 (no Git activity detected)")
        
        # 4. Calculate Collaboration Score
        log("\n🏆 COLLABORATION SCORE:")
        
        # Score components (0-100 scale)
        # - Shared notebook usage (40 points)
        # - Dashboard creation (30 points)
        # - Git usage (30 points)
        
        notebook_score = min((shared_notebook_users / total_active) * 100, 40)
        dashboard_score = min((dashboard_creators / total_active) * 100 * 0.75, 30)  # Scale down
        git_score = min((git_users / total_active) * 100 * 0.75, 30) if git_users > 0 else 0  # Scale down
        
        collaboration_score = notebook_score + dashboard_score + git_score
        
        log(f"\n  Overall Collaboration Score: {collaboration_score:.1f}/100")
        log(f"    Notebook collaboration: {notebook_score:.1f}/40")
        log(f"    Dashboard sharing: {dashboard_score:.1f}/30")
        log(f"    Git collaboration: {git_score:.1f}/30")
        
        # Interpretation
        if collaboration_score >= 70:
            interpretation = "EXCELLENT - High collaboration culture"
        elif collaboration_score >= 50:
            interpretation = "GOOD - Moderate collaboration"
        elif collaboration_score >= 30:
            interpretation = "FAIR - Some collaboration"
        else:
            interpretation = "LOW - Limited collaboration"
        
        log(f"\n  Assessment: {interpretation}")
        
        # 5. Collaboration Recommendations
        log("\n💡 COLLABORATION RECOMMENDATIONS:")
        
        recommendations = []
        
        if notebook_score < 20:
            recommendations.append("Promote shared notebook usage and /Shared folder")
        
        if dashboard_score < 15:
            recommendations.append("Encourage dashboard creation for team visibility")
        
        if git_score < 10:
            recommendations.append("Introduce Git integration for version control")
        
        if len(recommendations) > 0:
            for i, rec in enumerate(recommendations, 1):
                log(f"  {i}. {rec}")
        else:
            log("  ✓ Collaboration practices look strong!")
        
        # Store collaboration metrics
        collaboration_metrics = {
            'collaboration_score': collaboration_score,
            'shared_notebook_users': shared_notebook_users,
            'dashboard_creators': dashboard_creators,
            'git_users': git_users,
            'interpretation': interpretation
        }
        
    except Exception as e:
        log(f"❌ Error calculating collaboration score: {str(e)}")
        import traceback
        log(traceback.format_exc())
        collaboration_metrics = None
else:
    log("ℹ️  Insufficient data for collaboration analysis")
    collaboration_metrics = None

log_execution_time("Collaboration Score", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("CALCULATING ADOPTION METRICS")
log("="*60)

if activity_df is not None and users_df is not None:
    try:
        # Calculate active users for different time periods
        current_date = datetime.now(eastern).date()
        
        # Daily Active Users (DAU)
        dau_date = current_date - timedelta(days=DAU_DAYS)
        dau = activity_df.filter(F.col('activity_date') >= F.lit(dau_date)).select('user_name').distinct().count()
        
        # Weekly Active Users (WAU)
        wau_date = current_date - timedelta(days=WAU_DAYS)
        wau = activity_df.filter(F.col('activity_date') >= F.lit(wau_date)).select('user_name').distinct().count()
        
        # Monthly Active Users (MAU)
        mau_date = current_date - timedelta(days=MAU_DAYS)
        mau = activity_df.filter(F.col('activity_date') >= F.lit(mau_date)).select('user_name').distinct().count()
        
        # Total active users
        total_active_users = users_df.filter(F.col('active') == True).count()
        
        log(f"\n📈 Active User Metrics:")
        log(f"  DAU (last {DAU_DAYS} day): {dau} ({(dau/total_active_users*100) if total_active_users > 0 else 0:.1f}% of active users)")
        log(f"  WAU (last {WAU_DAYS} days): {wau} ({(wau/total_active_users*100) if total_active_users > 0 else 0:.1f}% of active users)")
        log(f"  MAU (last {MAU_DAYS} days): {mau} ({(mau/total_active_users*100) if total_active_users > 0 else 0:.1f}% of active users)")
        
        # Calculate user activity levels
        user_activity = activity_df.groupBy('user_name').agg(
            F.sum('activity_count').alias('total_activities'),
            F.countDistinct('activity_date').alias('active_days'),
            F.countDistinct('action_name').alias('unique_actions'),
            F.max('activity_date').alias('last_activity_date')
        )
        
        # Join with users
        user_metrics = users_df.join(user_activity, 'user_name', 'left')
        
        # Fill nulls for users with no activity
        user_metrics = user_metrics.fillna({
            'total_activities': 0,
            'active_days': 0,
            'unique_actions': 0
        })
        
        # Classify users
        user_metrics = user_metrics.withColumn(
            'user_segment',
            F.when(F.col('total_activities') >= POWER_USER_THRESHOLD, 'Power User')
             .when(F.col('total_activities') >= 10, 'Regular User')
             .when(F.col('total_activities') >= 1, 'Occasional User')
             .otherwise('Inactive')
        )
        
        # Calculate days since last activity
        user_metrics = user_metrics.withColumn(
            'days_since_activity',
            F.datediff(F.lit(current_date), F.col('last_activity_date'))
        )
        
        log(f"\n👥 User Segmentation:")
        segments = user_metrics.groupBy('user_segment').count().orderBy(F.desc('count'))
        for row in segments.collect():
            log(f"  {row.user_segment}: {row['count']} users")
        
        # Inactive users
        inactive_users = user_metrics.filter(
            (F.col('days_since_activity') > INACTIVE_THRESHOLD_DAYS) | 
            (F.col('last_activity_date').isNull())
        )
        inactive_count = inactive_users.count()
        
        log(f"\n⚠️  Inactive users (>{INACTIVE_THRESHOLD_DAYS} days): {inactive_count}")
        
    except Exception as e:
        log(f"✗ Error calculating metrics: {str(e)}")
        user_metrics = None
        dau = wau = mau = 0
else:
    log("⚠️  Skipping metrics calculation (no data)")
    user_metrics = None
    dau = wau = mau = 0

log_execution_time("Calculate Metrics", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("FEATURE USAGE ANALYSIS")
log("="*60)

if activity_df is not None:
    try:
        # Feature usage by action type
        feature_usage = activity_df.groupBy('action_name', 'service_name').agg(
            F.sum('activity_count').alias('total_uses'),
            F.countDistinct('user_name').alias('unique_users')
        ).orderBy(F.desc('total_uses'))
        
        # Map actions to feature categories and friendly names
        feature_categories = {
            # Notebooks
            'runCommand': ('Notebooks', 'Notebook Execution'),
            'submitCommand': ('Notebooks', 'Command Submission'),
            'modifyNotebook': ('Notebooks', 'Notebook Editing'),
            'createNotebook': ('Notebooks', 'Notebook Creation'),
            'runNotebook': ('Notebooks', 'Notebook Runs'),
            
            # SQL
            'executeQuery': ('SQL', 'Query Execution'),
            'createQuery': ('SQL', 'Query Creation'),
            'commandSubmit': ('SQL', 'SQL Command Submit'),
            'commandFinish': ('SQL', 'SQL Command Finish'),
            
            # Genie/AI (by service name)
            'agents': ('Genie/AI', 'AI Agents'),
            'knowledge_assistant': ('Genie/AI', 'Knowledge Assistant'),
            'supervisor_agent': ('Genie/AI', 'Supervisor Agent'),
            'agentFramework': ('Genie/AI', 'Agent Framework'),
            'agentEvaluation': ('Genie/AI', 'Agent Evaluation'),
            'aibiGenie': ('Genie/AI', 'AI/BI Genie'),
            
            # Dashboards
            'createDashboard': ('Dashboards', 'Dashboard Creation'),
            'viewDashboard': ('Dashboards', 'Dashboard Views'),
            'modifyDashboard': ('Dashboards', 'Dashboard Editing'),
            
            # Jobs
            'createJob': ('Jobs', 'Job Creation'),
            'runJob': ('Jobs', 'Job Execution'),
            'submitRun': ('Jobs', 'Job Submission'),
            
            # Git/Repos
            'gitSync': ('Git/Repos', 'Git Sync'),
            'gitCommit': ('Git/Repos', 'Git Commit'),
            'gitPush': ('Git/Repos', 'Git Push'),
            
            # MLflow
            'mlflowRunCreated': ('MLflow', 'Experiment Runs'),
            'mlflowModelRegistered': ('MLflow', 'Model Registration'),
        }
        
        # Collect feature usage data
        feature_data = []
        for row in feature_usage.collect():
            # Check if it's a Genie/AI service
            if row.service_name in ['agents', 'knowledge_assistant', 'supervisor_agent', 
                                     'agentFramework', 'agentEvaluation', 'aibiGenie']:
                category, feature_name = feature_categories.get(row.service_name, ('Other', row.service_name))
            else:
                category, feature_name = feature_categories.get(row.action_name, ('Other', row.action_name))
            
            feature_data.append({
                'category': category,
                'feature_name': feature_name,
                'action_name': row.action_name,
                'service_name': row.service_name,
                'total_uses': row.total_uses,
                'unique_users': row.unique_users
            })
        
        # Convert to pandas for easier grouping
        import pandas as pd
        features_pd = pd.DataFrame(feature_data)
        
        # Calculate total active users
        total_active = users_df.filter(F.col('active') == True).count()
        
        # Group by category
        log(f"\n📊 FEATURE ADOPTION BY CATEGORY:")
        log("="*60)
        
        category_summary = features_pd.groupby('category').agg({
            'total_uses': 'sum',
            'unique_users': 'max'  # Max because same user can appear in multiple features
        }).sort_values('total_uses', ascending=False)
        
        for category, row in category_summary.iterrows():
            adoption_rate = (row['unique_users'] / total_active * 100) if total_active > 0 else 0
            log(f"\n{category}:")
            log(f"  Total uses: {int(row['total_uses']):,}")
            log(f"  Unique users: {int(row['unique_users'])} ({adoption_rate:.1f}% adoption)")
            
            # Show top features in this category
            category_features = features_pd[features_pd['category'] == category].sort_values('total_uses', ascending=False).head(5)
            log(f"  Top features:")
            for _, feat in category_features.iterrows():
                feat_adoption = (feat['unique_users'] / total_active * 100) if total_active > 0 else 0
                log(f"    • {feat['feature_name']}: {int(feat['total_uses']):,} uses, {int(feat['unique_users'])} users ({feat_adoption:.1f}%)")
        
        # Overall summary
        log(f"\n" + "="*60)
        log(f"📈 OVERALL ADOPTION SUMMARY:")
        log(f"  Total active users: {total_active}")
        log(f"  Total feature uses: {int(features_pd['total_uses'].sum()):,}")
        log(f"  Average uses per user: {int(features_pd['total_uses'].sum() / total_active) if total_active > 0 else 0:,}")
        
        # Category adoption ranking
        log(f"\n🏆 Category Adoption Ranking:")
        for i, (category, row) in enumerate(category_summary.iterrows(), 1):
            adoption_rate = (row['unique_users'] / total_active * 100) if total_active > 0 else 0
            log(f"  {i}. {category}: {adoption_rate:.1f}% ({int(row['unique_users'])}/{total_active} users)")
        
    except Exception as e:
        log(f"✗ Error analyzing feature usage: {str(e)}")
        import traceback
        log(traceback.format_exc())
        feature_usage = None
else:
    log("⚠️  Skipping feature analysis (no activity data)")
    feature_usage = None

log_execution_time("Feature Analysis", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("GENERATING RECOMMENDATIONS")
log("="*60)

recommendations = []

if user_metrics is not None:
    # Inactive users
    inactive_users_list = inactive_users.collect() if 'inactive_users' in dir() else []
    
    if len(inactive_users_list) > 0:
        recommendations.append({
            'priority': 'MEDIUM',
            'category': 'User Management',
            'issue': f'{len(inactive_users_list)} inactive users (>{INACTIVE_THRESHOLD_DAYS} days)',
            'impact': 'Unused licenses, security risk from stale accounts',
            'recommendation': f'Review and deactivate {len(inactive_users_list)} inactive user accounts',
            'affected_count': len(inactive_users_list)
        })
    
    # Low AI adoption
    if 'agent_adoption_rate' in dir() and agent_adoption_rate < 20:
        recommendations.append({
            'priority': 'LOW',
            'category': 'AI Adoption',
            'issue': f'Low AI assistant adoption: {agent_adoption_rate:.1f}%',
            'impact': 'Users not leveraging productivity features',
            'recommendation': 'Promote AI assistant through training sessions and demos',
            'affected_count': int(total_active_users * (100 - agent_adoption_rate) / 100) if 'total_active_users' in dir() else 0
        })
    
    # Users with low activity
    low_activity_users = user_metrics.filter(
        (F.col('total_activities') > 0) & 
        (F.col('total_activities') < 5) &
        (F.col('active') == True)
    )
    low_activity_count = low_activity_users.count()
    
    if low_activity_count > 0:
        recommendations.append({
            'priority': 'LOW',
            'category': 'Training',
            'issue': f'{low_activity_count} users with minimal activity (<5 actions)',
            'impact': 'Low platform utilization, potential training gap',
            'recommendation': 'Provide onboarding training and resources',
            'affected_count': low_activity_count
        })
    
    # Create recommendations DataFrame
    if recommendations:
        recommendations_df = spark.createDataFrame(recommendations)
        
        log(f"\n💡 Generated {len(recommendations)} recommendations:")
        for rec in recommendations:
            log(f"  {rec['priority']}: {rec['issue']}")
    else:
        recommendations_df = None
        log("\n✅ No recommendations - adoption looks healthy!")
else:
    log("⚠️  Skipping recommendations (no data)")
    recommendations_df = None

log_execution_time("Generate Recommendations", cell_start_time)

In [0]:
cell_start_time = time.time()

if not is_job_mode and ENABLE_VISUALIZATIONS and user_metrics is not None:
    log("\n" + "="*60)
    log("VISUALIZATIONS")
    log("="*60)
    
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Create figure with subplots (3 rows, 3 columns for 9 charts)
    fig, axes = plt.subplots(3, 3, figsize=(20, 18))
    
    # Chart 1: User Segmentation
    ax1 = axes[0, 0]
    segments = user_metrics.groupBy('user_segment').count().toPandas()
    colors = {'Power User': 'green', 'Regular User': 'steelblue', 'Occasional User': 'orange', 'Inactive': 'red'}
    bar_colors = [colors.get(seg, 'gray') for seg in segments['user_segment']]
    bars = ax1.bar(segments['user_segment'], segments['count'], color=bar_colors)
    ax1.set_title('User Segmentation', fontsize=12, fontweight='bold')
    ax1.set_xlabel('User Segment')
    ax1.set_ylabel('Number of Users')
    plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')
    for i, (bar, count) in enumerate(zip(bars, segments['count'])):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 1,
                str(int(count)), ha='center', fontweight='bold')
    
    # Chart 2: Active Users Trend (DAU/WAU/MAU)
    ax2 = axes[0, 1]
    active_metrics = ['DAU', 'WAU', 'MAU']
    active_counts = [dau, wau, mau]
    ax2.bar(active_metrics, active_counts, color=['lightgreen', 'steelblue', 'darkblue'])
    ax2.set_title('Active Users Metrics', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Number of Users')
    for i, v in enumerate(active_counts):
        ax2.text(i, v + 1, str(v), ha='center', fontweight='bold')
    
    # Chart 3: AI Adoption Funnel (NEW)
    ax3 = axes[0, 2]
    if 'ai_metrics_summary' in dir() and ai_metrics_summary is not None:
        # Get AI funnel data from the AI breakdown analysis
        funnel_labels = ['Tried Once', 'Light\n(2-5)', 'Regular\n(6-20)', 'Power\n(20+)']
        # Calculate from agent_usage if available
        if 'agent_usage' in dir() and agent_usage is not None:
            tried_once = agent_usage.filter(F.col('agent_interactions') == 1).count()
            light_users = agent_usage.filter((F.col('agent_interactions') >= 2) & (F.col('agent_interactions') <= 5)).count()
            regular_users = agent_usage.filter((F.col('agent_interactions') >= 6) & (F.col('agent_interactions') <= 20)).count()
            power_users = agent_usage.filter(F.col('agent_interactions') > 20).count()
            funnel_values = [tried_once, light_users, regular_users, power_users]
            funnel_colors = ['lightcoral', 'orange', 'steelblue', 'green']
            bars = ax3.bar(range(len(funnel_labels)), funnel_values, color=funnel_colors)
            ax3.set_xticks(range(len(funnel_labels)))
            ax3.set_xticklabels(funnel_labels, fontsize=9)
            ax3.set_title('AI Adoption Funnel', fontsize=12, fontweight='bold')
            ax3.set_ylabel('Number of Users')
            for bar, val in zip(bars, funnel_values):
                height = bar.get_height()
                ax3.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                        str(int(val)), ha='center', fontweight='bold', fontsize=9)
        else:
            ax3.text(0.5, 0.5, 'AI data not available', ha='center', va='center', transform=ax3.transAxes)
            ax3.axis('off')
    else:
        ax3.text(0.5, 0.5, 'AI data not available', ha='center', va='center', transform=ax3.transAxes)
        ax3.axis('off')
    
    # Chart 4: Feature Usage Distribution (with logarithmic scale)
    ax4 = axes[1, 0]
    if feature_usage is not None and feature_usage.count() > 0:
        top_features = feature_usage.limit(10).toPandas()
        top_features['total_uses'] = top_features['total_uses'].apply(lambda x: max(x, 0.1))
        ax4.barh(range(len(top_features)), top_features['total_uses'], color='steelblue')
        ax4.set_yticks(range(len(top_features)))
        ax4.set_yticklabels(top_features['action_name'], fontsize=8)
        ax4.set_title('Top Features by Usage (Log Scale)', fontsize=12, fontweight='bold')
        ax4.set_xlabel('Relative Usage (log scale)')
        ax4.set_xscale('log')
        ax4.set_xticklabels([])
        ax4.invert_yaxis()
    
    # Chart 5: AI/Agent Adoption
    ax5 = axes[1, 1]
    if 'agent_adoption_rate' in dir():
        adoption_data = ['Using AI', 'Not Using AI']
        adoption_counts = [
            agent_usage.count() if agent_usage is not None else 0,
            total_active_users - (agent_usage.count() if agent_usage is not None else 0)
        ]
        colors_pie = ['green', 'lightgray']
        ax5.pie(adoption_counts, labels=adoption_data, autopct='%1.1f%%', colors=colors_pie, startangle=90)
        ax5.set_title(f'AI/Agent Adoption Rate: {agent_adoption_rate:.1f}%', fontsize=12, fontweight='bold')
    
    # Chart 6: User Journey Entry Points (NEW)
    ax6 = axes[1, 2]
    if 'journey_metrics' in dir() and journey_metrics is not None:
        entry_labels = ['Notebooks', 'SQL', 'AI']
        entry_values = [
            journey_metrics.get('notebook_starters', 0),
            journey_metrics.get('sql_starters', 0),
            journey_metrics.get('ai_starters', 0)
        ]
        entry_colors = ['steelblue', 'orange', 'green']
        bars = ax6.bar(entry_labels, entry_values, color=entry_colors)
        ax6.set_title('User Journey Entry Points', fontsize=12, fontweight='bold')
        ax6.set_ylabel('Number of Users')
        for bar, val in zip(bars, entry_values):
            height = bar.get_height()
            ax6.text(bar.get_x() + bar.get_width()/2., height + 1,
                    str(int(val)), ha='center', fontweight='bold')
    else:
        ax6.text(0.5, 0.5, 'Journey data not available', ha='center', va='center', transform=ax6.transAxes)
        ax6.axis('off')
    
    # Chart 7: Top Users by Classification (with logarithmic scale)
    ax7 = axes[2, 0]
    top_users_by_segment = []
    segment_colors_map = {'Power User': 'green', 'Regular User': 'steelblue', 'Occasional User': 'orange', 'Inactive': 'red'}
    
    for segment in ['Power User', 'Regular User', 'Occasional User', 'Inactive']:
        segment_users = user_metrics.filter(F.col('user_segment') == segment) \
            .orderBy(F.desc('total_activities')) \
            .limit(5) \
            .select('user_name', 'total_activities', 'user_segment') \
            .collect()
        
        for user in segment_users:
            top_users_by_segment.append({
                'user': user.user_name.split('@')[0] if '@' in user.user_name else user.user_name,
                'activities': user.total_activities if user.total_activities > 0 else 0.1,
                'segment': user.user_segment,
                'color': segment_colors_map.get(segment, 'gray')
            })
    
    if top_users_by_segment:
        y_pos = 0
        y_labels = []
        y_positions = []
        
        for segment in ['Power User', 'Regular User', 'Occasional User', 'Inactive']:
            segment_data = [u for u in top_users_by_segment if u['segment'] == segment]
            
            if segment_data:
                y_labels.append(f"--- {segment} ---")
                y_positions.append(y_pos)
                y_pos += 1
                
                for user_data in segment_data:
                    ax7.barh(y_pos, user_data['activities'], color=user_data['color'], alpha=0.7)
                    y_labels.append(user_data['user'][:20])
                    y_positions.append(y_pos)
                    y_pos += 1
                
                y_pos += 0.5
        
        ax7.set_yticks(y_positions)
        ax7.set_yticklabels(y_labels, fontsize=7)
        ax7.set_title('Top 5 Users by Classification', fontsize=12, fontweight='bold')
        ax7.set_xlabel('Activity (log scale)')
        ax7.set_xscale('log')
        ax7.set_xticklabels([])
        ax7.invert_yaxis()
    
    # Chart 8: Feature Adoption by Category
    ax8 = axes[2, 1]
    if 'features_pd' in dir() and features_pd is not None and len(features_pd) > 0:
        category_summary = features_pd.groupby('category').agg({
            'unique_users': 'max'
        }).sort_values('unique_users', ascending=True)
        
        total_active = users_df.filter(F.col('active') == True).count()
        category_summary['adoption_rate'] = (category_summary['unique_users'] / total_active * 100) if total_active > 0 else 0
        
        categories = category_summary.index.tolist()
        adoption_rates = category_summary['adoption_rate'].tolist()
        
        bar_colors_adoption = ['green' if rate >= 50 else 'steelblue' if rate >= 25 else 'orange' for rate in adoption_rates]
        
        bars = ax8.barh(categories, adoption_rates, color=bar_colors_adoption)
        ax8.set_title('Feature Adoption by Category', fontsize=12, fontweight='bold')
        ax8.set_xlabel('Adoption Rate (%)')
        ax8.set_xlim(0, 100)
        
        for i, (bar, rate, users) in enumerate(zip(bars, adoption_rates, category_summary['unique_users'])):
            width = bar.get_width()
            ax8.text(width + 2, bar.get_y() + bar.get_height()/2.,
                    f'{rate:.1f}% ({int(users)})',
                    ha='left', va='center', fontsize=8, fontweight='bold')
    else:
        ax8.text(0.5, 0.5, 'Category data not available', ha='center', va='center', transform=ax8.transAxes)
        ax8.axis('off')
    
    # Chart 9: Collaboration Score Breakdown (NEW)
    ax9 = axes[2, 2]
    if 'collaboration_metrics' in dir() and collaboration_metrics is not None:
        # Create stacked bar showing collaboration score components
        components = ['Notebooks\n(40 pts)', 'Dashboards\n(30 pts)', 'Git\n(30 pts)']
        
        # Calculate actual scores from collaboration_metrics
        total_active = users_df.filter(F.col('active') == True).count()
        shared_notebook_users = collaboration_metrics.get('shared_notebook_users', 0)
        dashboard_creators = collaboration_metrics.get('dashboard_creators', 0)
        git_users = collaboration_metrics.get('git_users', 0)
        
        notebook_score = min((shared_notebook_users / total_active) * 100, 40) if total_active > 0 else 0
        dashboard_score = min((dashboard_creators / total_active) * 100 * 0.75, 30) if total_active > 0 else 0
        git_score = min((git_users / total_active) * 100 * 0.75, 30) if total_active > 0 and git_users > 0 else 0
        
        scores = [notebook_score, dashboard_score, git_score]
        max_scores = [40, 30, 30]
        
        # Create bars
        x_pos = np.arange(len(components))
        bars1 = ax9.bar(x_pos, scores, color=['green', 'steelblue', 'orange'], alpha=0.8, label='Actual')
        bars2 = ax9.bar(x_pos, [m - s for m, s in zip(max_scores, scores)], 
                       bottom=scores, color='lightgray', alpha=0.3, label='Potential')
        
        ax9.set_xticks(x_pos)
        ax9.set_xticklabels(components, fontsize=9)
        ax9.set_ylabel('Score')
        ax9.set_title(f'Collaboration Score: {collaboration_metrics.get("collaboration_score", 0):.1f}/100', 
                     fontsize=12, fontweight='bold')
        ax9.legend(loc='upper right', fontsize=8)
        
        # Add score labels
        for bar, score in zip(bars1, scores):
            height = bar.get_height()
            ax9.text(bar.get_x() + bar.get_width()/2., height/2,
                    f'{score:.1f}', ha='center', va='center', fontweight='bold', fontsize=9)
    else:
        ax9.text(0.5, 0.5, 'Collaboration data not available', ha='center', va='center', transform=ax9.transAxes)
        ax9.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    log("✓ Visualizations generated (9 charts)")
else:
    if is_job_mode:
        log("ℹ️  Visualizations skipped (job mode)")
    else:
        log("ℹ️  Visualizations skipped (no data or disabled)")

log_execution_time("Visualizations", cell_start_time)

In [0]:
cell_start_time = time.time()

if ENABLE_EXCEL_EXPORT and user_metrics is not None:
    log("\n" + "="*60)
    log("EXPORTING TO EXCEL")
    log("="*60)
    
    try:
        # Create export directory
        if is_serverless:
            import tempfile
            temp_dir = tempfile.mkdtemp()
            export_path = temp_dir
        else:
            export_path = EXPORT_PATH
            os.makedirs(export_path, exist_ok=True)
        
        timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
        excel_path = f"{export_path}/adoption_report_{timestamp}.xlsx"
        
        log(f"Creating Excel workbook: {excel_path}")
        
        with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
            # Sheet 1: User Metrics
            user_metrics.orderBy(F.desc('total_activities')).toPandas().to_excel(writer, sheet_name='User Metrics', index=False)
            
            # Sheet 2: AI/Agent Users
            if agent_usage is not None:
                agent_usage.orderBy(F.desc('agent_interactions')).toPandas().to_excel(writer, sheet_name='AI Agent Users', index=False)
            
            # Sheet 3: Feature Usage
            if feature_usage is not None:
                feature_usage.toPandas().to_excel(writer, sheet_name='Feature Usage', index=False)
            
            # Sheet 4: Inactive Users
            if 'inactive_users' in dir() and inactive_users is not None:
                inactive_users.toPandas().to_excel(writer, sheet_name='Inactive Users', index=False)
            
            # Sheet 5: Recommendations
            if recommendations_df is not None:
                recommendations_df.toPandas().to_excel(writer, sheet_name='Recommendations', index=False)
            
            # Sheet 6: Summary
            summary_data = {
                'Metric': [
                    'Total Users',
                    'Active Users',
                    'DAU',
                    'WAU',
                    'MAU',
                    'AI/Agent Adoption Rate (%)',
                    'Power Users',
                    'Inactive Users',
                    'Analysis Period (days)',
                    'Analysis Date'
                ],
                'Value': [
                    users_df.count(),
                    users_df.filter(F.col('active') == True).count(),
                    dau,
                    wau,
                    mau,
                    f"{agent_adoption_rate:.1f}" if 'agent_adoption_rate' in dir() else '0',
                    user_metrics.filter(F.col('user_segment') == 'Power User').count(),
                    inactive_count if 'inactive_count' in dir() else 0,
                    LOOKBACK_DAYS,
                    datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S')
                ]
            }
            pd.DataFrame(summary_data).to_excel(writer, sheet_name='Summary', index=False)
        
        # Apply formatting
        from openpyxl import load_workbook
        from openpyxl.styles import Font, PatternFill, Alignment
        
        wb = load_workbook(excel_path)
        for sheet_name in wb.sheetnames:
            ws = wb[sheet_name]
            
            # Format header row
            for cell in ws[1]:
                cell.font = Font(bold=True, color='FFFFFF')
                cell.fill = PatternFill(start_color='366092', end_color='366092', fill_type='solid')
                cell.alignment = Alignment(horizontal='center')
            
            # Auto-adjust column widths
            for column in ws.columns:
                max_length = 0
                column_letter = column[0].column_letter
                for cell in column:
                    if cell.value:
                        max_length = max(max_length, len(str(cell.value)))
                ws.column_dimensions[column_letter].width = min(max_length + 2, 50)
        
        wb.save(excel_path)
        
        log(f"✓ Excel workbook created: {excel_path}")
        log(f"  Sheets: {len(wb.sheetnames)}")
        
    except Exception as e:
        log(f"✗ Excel export failed: {str(e)}")
else:
    log("ℹ️  Excel export skipped")

log_execution_time("Excel Export", cell_start_time)

In [0]:
cell_start_time = time.time()

if ENABLE_DELTA_EXPORT and user_metrics is not None:
    log("\n" + "="*60)
    log("EXPORTING TO DELTA TABLE")
    log("="*60)
    
    try:
        # Add audit metadata
        user_metrics_export = user_metrics.withColumn('audit_timestamp', F.current_timestamp())
        user_metrics_export = user_metrics_export.withColumn('lookback_days', F.lit(LOOKBACK_DAYS))
        user_metrics_export = user_metrics_export.withColumn('dau', F.lit(dau))
        user_metrics_export = user_metrics_export.withColumn('wau', F.lit(wau))
        user_metrics_export = user_metrics_export.withColumn('mau', F.lit(mau))
        user_metrics_export = user_metrics_export.withColumn('agent_adoption_rate', F.lit(agent_adoption_rate if 'agent_adoption_rate' in dir() else 0))
        
        # Write to Delta table (append mode)
        user_metrics_export.write \
            .format('delta') \
            .mode('append') \
            .option('mergeSchema', 'true') \
            .saveAsTable(DELTA_TABLE_NAME)
        
        log(f"✓ Delta table updated: {DELTA_TABLE_NAME}")
        log(f"  Mode: append (historical retention)")
        log(f"  Rows added: {user_metrics.count()}")
        
    except Exception as e:
        log(f"✗ Delta export failed: {str(e)}")
else:
    log("ℹ️  Delta export skipped")

log_execution_time("Delta Export", cell_start_time)

In [0]:
# ============================================================================
# EXECUTION SUMMARY
# ============================================================================

execution_time = time.time() - execution_stats['start_time']

log("\n" + "="*60)
log("EXECUTION SUMMARY")
log("="*60)

log(f"\n⏱️  Total execution time: {execution_time:.2f} seconds")

log(f"\n📊 Statistics:")
log(f"  Users analyzed: {execution_stats['users_processed']}")
log(f"  Audit records processed: {execution_stats['audit_records_processed']:,}")
log(f"  API calls: {execution_stats['api_calls']}")
log(f"  API failures: {execution_stats['api_failures']}")

if user_metrics is not None:
    log(f"\n👥 Adoption Summary:")
    log(f"  DAU: {dau}")
    log(f"  WAU: {wau}")
    log(f"  MAU: {mau}")
    if 'agent_adoption_rate' in dir():
        log(f"  AI/Agent adoption: {agent_adoption_rate:.1f}%")
    log(f"  Power users: {user_metrics.filter(F.col('user_segment') == 'Power User').count()}")
    log(f"  Inactive users: {inactive_count if 'inactive_count' in dir() else 0}")

if recommendations_df is not None:
    log(f"\n💡 Recommendations: {recommendations_df.count()}")

log("\n" + "="*60)
log("✓ COLLABORATION & ADOPTION ANALYSIS COMPLETE")
log("="*60)

# Return JSON summary for job mode
if is_job_mode:
    import json
    summary = {
        'status': 'success',
        'execution_time_seconds': execution_time,
        'users_analyzed': execution_stats['users_processed'],
        'dau': dau,
        'wau': wau,
        'mau': mau,
        'agent_adoption_rate': agent_adoption_rate if 'agent_adoption_rate' in dir() else 0,
        'inactive_users': inactive_count if 'inactive_count' in dir() else 0,
        'recommendations': recommendations_df.count() if recommendations_df is not None else 0,
        'timestamp': datetime.now(eastern).isoformat()
    }
    dbutils.notebook.exit(json.dumps(summary))