# GlobalSupply Corp - SQL Server to Databricks Transpilation

## 🚀 Module 2: Schema Migration & Transpilation

**Building on Module 1 Assessment Results**

This notebook demonstrates the practical transpilation of SQL Server workloads to Databricks SQL, focusing on the **5 highest-priority files** identified in Module 1's assessment.

### 📊 Migration Strategy (From Module 1)
- **Wave 1 - Quick Wins:** 2 files (Complexity ≤6.0)
- **Wave 2 - Standard Migration:** 3 files (Complexity 6.1-7.9)  
- **Wave 3 - Complex Components:** 3 files (Complexity ≥8.0) - *Optional advanced exercises*

### 🎯 Business Impact
- **Cost Reduction:** Focus on manageable scope (44 hours vs 120 total hours)
- **Risk Mitigation:** Start with proven success patterns
- **Team Building:** Build confidence before tackling complex components

---

## 🔧 Prerequisites & Setup

### Step 1: Install Required Dependencies
```bash
# Core dependencies for transpilation analysis
pip install pandas matplotlib seaborn sqlparse

# Optional: Lakebridge for automated transpilation
databricks labs install lakebridge
```

### Step 2: Databricks Environment Setup
```bash
# Configure Databricks CLI (if using real Databricks workspace)
databricks configure

# Create Unity Catalog structure (run in Databricks SQL)
CREATE CATALOG IF NOT EXISTS globalsupply_corp;
CREATE SCHEMA IF NOT EXISTS globalsupply_corp.raw;
CREATE SCHEMA IF NOT EXISTS globalsupply_corp.analytics;
```

### Step 3: Verify Module 1 Completion
Ensure the following files exist from Module 1:
- `../01_assessment/sample_sql/*.sql` (8 sample files)
- Assessment results and migration wave assignments

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path
import subprocess
import json
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✅ Libraries imported successfully")
print("🚀 Ready for GlobalSupply Corp transpilation analysis")

## 📋 Module 1 Integration - Load Assessment Results

First, let's load and review the assessment results from Module 1 to understand our transpilation scope and priorities.

In [None]:
# Load Module 1 assessment data and migration strategy
# This integrates directly with the assessment findings

# Define the 5 focus files based on Module 1 assessment
FOCUS_FILES = {
    # Wave 1 - Quick Wins (Simple complexity ≤6.0)
    'financial_summary.sql': {
        'complexity': 4.5,
        'hours': 4,
        'wave': 'Wave 1 - Quick Wins',
        'category': 'Reporting',
        'features': ['Basic Aggregation', 'Simple Joins'],
        'priority': 'High - Build momentum'
    },
    'order_processing.sql': {
        'complexity': 5.1,
        'hours': 6,
        'wave': 'Wave 1 - Quick Wins',
        'category': 'OLTP',
        'features': ['CRUD Operations', 'Transactions'],
        'priority': 'High - Core business process'
    },
    
    # Wave 2 - Standard Migration (Medium complexity 6.1-7.9)
    'dynamic_reporting.sql': {
        'complexity': 6.5,
        'hours': 8,
        'wave': 'Wave 2 - Standard Migration',
        'category': 'Reporting',
        'features': ['Dynamic SQL', 'Conditional Logic'],
        'priority': 'Medium - Standard patterns'
    },
    'window_functions_analysis.sql': {
        'complexity': 7.2,
        'hours': 12,
        'wave': 'Wave 2 - Standard Migration', 
        'category': 'Analytics',
        'features': ['Advanced Window Functions', 'LAG/LEAD'],
        'priority': 'Medium - Analytics foundation'
    },
    'customer_profitability.sql': {
        'complexity': 7.8,
        'hours': 18,
        'wave': 'Wave 2 - Standard Migration',
        'category': 'Analytics', 
        'features': ['PIVOT', 'Window Functions', 'String Aggregation'],
        'priority': 'Medium - Advanced analytics'
    }
}

# Optional advanced files (Wave 3 - left as challenges)
ADVANCED_FILES = {
    'supply_chain_performance.sql': {'complexity': 8.5, 'hours': 16},
    'inventory_optimization.sql': {'complexity': 9.2, 'hours': 24},
    'supplier_risk_assessment.sql': {'complexity': 9.8, 'hours': 32}
}

# Create DataFrame for analysis
focus_df = pd.DataFrame.from_dict(FOCUS_FILES, orient='index').reset_index()
focus_df.rename(columns={'index': 'filename'}, inplace=True)

print("📊 MODULE 1 INTEGRATION - TRANSPILATION SCOPE")
print("=" * 60)
print(f"Focus Files: {len(FOCUS_FILES)} (Complexity 4.5-7.8)")
print(f"Advanced Files: {len(ADVANCED_FILES)} (Complexity 8.5-9.8) - Optional")
print(f"Total Effort: {focus_df['hours'].sum()} hours (vs {focus_df['hours'].sum() + 72} with advanced)")
print(f"Cost Savings: ${(72 * 150):,} by focusing on manageable scope\n")

# Display prioritized file list
focus_df_display = focus_df[['filename', 'complexity', 'hours', 'wave', 'priority']].copy()
focus_df_display['filename'] = focus_df_display['filename'].str.replace('.sql', '')
display(focus_df_display.sort_values('complexity'))

## 📁 Validate Source Files and Environment

Before starting transpilation, let's validate our environment and source files.

In [None]:
# Check source files availability
source_dir = Path('../01_assessment/sample_sql')
missing_files = []
available_files = []

print("📁 VALIDATING SOURCE FILES")
print("-" * 40)

for filename in FOCUS_FILES.keys():
    file_path = source_dir / filename
    if file_path.exists():
        file_size = file_path.stat().st_size
        available_files.append(filename)
        file_info = FOCUS_FILES[filename]
        print(f"✅ {filename} ({file_size:,} bytes)")
        print(f"   Wave: {file_info['wave']} | Complexity: {file_info['complexity']}/10")
    else:
        missing_files.append(filename)
        print(f"❌ Missing: {filename}")

print(f"\n📊 Status: {len(available_files)}/{len(FOCUS_FILES)} files available")

if missing_files:
    print(f"⚠️ Missing files: {missing_files}")
    print("Please ensure Module 1 is completed with all sample SQL files.")
else:
    print("🎉 All target files available for transpilation!")

# Check Lakebridge availability
print("\n🔧 CHECKING TRANSPILATION TOOLS")
print("-" * 40)

try:
    result = subprocess.run(
        ["databricks", "labs", "lakebridge", "transpile", "--help"],
        capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        print("✅ Lakebridge transpiler available")
        lakebridge_available = True
    else:
        print("⚠️ Lakebridge installed but transpiler not configured")
        lakebridge_available = False
except (subprocess.TimeoutExpired, FileNotFoundError):
    print("⚠️ Lakebridge not available - will use manual conversion examples")
    lakebridge_available = False

print(f"\n🚀 Ready to proceed with {'automated' if lakebridge_available else 'manual'} transpilation")

## 🔄 Execute Transpilation Process

Now let's execute the transpilation process using our analyzer script.

In [None]:
# Execute the transpilation analyzer
print("🚀 EXECUTING TRANSPILATION PROCESS")
print("=" * 50)

try:
    # Run the transpilation analyzer script
    cmd = ["python", "02_transpile_analyzer.py", "--source-directory", "../01_assessment/sample_sql"]
    
    print(f"Executing: {' '.join(cmd)}")
    print("This may take a few moments...\n")
    
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
    
    if result.returncode == 0:
        print("✅ TRANSPILATION COMPLETED SUCCESSFULLY")
        print(result.stdout)
        transpilation_success = True
    else:
        print("❌ TRANSPILATION ENCOUNTERED ISSUES")
        print("STDOUT:", result.stdout)
        print("STDERR:", result.stderr)
        transpilation_success = False

except subprocess.TimeoutExpired:
    print("❌ Transpilation process timed out")
    transpilation_success = False
except Exception as e:
    print(f"❌ Error running transpilation: {e}")
    transpilation_success = False

if not transpilation_success:
    print("\n💡 FALLBACK: Creating manual conversion examples...")
    print("In a real scenario, you would:")
    print("1. Review the error messages above")
    print("2. Fix configuration issues")
    print("3. Use manual conversion patterns from 02_manual_conversion_guide.sql")

## 📊 Analyze Transpilation Results

Let's examine the transpiled files and compare them with the originals.

In [None]:
# Load and analyze transpiled files
transpiled_dir = Path('transpiled_sql')
source_dir = Path('../01_assessment/sample_sql')

print("📊 TRANSPILATION RESULTS ANALYSIS")
print("=" * 50)

results_data = []

for original_filename in FOCUS_FILES.keys():
    original_path = source_dir / original_filename
    transpiled_filename = original_filename.replace('.sql', '_databricks.sql')
    transpiled_path = transpiled_dir / transpiled_filename
    
    file_info = FOCUS_FILES[original_filename]
    
    # Check if files exist and get sizes
    original_exists = original_path.exists()
    transpiled_exists = transpiled_path.exists()
    
    original_size = original_path.stat().st_size if original_exists else 0
    transpiled_size = transpiled_path.stat().st_size if transpiled_exists else 0
    
    # Read file content for analysis (first 500 chars for preview)
    original_preview = ""
    transpiled_preview = ""
    
    if original_exists:
        with open(original_path, 'r') as f:
            content = f.read()
            original_preview = content[:500] + "..." if len(content) > 500 else content
    
    if transpiled_exists:
        with open(transpiled_path, 'r') as f:
            content = f.read()
            transpiled_preview = content[:500] + "..." if len(content) > 500 else content
    
    results_data.append({
        'filename': original_filename,
        'complexity': file_info['complexity'],
        'wave': file_info['wave'],
        'original_size': original_size,
        'transpiled_size': transpiled_size,
        'transpiled_exists': transpiled_exists,
        'size_change_pct': ((transpiled_size - original_size) / original_size * 100) if original_size > 0 else 0,
        'features': file_info['features']
    })
    
    # Display individual file results
    status = "✅ SUCCESS" if transpiled_exists else "❌ NOT FOUND"
    print(f"\n{status} {original_filename}")
    print(f"  Complexity: {file_info['complexity']}/10 | Wave: {file_info['wave']}")
    print(f"  Original: {original_size:,} bytes | Transpiled: {transpiled_size:,} bytes")
    if transpiled_exists and original_size > 0:
        change_pct = (transpiled_size - original_size) / original_size * 100
        print(f"  Size Change: {change_pct:+.1f}% (includes comments and optimizations)")

# Create summary DataFrame
results_df = pd.DataFrame(results_data)
successful_files = results_df[results_df['transpiled_exists']]

print(f"\n📈 OVERALL RESULTS:")
print(f"  Files Processed: {len(results_df)}")
print(f"  Successful Transpilations: {len(successful_files)}")
print(f"  Success Rate: {len(successful_files)/len(results_df)*100:.1f}%")
print(f"  Total Original Size: {results_df['original_size'].sum():,} bytes")
print(f"  Total Transpiled Size: {successful_files['transpiled_size'].sum():,} bytes")

if len(successful_files) > 0:
    avg_size_change = successful_files['size_change_pct'].mean()
    print(f"  Average Size Change: {avg_size_change:+.1f}% (includes documentation)")

## 📈 Visualize Transpilation Analysis

Create visualizations to understand the transpilation results and complexity distribution.

In [None]:
# Create comprehensive transpilation analysis dashboard
if len(results_df) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('GlobalSupply Corp - Transpilation Analysis Dashboard', 
                 fontsize=16, fontweight='bold', y=0.98)
    
    # 1. Success rate by migration wave
    wave_success = results_df.groupby('wave').agg({
        'transpiled_exists': ['count', 'sum']
    }).round(2)
    wave_success.columns = ['Total', 'Successful']
    wave_success['Success_Rate'] = (wave_success['Successful'] / wave_success['Total'] * 100)
    
    wave_names = [w.replace(' - ', '\n') for w in wave_success.index]
    bars1 = axes[0, 0].bar(wave_names, wave_success['Success_Rate'], 
                          color=['#2ecc71', '#f39c12'], alpha=0.8)
    axes[0, 0].set_title('🌊 Success Rate by Migration Wave', fontweight='bold')
    axes[0, 0].set_ylabel('Success Rate (%)')
    axes[0, 0].set_ylim(0, 105)
    
    # Add percentage labels on bars
    for bar, pct in zip(bars1, wave_success['Success_Rate']):
        axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                        f'{pct:.0f}%', ha='center', va='bottom', fontweight='bold')
    
    # 2. Complexity vs File Size Analysis
    successful_only = results_df[results_df['transpiled_exists']]
    if len(successful_only) > 0:
        scatter = axes[0, 1].scatter(successful_only['complexity'], 
                                    successful_only['original_size'],
                                    c=successful_only['transpiled_size'], 
                                    cmap='viridis', alpha=0.8, s=150, 
                                    edgecolors='black', linewidth=0.5)
        
        axes[0, 1].set_xlabel('Complexity Score')
        axes[0, 1].set_ylabel('Original File Size (bytes)')
        axes[0, 1].set_title('📊 Complexity vs File Size\n(Color = Transpiled Size)', fontweight='bold')
        
        # Add colorbar
        cbar = plt.colorbar(scatter, ax=axes[0, 1])
        cbar.set_label('Transpiled Size (bytes)', rotation=270, labelpad=20)
    
    # 3. File size changes
    if len(successful_only) > 0:
        file_names = [f.replace('.sql', '') for f in successful_only['filename']]
        size_changes = successful_only['size_change_pct']
        
        colors = ['#e74c3c' if x < 0 else '#2ecc71' for x in size_changes]
        bars3 = axes[1, 0].barh(file_names, size_changes, color=colors, alpha=0.7)
        
        axes[1, 0].set_xlabel('Size Change (%)')
        axes[1, 0].set_title('📏 File Size Changes After Transpilation', fontweight='bold')
        axes[1, 0].axvline(x=0, color='black', linestyle='-', alpha=0.5)
        
        # Add percentage labels
        for i, (bar, pct) in enumerate(zip(bars3, size_changes)):
            axes[1, 0].text(pct + (5 if pct >= 0 else -5), i,
                            f'{pct:+.0f}%', va='center', 
                            ha='left' if pct >= 0 else 'right', fontweight='bold')
    
    # 4. Migration timeline and effort
    wave_effort = focus_df.groupby('wave')['hours'].sum()
    
    # Create timeline visualization
    timeline_data = {
        'Wave 1 - Quick Wins': {'start': 0, 'duration': 2, 'hours': wave_effort.get('Wave 1 - Quick Wins', 0)},
        'Wave 2 - Standard Migration': {'start': 1, 'duration': 6, 'hours': wave_effort.get('Wave 2 - Standard Migration', 0)}
    }
    
    y_pos = 0
    colors_timeline = ['#2ecc71', '#f39c12']
    
    for i, (wave, data) in enumerate(timeline_data.items()):
        axes[1, 1].barh(y_pos, data['duration'], left=data['start'], 
                       color=colors_timeline[i], alpha=0.7, height=0.6)
        
        # Add wave and hours labels
        axes[1, 1].text(data['start'] + data['duration']/2, y_pos,
                        f"{wave.split(' - ')[0]}\n{data['hours']}h",
                        ha='center', va='center', fontweight='bold', color='white')
        y_pos += 1
    
    axes[1, 1].set_xlim(0, 8)
    axes[1, 1].set_ylim(-0.5, 1.5)
    axes[1, 1].set_xlabel('Timeline (Weeks)')
    axes[1, 1].set_title('📅 Migration Timeline & Effort', fontweight='bold')
    axes[1, 1].set_yticks(range(2))
    axes[1, 1].set_yticklabels(['Wave 1', 'Wave 2'])
    
    plt.tight_layout()
    plt.subplots_adjust(top=0.93)
    plt.show()
    
else:
    print("⚠️ No transpilation results available for visualization")
    print("Please ensure the transpilation process completed successfully.")

## 🔍 Before/After Code Comparison

Let's examine specific examples of transpilation to understand the key changes made.

In [None]:
# Function to display before/after code comparison
def show_code_comparison(filename, max_lines=30):
    """
    Display side-by-side comparison of original vs transpiled SQL
    """
    original_path = source_dir / filename
    transpiled_path = transpiled_dir / filename.replace('.sql', '_databricks.sql')
    
    print(f"\n{'='*80}")
    print(f"📋 CODE COMPARISON: {filename}")
    print(f"{'='*80}")
    
    if original_path.exists() and transpiled_path.exists():
        # Read files
        with open(original_path, 'r') as f:
            original_lines = f.readlines()[:max_lines]
        
        with open(transpiled_path, 'r') as f:
            transpiled_lines = f.readlines()[:max_lines]
        
        print(f"{'ORIGINAL (SQL Server)':^40} | {'TRANSPILED (Databricks)':^40}")
        print("-" * 80)
        
        # Display line by line comparison
        max_len = max(len(original_lines), len(transpiled_lines))
        
        for i in range(min(max_lines, max_len)):
            orig_line = original_lines[i].rstrip() if i < len(original_lines) else ""
            trans_line = transpiled_lines[i].rstrip() if i < len(transpiled_lines) else ""
            
            # Truncate long lines
            orig_line = orig_line[:38] + ".." if len(orig_line) > 40 else orig_line
            trans_line = trans_line[:38] + ".." if len(trans_line) > 40 else trans_line
            
            print(f"{orig_line:<40} | {trans_line:<40}")
        
        if max_len > max_lines:
            print(f"... ({max_len - max_lines} more lines) ...")
            
    else:
        print("❌ One or both files not found")
        print(f"Original exists: {original_path.exists()}")
        print(f"Transpiled exists: {transpiled_path.exists()}")

# Show comparisons for successfully transpiled files
print("🔍 BEFORE/AFTER CODE COMPARISONS")
print("=" * 60)

# Start with the simplest file
comparison_files = ['financial_summary.sql', 'order_processing.sql']

for filename in comparison_files:
    if filename in FOCUS_FILES:
        show_code_comparison(filename, max_lines=25)
        
        # Highlight key changes
        file_info = FOCUS_FILES[filename]
        print(f"\n🔧 KEY CHANGES for {filename}:")
        
        if filename == 'financial_summary.sql':
            print("  • Added Unity Catalog references (globalsupply_corp.raw.*)")
            print("  • Added performance optimization comments")
            print("  • Minimal syntax changes (already ANSI compatible)")
        
        elif filename == 'order_processing.sql':
            print("  • GETDATE() → CURRENT_TIMESTAMP()")
            print("  • DATEADD(day, N, date) → DATE_ADD(date, N)")
            print("  • DECLARE @var → DECLARE var (session variables)")
            print("  • SCOPE_IDENTITY() → MAX(column) approach")
            print("  • BEGIN/COMMIT TRANSACTION → Delta Lake ACID")
            print("  • Added MERGE operations for better performance")

print("\n💡 For complete comparisons, examine files in ./transpiled_sql/")

## 📊 Key Conversion Patterns Summary

Based on the transpilation analysis, let's summarize the most common conversion patterns encountered.

In [None]:
# Analyze and summarize conversion patterns
print("📊 KEY CONVERSION PATTERNS IDENTIFIED")
print("=" * 60)

conversion_patterns = {
    "Date Functions": {
        "files_affected": ["order_processing.sql", "dynamic_reporting.sql"],
        "changes": [
            "GETDATE() → CURRENT_TIMESTAMP()",
            "DATEADD(day, N, date) → DATE_ADD(date, N)",
            "DATEDIFF(day, date1, date2) → DATEDIFF(date2, date1)"
        ],
        "complexity": "Simple",
        "risk": "Low"
    },
    
    "Variable Declarations": {
        "files_affected": ["order_processing.sql", "dynamic_reporting.sql"],
        "changes": [
            "DECLARE @var TYPE → DECLARE var TYPE",
            "Set session variables instead of local variables",
            "Consider stored procedure parameters"
        ],
        "complexity": "Medium",
        "risk": "Medium"
    },
    
    "Transaction Handling": {
        "files_affected": ["order_processing.sql"],
        "changes": [
            "BEGIN TRANSACTION → BEGIN (Delta Lake ACID)",
            "COMMIT/ROLLBACK → Automatic with Delta Lake",
            "Use MERGE for multi-table operations"
        ],
        "complexity": "High", 
        "risk": "High"
    },
    
    "String Aggregation": {
        "files_affected": ["customer_profitability.sql"],
        "changes": [
            "STRING_AGG(col, delimiter) → ARRAY_JOIN(COLLECT_LIST(col), delimiter)"
        ],
        "complexity": "Medium",
        "risk": "Low"
    },
    
    "Window Functions": {
        "files_affected": ["window_functions_analysis.sql", "customer_profitability.sql"],
        "changes": [
            "Most syntax identical",
            "Frame specifications work the same",
            "Performance optimization opportunities"
        ],
        "complexity": "Simple",
        "risk": "Low"
    },
    
    "Schema References": {
        "files_affected": ["All files"],
        "changes": [
            "table_name → catalog.schema.table_name",
            "Added Unity Catalog structure",
            "Namespace organization"
        ],
        "complexity": "Simple",
        "risk": "Low"
    }
}

# Display pattern analysis
for pattern_name, details in conversion_patterns.items():
    print(f"\n🔧 {pattern_name.upper()}")
    print("-" * 40)
    print(f"Files Affected: {len(details['files_affected'])} - {', '.join(details['files_affected'][:2])}{'...' if len(details['files_affected']) > 2 else ''}")
    print(f"Complexity: {details['complexity']} | Risk: {details['risk']}")
    print("Key Changes:")
    for change in details['changes']:
        print(f"  • {change}")

# Create pattern complexity distribution
pattern_df = pd.DataFrame([
    {'pattern': k, 'complexity': v['complexity'], 'risk': v['risk'], 'files': len(v['files_affected'])}
    for k, v in conversion_patterns.items()
])

# Visualization of pattern complexity and risk
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Complexity distribution
complexity_counts = pattern_df['complexity'].value_counts()
colors_comp = {'Simple': '#2ecc71', 'Medium': '#f39c12', 'High': '#e74c3c'}
comp_colors = [colors_comp[comp] for comp in complexity_counts.index]

wedges1, texts1, autotexts1 = ax1.pie(complexity_counts.values, labels=complexity_counts.index,
                                      autopct='%1.0f%%', colors=comp_colors, startangle=90)
ax1.set_title('🎯 Pattern Complexity Distribution', fontweight='bold')

# Risk distribution  
risk_counts = pattern_df['risk'].value_counts()
colors_risk = {'Low': '#2ecc71', 'Medium': '#f39c12', 'High': '#e74c3c'}
risk_colors = [colors_risk[risk] for risk in risk_counts.index]

wedges2, texts2, autotexts2 = ax2.pie(risk_counts.values, labels=risk_counts.index,
                                      autopct='%1.0f%%', colors=risk_colors, startangle=90)
ax2.set_title('⚠️ Pattern Risk Distribution', fontweight='bold')

plt.suptitle('Conversion Pattern Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"\n📈 PATTERN INSIGHTS:")
print(f"  Total Patterns: {len(conversion_patterns)}")
print(f"  Simple Patterns: {len(pattern_df[pattern_df['complexity'] == 'Simple'])} (Low effort)")
print(f"  Medium Patterns: {len(pattern_df[pattern_df['complexity'] == 'Medium'])} (Standard effort)")
print(f"  High Patterns: {len(pattern_df[pattern_df['complexity'] == 'High'])} (Expert required)")
print(f"  Low Risk: {len(pattern_df[pattern_df['risk'] == 'Low'])} patterns")
print(f"  High Risk: {len(pattern_df[pattern_df['risk'] == 'High'])} patterns (require careful testing)")

## ✅ Validation & Testing Strategy

Now let's define our approach for validating the transpiled SQL code.

In [None]:
# Define validation strategy based on transpilation results
print("✅ VALIDATION & TESTING STRATEGY")
print("=" * 50)

validation_plan = {
    "Phase 1 - Syntax Validation": {
        "description": "Verify transpiled SQL parses correctly",
        "tools": ["Databricks SQL Parser", "SQLFluff", "Manual Review"],
        "files": "All transpiled files",
        "effort": "2-4 hours",
        "risk": "Low"
    },
    
    "Phase 2 - Schema Compatibility": {
        "description": "Ensure Unity Catalog references are correct", 
        "tools": ["Unity Catalog DDL", "Information Schema Queries"],
        "files": "All files with table references",
        "effort": "4-6 hours",
        "risk": "Medium"
    },
    
    "Phase 3 - Data Reconciliation": {
        "description": "Compare results between SQL Server and Databricks",
        "tools": ["Lakebridge Reconcile", "Custom Validation Queries"],
        "files": "Simple files first (financial_summary.sql)",
        "effort": "8-12 hours", 
        "risk": "High"
    },
    
    "Phase 4 - Performance Testing": {
        "description": "Validate performance improvements",
        "tools": ["Databricks SQL Analytics", "Query Plans", "Spark UI"],
        "files": "Analytics files (window_functions, customer_profitability)",
        "effort": "6-8 hours",
        "risk": "Medium"
    },
    
    "Phase 5 - Business Logic Validation": {
        "description": "Verify business calculations and logic",
        "tools": ["Business User Review", "Sample Data Testing"],
        "files": "Business-critical files (order_processing.sql)",
        "effort": "10-15 hours",
        "risk": "High"
    }
}

# Display validation plan
total_effort_hours = 0
high_risk_phases = 0

for phase_name, details in validation_plan.items():
    print(f"\n📋 {phase_name}")
    print("-" * 40)
    print(f"Description: {details['description']}")
    print(f"Tools: {', '.join(details['tools'])}")
    print(f"Scope: {details['files']}")
    print(f"Effort: {details['effort']} | Risk: {details['risk']}")
    
    # Extract effort hours for summary
    effort_str = details['effort'].split('-')[1] if '-' in details['effort'] else details['effort']
    effort_hours = int(''.join(filter(str.isdigit, effort_str)))
    total_effort_hours += effort_hours
    
    if details['risk'] == 'High':
        high_risk_phases += 1

print(f"\n📊 VALIDATION SUMMARY:")
print(f"  Total Phases: {len(validation_plan)}")
print(f"  Estimated Effort: ~{total_effort_hours} hours")
print(f"  High Risk Phases: {high_risk_phases}")
print(f"  Recommended Timeline: 2-3 weeks")

# Create simple validation checklist
print(f"\n✅ IMMEDIATE VALIDATION CHECKLIST:")
checklist = [
    "Review transpiled files for obvious syntax errors",
    "Check Unity Catalog references are consistent", 
    "Verify date function conversions are correct",
    "Test simple queries first (financial_summary.sql)",
    "Document any manual fixes needed",
    "Plan data reconciliation tests (Module 3)"
]

for i, item in enumerate(checklist, 1):
    print(f"  {i}. [ ] {item}")

print("\n🚀 Next Step: Execute validation tests using 02_validation_tests.sql")

## 📤 Export Results and Generate Deployment Artifacts

Create deployment-ready artifacts and documentation for the transpiled SQL.

In [None]:
# Generate comprehensive results and deployment artifacts
from datetime import datetime
import json

print("📤 GENERATING DEPLOYMENT ARTIFACTS")
print("=" * 50)

# Create deployment summary
deployment_summary = {
    "project": "GlobalSupply Corp - SQL Server to Databricks Migration",
    "module": "Module 2 - Schema Migration & Transpilation", 
    "timestamp": datetime.now().isoformat(),
    "scope": {
        "target_files": len(FOCUS_FILES),
        "complexity_range": "4.5-7.8 (Simple to Medium)",
        "migration_waves": ["Wave 1 - Quick Wins", "Wave 2 - Standard Migration"],
        "estimated_effort": f"{focus_df['hours'].sum()} hours"
    },
    "results": {
        "files_processed": len(results_df) if 'results_df' in locals() else 0,
        "successful_transpilations": len(successful_files) if 'successful_files' in locals() else 0,
        "transpilation_method": "lakebridge" if lakebridge_available else "manual",
    },
    "files": dict(FOCUS_FILES)
}

# Save deployment summary
summary_path = Path('deployment_summary.json')
with open(summary_path, 'w') as f:
    json.dump(deployment_summary, f, indent=2)

print(f"✅ Deployment summary saved: {summary_path}")

# Generate README for transpiled files
readme_content = f"""
# GlobalSupply Corp - Transpiled SQL Files

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Module: 2 - Schema Migration & Transpilation

## 📁 Files Overview

This directory contains SQL Server workloads transpiled to Databricks SQL syntax.

### Migration Waves (from Module 1 Assessment)

**Wave 1 - Quick Wins** (Simple complexity ≤6.0):
- `financial_summary_databricks.sql` - Basic aggregation and reporting
- `order_processing_databricks.sql` - CRUD operations with transactions

**Wave 2 - Standard Migration** (Medium complexity 6.1-7.9):
- `dynamic_reporting_databricks.sql` - Dynamic SQL and conditional logic
- `window_functions_analysis_databricks.sql` - Advanced window functions
- `customer_profitability_databricks.sql` - PIVOT operations and analytics

## 🚀 Deployment Instructions

### Prerequisites
1. Databricks workspace with Unity Catalog enabled
2. Create catalog and schemas:
   ```sql
   CREATE CATALOG IF NOT EXISTS globalsupply_corp;
   CREATE SCHEMA IF NOT EXISTS globalsupply_corp.raw;
   CREATE SCHEMA IF NOT EXISTS globalsupply_corp.analytics;
   ```

### Deployment Order
1. Start with Wave 1 files (lowest complexity)
2. Validate results before proceeding to Wave 2
3. Run validation tests from `../02_validation_tests.sql`

### Key Changes Made
- **Date Functions**: `GETDATE()` → `CURRENT_TIMESTAMP()`
- **Schema References**: Added Unity Catalog namespacing
- **Transactions**: Adapted for Delta Lake ACID properties
- **Variables**: Converted to session variables or stored procedure parameters
- **Performance**: Added optimization hints and suggestions

## ⚠️ Important Notes
- All files include validation status in headers
- High-risk changes are clearly marked
- Performance optimization suggestions provided
- Test thoroughly before production deployment

## 📞 Next Steps
1. Review each file's header comments for specific changes
2. Run syntax validation in Databricks SQL
3. Execute validation tests
4. Proceed to Module 3: Data Reconciliation

For questions or issues, refer to the transpilation analysis notebook.
"""

readme_path = Path('transpiled_sql/README.md')
readme_path.parent.mkdir(exist_ok=True)
with open(readme_path, 'w') as f:
    f.write(readme_content)

print(f"✅ README file generated: {readme_path}")

# Create deployment checklist
checklist_content = f"""
GlobalSupply Corp - Module 2 Deployment Checklist
================================================

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

PRE-DEPLOYMENT VALIDATION:
[ ] All {len(FOCUS_FILES)} target files successfully transpiled
[ ] Syntax validation completed without errors
[ ] Unity Catalog schema structure created
[ ] Sample data available for testing
[ ] Databricks SQL warehouse configured

DEPLOYMENT SEQUENCE:

WAVE 1 - QUICK WINS:
[ ] Deploy financial_summary_databricks.sql
[ ] Test with sample data
[ ] Validate results against original
[ ] Deploy order_processing_databricks.sql  
[ ] Test transaction handling
[ ] Document any issues found

WAVE 2 - STANDARD MIGRATION:
[ ] Deploy dynamic_reporting_databricks.sql
[ ] Test parameter handling
[ ] Deploy window_functions_analysis_databricks.sql
[ ] Validate window function results
[ ] Deploy customer_profitability_databricks.sql
[ ] Test PIVOT operation conversion

POST-DEPLOYMENT:
[ ] Run comprehensive validation tests
[ ] Performance benchmarking
[ ] User acceptance testing
[ ] Document lessons learned
[ ] Prepare for Module 3: Data Reconciliation

ROLLBACK PLAN:
[ ] Delta Lake time travel commands documented
[ ] Original SQL Server queries preserved
[ ] Rollback procedures tested

ESTIMATED EFFORT: {focus_df['hours'].sum()} hours
TEAM SIZE: 2-3 developers
TIMELINE: 2-3 weeks

RISKS & MITIGATION:
- Transaction handling complexity → Thorough testing with Delta Lake
- Variable scope differences → Use stored procedures where needed
- Performance variations → Benchmark and optimize

SUCCESS CRITERIA:
✓ All transpiled queries execute without syntax errors
✓ Results match original SQL Server output (within tolerance)
✓ Performance meets or exceeds baseline requirements
✓ Business logic validation passes
✓ Ready for production deployment
"""

checklist_path = Path('deployment_checklist.txt')
with open(checklist_path, 'w') as f:
    f.write(checklist_content)

print(f"✅ Deployment checklist created: {checklist_path}")

# Final summary
print(f"\n🎉 MODULE 2 TRANSPILATION COMPLETE!")
print("=" * 50)
print(f"📁 Generated Files:")
print(f"  • Transpiled SQL: ./transpiled_sql/")
print(f"  • Deployment Summary: {summary_path}")
print(f"  • README: {readme_path}")
print(f"  • Checklist: {checklist_path}")
print(f"\n📊 Results:")
print(f"  • Target Files: {len(FOCUS_FILES)}")
print(f"  • Focus Scope: Simple to Medium complexity (4.5-7.8)")
print(f"  • Estimated Effort: {focus_df['hours'].sum()} hours")
print(f"  • Cost Avoidance: ${72 * 150:,} (by deferring complex files)")
print(f"\n🚀 Ready for Module 3: Data Reconciliation!")
print(f"\n💡 Advanced Challenge: When ready, tackle the 3 Wave 3 files:")
for filename, info in ADVANCED_FILES.items():
    print(f"  • {filename} (Complexity: {info['complexity']}/10, {info['hours']}h)")

## 🎯 Module Summary & Next Steps

### ✅ What We Accomplished

1. **Strategic Focus**: Successfully transpiled 5 highest-priority SQL files (4.5-7.8 complexity)
2. **Risk Management**: Avoided high-complexity files to ensure manageable scope
3. **Pattern Recognition**: Identified key conversion patterns for future use
4. **Practical Skills**: Gained hands-on experience with SQL transpilation tools
5. **Business Alignment**: Connected technical work to Module 1's business case

### 📈 Business Value Delivered

- **Effort Optimization**: 48 hours (focused scope) vs 120 hours (all files)
- **Cost Savings**: $10,800 by deferring complex components
- **Risk Reduction**: Started with proven patterns before advanced challenges
- **Team Confidence**: Built skills foundation for future complex migrations

### 🚀 Next Steps

1. **Immediate**: Run validation tests using `02_validation_tests.sql`
2. **Short-term**: Deploy Wave 1 files to development environment
3. **Medium-term**: Complete Module 3 (Data Reconciliation) for thorough testing
4. **Advanced**: Tackle Wave 3 complex files when team is ready

### 🎓 Key Learnings

- **Automated vs Manual**: Understand when each approach is appropriate
- **Pattern Recognition**: Common conversion patterns apply across similar SQL
- **Risk Assessment**: Technical complexity directly correlates with business risk
- **Incremental Approach**: Phased migration reduces overall project risk

**Ready for Module 3: Data Reconciliation** 🎉