# Data Scientist Core Workflow with MLFlow

This notebook provides essential functionality for Data Scientists to explore and prepare drone imagery data for YOLOv11 model training, with MLFlow experiment tracking integration.

## Workflow Overview

1. **Data Exploration**: Analyze and visualize the drone imagery dataset
2. **Data Preparation**: Prepare data for YOLOv11 training
3. **Ground Truth Labeling**: Create labeling jobs for annotation
4. **Experiment Tracking**: Track data exploration experiments with MLFlow

## Prerequisites

- AWS account with appropriate permissions
- AWS CLI configured with "ab" profile
- SageMaker Studio access with Data Scientist role
- Access to the drone imagery dataset in S3 bucket: `lucaskle-ab3-project-pv`
- SageMaker managed MLFlow tracking server

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
# Install required packages
!pip install --quiet mlflow>=3.0.0 requests-auth-aws-sigv4>=0.7 boto3>=1.28.0 sagemaker>=2.190.0 pandas>=2.0.0 matplotlib>=3.7.0 seaborn>=0.12.0 numpy>=1.24.0 PyYAML>=6.0 Pillow>=9.0.0

print("✅ Required packages installed successfully!")

In [None]:
import os
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from IPython.display import display, HTML
import io
import json
from PIL import Image
import mlflow
import mlflow.sklearn
import sagemaker
import random
import re
import yaml
from collections import defaultdict

# Reliable progress bars for SageMaker Studio
from tqdm import tqdm

print("✅ All packages imported successfully with SageMaker Studio compatibility")

# Set up AWS session with "ab" profile
session = boto3.Session(profile_name='ab')
s3_client = session.client('s3')
sagemaker_client = session.client('sagemaker')
sagemaker_session = sagemaker.Session(boto_session=session)
region = session.region_name
account_id = session.client('sts').get_caller_identity()['Account']

# Set up MLFlow tracking with SageMaker managed server
try:
    # Use the correct tracking server ARN format for SageMaker managed MLflow
    tracking_server_arn = "arn:aws:sagemaker:us-east-1:192771711075:mlflow-tracking-server/sagemaker-core-setup-mlflow-server"
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow_tracking_uri = tracking_server_arn
    
    print(f"✅ Connected to SageMaker managed MLflow server")
    print(f"Tracking Server ARN: {tracking_server_arn}")
    
except Exception as e:
    print(f"⚠️  Could not connect to SageMaker managed MLflow: {e}")
    print("Using basic MLflow setup as fallback")
    mlflow_tracking_uri = "file:///tmp/mlruns"
    mlflow.set_tracking_uri(mlflow_tracking_uri)

# Set up visualization
plt.rcParams["figure.figsize"] = (12, 6)

# Define bucket name
BUCKET_NAME = 'lucaskle-ab3-project-pv'

print(f"Data Bucket: {BUCKET_NAME}")
print(f"Region: {region}")
print(f"Account ID: {account_id}")
print(f"MLFlow Tracking URI: {mlflow_tracking_uri}")

# Helper functions for MLflow logging
def log_params(params_dict):
    """Log parameters using MLflow"""
    mlflow.log_params(params_dict)

def log_metrics(metrics_dict, step=None):
    """Log metrics using MLflow"""
    for key, value in metrics_dict.items():
        mlflow.log_metric(key, value, step=step)

def log_artifact(local_path, artifact_path=None):
    """Log artifact using MLflow"""
    mlflow.log_artifact(local_path, artifact_path)

print("✅ MLflow helper functions loaded")

## 1. Data Exploration with MLFlow Tracking

Let's start by exploring the drone imagery dataset stored in S3 and track our exploration with MLFlow.

In [None]:
# Start MLFlow experiment for data exploration
experiment_name = "drone-imagery-data-exploration"

# Set up MLflow experiment
mlflow.set_experiment(experiment_name)
mlflow_run = mlflow.start_run(run_name=f"data-exploration-{datetime.now().strftime('%Y%m%d-%H%M%S')}")

# Start MLFlow run
def list_s3_objects(bucket, prefix=""):
    """List all objects in an S3 bucket with the given prefix"""
    all_objects = []
    paginator = s3_client.get_paginator('list_objects_v2')
    
    # Create a PageIterator from the Paginator
    page_iterator = paginator.paginate(
        Bucket=bucket,
        Prefix=prefix
    )
    
    # Iterate through each page
    for page in page_iterator:
        if 'Contents' in page:
            all_objects.extend(page['Contents'])
    
    return all_objects

# Function to filter image files
def filter_image_files(objects):
    """Filter image files from S3 objects list"""
    image_extensions = [".jpg", ".jpeg", ".png", ".tiff", ".tif"]
    return [obj for obj in objects 
            if any(obj['Key'].lower().endswith(ext) for ext in image_extensions)]

# List raw images in the bucket
raw_objects = list_s3_objects(BUCKET_NAME, prefix="raw-image")
raw_images = filter_image_files(raw_objects)

print(f"Found {len(raw_images)} raw images in the bucket")

# Log dataset statistics to MLFlow
mlflow.log_param("bucket_name", BUCKET_NAME)
mlflow.log_param("data_prefix", "raw-images/")
mlflow.log_metric("total_images", len(raw_images))

# Display the first few image keys
if raw_images:
    print("\nSample image keys:")
    for i, img in enumerate(raw_images[:5]):
        print(f"  {i+1}. {img['Key']}")

### 1.1 Display Sample Images

Let's display some sample images from the dataset to get a visual understanding.

In [None]:
# Function to validate image quality (needed for display function)
def validate_image_quality(img, min_size=50, max_aspect_ratio=10):
    """Validate image quality and characteristics"""
    width, height = img.size
    aspect_ratio = max(width, height) / min(width, height)
    
    issues = []
    if min(width, height) < min_size:
        issues.append(f"Too small: {width}x{height}")
    if aspect_ratio > max_aspect_ratio:
        issues.append(f"Extreme aspect ratio: {aspect_ratio:.2f}")
    
    return len(issues) == 0, issues

# Function to get random samples from all classes
def get_random_samples_from_classes(image_objects, samples_per_class=1, max_total_samples=4):
    """Get random samples from different classes/directories"""
    
    # Group images by class/directory
    class_groups = {}
    for img_obj in image_objects:
        # Extract class from path (e.g., 'raw-images/Apple___Apple_scab/image.jpg' -> 'Apple___Apple_scab')
        key_parts = img_obj['Key'].split('/')
        if len(key_parts) > 2:
            class_name = key_parts[-2]  # Second to last part is class
        else:
            class_name = 'unknown'
        
        if class_name not in class_groups:
            class_groups[class_name] = []
        class_groups[class_name].append(img_obj)
    
    print(f"📊 Found {len(class_groups)} classes in dataset")
    
    # Sample from each class
    selected_samples = []
    classes_sampled = []
    
    # If we have more classes than max samples, randomly select classes
    available_classes = list(class_groups.keys())
    if len(available_classes) > max_total_samples:
        selected_classes = random.sample(available_classes, max_total_samples)
        samples_per_class = 1
    else:
        selected_classes = available_classes
        samples_per_class = min(samples_per_class, max_total_samples // len(selected_classes))
    
    for class_name in selected_classes:
        class_images = class_groups[class_name]
        sample_size = min(samples_per_class, len(class_images))
        
        # Randomly sample from this class
        sampled = random.sample(class_images, sample_size)
        selected_samples.extend(sampled)
        classes_sampled.append(f"{class_name} ({sample_size})")
        
        if len(selected_samples) >= max_total_samples:
            break
    
    # Limit to max_total_samples
    selected_samples = selected_samples[:max_total_samples]
    
    print(f"🎯 Sampled from classes: {', '.join(classes_sampled)}")
    
    return selected_samples, class_groups

# Enhanced function to download and display images with validation and progress tracking
def display_sample_images_enhanced(bucket, image_objects, num_samples=4):
    """Download and display sample images from S3 with validation, progress tracking, and class diversity"""
    
    # Get random samples from different classes
    samples, class_info = get_random_samples_from_classes(
        image_objects, 
        samples_per_class=1, 
        max_total_samples=num_samples
    )
    
    print(f"Loading {len(samples)} sample images from different classes...")
    
    # Create a figure with subplots
    fig, axes = plt.subplots(1, len(samples), figsize=(16, 4))
    
    # If only one sample, axes is not an array
    if len(samples) == 1:
        axes = [axes]
    
    # Track image loading with progress
    loaded_images = []
    failed_images = []
    
    # Download and display each image with progress tracking
    # Use regular tqdm instead of notebook version to avoid widget issues
    print("📥 Downloading and processing images...")
    for i, img_obj in enumerate(samples):
        try:
            print(f"  Processing image {i+1}/{len(samples)}: {os.path.basename(img_obj['Key'])}")
            
            # Download image from S3
            response = s3_client.get_object(Bucket=bucket, Key=img_obj['Key'])
            img_data = response['Body'].read()
            
            # Open image with PIL
            img = Image.open(io.BytesIO(img_data))
            
            # Validate image quality
            is_valid, issues = validate_image_quality(img)
            
            # Extract class name for display
            key_parts = img_obj['Key'].split('/')
            class_name = key_parts[-2] if len(key_parts) > 2 else 'unknown'
            
            # Display image
            axes[i].imshow(img)
            title = f"{class_name}\n{os.path.basename(img_obj['Key'])}"
            if not is_valid:
                title += f" ⚠️"  # Add warning for quality issues
            axes[i].set_title(title, fontsize=9)
            axes[i].axis('off')
            
            # Add image info as text
            width, height = img.size
            file_size = len(img_data) / 1024  # KB
            info_text = f"{width}×{height}\n{file_size:.1f}KB"
            if not is_valid:
                info_text += f"\nIssues: {len(issues)}"
            axes[i].text(0.02, 0.98, info_text, transform=axes[i].transAxes, 
                       fontsize=8, verticalalignment='top', 
                       bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
            
            loaded_images.append({
                'key': img_obj['Key'],
                'class': class_name,
                'size': (width, height),
                'file_size_kb': file_size,
                'valid': is_valid,
                'issues': issues
            })
            
        except Exception as e:
            print(f"  ❌ Error displaying image {img_obj['Key']}: {str(e)}")
            axes[i].text(0.5, 0.5, f"Error loading image\n{str(e)[:50]}...", 
                       ha='center', va='center', fontsize=8, wrap=True)
            axes[i].axis('off')
            failed_images.append({'key': img_obj['Key'], 'error': str(e)})
    
    plt.tight_layout()
    plt.show()
    
    # Display validation summary
    print(f"\n📊 Image Loading Summary:")
    print(f"✅ Successfully loaded: {len(loaded_images)}")
    print(f"❌ Failed to load: {len(failed_images)}")
    
    if loaded_images:
        valid_count = sum(1 for img in loaded_images if img['valid'])
        print(f"✅ Quality validation passed: {valid_count}/{len(loaded_images)}")
        
        # Show class distribution
        class_counts = {}
        for img in loaded_images:
            class_name = img['class']
            class_counts[class_name] = class_counts.get(class_name, 0) + 1
        
        print(f"🏷️  Classes represented: {', '.join([f'{cls}({cnt})' for cls, cnt in class_counts.items()])}")
        
        # Show quality issues if any
        quality_issues = [img for img in loaded_images if not img['valid']]
        if quality_issues:
            print(f"⚠️  Images with quality issues:")
            for img in quality_issues[:3]:  # Show first 3
                print(f"  • {os.path.basename(img['key'])}: {', '.join(img['issues'])}")
    
    # Save plot as artifact in MLFlow with metadata
    plt.savefig('sample_images.png', dpi=150, bbox_inches='tight')
    mlflow.log_artifact('sample_images.png')
    
    # Log image loading statistics
    mlflow.log_metric("sample_images_loaded", len(loaded_images))
    mlflow.log_metric("sample_images_failed", len(failed_images))
    mlflow.log_metric("sample_classes_represented", len(set(img['class'] for img in loaded_images)))
    
    if loaded_images:
        mlflow.log_metric("sample_images_valid", sum(1 for img in loaded_images if img['valid']))
        avg_width = np.mean([img['size'][0] for img in loaded_images])
        avg_height = np.mean([img['size'][1] for img in loaded_images])
        avg_file_size = np.mean([img['file_size_kb'] for img in loaded_images])
        mlflow.log_metric("sample_avg_width", avg_width)
        mlflow.log_metric("sample_avg_height", avg_height)
        mlflow.log_metric("sample_avg_file_size_kb", avg_file_size)
        
        # Log class distribution
        class_counts = {}
        for img in loaded_images:
            class_name = img['class']
            class_counts[class_name] = class_counts.get(class_name, 0) + 1
        
        for class_name, count in class_counts.items():
            # Sanitize class name for MLflow
            sanitized_name = re.sub(r'[^a-zA-Z0-9_\-\.\s:/]', '_', str(class_name))
            mlflow.log_metric(f"sample_class_{sanitized_name}", count)
    
    return loaded_images, failed_images, class_info

# Display sample images with enhanced validation and class diversity
if raw_images:
    print("🖼️  Displaying sample images from different classes...")
    loaded_imgs, failed_imgs, class_distribution = display_sample_images_enhanced(
        BUCKET_NAME, raw_images, num_samples=4
    )
    
    if loaded_imgs:
        print(f"\n✅ Successfully displayed {len(loaded_imgs)} images from {len(set(img['class'] for img in loaded_imgs))} different classes")
    else:
        print("❌ No images could be displayed")
        
else:
    print("No images found to display")

### 1.2 Basic Image Analysis with MLFlow Tracking

Let's analyze some basic characteristics of the images in our dataset and track the results.

In [None]:
# Continue with the current active MLFlow run

# Function to sanitize names for MLflow (fixes the error)
def sanitize_mlflow_name(name):
    """Sanitize names for MLflow compatibility - only alphanumerics, underscores, dashes, periods, spaces, colons, slashes"""
    # Replace problematic characters with underscores
    sanitized = re.sub(r'[^a-zA-Z0-9_\-\.\s:/]', '_', str(name))
    # Remove multiple consecutive underscores
    sanitized = re.sub(r'_+', '_', sanitized)
    # Remove leading/trailing underscores
    sanitized = sanitized.strip('_')
    return sanitized

# Enhanced validation functions
def validate_image_quality(img, min_size=50, max_aspect_ratio=10):
    """Validate image quality and characteristics"""
    width, height = img.size
    aspect_ratio = max(width, height) / min(width, height)
    
    issues = []
    if min(width, height) < min_size:
        issues.append(f"Too small: {width}x{height}")
    if aspect_ratio > max_aspect_ratio:
        issues.append(f"Extreme aspect ratio: {aspect_ratio:.2f}")
    
    return len(issues) == 0, issues

def validate_dataset_integrity(class_distribution, min_samples_per_class=5):
    """Validate dataset has sufficient samples per class"""
    issues = []
    for class_name, count in class_distribution.items():
        if count < min_samples_per_class:
            issues.append(f"Class '{class_name}' has only {count} samples (minimum: {min_samples_per_class})")
    
    return len(issues) == 0, issues

# Function to get stratified sample by class/directory
def get_stratified_sample(image_objects, samples_per_class=10):
    """Get stratified sample from images organized by directory/class"""
    # Group images by directory (assuming directory represents class)
    class_groups = {}
    for img_obj in image_objects:
        # Extract directory/class from key (e.g., 'raw-images/class1/image.jpg' -> 'class1')
        key_parts = img_obj['Key'].split('/')
        if len(key_parts) > 2:  # Has subdirectory structure
            class_name = key_parts[-2]  # Second to last part is class
        else:
            class_name = 'unknown'  # Flat structure
        
        if class_name not in class_groups:
            class_groups[class_name] = []
        class_groups[class_name].append(img_obj)
    
    # Sample from each class
    stratified_sample = []
    print(f"Found {len(class_groups)} classes/directories:")
    for class_name, class_images in class_groups.items():
        sample_size = min(samples_per_class, len(class_images))
        # Use random sampling instead of just taking first N
        sampled = random.sample(class_images, sample_size)
        stratified_sample.extend(sampled)
        print(f"  '{class_name}': {len(class_images):,} images → sampling {sample_size}")
    
    return stratified_sample, class_groups

# Multiple bounding box strategies (Recommendation 3)
def get_bounding_box_strategy(strategy='centered_80', image_width=None, image_height=None):
    """Generate different bounding box strategies for object detection"""
    strategies = {
        'centered_80': {
            'x_center': 0.5, 'y_center': 0.5, 'width': 0.8, 'height': 0.8,
            'description': '80% centered box (default)'
        },
        'full_image': {
            'x_center': 0.5, 'y_center': 0.5, 'width': 0.95, 'height': 0.95,
            'description': '95% full image coverage'
        },
        'conservative': {
            'x_center': 0.5, 'y_center': 0.5, 'width': 0.6, 'height': 0.6,
            'description': '60% conservative box'
        },
        'adaptive': {
            'x_center': 0.5, 'y_center': 0.5, 
            'width': 0.9 if image_width and image_width < 300 else 0.8,
            'height': 0.9 if image_height and image_height < 300 else 0.8,
            'description': 'Adaptive based on image size'
        }
    }
    
    return strategies.get(strategy, strategies['centered_80'])

# Function to analyze image characteristics (optimized with progress bars and validation)
def analyze_images_optimized(bucket, image_objects, max_total_samples=200, bbox_strategy='centered_80'):
    """Analyze basic characteristics of images with optimized sampling, validation, and progress tracking"""
    
    # First, get a quick peek at class structure
    sample_for_classes = image_objects[:min(100, len(image_objects))]
    unique_classes = set()
    for obj in sample_for_classes:
        key_parts = obj['Key'].split('/')
        class_name = key_parts[-2] if len(key_parts) > 2 else 'unknown'
        unique_classes.add(class_name)
    
    # Calculate samples per class
    samples_per_class = max(5, max_total_samples // len(unique_classes)) if unique_classes else max_total_samples
    
    # Get stratified sample
    samples, class_info = get_stratified_sample(image_objects, samples_per_class=samples_per_class)
    
    # Initialize lists to store image characteristics
    widths = []
    heights = []
    aspect_ratios = []
    file_sizes = []
    formats = []
    class_distribution = {}
    quality_issues = []
    
    print(f"\nAnalyzing {len(samples)} strategically sampled images with validation...")
    
    # Progress bar for image analysis (Recommendation 1)
    with tqdm(total=len(samples), desc="Processing images", unit="img") as pbar:
        for i, img_obj in enumerate(samples):
            try:
                # Get class from path
                key_parts = img_obj['Key'].split('/')
                class_name = key_parts[-2] if len(key_parts) > 2 else 'unknown'
                class_distribution[class_name] = class_distribution.get(class_name, 0) + 1
                
                # Download image from S3
                response = s3_client.get_object(Bucket=bucket, Key=img_obj['Key'])
                img_data = response['Body'].read()
                
                # Get file size
                file_size = len(img_data) / (1024 * 1024)  # Convert to MB
                file_sizes.append(file_size)
                
                # Open image with PIL
                img = Image.open(io.BytesIO(img_data))
                
                # Get image dimensions
                width, height = img.size
                widths.append(width)
                heights.append(height)
                
                # Calculate aspect ratio
                aspect_ratio = width / height
                aspect_ratios.append(aspect_ratio)
                
                # Get image format
                formats.append(img.format)
                
                # Validate image quality (Recommendation 2)
                is_valid, issues = validate_image_quality(img)
                if not is_valid:
                    quality_issues.extend([f"{img_obj['Key']}: {issue}" for issue in issues])
                
                # Update progress bar
                pbar.set_postfix({
                    'Class': class_name[:10] + '...' if len(class_name) > 10 else class_name,
                    'Size': f"{width}x{height}",
                    'Issues': len(quality_issues)
                })
                pbar.update(1)
                
            except Exception as e:
                quality_issues.append(f"Error processing {img_obj['Key']}: {str(e)}")
                pbar.update(1)
    
    # Validate dataset integrity (Recommendation 2)
    dataset_valid, dataset_issues = validate_dataset_integrity(class_distribution)
    
    # Get bounding box strategy info (Recommendation 3)
    bbox_info = get_bounding_box_strategy(bbox_strategy, np.mean(widths), np.mean(heights))
    
    # Calculate statistics
    stats = {
        'count': len(widths),
        'total_dataset_size': len(image_objects),
        'classes_found': len(class_info),
        'class_distribution': class_distribution,
        'avg_width': np.mean(widths) if widths else 0,
        'avg_height': np.mean(heights) if heights else 0,
        'min_width': min(widths) if widths else 0,
        'max_width': max(widths) if widths else 0,
        'min_height': min(heights) if heights else 0,
        'max_height': max(heights) if heights else 0,
        'avg_aspect_ratio': np.mean(aspect_ratios) if aspect_ratios else 0,
        'avg_file_size': np.mean(file_sizes) if file_sizes else 0,
        'formats': list(set(formats)) if formats else [],
        'quality_issues': quality_issues,
        'dataset_valid': dataset_valid,
        'dataset_issues': dataset_issues,
        'bbox_strategy': bbox_info
    }
    
    return {
        'stats': stats,
        'widths': widths,
        'heights': heights,
        'aspect_ratios': aspect_ratios,
        'file_sizes': file_sizes,
        'formats': formats,
        'class_info': class_info
    }

# Analyze sample images with optimized approach
if raw_images:
    print(f"Found {len(raw_images):,} total images in dataset")
    
    # Choose bounding box strategy based on image characteristics
    # Since your images are small (251×249 avg), use adaptive strategy
    bbox_strategy = 'adaptive'
    
    analysis_results = analyze_images_optimized(
        BUCKET_NAME, 
        raw_images, 
        max_total_samples=200,
        bbox_strategy=bbox_strategy
    )
    
    # Display statistics
    stats = analysis_results['stats']
    print(f"\n{'='*60}")
    print("DATASET ANALYSIS SUMMARY")
    print(f"{'='*60}")
    print(f"Total dataset size: {stats['total_dataset_size']:,} images")
    print(f"Sample analyzed: {stats['count']} images ({stats['count']/stats['total_dataset_size']*100:.1f}%)")
    print(f"Classes/directories found: {stats['classes_found']}")
    
    # Validation Results (Recommendation 2)
    print(f"\n📋 VALIDATION RESULTS:")
    print(f"Dataset integrity: {'✅ VALID' if stats['dataset_valid'] else '⚠️  ISSUES FOUND'}")
    if stats['dataset_issues']:
        for issue in stats['dataset_issues']:
            print(f"  • {issue}")
    
    if stats['quality_issues']:
        print(f"Image quality issues: {len(stats['quality_issues'])} found")
        for issue in stats['quality_issues'][:5]:  # Show first 5
            print(f"  • {issue}")
        if len(stats['quality_issues']) > 5:
            print(f"  ... and {len(stats['quality_issues']) - 5} more")
    else:
        print("Image quality: ✅ All samples passed validation")
    
    print(f"\nClass Distribution in Sample:")
    for class_name, count in sorted(stats['class_distribution'].items()):
        total_in_class = len(analysis_results['class_info'][class_name])
        percentage = (count / stats['count']) * 100
        print(f"  {class_name}: {count} samples ({percentage:.1f}%) from {total_in_class:,} total images")
    
    print(f"\nImage Characteristics:")
    print(f"  Average dimensions: {stats['avg_width']:.0f}×{stats['avg_height']:.0f} pixels")
    print(f"  Dimension range: {stats['min_width']}×{stats['min_height']} to {stats['max_width']}×{stats['max_height']} pixels")
    print(f"  Average aspect ratio: {stats['avg_aspect_ratio']:.2f}")
    print(f"  Average file size: {stats['avg_file_size']:.2f} MB")
    print(f"  Image formats: {', '.join(stats['formats'])}")
    
    # Bounding Box Strategy Info (Recommendation 3)
    bbox_info = stats['bbox_strategy']
    print(f"\n🎯 BOUNDING BOX STRATEGY: {bbox_strategy}")
    print(f"  Strategy: {bbox_info['description']}")
    print(f"  Box dimensions: {bbox_info['width']*100:.0f}% × {bbox_info['height']*100:.0f}%")
    print(f"  Center position: ({bbox_info['x_center']}, {bbox_info['y_center']})")
    
    # Log comprehensive statistics to MLFlow (with sanitized names to fix error)
    mlflow.log_param("total_dataset_size", stats['total_dataset_size'])
    mlflow.log_param("sample_size", stats['count'])
    mlflow.log_param("sampling_percentage", f"{stats['count']/stats['total_dataset_size']*100:.1f}%")
    mlflow.log_param("classes_found", stats['classes_found'])
    mlflow.log_param("class_names", [sanitize_mlflow_name(name) for name in stats['class_distribution'].keys()])
    mlflow.log_param("dataset_valid", stats['dataset_valid'])
    mlflow.log_param("quality_issues_count", len(stats['quality_issues']))
    mlflow.log_param("bbox_strategy", bbox_strategy)
    mlflow.log_param("bbox_description", bbox_info['description'])
    
    # Log image characteristics
    mlflow.log_metric("avg_width", stats['avg_width'])
    mlflow.log_metric("avg_height", stats['avg_height'])
    mlflow.log_metric("min_width", stats['min_width'])
    mlflow.log_metric("max_width", stats['max_width'])
    mlflow.log_metric("min_height", stats['min_height'])
    mlflow.log_metric("max_height", stats['max_height'])
    mlflow.log_metric("avg_aspect_ratio", stats['avg_aspect_ratio'])
    mlflow.log_metric("avg_file_size_mb", stats['avg_file_size'])
    mlflow.log_param("image_formats", ", ".join(stats['formats']))
    
    # Log class distribution and dataset composition (with sanitized names - FIXES THE ERROR)
    for class_name, count in stats['class_distribution'].items():
        sanitized_name = sanitize_mlflow_name(class_name)
        mlflow.log_metric(f"sample_count_{sanitized_name}", count)
        total_in_class = len(analysis_results['class_info'][class_name])
        mlflow.log_metric(f"total_count_{sanitized_name}", total_in_class)
    
    # Log bounding box strategy parameters
    mlflow.log_metric("bbox_width_ratio", bbox_info['width'])
    mlflow.log_metric("bbox_height_ratio", bbox_info['height'])
    mlflow.log_metric("bbox_x_center", bbox_info['x_center'])
    mlflow.log_metric("bbox_y_center", bbox_info['y_center'])
    
    print(f"\n✅ Analysis complete and logged to MLFlow experiment")
    print(f"📊 Analyzed {stats['count']} representative samples from {stats['total_dataset_size']:,} total images")
    print(f"🎯 Using {bbox_strategy} bounding box strategy optimized for your image dimensions")
    
else:
    print("No images found to analyze")

### 1.3 Visualize Image Characteristics

Let's create some visualizations to better understand our dataset and save them as MLFlow artifacts.

In [None]:
# Visualize image characteristics
if 'analysis_results' in locals() and analysis_results['stats']['count'] > 0:
    # Create a figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot image dimensions
    axes[0, 0].scatter(analysis_results['widths'], analysis_results['heights'])
    axes[0, 0].set_xlabel('Width (pixels)')
    axes[0, 0].set_ylabel('Height (pixels)')
    axes[0, 0].set_title('Image Dimensions')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot aspect ratio distribution
    axes[0, 1].hist(analysis_results['aspect_ratios'], bins=10)
    axes[0, 1].set_xlabel('Aspect Ratio (width/height)')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].set_title('Aspect Ratio Distribution')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot file size distribution
    axes[1, 0].hist(analysis_results['file_sizes'], bins=10)
    axes[1, 0].set_xlabel('File Size (MB)')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].set_title('File Size Distribution')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Plot format distribution
    format_counts = {}
    for fmt in analysis_results['formats']:
        if fmt in format_counts:
            format_counts[fmt] += 1
        else:
            format_counts[fmt] = 1
    
    formats = list(format_counts.keys())
    counts = list(format_counts.values())
    
    axes[1, 1].bar(formats, counts)
    axes[1, 1].set_xlabel('Image Format')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].set_title('Image Format Distribution')
    
    plt.tight_layout()
    plt.show()
    
    # Save visualization as MLFlow artifact
    plt.savefig('image_analysis.png', dpi=150, bbox_inches='tight')
    mlflow.log_artifact('image_analysis.png')
    
else:
    print("No analysis results available for visualization")
    
mlflow.end_run()

## 2. Data Preparation for YOLOv11 Training

Now let's prepare our data for YOLOv11 training and track the preparation process.

In [None]:
# Start new MLFlow run for data preparation
# Continue with the current active MLFlow run
mlflow_run = mlflow.start_run(run_name=f"data-preparation-{datetime.now().strftime('%Y%m%d-%H%M%S')}")

# Enhanced function to check if labeled data exists with validation and progress tracking
def check_labeled_data_enhanced(bucket, prefix="labeled-data/"):
    """Check if labeled data exists in the bucket with comprehensive validation"""
    
    print(f"🔍 Scanning for labeled data in s3://{bucket}/{prefix}...")
    
    # Get all objects with progress tracking
    objects = list_s3_objects(bucket, prefix=prefix)
    
    if not objects:
        print("❌ No labeled data found")
        mlflow.log_metric("labeled_jobs_count", 0)
        mlflow.log_metric("labeled_files_count", 0)
        mlflow.log_param("labeled_data_status", "not_found")
        return {}
    
    print(f"📁 Found {len(objects)} objects in labeled data directory")
    
    # Enhanced validation and grouping with progress tracking
    jobs = {}
    valid_files = []
    invalid_files = []
    file_types = {'json': 0, 'txt': 0, 'other': 0}
    
    print("📊 Analyzing labeled data structure...")
    with tqdm(total=len(objects), desc="Analyzing files", unit="file") as pbar:
        for obj in objects:
            try:
                key = obj['Key']
                parts = key.split('/')
                
                # Validate file structure
                if len(parts) < 3:
                    invalid_files.append({'key': key, 'issue': 'Invalid directory structure'})
                    pbar.update(1)
                    continue
                
                # Extract job name and validate
                job_name = parts[1] if len(parts) > 2 else 'unknown'
                
                # Validate file type
                file_ext = key.lower().split('.')[-1] if '.' in key else 'unknown'
                if file_ext in ['json', 'jsonl']:
                    file_types['json'] += 1
                elif file_ext == 'txt':
                    file_types['txt'] += 1
                else:
                    file_types['other'] += 1
                
                # Group by job
                if job_name not in jobs:
                    jobs[job_name] = {
                        'files': [],
                        'json_files': 0,
                        'txt_files': 0,
                        'other_files': 0,
                        'total_size_mb': 0
                    }
                
                jobs[job_name]['files'].append(key)
                jobs[job_name][f"{file_ext}_files"] = jobs[job_name].get(f"{file_ext}_files", 0) + 1
                jobs[job_name]['total_size_mb'] += obj.get('Size', 0) / (1024 * 1024)
                
                valid_files.append(key)
                
                pbar.set_postfix({
                    'Jobs': len(jobs),
                    'Valid': len(valid_files),
                    'Invalid': len(invalid_files)
                })
                pbar.update(1)
                
            except Exception as e:
                invalid_files.append({'key': obj.get('Key', 'unknown'), 'issue': str(e)})
                pbar.update(1)
    
    # Display comprehensive analysis
    print(f"\n📋 Labeled Data Analysis Summary:")
    print(f"✅ Valid files: {len(valid_files)}")
    print(f"❌ Invalid files: {len(invalid_files)}")
    print(f"📊 File types: JSON: {file_types['json']}, TXT: {file_types['txt']}, Other: {file_types['other']}")
    
    if jobs:
        print(f"\n🏷️  Found {len(jobs)} labeling jobs:")
        for job_name, job_info in jobs.items():
            total_size_mb = job_info['total_size_mb']
            print(f"  📁 {job_name}:")
            print(f"    • Files: {len(job_info['files'])}")
            print(f"    • JSON: {job_info.get('json_files', 0)}, TXT: {job_info.get('txt_files', 0)}")
            print(f"    • Size: {total_size_mb:.2f} MB")
        
        # Enhanced MLFlow logging
        mlflow.log_metric("labeled_jobs_count", len(jobs))
        mlflow.log_metric("labeled_files_count", len(valid_files))
        mlflow.log_metric("labeled_invalid_files", len(invalid_files))
        mlflow.log_metric("labeled_json_files", file_types['json'])
        mlflow.log_metric("labeled_txt_files", file_types['txt'])
        mlflow.log_param("labeled_data_status", "found")
        
        # Log job details
        for job_name, job_info in jobs.items():
            sanitized_job_name = sanitize_mlflow_name(job_name)
            mlflow.log_metric(f"job_{sanitized_job_name}_files", len(job_info['files']))
            mlflow.log_metric(f"job_{sanitized_job_name}_size_mb", job_info['total_size_mb'])
    
    # Show validation issues if any
    if invalid_files:
        print(f"\n⚠️  Invalid files found:")
        for invalid in invalid_files[:5]:  # Show first 5
            print(f"  • {invalid['key']}: {invalid['issue']}")
        if len(invalid_files) > 5:
            print(f"  ... and {len(invalid_files) - 5} more")
    
    return {
        'jobs': jobs,
        'valid_files': valid_files,
        'invalid_files': invalid_files,
        'file_types': file_types,
        'total_size_mb': sum(job['total_size_mb'] for job in jobs.values())
    }

# Check for labeled data with enhanced validation
print("🔍 Checking for existing labeled data...")
labeled_data_analysis = check_labeled_data_enhanced(BUCKET_NAME)

# Store results for later use
if labeled_data_analysis and labeled_data_analysis.get('jobs'):
    labeled_jobs = labeled_data_analysis['jobs']
    print(f"✅ Found labeled data from {len(labeled_jobs)} jobs")
else:
    labeled_jobs = {}
    print("ℹ️  No existing labeled data found - will proceed with data preparation")

### 2.1 Prepare Data Structure for YOLOv11

YOLOv11 requires a specific data structure. Let's prepare our data accordingly.

In [None]:
# Enhanced function to create YOLO dataset structure with validation and multiple strategies
def prepare_yolo_structure_enhanced(bucket, job_name=None, structure_strategy='standard'):
    """Prepare YOLO dataset structure in S3 with enhanced validation and multiple strategies"""
    
    print(f"🏗️  Preparing YOLO dataset structure...")
    print(f"📦 Strategy: {structure_strategy}")
    
    # Define dataset structure with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    dataset_name = f"yolov11_dataset_{timestamp}"
    
    # Multiple structure strategies
    structure_strategies = {
        'standard': {
            'description': 'Standard YOLOv11 structure with train/val split',
            'splits': ['train', 'val'],
            'subdirs': ['images', 'labels']
        },
        'extended': {
            'description': 'Extended structure with test split',
            'splits': ['train', 'val', 'test'],
            'subdirs': ['images', 'labels']
        },
        'hierarchical': {
            'description': 'Hierarchical structure with class subdirectories',
            'splits': ['train', 'val'],
            'subdirs': ['images', 'labels'],
            'class_subdirs': True
        }
    }
    
    strategy_config = structure_strategies.get(structure_strategy, structure_strategies['standard'])
    print(f"📋 Using strategy: {strategy_config['description']}")
    
    # Define base structure
    base_prefix = f"datasets/{dataset_name}/"
    
    # Validate S3 bucket accessibility
    print("🔍 Validating S3 bucket accessibility...")
    try:
        s3_client.head_bucket(Bucket=bucket)
        print(f"✅ S3 bucket '{bucket}' is accessible")
    except Exception as e:
        print(f"❌ S3 bucket validation failed: {e}")
        mlflow.log_param("structure_creation_error", str(e))
        return None
    
    # Create directory structure with progress tracking
    directories_created = []
    creation_errors = []
    
    total_dirs = len(strategy_config['splits']) * len(strategy_config['subdirs'])
    print(f"📁 Creating {total_dirs} directories...")
    
    with tqdm(total=total_dirs, desc="Creating directories", unit="dir") as pbar:
        for split in strategy_config['splits']:
            split_prefix = f"{base_prefix}{split}/"
            
            for subdir in strategy_config['subdirs']:
                try:
                    full_prefix = f"{split_prefix}{subdir}/"
                    
                    # Create directory marker object
                    s3_client.put_object(
                        Bucket=bucket, 
                        Key=full_prefix,
                        Body='',
                        ContentType='application/x-directory'
                    )
                    
                    directories_created.append(full_prefix)
                    
                    pbar.set_postfix({
                        'Created': len(directories_created),
                        'Errors': len(creation_errors)
                    })
                    pbar.update(1)
                    
                except Exception as e:
                    error_msg = f"Failed to create {full_prefix}: {str(e)}"
                    creation_errors.append(error_msg)
                    print(f"❌ {error_msg}")
                    pbar.update(1)
    
    # Validation of created structure
    print(f"\n📊 Structure Creation Summary:")
    print(f"✅ Directories created: {len(directories_created)}")
    print(f"❌ Creation errors: {len(creation_errors)}")
    
    if creation_errors:
        print(f"\n⚠️  Creation errors:")
        for error in creation_errors:
            print(f"  • {error}")
    
    # Display the created structure
    print(f"\n📁 Created YOLO dataset structure at s3://{bucket}/{base_prefix}")
    print(f"\n🌳 Directory structure:")
    print(f"s3://{bucket}/{base_prefix}")
    
    for split in strategy_config['splits']:
        print(f"├── {split}/")
        for i, subdir in enumerate(strategy_config['subdirs']):
            connector = "└──" if i == len(strategy_config['subdirs']) - 1 else "├──"
            print(f"│   {connector} {subdir}/")
    
    # Create dataset configuration metadata
    dataset_config = {
        'dataset_name': dataset_name,
        'base_prefix': base_prefix,
        'structure_strategy': structure_strategy,
        'splits': strategy_config['splits'],
        'subdirs': strategy_config['subdirs'],
        'created_at': datetime.now().isoformat(),
        'directories_created': len(directories_created),
        'creation_errors': len(creation_errors),
        'bucket': bucket
    }
    
    # Create split-specific prefixes for easy access
    split_prefixes = {}
    for split in strategy_config['splits']:
        split_prefixes[f'{split}_prefix'] = f"{base_prefix}{split}/"
    
    # Enhanced MLFlow logging
    mlflow.log_param("dataset_name", dataset_name)
    mlflow.log_param("dataset_s3_path", f"s3://{bucket}/{base_prefix}")
    mlflow.log_param("structure_strategy", structure_strategy)
    mlflow.log_param("structure_description", strategy_config['description'])
    mlflow.log_metric("directories_created", len(directories_created))
    mlflow.log_metric("creation_errors", len(creation_errors))
    mlflow.log_param("splits", ', '.join(strategy_config['splits']))
    mlflow.log_param("subdirs", ', '.join(strategy_config['subdirs']))
    
    # Log split prefixes
    for split in strategy_config['splits']:
        mlflow.log_param(f"{split}_prefix", f"{base_prefix}{split}/")
    
    # Save dataset configuration as artifact
    config_filename = 'dataset_structure_config.json'
    with open(config_filename, 'w') as f:
        json.dump(dataset_config, f, indent=2)
    mlflow.log_artifact(config_filename)
    
    # Validate structure integrity
    print(f"\n🔍 Validating created structure...")
    validation_result = validate_yolo_structure(bucket, base_prefix, strategy_config)
    
    if validation_result['valid']:
        print(f"✅ Structure validation passed")
        mlflow.log_param("structure_validation", "passed")
    else:
        print(f"⚠️  Structure validation issues found:")
        for issue in validation_result['issues']:
            print(f"  • {issue}")
        mlflow.log_param("structure_validation", "issues_found")
        mlflow.log_param("validation_issues", ', '.join(validation_result['issues']))
    
    result = {
        'dataset_name': dataset_name,
        'base_prefix': base_prefix,
        'structure_strategy': structure_strategy,
        'directories_created': len(directories_created),
        'creation_errors': creation_errors,
        'validation': validation_result,
        **split_prefixes
    }
    
    return result

def validate_yolo_structure(bucket, base_prefix, strategy_config):
    """Validate the created YOLO structure"""
    issues = []
    
    try:
        # Check if all expected directories exist
        for split in strategy_config['splits']:
            for subdir in strategy_config['subdirs']:
                expected_prefix = f"{base_prefix}{split}/{subdir}/"
                
                try:
                    # Try to list objects with this prefix
                    response = s3_client.list_objects_v2(
                        Bucket=bucket,
                        Prefix=expected_prefix,
                        MaxKeys=1
                    )
                    # If no error, directory exists
                except Exception as e:
                    issues.append(f"Directory {expected_prefix} not accessible: {str(e)}")
        
        return {
            'valid': len(issues) == 0,
            'issues': issues
        }
        
    except Exception as e:
        return {
            'valid': False,
            'issues': [f"Structure validation failed: {str(e)}"]
        }

# Create enhanced YOLO dataset structure
print("🏗️  Creating YOLO dataset structure...")

# Choose structure strategy based on requirements
# For drone detection with potential class hierarchies, use 'standard' for simplicity
structure_strategy = 'standard'

yolo_structure = prepare_yolo_structure_enhanced(
    BUCKET_NAME, 
    structure_strategy=structure_strategy
)

if yolo_structure:
    print(f"\n🎉 YOLO dataset structure created successfully!")
    print(f"📦 Dataset name: {yolo_structure['dataset_name']}")
    print(f"📁 Base path: s3://{BUCKET_NAME}/{yolo_structure['base_prefix']}")
    print(f"🏗️  Strategy: {yolo_structure['structure_strategy']}")
    print(f"📊 Directories created: {yolo_structure['directories_created']}")
    
    if yolo_structure['validation']['valid']:
        print(f"✅ Structure validation: PASSED")
    else:
        print(f"⚠️  Structure validation: ISSUES FOUND")
        print("💡 Check the validation issues above")
    
    # Store for use in subsequent cells
    dataset_name = yolo_structure['dataset_name']
    base_prefix = yolo_structure['base_prefix']
    
else:
    print("❌ Failed to create YOLO dataset structure")
    print("💡 Check the error messages above for troubleshooting")

mlflow.end_run()

## 3. Ground Truth Labeling Job Creation

Now let's create a SageMaker Ground Truth labeling job to annotate our drone imagery for object detection.

### 3.1 Configure Labeling Job Parameters

Let's configure the parameters for our Ground Truth labeling job.

In [None]:
# Start new MLFlow run for labeling job creation
# Continue with the current active MLFlow run
mlflow_run = mlflow.start_run(run_name=f"ground-truth-labeling-{datetime.now().strftime('%Y%m%d-%H%M%S')}")
# Configure labeling job parameters
labeling_job_config = {
    'job_name': f"drone-detection-labeling-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
    'input_s3_path': f"s3://{BUCKET_NAME}/raw-images/",
    'output_s3_path': f"s3://{BUCKET_NAME}/labeled-data/",
    'task_type': 'BoundingBox',
    'labels': ['drone', 'vehicle', 'person', 'building'],
    'instructions': 'Please draw bounding boxes around all drones and other objects visible in the image.',
    'max_budget_usd': 50.00,
    'workforce_type': 'private'  # or 'public' for Mechanical Turk
}

# Display configuration
print("Labeling Job Configuration:")
for key, value in labeling_job_config.items():
    print(f"  {key}: {value}")

# Log configuration to MLFlow
for key, value in labeling_job_config.items():
    if isinstance(value, (str, int, float)):
        mlflow.log_param(f"labeling_{key}", value)
    elif isinstance(value, list):
        mlflow.log_param(f"labeling_{key}", ", ".join(value))

### 3.2 Create Input Manifest for Ground Truth

Ground Truth requires an input manifest file that lists all images to be labeled.

In [None]:
# Enhanced function to create input manifest for Ground Truth with validation and progress tracking
def create_input_manifest_enhanced(bucket, image_objects, output_key="input-manifest.json", max_images=50):
    """Create input manifest file for Ground Truth labeling job with validation and progress tracking"""
    
    # Limit images for demo and validate selection
    selected_images = image_objects[:min(max_images, len(image_objects))]
    
    print(f"Creating input manifest for {len(selected_images)} images...")
    
    # Create manifest entries with validation
    manifest_entries = []
    valid_images = []
    invalid_images = []
    
    with tqdm(total=len(selected_images), desc="Validating images", unit="img") as pbar:
        for img_obj in selected_images:
            try:
                s3_uri = f"s3://{bucket}/{img_obj['Key']}"
                
                # Basic validation - check if image exists and is accessible
                try:
                    # Quick head request to validate accessibility
                    s3_client.head_object(Bucket=bucket, Key=img_obj['Key'])
                    
                    # Validate file extension
                    valid_extensions = ['.jpg', '.jpeg', '.png', '.tiff', '.tif']
                    if not any(img_obj['Key'].lower().endswith(ext) for ext in valid_extensions):
                        raise ValueError(f"Invalid file extension: {img_obj['Key']}")
                    
                    # Create manifest entry
                    manifest_entry = {
                        "source-ref": s3_uri
                    }
                    manifest_entries.append(manifest_entry)
                    valid_images.append(img_obj['Key'])
                    
                except Exception as validation_error:
                    invalid_images.append({
                        'key': img_obj['Key'], 
                        'error': str(validation_error)
                    })
                
                pbar.set_postfix({
                    'Valid': len(valid_images), 
                    'Invalid': len(invalid_images)
                })
                pbar.update(1)
                
            except Exception as e:
                invalid_images.append({
                    'key': img_obj['Key'], 
                    'error': str(e)
                })
                pbar.update(1)
    
    # Validation summary
    print(f"\n📋 Manifest Validation Summary:")
    print(f"✅ Valid images: {len(valid_images)}")
    print(f"❌ Invalid images: {len(invalid_images)}")
    
    if invalid_images:
        print(f"\n⚠️  Invalid images found:")
        for invalid in invalid_images[:5]:  # Show first 5
            print(f"  • {os.path.basename(invalid['key'])}: {invalid['error']}")
        if len(invalid_images) > 5:
            print(f"  ... and {len(invalid_images) - 5} more")
    
    if not manifest_entries:
        print("❌ No valid images found for manifest creation!")
        return None
    
    # Create manifest file content
    manifest_content = "\n".join([json.dumps(entry) for entry in manifest_entries])
    
    # Upload manifest to S3 with progress tracking
    print(f"\n📤 Uploading manifest to S3...")
    try:
        s3_client.put_object(
            Bucket=bucket,
            Key=output_key,
            Body=manifest_content,
            ContentType='application/json'
        )
        
        manifest_uri = f"s3://{bucket}/{output_key}"
        print(f"✅ Manifest uploaded successfully: {manifest_uri}")
        
        # Log manifest creation to MLFlow
        mlflow.log_param("manifest_s3_uri", manifest_uri)
        mlflow.log_param("manifest_total_images", len(manifest_entries))
        mlflow.log_metric("manifest_valid_images", len(valid_images))
        mlflow.log_metric("manifest_invalid_images", len(invalid_images))
        mlflow.log_param("manifest_validation_success_rate", f"{len(valid_images)/len(selected_images)*100:.1f}%")
        
        # Save manifest locally and log as artifact
        with open('input_manifest.json', 'w') as f:
            f.write(manifest_content)
        mlflow.log_artifact('input_manifest.json')
        
        return {
            'manifest_uri': manifest_uri,
            'total_images': len(manifest_entries),
            'valid_images': valid_images,
            'invalid_images': invalid_images,
            'success_rate': len(valid_images)/len(selected_images)*100
        }
        
    except Exception as e:
        print(f"❌ Error uploading manifest: {str(e)}")
        mlflow.log_param("manifest_creation_error", str(e))
        return None

# Create input manifest with enhanced validation
if raw_images:
    print("Creating input manifest for Ground Truth labeling job...")
    manifest_result = create_input_manifest_enhanced(
        BUCKET_NAME, 
        raw_images, 
        output_key="ground-truth/input-manifest.json",
        max_images=20  # Limit for demo
    )
    
    if manifest_result:
        print(f"\n🎯 Manifest created successfully!")
        print(f"📊 Success rate: {manifest_result['success_rate']:.1f}%")
        print(f"📁 Location: {manifest_result['manifest_uri']}")
        
        # Store for later use
        input_manifest_uri = manifest_result['manifest_uri']
    else:
        print("❌ Failed to create manifest")
else:
    print("No images available for manifest creation")

### 3.3 Create Ground Truth Labeling Job

Now let's create the actual Ground Truth labeling job.

In [None]:
# Enhanced function to create Ground Truth labeling job with comprehensive validation
def create_ground_truth_job_enhanced(config, manifest_uri):
    """Create SageMaker Ground Truth labeling job with enhanced validation and monitoring"""
    
    print("🚀 Creating SageMaker Ground Truth labeling job...")
    
    try:
        # Enhanced validation of prerequisites
        print("📋 Validating prerequisites...")
        
        # 1. Validate IAM role
        try:
            role_arn = sagemaker_session.get_caller_identity_arn()
            print(f"✅ IAM role validated: {role_arn}")
        except Exception as e:
            raise ValueError(f"IAM role validation failed: {e}")
        
        # 2. Validate manifest accessibility
        try:
            manifest_bucket = manifest_uri.split('/')[2]
            manifest_key = '/'.join(manifest_uri.split('/')[3:])
            s3_client.head_object(Bucket=manifest_bucket, Key=manifest_key)
            print(f"✅ Input manifest validated: {manifest_uri}")
        except Exception as e:
            raise ValueError(f"Manifest validation failed: {e}")
        
        # 3. Validate S3 output path
        output_bucket = config['output_s3_uri'].split('/')[2]
        try:
            s3_client.head_bucket(Bucket=output_bucket)
            print(f"✅ Output S3 bucket validated: {output_bucket}")
        except Exception as e:
            raise ValueError(f"Output S3 bucket validation failed: {e}")
        
        # Enhanced job configuration with validation
        job_config = {
            'LabelingJobName': config['job_name'],
            'LabelAttributeName': 'bounding-box',
            'InputConfig': {
                'DataSource': {
                    'S3DataSource': {
                        'ManifestS3Uri': manifest_uri
                    }
                }
            },
            'OutputConfig': {
                'S3OutputPath': config['output_s3_uri']
            },
            'RoleArn': role_arn,
            'HumanTaskConfig': {
                'WorkteamArn': f"arn:aws:sagemaker:{region}:{account_id}:workteam/private-crowd/default",
                'UiConfig': {
                    'UiTemplateS3Uri': 's3://sagemaker-example-files-prod-us-east-1/ground-truth/object-detection/template.liquid'
                },
                'PreHumanTaskLambdaArn': f'arn:aws:lambda:{region}:432418664414:function:PRE-BoundingBox',
                'TaskTitle': config.get('task_title', 'Drone Detection Labeling'),
                'TaskDescription': config.get('task_description', 'Draw bounding boxes around drones and other objects in aerial imagery'),
                'NumberOfHumanWorkersPerDataObject': config.get('workers_per_object', 1),
                'TaskTimeLimitInSeconds': config.get('time_limit_seconds', 3600),
                'TaskAvailabilityLifetimeInSeconds': config.get('availability_seconds', 86400),
                'MaxConcurrentTaskCount': config.get('max_concurrent_tasks', 10),
                'AnnotationConsolidationConfig': {
                    'AnnotationConsolidationLambdaArn': f'arn:aws:lambda:{region}:432418664414:function:ACS-BoundingBox'
                }
            },
            'Tags': [
                {'Key': 'Project', 'Value': 'drone-detection'},
                {'Key': 'Environment', 'Value': 'development'},
                {'Key': 'CreatedBy', 'Value': 'data-scientist-notebook'}
            ]
        }
        
        # Add label categories if provided
        if 'labels' in config and config['labels']:
            job_config['LabelCategoryConfigS3Uri'] = create_label_categories_file(
                config['labels'], 
                f"{config['output_s3_uri']}/label-categories.json"
            )
            print(f"✅ Label categories configured: {config['labels']}")
        
        print("🔄 Submitting labeling job...")
        
        # Create the labeling job
        response = sagemaker_client.create_labeling_job(**job_config)
        
        job_arn = response['LabelingJobArn']
        print(f"✅ Labeling job created successfully!")
        print(f"📋 Job Name: {config['job_name']}")
        print(f"🔗 Job ARN: {job_arn}")
        
        # Log job creation to MLFlow
        mlflow.log_param("labeling_job_name", config['job_name'])
        mlflow.log_param("labeling_job_arn", job_arn)
        mlflow.log_param("labeling_job_status", "CREATED")
        mlflow.log_param("labeling_manifest_uri", manifest_uri)
        mlflow.log_param("labeling_output_uri", config['output_s3_uri'])
        mlflow.log_param("labeling_workers_per_object", config.get('workers_per_object', 1))
        mlflow.log_param("labeling_time_limit", config.get('time_limit_seconds', 3600))
        
        if 'labels' in config:
            mlflow.log_param("labeling_categories", ', '.join(config['labels']))
            mlflow.log_metric("labeling_categories_count", len(config['labels']))
        
        # Enhanced monitoring setup
        print("\n📊 Setting up job monitoring...")
        monitoring_result = setup_enhanced_monitoring(config['job_name'])
        
        return {
            'job_name': config['job_name'],
            'job_arn': job_arn,
            'status': 'CREATED',
            'monitoring': monitoring_result
        }
        
    except Exception as e:
        error_msg = str(e)
        print(f"❌ Error creating labeling job: {error_msg}")
        
        # Enhanced error diagnosis
        print("\n🔍 Error Diagnosis:")
        if "workteam" in error_msg.lower():
            print("• Issue: Private workforce not configured")
            print("• Solution: Create a private workforce in SageMaker Console")
            print("• Steps:")
            print("  1. Go to SageMaker Console > Ground Truth > Labeling workforces")
            print("  2. Create a private workforce")
            print("  3. Add team members to the workforce")
        elif "role" in error_msg.lower():
            print("• Issue: IAM role permissions insufficient")
            print("• Solution: Ensure SageMaker execution role has Ground Truth permissions")
        elif "manifest" in error_msg.lower():
            print("• Issue: Input manifest file problem")
            print("• Solution: Check manifest file format and S3 accessibility")
        elif "s3" in error_msg.lower():
            print("• Issue: S3 access problem")
            print("• Solution: Verify S3 bucket permissions and paths")
        
        # Log error to MLFlow
        mlflow.log_param("labeling_job_error", error_msg)
        mlflow.log_param("labeling_job_status", "FAILED")
        
        return None

def create_label_categories_file(labels, s3_uri):
    """Create label categories file for Ground Truth"""
    categories = {
        "document-version": "2018-11-28",
        "labels": [{"label": label} for label in labels]
    }
    
    # Upload to S3
    bucket = s3_uri.split('/')[2]
    key = '/'.join(s3_uri.split('/')[3:])
    
    s3_client.put_object(
        Bucket=bucket,
        Key=key,
        Body=json.dumps(categories),
        ContentType='application/json'
    )
    
    return s3_uri

def setup_enhanced_monitoring(job_name):
    """Set up enhanced monitoring for the labeling job"""
    try:
        # Create monitoring configuration
        monitoring_config = {
            'job_name': job_name,
            'check_interval_seconds': 300,  # Check every 5 minutes
            'notifications_enabled': True
        }
        
        print(f"✅ Monitoring configured for job: {job_name}")
        return monitoring_config
        
    except Exception as e:
        print(f"⚠️  Monitoring setup failed: {e}")
        return None

# Enhanced Ground Truth job creation
if 'input_manifest_uri' in locals() and input_manifest_uri:
    # Enhanced labeling job configuration with validation
    labeling_config = {
        'job_name': f"drone-detection-labeling-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
        'output_s3_uri': f"s3://{BUCKET_NAME}/ground-truth-output/",
        'labels': ['drone', 'vehicle', 'person', 'building'],
        'task_title': 'Drone Detection in Aerial Imagery',
        'task_description': 'Draw bounding boxes around drones, vehicles, people, and buildings in aerial imagery. Be precise with bounding box placement.',
        'workers_per_object': 1,
        'time_limit_seconds': 1800,  # 30 minutes per image
        'max_concurrent_tasks': 5,
        'availability_seconds': 86400  # 24 hours
    }
    
    # Create the labeling job with enhanced validation
    job_result = create_ground_truth_job_enhanced(labeling_config, input_manifest_uri)
    
    if job_result:
        print(f"\n🎉 Ground Truth labeling job setup complete!")
        print(f"📋 Job Name: {job_result['job_name']}")
        print(f"📊 Monitor progress in SageMaker Console > Ground Truth")
        
        # Store job info for monitoring
        current_labeling_job = job_result['job_name']
    else:
        print("❌ Failed to create Ground Truth labeling job")
        print("💡 Check the error diagnosis above for troubleshooting steps")
else:
    print("⚠️  No input manifest available. Please create the manifest first.")
    
mlflow.end_run()

### 3.4 Monitor Labeling Job Progress

Let's create a function to monitor the labeling job progress.

In [None]:
# Function to monitor labeling job
def monitor_labeling_job(job_name):
    """Monitor Ground Truth labeling job progress"""
    
    try:
        response = sagemaker_client.describe_labeling_job(
            LabelingJobName=job_name
        )
        
        status = response['LabelingJobStatus']
        creation_time = response['CreationTime']
        
        print(f"Job Name: {job_name}")
        print(f"Status: {status}")
        print(f"Created: {creation_time.strftime('%Y-%m-%d %H:%M:%S')}")
        
        if 'LabelCounters' in response:
            counters = response['LabelCounters']
            print(f"Total objects: {counters.get('TotalLabeled', 0) + counters.get('Unlabeled', 0)}")
            print(f"Labeled: {counters.get('TotalLabeled', 0)}")
            print(f"Remaining: {counters.get('Unlabeled', 0)}")
        
        if status == 'Completed':
            output_location = response['LabelingJobOutput']['OutputDatasetS3Uri']
            print(f"✅ Job completed! Output: {output_location}")
            
            # Log completion to MLFlow
            mlflow.log_param("labeling_job_output", output_location)
            mlflow.log_param("labeling_job_status", "completed")
            
        elif status == 'Failed':
            failure_reason = response.get('FailureReason', 'Unknown')
            print(f"❌ Job failed: {failure_reason}")
            
            # Log failure to MLFlow
            mlflow.log_param("labeling_job_failure", failure_reason)
            mlflow.log_param("labeling_job_status", "failed")
        
        return response
        
    except Exception as e:
        print(f"Error monitoring labeling job: {str(e)}")
        return None

# Monitor the job if it was created
if 'job_arn' in locals() and job_arn:
    job_status = monitor_labeling_job(labeling_job_config['job_name'])

## 4. View MLFlow Experiments

Let's view our MLFlow experiments and runs.

In [None]:
# Function to list MLFlow experiments
def list_mlflow_experiments():
    """List all MLFlow experiments"""
    experiments = mlflow.search_experiments()
    
    if experiments:
        print("MLFlow Experiments:")
        for exp in experiments:
            print(f"  - {exp.name} (ID: {exp.experiment_id})")
            
            # Get runs for this experiment
            runs = mlflow.search_runs(experiment_ids=[exp.experiment_id])
            print(f"    Runs: {len(runs)}")
            
            if len(runs) > 0:
                print("    Recent runs:")
                for _, run in runs.head(3).iterrows():
                    run_name = run.get('tags.mlflow.runName', 'Unnamed')
                    status = run.get('status', 'Unknown')
                    print(f"      - {run_name} ({status})")
            print()
    else:
        print("No MLFlow experiments found")

# List experiments
list_mlflow_experiments()

## 5. Summary and Next Steps

In this notebook, we've explored the drone imagery dataset, prepared the structure for YOLOv11 training, created a Ground Truth labeling job, and tracked our work with MLFlow. Here's a summary of what we've accomplished:

1. **Data Exploration with MLFlow**:
   - Listed and displayed sample images from the S3 bucket
   - Analyzed image characteristics (dimensions, aspect ratios, file sizes)
   - Visualized image statistics
   - Tracked all metrics and artifacts in MLFlow

2. **Data Preparation**:
   - Checked for existing labeled data
   - Created YOLO dataset structure in S3
   - Logged dataset information to MLFlow

3. **Ground Truth Labeling**:
   - Configured labeling job parameters
   - Created input manifest for Ground Truth
   - Set up Ground Truth labeling job for object detection
   - Implemented job monitoring functionality

4. **Experiment Tracking**:
   - Used MLFlow to track all data exploration and labeling activities
   - Saved visualizations as artifacts
   - Logged parameters and metrics for reproducibility

### Next Steps

1. **Monitor Labeling Job**: Check the progress of your Ground Truth labeling job
2. **Complete Labeling**: Ensure all images are properly labeled
3. **Convert Labels**: Convert Ground Truth output to YOLOv11 format
4. **Organize Training Data**: Place labeled data in the YOLO structure we created
5. **Proceed to Training**: Use the ML Engineer notebook for model training
6. **Review MLFlow**: Check all experiments in the SageMaker Studio MLFlow UI

### Ground Truth Integration Benefits

- **Quality Control**: Professional annotation with quality checks
- **Scalability**: Handle large datasets efficiently
- **Cost Management**: Budget controls and cost estimation
- **Workforce Management**: Private or public workforce options
- **Integration**: Seamless integration with SageMaker training pipelines

### MLFlow Integration Benefits

- **Complete Tracking**: All data exploration and labeling activities are tracked
- **Reproducibility**: Parameters and configurations are logged
- **Collaboration**: Team members can view and compare experiments
- **Artifact Management**: Visualizations and data summaries are stored
- **Lineage**: Track data from exploration to training

### Accessing Your Work

- **MLFlow UI**: Go to "Experiments and trials" > "MLflow" in SageMaker Studio
- **Ground Truth Console**: Monitor labeling jobs in the SageMaker console
- **S3 Data**: All data and artifacts are stored in your S3 bucket

For more detailed functionality, refer to the comprehensive notebooks in the `notebooks/data-labeling/` directory.

### 2.2 Transform Classification Images to YOLOv11 Format with MLFlow Tracking

Let's transform the images from the classification structure (raw-images/class_name/) to YOLOv11 format with proper train/val splits and label files, while tracking everything in MLFlow.

In [None]:


# Start new MLFlow run for image transformation
# Continue with the current active MLFlow run
# with mlflow.start_run(run_name=f"image-transformation-{datetime.now().strftime('%Y%m%d-%H%M%S')}") - REMOVED

# Function to discover classes from raw-images structure
def discover_classes_from_s3(bucket, prefix="raw-images/"):
    """Discover class names from S3 directory structure"""
    response = s3_client.list_objects_v2(
        Bucket=bucket,
        Prefix=prefix,
        Delimiter='/'
    )
    
    classes = []
    if 'CommonPrefixes' in response:
        for obj in response['CommonPrefixes']:
            class_prefix = obj['Prefix']
            class_name = class_prefix.replace(prefix, '').rstrip('/')
            if class_name:  # Skip empty class names
                classes.append(class_name)
    
    return sorted(classes)

# Function to get images for each class
def get_images_by_class(bucket, prefix="raw-images/"):
    """Get all images organized by class"""
    classes = discover_classes_from_s3(bucket, prefix)
    images_by_class = {}
    
    print(f"Found {len(classes)} classes: {classes}")
    
    for class_name in classes:
        class_prefix = f"{prefix}{class_name}/"
        objects = list_s3_objects(bucket, prefix=class_prefix)
        images = filter_image_files(objects)
        images_by_class[class_name] = images
        print(f"  {class_name}: {len(images)} images")
    
    return images_by_class

# Discover classes and images
images_by_class = get_images_by_class(BUCKET_NAME)
class_names = list(images_by_class.keys())
total_images = sum(len(images) for images in images_by_class.values())

print(f"\nTotal images found: {total_images}")
print(f"Classes: {class_names}")

# Log class discovery to MLFlow
mlflow.log_param("num_classes", len(class_names))
mlflow.log_param("class_names", ", ".join(class_names))
mlflow.log_metric("total_raw_images", total_images)

# Log per-class image counts
for class_name, images in images_by_class.items():
    mlflow.log_metric(f"images_{class_name}", len(images))

In [None]:
# Continue with the same MLFlow run for transformation
# Continue with the current active MLFlow run
# with mlflow.start_run(run_name=f"yolo-transformation-{datetime.now().strftime('%Y%m%d-%H%M%S')}") - REMOVED

# Function to create YOLO label file content
def create_yolo_label(class_id, image_width, image_height, 
                     bbox_x=None, bbox_y=None, bbox_w=None, bbox_h=None):
    """Create YOLO format label content
    
    For classification images, we'll create a bounding box that covers the entire image
    since we don't have specific object locations.
    
    Args:
        class_id: Class ID (0-based)
        image_width: Image width in pixels
        image_height: Image height in pixels
        bbox_x, bbox_y, bbox_w, bbox_h: Optional specific bounding box coordinates
    
    Returns:
        YOLO format label string
    """
    if bbox_x is None or bbox_y is None or bbox_w is None or bbox_h is None:
        # Create a bounding box covering most of the image (80% centered)
        # This assumes the object of interest is roughly centered in the image
        x_center = 0.5  # Center of image (normalized)
        y_center = 0.5  # Center of image (normalized)
        width = 0.8     # 80% of image width (normalized)
        height = 0.8    # 80% of image height (normalized)
    else:
        # Normalize the provided bounding box coordinates
        x_center = (bbox_x + bbox_w/2) / image_width
        y_center = (bbox_y + bbox_h/2) / image_height
        width = bbox_w / image_width
        height = bbox_h / image_height
    
    # Ensure values are within [0, 1] range
    x_center = max(0, min(1, x_center))
    y_center = max(0, min(1, y_center))
    width = max(0, min(1, width))
    height = max(0, min(1, height))
    
    return f"{class_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}"

# Function to split data into train/val sets
def split_train_val(images_by_class, train_ratio=0.8, random_seed=42):
    """Split images into train and validation sets"""
    random.seed(random_seed)
    
    train_data = {}
    val_data = {}
    
    for class_name, images in images_by_class.items():
        # Shuffle images
        shuffled_images = images.copy()
        random.shuffle(shuffled_images)
        
        # Split into train/val
        split_idx = int(len(shuffled_images) * train_ratio)
        train_data[class_name] = shuffled_images[:split_idx]
        val_data[class_name] = shuffled_images[split_idx:]
        
        print(f"{class_name}: {len(train_data[class_name])} train, {len(val_data[class_name])} val")
    
    return train_data, val_data

# Split data into train/val
if images_by_class:
    train_data, val_data = split_train_val(images_by_class)
    
    total_train = sum(len(images) for images in train_data.values())
    total_val = sum(len(images) for images in val_data.values())
    
    print(f"\nTotal train images: {total_train}")
    print(f"Total validation images: {total_val}")
    
    # Log split information to MLFlow
    mlflow.log_param("train_ratio", 0.8)
    mlflow.log_param("random_seed", 42)
    mlflow.log_metric("train_images_total", total_train)
    mlflow.log_metric("val_images_total", total_val)
    
    # Log per-class split information
    for class_name in class_names:
        mlflow.log_metric(f"train_images_{class_name}", len(train_data[class_name]))
        mlflow.log_metric(f"val_images_{class_name}", len(val_data[class_name]))
    
else:
    print("No images found to split")

In [None]:
# Continue transformation with MLFlow tracking
# Continue with the current active MLFlow run
# with mlflow.start_run(run_name=f"yolo-dataset-creation-{datetime.now().strftime('%Y%m%d-%H%M%S')}") - REMOVED

# Function to transform and upload images to YOLO format
def transform_to_yolo_format(bucket, train_data, val_data, class_names, 
                           dataset_name, progress_callback=None):
    """Transform classification images to YOLO format and upload to S3"""
    
    # Create class ID mapping
    class_to_id = {class_name: idx for idx, class_name in enumerate(class_names)}
    
    # Define S3 prefixes
    base_prefix = f"datasets/{dataset_name}/"
    train_images_prefix = f"{base_prefix}train/images/"
    train_labels_prefix = f"{base_prefix}train/labels/"
    val_images_prefix = f"{base_prefix}val/images/"
    val_labels_prefix = f"{base_prefix}val/labels/"
    
    def process_split(data, images_prefix, labels_prefix, split_name):
        """Process a data split (train or val)"""
        processed_count = 0
        total_count = sum(len(images) for images in data.values())
        
        print(f"\nProcessing {split_name} split ({total_count} images)...")
        
        for class_name, images in data.items():
            class_id = class_to_id[class_name]
            
            for img_obj in images:
                try:
                    # Get original image key and filename
                    original_key = img_obj['Key']
                    filename = os.path.basename(original_key)
                    name_without_ext = os.path.splitext(filename)[0]
                    
                    # Download image to get dimensions
                    response = s3_client.get_object(Bucket=bucket, Key=original_key)
                    img_data = response['Body'].read()
                    img = Image.open(io.BytesIO(img_data))
                    width, height = img.size
                    
                    # Copy image to new location
                    new_image_key = f"{images_prefix}{filename}"
                    s3_client.copy_object(
                        Bucket=bucket,
                        CopySource={'Bucket': bucket, 'Key': original_key},
                        Key=new_image_key
                    )
                    
                    # Create YOLO label
                    label_content = create_yolo_label(class_id, width, height)
                    
                    # Upload label file
                    label_key = f"{labels_prefix}{name_without_ext}.txt"
                    s3_client.put_object(
                        Bucket=bucket,
                        Key=label_key,
                        Body=label_content.encode('utf-8')
                    )
                    
                    processed_count += 1
                    
                    # Progress callback
                    if progress_callback and processed_count % 10 == 0:
                        progress_callback(processed_count, total_count, split_name)
                        
                except Exception as e:
                    print(f"Error processing {original_key}: {str(e)}")
        
        print(f"Completed {split_name} split: {processed_count}/{total_count} images processed")
        return processed_count
    
    # Progress callback function
    def show_progress(current, total, split_name):
        percentage = (current / total) * 100
        print(f"  {split_name}: {current}/{total} ({percentage:.1f}%) processed")
    
    # Process train and validation splits
    train_processed = process_split(train_data, train_images_prefix, train_labels_prefix, "Train")
    val_processed = process_split(val_data, val_images_prefix, val_labels_prefix, "Validation")
    
    return {
        'train_processed': train_processed,
        'val_processed': val_processed,
        'class_to_id': class_to_id,
        'base_prefix': base_prefix
    }

# Transform images to YOLO format
if 'train_data' in locals() and 'val_data' in locals() and class_names:
    print("Starting transformation to YOLO format...")
    
    # Create a new dataset with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    dataset_name = f"yolov11_dataset_{timestamp}"
    
    # Log dataset creation to MLFlow
    mlflow.log_param("dataset_name", dataset_name)
    mlflow.log_param("transformation_timestamp", timestamp)
    mlflow.log_param("bbox_strategy", "80% centered")
    
    transformation_result = transform_to_yolo_format(
        BUCKET_NAME, train_data, val_data, class_names, dataset_name
    )
    
    print(f"\nTransformation completed!")
    print(f"Dataset created at: s3://{BUCKET_NAME}/datasets/{dataset_name}/")
    print(f"Train images processed: {transformation_result['train_processed']}")
    print(f"Validation images processed: {transformation_result['val_processed']}")
    
    # Log transformation results to MLFlow
    mlflow.log_param("dataset_s3_path", f"s3://{BUCKET_NAME}/datasets/{dataset_name}/")
    mlflow.log_metric("train_processed", transformation_result['train_processed'])
    mlflow.log_metric("val_processed", transformation_result['val_processed'])
    mlflow.log_metric("total_processed", transformation_result['train_processed'] + transformation_result['val_processed'])
    
    # Log class mapping
    for class_name, class_id in transformation_result['class_to_id'].items():
        mlflow.log_param(f"class_id_{class_name}", class_id)
    
else:
    print("No data available for transformation")

In [None]:
# Continue with data.yaml creation and verification
# Continue with the current active MLFlow run
# with mlflow.start_run(run_name=f"dataset-finalization-{datetime.now().strftime('%Y%m%d-%H%M%S')}") - REMOVED

# Function to create data.yaml configuration file
def create_data_yaml(bucket, dataset_name, class_names, class_to_id):
    """Create YOLO data.yaml configuration file"""
    
    # Create the YAML configuration
    data_config = {
        'path': f's3://{bucket}/datasets/{dataset_name}',
        'train': 'train/images',
        'val': 'val/images',
        'nc': len(class_names),
        'names': {class_to_id[name]: name for name in class_names}
    }
    
    # Convert to YAML string
    yaml_content = yaml.dump(data_config, default_flow_style=False, sort_keys=False)
    
    # Upload to S3
    yaml_key = f"datasets/{dataset_name}/data.yaml"
    s3_client.put_object(
        Bucket=bucket,
        Key=yaml_key,
        Body=yaml_content.encode('utf-8')
    )
    
    print(f"Created data.yaml at: s3://{bucket}/{yaml_key}")
    print("\nYAML content:")
    print(yaml_content)
    
    return yaml_key

# Create data.yaml file
if 'transformation_result' in locals():
    yaml_key = create_data_yaml(
        BUCKET_NAME, 
        dataset_name, 
        class_names, 
        transformation_result['class_to_id']
    )
    
    # Log YAML creation to MLFlow
    mlflow.log_param("data_yaml_path", f"s3://{BUCKET_NAME}/{yaml_key}")
    mlflow.log_artifact_from_s3(f"s3://{BUCKET_NAME}/{yaml_key}", "data.yaml")
    
else:
    print("No transformation result available to create data.yaml")

In [None]:
# Final dataset summary with MLFlow logging
# Continue with the current active MLFlow run
# with mlflow.start_run(run_name=f"dataset-summary-{datetime.now().strftime('%Y%m%d-%H%M%S')}") - REMOVED

# Display dataset summary
if 'dataset_name' in locals() and 'transformation_result' in locals():
    print("🎉 Dataset Transformation Complete!")
    print("=" * 50)
    print(f"Dataset Name: {dataset_name}")
    print(f"S3 Location: s3://{BUCKET_NAME}/datasets/{dataset_name}/")
    print(f"Classes: {len(class_names)} ({', '.join(class_names)})")
    print(f"Train Images: {transformation_result['train_processed']}")
    print(f"Validation Images: {transformation_result['val_processed']}")
    print(f"Total Images: {transformation_result['train_processed'] + transformation_result['val_processed']}")
    
    print("\n📁 Dataset Structure:")
    print(f"s3://{BUCKET_NAME}/datasets/{dataset_name}/")
    print("├── train/")
    print("│   ├── images/     # Training images")
    print("│   └── labels/     # Training labels (.txt files)")
    print("├── val/")
    print("│   ├── images/     # Validation images")
    print("│   └── labels/     # Validation labels (.txt files)")
    print("└── data.yaml       # Dataset configuration")
    
    print("\n🏷️  Class Mapping:")
    for class_name, class_id in transformation_result['class_to_id'].items():
        print(f"  {class_id}: {class_name}")
    
    print("\n📝 Label Format:")
    print("Each .txt file contains one line per object:")
    print("<class_id> <x_center> <y_center> <width> <height>")
    print("All coordinates are normalized (0.0 to 1.0)")
    
    print("\n🚀 Next Steps:")
    print("1. Use this dataset in the ML Engineer notebook for training")
    print("2. The dataset is ready for YOLOv11 training")
    print(f"3. Reference the dataset using: s3://{BUCKET_NAME}/datasets/{dataset_name}/data.yaml")
    
    # Save dataset info for later use
    dataset_info = {
        'dataset_name': dataset_name,
        'bucket': BUCKET_NAME,
        'base_path': f"s3://{BUCKET_NAME}/datasets/{dataset_name}/",
        'config_path': f"s3://{BUCKET_NAME}/datasets/{dataset_name}/data.yaml",
        'classes': class_names,
        'class_to_id': transformation_result['class_to_id'],
        'train_count': transformation_result['train_processed'],
        'val_count': transformation_result['val_processed'],
        'created_at': datetime.now().isoformat(),
        'transformation_method': 'classification_to_yolo',
        'bbox_strategy': '80% centered',
        'mlflow_experiment': experiment_name
    }
    
    # Save dataset info to S3 for reference
    info_key = f"datasets/{dataset_name}/dataset_info.json"
    s3_client.put_object(
        Bucket=BUCKET_NAME,
        Key=info_key,
        Body=json.dumps(dataset_info, indent=2).encode('utf-8')
    )
    
    print(f"\n💾 Dataset info saved to: s3://{BUCKET_NAME}/{info_key}")
    
    # Log final summary to MLFlow
    mlflow.log_param("final_dataset_name", dataset_name)
    mlflow.log_param("final_dataset_path", f"s3://{BUCKET_NAME}/datasets/{dataset_name}/")
    mlflow.log_param("dataset_info_path", f"s3://{BUCKET_NAME}/{info_key}")
    mlflow.log_metric("final_train_count", transformation_result['train_processed'])
    mlflow.log_metric("final_val_count", transformation_result['val_processed'])
    mlflow.log_metric("final_total_count", transformation_result['train_processed'] + transformation_result['val_processed'])
    
    # Log dataset info as artifact
    with open('dataset_info.json', 'w') as f:
        json.dump(dataset_info, f, indent=2)
    mlflow.log_artifact('dataset_info.json')
    
    # Set tags for easy filtering
    mlflow.set_tag("stage", "dataset_creation")
    mlflow.set_tag("dataset_ready", "true")
    mlflow.set_tag("dataset_name", dataset_name)
    mlflow.set_tag("transformation_type", "classification_to_yolo")
    
else:
    print("No dataset transformation was completed.")
    print("Please run the transformation cells above first.")
    
    # Log failure
    mlflow.log_param("transformation_status", "failed")
    mlflow.set_tag("dataset_ready", "false")

## 6. Updated Summary and Next Steps

In this enhanced notebook, we've explored the drone imagery dataset, transformed it for YOLOv11 training, created Ground Truth labeling jobs, and tracked everything with MLFlow. Here's a comprehensive summary of what we've accomplished:

### Completed Tasks:

1. **Data Exploration with MLFlow**:
   - Listed and displayed sample images from the S3 bucket
   - Analyzed image characteristics (dimensions, aspect ratios, file sizes)
   - Visualized image statistics
   - Tracked all metrics and artifacts in MLFlow

2. **Data Preparation**:
   - Checked for existing labeled data
   - Created YOLO dataset structure in S3
   - Logged dataset information to MLFlow

3. **Image Transformation** (NEW):
   - Discovered classes from raw-images/ directory structure
   - Split images into train/validation sets (80/20 split)
   - Created YOLO format labels for each image
   - Organized data in proper YOLOv11 structure
   - Generated data.yaml configuration file
   - Tracked entire transformation process in MLFlow

4. **Ground Truth Labeling**:
   - Configured labeling job parameters
   - Created input manifest for Ground Truth
   - Set up Ground Truth labeling job for object detection
   - Implemented job monitoring functionality

5. **Comprehensive MLFlow Tracking**:
   - Used MLFlow to track all data exploration and labeling activities
   - Saved visualizations as artifacts
   - Logged parameters and metrics for reproducibility
   - Tracked transformation process with detailed metrics
   - Created experiment lineage from exploration to dataset creation

### Key Features of the Enhanced Transformation:

- **Complete MLFlow Integration**: Every step is tracked with parameters, metrics, and artifacts
- **Automatic Class Discovery**: Discovers classes from S3 directory structure
- **Smart Label Generation**: Creates bounding boxes covering 80% of each image (centered)
- **Train/Val Split Tracking**: Logs split ratios and per-class distributions
- **YOLO Format Compliance**: Generates proper YOLO format labels with normalized coordinates
- **Dataset Verification**: Validates the created dataset structure
- **Configuration Management**: Creates data.yaml and dataset_info.json with full metadata
- **Artifact Management**: Saves all configuration files and summaries as MLFlow artifacts
- **Experiment Tagging**: Uses tags for easy filtering and organization

### MLFlow Experiment Organization:

The notebook creates multiple MLFlow runs for different stages:
- **Data Exploration**: Image analysis and statistics
- **Image Analysis**: Detailed image characteristics
- **Data Preparation**: Dataset structure preparation
- **Labeling Job Creation**: Ground Truth job configuration
- **Image Transformation**: Classification to YOLO conversion
- **Dataset Creation**: YOLO dataset assembly
- **Dataset Finalization**: Configuration file creation
- **Dataset Summary**: Final validation and metadata

### Next Steps

1. **Use the transformed dataset** in the ML Engineer notebook for YOLOv11 training
2. **Monitor Ground Truth labeling jobs** if created
3. **Review MLFlow experiments** in the SageMaker Studio MLFlow UI
4. **Fine-tune the bounding boxes** if needed (currently uses 80% centered boxes)
5. **Adjust train/val split ratio** if different proportions are needed
6. **Proceed to model training** using the generated dataset

### Dataset Ready for Training!

Your dataset is now in the proper YOLOv11 format and ready for training. The ML Engineer can use the dataset path provided in the transformation summary to train YOLOv11 models.

### Accessing Your Work

- **MLFlow UI**: Go to "Experiments and trials" > "MLflow" in SageMaker Studio
- **Ground Truth Console**: Monitor labeling jobs in the SageMaker console
- **S3 Data**: All data and artifacts are stored in your S3 bucket
- **Dataset Info**: Complete metadata available in dataset_info.json

### Benefits Summary

**Ground Truth Integration**:
- Quality control with professional annotation
- Scalable handling of large datasets
- Cost management with budget controls
- Flexible workforce options (private/public)
- Seamless SageMaker integration

**MLFlow Integration**:
- Complete activity tracking and lineage
- Full reproducibility with logged parameters
- Team collaboration through shared experiments
- Comprehensive artifact management
- Organized experiment structure with tagging
- Easy filtering and searching capabilities

**Image Transformation**:
- Automatic conversion from classification to object detection format
- Proper YOLO format compliance
- Intelligent bounding box generation
- Complete dataset validation
- Ready-to-use training configuration

For more detailed functionality, refer to the comprehensive notebooks in the `notebooks/data-labeling/` directory.