# Data Scientist Core Workflow

This notebook provides essential functionality for Data Scientists to explore and prepare drone imagery data for YOLOv11 model training. It focuses on core capabilities while maintaining simplicity.

## Workflow Overview

1. **Data Exploration**: Analyze and visualize the drone imagery dataset
2. **Data Preparation**: Prepare data for YOLOv11 training

## Prerequisites

- AWS account with appropriate permissions
- AWS CLI configured with "ab" profile
- SageMaker Studio access with Data Scientist role
- Access to the drone imagery dataset in S3 bucket: `lucaskle-ab3-project-pv`

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
import os
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from IPython.display import display, HTML
import io
from PIL import Image

# Set up AWS session with "ab" profile
session = boto3.Session(profile_name='ab')
s3_client = session.client('s3')
region = session.region_name
account_id = session.client('sts').get_caller_identity()['Account']

# Set up visualization
plt.rcParams["figure.figsize"] = (12, 6)

# Define bucket name
BUCKET_NAME = 'lucaskle-ab3-project-pv'

print(f"Data Bucket: {BUCKET_NAME}")
print(f"Region: {region}")
print(f"Account ID: {account_id}")

## 1. Data Exploration

Let's start by exploring the drone imagery dataset stored in S3.

In [None]:
# Function to list objects in S3 bucket
def list_s3_objects(bucket, prefix="", max_items=100):
    """List objects in an S3 bucket with the given prefix"""
    response = s3_client.list_objects_v2(
        Bucket=bucket,
        Prefix=prefix,
        MaxKeys=max_items
    )
    
    if 'Contents' in response:
        return response['Contents']
    else:
        return []

# Function to filter image files
def filter_image_files(objects):
    """Filter image files from S3 objects list"""
    image_extensions = [".jpg", ".jpeg", ".png", ".tiff", ".tif"]
    return [obj for obj in objects 
            if any(obj['Key'].lower().endswith(ext) for ext in image_extensions)]

# List raw images in the bucket
raw_objects = list_s3_objects(BUCKET_NAME, prefix="raw-images/")
raw_images = filter_image_files(raw_objects)

print(f"Found {len(raw_images)} raw images in the bucket")

# Display the first few image keys
if raw_images:
    print("\nSample image keys:")
    for i, img in enumerate(raw_images[:5]):
        print(f"  {i+1}. {img['Key']}")

### 1.1 Display Sample Images

Let's display some sample images from the dataset to get a visual understanding.

In [None]:
# Function to download and display images
def display_sample_images(bucket, image_objects, num_samples=4):
    """Download and display sample images from S3"""
    # Limit to the requested number of samples
    samples = image_objects[:min(num_samples, len(image_objects))]
    
    # Create a figure with subplots
    fig, axes = plt.subplots(1, len(samples), figsize=(16, 4))
    
    # If only one sample, axes is not an array
    if len(samples) == 1:
        axes = [axes]
    
    # Download and display each image
    for i, img_obj in enumerate(samples):
        try:
            # Download image from S3
            response = s3_client.get_object(Bucket=bucket, Key=img_obj['Key'])
            img_data = response['Body'].read()
            
            # Open image with PIL
            img = Image.open(io.BytesIO(img_data))
            
            # Display image
            axes[i].imshow(img)
            axes[i].set_title(os.path.basename(img_obj['Key']))
            axes[i].axis('off')
            
        except Exception as e:
            print(f"Error displaying image {img_obj['Key']}: {str(e)}")
            axes[i].text(0.5, 0.5, f"Error loading image", ha='center')
            axes[i].axis('off')
    
    plt.tight_layout()
    plt.show()

# Display sample images
if raw_images:
    display_sample_images(BUCKET_NAME, raw_images, num_samples=4)
else:
    print("No images found to display")

### 1.2 Basic Image Analysis

Let's analyze some basic characteristics of the images in our dataset.

In [None]:
# Function to analyze image characteristics
def analyze_images(bucket, image_objects, sample_size=20):
    """Analyze basic characteristics of images"""
    # Limit to sample size
    samples = image_objects[:min(sample_size, len(image_objects))]
    
    # Initialize lists to store image characteristics
    widths = []
    heights = []
    aspect_ratios = []
    file_sizes = []
    formats = []
    
    print(f"Analyzing {len(samples)} sample images...")
    
    # Process each image
    for img_obj in samples:
        try:
            # Download image from S3
            response = s3_client.get_object(Bucket=bucket, Key=img_obj['Key'])
            img_data = response['Body'].read()
            
            # Get file size
            file_size = len(img_data) / (1024 * 1024)  # Convert to MB
            file_sizes.append(file_size)
            
            # Open image with PIL
            img = Image.open(io.BytesIO(img_data))
            
            # Get image dimensions
            width, height = img.size
            widths.append(width)
            heights.append(height)
            
            # Calculate aspect ratio
            aspect_ratio = width / height
            aspect_ratios.append(aspect_ratio)
            
            # Get image format
            formats.append(img.format)
            
        except Exception as e:
            print(f"Error analyzing image {img_obj['Key']}: {str(e)}")
    
    # Calculate statistics
    stats = {
        'count': len(widths),
        'avg_width': np.mean(widths) if widths else 0,
        'avg_height': np.mean(heights) if heights else 0,
        'min_width': min(widths) if widths else 0,
        'max_width': max(widths) if widths else 0,
        'min_height': min(heights) if heights else 0,
        'max_height': max(heights) if heights else 0,
        'avg_aspect_ratio': np.mean(aspect_ratios) if aspect_ratios else 0,
        'avg_file_size': np.mean(file_sizes) if file_sizes else 0,
        'formats': list(set(formats)) if formats else []
    }
    
    return {
        'stats': stats,
        'widths': widths,
        'heights': heights,
        'aspect_ratios': aspect_ratios,
        'file_sizes': file_sizes,
        'formats': formats
    }

# Analyze sample images
if raw_images:
    analysis_results = analyze_images(BUCKET_NAME, raw_images, sample_size=20)
    
    # Display statistics
    stats = analysis_results['stats']
    print("\nImage Statistics:")
    print(f"Total images analyzed: {stats['count']}")
    print(f"Average dimensions: {stats['avg_width']:.1f}x{stats['avg_height']:.1f} pixels")
    print(f"Dimension range: {stats['min_width']}x{stats['min_height']} to {stats['max_width']}x{stats['max_height']} pixels")
    print(f"Average aspect ratio: {stats['avg_aspect_ratio']:.2f}")
    print(f"Average file size: {stats['avg_file_size']:.2f} MB")
    print(f"Image formats: {', '.join(stats['formats'])}")
else:
    print("No images found to analyze")

### 1.3 Visualize Image Characteristics

Let's create some visualizations to better understand our dataset.

In [None]:
# Visualize image characteristics
if 'analysis_results' in locals() and analysis_results['stats']['count'] > 0:
    # Create a figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot image dimensions
    axes[0, 0].scatter(analysis_results['widths'], analysis_results['heights'])
    axes[0, 0].set_xlabel('Width (pixels)')
    axes[0, 0].set_ylabel('Height (pixels)')
    axes[0, 0].set_title('Image Dimensions')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot aspect ratio distribution
    axes[0, 1].hist(analysis_results['aspect_ratios'], bins=10)
    axes[0, 1].set_xlabel('Aspect Ratio (width/height)')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].set_title('Aspect Ratio Distribution')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot file size distribution
    axes[1, 0].hist(analysis_results['file_sizes'], bins=10)
    axes[1, 0].set_xlabel('File Size (MB)')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].set_title('File Size Distribution')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Plot format distribution
    format_counts = {}
    for fmt in analysis_results['formats']:
        if fmt in format_counts:
            format_counts[fmt] += 1
        else:
            format_counts[fmt] = 1
    
    formats = list(format_counts.keys())
    counts = list(format_counts.values())
    
    axes[1, 1].bar(formats, counts)
    axes[1, 1].set_xlabel('Image Format')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].set_title('Image Format Distribution')
    
    plt.tight_layout()
    plt.show()
else:
    print("No analysis results available for visualization")

## 2. Data Preparation for YOLOv11 Training

Now let's prepare our data for YOLOv11 training. This involves organizing the data in the correct format and structure.

In [None]:
# Function to check if labeled data exists
def check_labeled_data(bucket, prefix="labeled-data/"):
    """Check if labeled data exists in the bucket"""
    objects = list_s3_objects(bucket, prefix=prefix)
    
    if objects:
        print(f"Found {len(objects)} objects in labeled data directory")
        
        # Group by job name (assuming directory structure)
        jobs = {}
        for obj in objects:
            key = obj['Key']
            parts = key.split('/')
            if len(parts) > 2:
                job_name = parts[1]
                if job_name not in jobs:
                    jobs[job_name] = []
                jobs[job_name].append(key)
        
        # Display job information
        if jobs:
            print(f"\nFound {len(jobs)} labeling jobs:")
            for job, files in jobs.items():
                print(f"  - {job}: {len(files)} files")
        
        return jobs
    else:
        print("No labeled data found")
        return {}

# Check for labeled data
labeled_jobs = check_labeled_data(BUCKET_NAME)

### 2.1 Prepare Data Structure for YOLOv11

YOLOv11 requires a specific data structure. Let's prepare our data accordingly.

In [None]:
# Function to create YOLO dataset structure
def prepare_yolo_structure(bucket, job_name=None):
    """Prepare YOLO dataset structure in S3"""
    # Define dataset structure
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    dataset_name = f"yolov11_dataset_{timestamp}"
    
    # Define directories
    base_prefix = f"datasets/{dataset_name}/"
    train_prefix = f"{base_prefix}train/"
    val_prefix = f"{base_prefix}val/"
    
    # Create empty directories in S3
    for prefix in [train_prefix, val_prefix]:
        for subdir in ["images/", "labels/"]:
            full_prefix = f"{prefix}{subdir}"
            # Create an empty object to represent the directory
            s3_client.put_object(Bucket=bucket, Key=full_prefix)
    
    print(f"Created YOLO dataset structure at s3://{bucket}/{base_prefix}")
    print("\nDirectory structure:")
    print(f"s3://{bucket}/{base_prefix}")
    print(f"├── train/")
    print(f"│   ├── images/")
    print(f"│   └── labels/")
    print(f"└── val/")
    print(f"    ├── images/")
    print(f"    └── labels/")
    
    return {
        'dataset_name': dataset_name,
        'base_prefix': base_prefix,
        'train_prefix': train_prefix,
        'val_prefix': val_prefix
    }

# Create YOLO dataset structure
yolo_structure = prepare_yolo_structure(BUCKET_NAME)

### 2.2 Data Preparation Instructions

To complete the data preparation for YOLOv11 training, follow these steps:

1. **Create Ground Truth Labeling Job**:
   - Use the `create_labeling_job.ipynb` notebook to create a labeling job
   - Label your images with bounding boxes for objects of interest

2. **Convert Ground Truth Output to YOLOv11 Format**:
   - After labeling is complete, convert the output to YOLOv11 format
   - Use the conversion function in the `create_labeling_job.ipynb` notebook

3. **Split Data into Train/Validation Sets**:
   - Split your labeled data into training and validation sets
   - Typically use 80% for training and 20% for validation

4. **Upload Data to YOLO Structure**:
   - Upload images to the `images/` directories
   - Upload corresponding label files to the `labels/` directories

5. **Create Dataset Configuration File**:
   - Create a YAML configuration file for your dataset
   - Specify paths to train and validation data
   - Define class names and IDs

Once these steps are complete, your data will be ready for YOLOv11 training.

## 3. Summary and Next Steps

In this notebook, we've explored the drone imagery dataset and prepared the structure for YOLOv11 training. Here's a summary of what we've accomplished:

1. **Data Exploration**:
   - Listed and displayed sample images from the S3 bucket
   - Analyzed image characteristics (dimensions, aspect ratios, file sizes)
   - Visualized image statistics

2. **Data Preparation**:
   - Checked for existing labeled data
   - Created YOLO dataset structure in S3
   - Provided instructions for completing data preparation

### Next Steps

1. **Create or use existing Ground Truth labeling jobs** to annotate your images
2. **Convert labeled data to YOLOv11 format** using the provided tools
3. **Organize your data** in the YOLO structure we created
4. **Proceed to model training** using the ML Engineer notebook

For more detailed functionality, refer to the comprehensive notebooks in the `notebooks/` directory.