# Enhanced S3 to COG Converter with Automatic AWS Authentication

This notebook converts TIF files from S3 to Cloud Optimized GeoTIFFs (COGs) with:
- **Automatic AWS credential detection** (no .env file needed)
- **Download caching** to avoid re-downloading large files
- **COG validation** before uploading
- **Support for multiple AWS authentication methods**

Author: Kyle Lesinger (Enhanced version)

In [74]:
import os
import pandas as pd
import json
import tempfile
import boto3
import rasterio
import rioxarray as rxr
import s3fs
import fsspec
from rasterio.warp import calculate_default_transform, reproject, Resampling
from botocore.exceptions import NoCredentialsError, ClientError
from pathlib import Path
from datetime import datetime
import time

print("‚úÖ Libraries imported successfully!")
print(f"Boto3 version: {boto3.__version__}")

‚úÖ Libraries imported successfully!
Boto3 version: 1.37.3


In [None]:
# Add path for importing custom modules
import sys
from pathlib import Path

# Add the scripts directory to the Python path
scripts_dir = Path('../scripts').resolve()
if str(scripts_dir) not in sys.path:
    sys.path.insert(0, str(scripts_dir))

# Import functions from list_s3crawler_files module
from list_s3crawler_files import (
    load_drcs_data,
    get_tif_files_from_path,
    get_files_with_full_paths,
    list_available_directories
)

# Import COG and cache utilities
from cog_utilities import (
    check_cache_status,
    clear_cache,
    validate_cog
)

# Import AWS S3 utilities
from aws_s3_utils import (
    initialize_s3_client,
    verify_s3_client,
    get_all_s3_keys
)

# Import batch processing utilities
from batch_processing import (
    process_file_batch,
    print_batch_summary
)

print("‚úÖ Custom modules imported successfully!")
print(f"   Module path: {scripts_dir}")

# Useful links

[drcs_activations OLD Directory](https://data.disasters.openveda.cloud/browseui/browseui/#drcs_activations/)

[VEDA docs for file naming conventions](https://docs.openveda.cloud/user-guide/content-curation/dataset-ingestion/file-preparation.html)

## List of new 2nd level directories

    "Sentinel-1"
    "Sentinel-2"
    "Landsat"
    "MODIS"
    "VIIRS"
    "ASTER"
    "MASTER"
    "ECOSTRESS"
    "Planet"
    "Maxar"
    "HLS"
    "IMERG"
    "GOES"
    "SMAP"
    "ICESat"
    "GEDI"
    "COMSAR"
    "UAVSAR"
    "WB-57"

In [76]:
# DO NOT CHANGE
DIR_OLD_BASE = 'drcs_activations'
DIR_NEW_BASE = 'drcs_activations_new'

In [77]:
EVENT_NAME = '202405_Flood_TX'
PRODUCT_NAME = 'sentinel1'

RENAME_PRODUCT = 'Sentinel-1'

PATH_OLD = f'{DIR_OLD_BASE}/{EVENT_NAME}/{PRODUCT_NAME}'  # Updated to use actual available directory
DIRECTORY_NEW = f'{DIR_NEW_BASE}/{RENAME_PRODUCT}'

## Load TIF Files from DRCS Data

This cell loads the pre-analyzed DRCS activation data from `drcs_activations_tif_files.json` which contains a complete inventory of all .tif files in the NASA Disasters S3 bucket.

The code will:
1. Load the JSON file containing the file inventory
2. Parse the `PATH_OLD` variable to find the corresponding directory
3. Extract all .tif filenames from that directory
4. Store them in `files_to_process` for later use

In [78]:
# Load the pre-analyzed DRCS TIF files data using imported functions
# The JSON path is relative to the notebook location
json_path = Path('../../s3-crawler/drcs_activations_tif_files.json')

# Load DRCS data
drcs_data = load_drcs_data(json_path)

if drcs_data:
    # Get TIF files from the specified PATH_OLD using the imported function
    tif_files = get_tif_files_from_path(PATH_OLD, drcs_data, DIR_OLD_BASE)
    
    if tif_files:
        print(f"\nüìÅ Found {len(tif_files)} .tif files in {PATH_OLD}:")
        print("\nFirst 10 files:")
        for i, file in enumerate(tif_files[:10], 1):
            print(f"  {i:2d}. {file}")
        if len(tif_files) > 10:
            print(f"  ... and {len(tif_files) - 10} more files")
        
        # Get files with full paths using the imported function
        files_to_process = get_files_with_full_paths(PATH_OLD, drcs_data, DIR_OLD_BASE, json_path)
        print(f"\n‚úÖ Files ready for processing. Stored in 'files_to_process' variable.")
    else:
        print(f"\n‚ùå No files found. Please check the PATH_OLD variable.")
        files_to_process = []
else:
    print(f"\n‚ùå Could not load DRCS data.")
    files_to_process = []

‚úÖ Loaded DRCS data from ../../s3-crawler/drcs_activations_tif_files.json

üìÅ Found 11 .tif files in drcs_activations/202405_Flood_TX/sentinel1:

First 10 files:
   1. S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif
   2. S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif
   3. S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif
   4. S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif
   5. S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif
   6. S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif
   7. S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif
   8. S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif
   9. S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif
  10. S1_20240430_20240507_WM_diff.tif
  ... and 1 more files

‚úÖ Files ready for processing. Stored in 'files_to_process' variable.


In [79]:
files_to_process

['drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_a

In [80]:
# # Example: List available activation events using the imported function
# print("üìÇ Available activation events in DRCS data:")
# events = list_available_directories('drcs_activations', drcs_data, json_path)

# # Show first 10 events
# for event in events[:10]:
#     print(f"  - {event}")
# if len(events) > 10:
#     print(f"  ... and {len(events) - 10} more events")

# # Example: List subdirectories for a specific event
# print(f"\nüìÅ Subdirectories in {EVENT_NAME}:")
# subdirs = list_available_directories(f'drcs_activations/{EVENT_NAME}', drcs_data, json_path)
# for subdir in subdirs:
#     print(f"  - {subdir}")

# For these we can see three different types of files

1. WM = water mask
2. rgb = red green blue
3. WM_diff = water mask difference between dates

### We are going to need 2 different directories for these!!!

We will keep WaterMask (WM) and rgb as separate directories

In [87]:
# For simplicity, let's use python list comprehension to return the files
# We may need to rename them in different ways for different products
# We will do a similar process later

## NOTE --- We can actually use these objects since they have the same path as the s3 files. We will call them again later

water_mask = [f for f in files_to_process if "_WM.tif" in f]
rgb = [f for f in files_to_process if "rgb.tif" in f]
water_mask_diff = [f for f in files_to_process if "WM_diff.tif" in f]
water_mask_diff


['drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240507_20240512_WM_diff.tif']

In [82]:
# NOTE for the diff files, we need to add diff at the 1st date before it
# Otherwise VEDA will think that the first date is the most important

# Example S1_diff20240430_20240507_WM.tif

In [83]:
config_WM = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/WM",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

config_rgb = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/rgb",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

In [None]:
# Add configuration for water mask diff files
config_WM_diff = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/WM_diff",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

## Configure bucket and paths (no need to create session manually)

In [37]:
# Configure bucket and paths (no need to create session manually)
bucket_name = config_WM["cog_data_bucket"]
raw_data_bucket = config_WM["raw_data_bucket"]
raw_data_prefix = config_WM["raw_data_prefix"]

cog_data_bucket = config_WM['cog_data_bucket']
cog_data_prefix = config_WM["cog_data_prefix"]

print(f"Configuration loaded:")
print(f"  Source bucket: {raw_data_bucket}")
print(f"  Source prefix: {raw_data_prefix}")
print(f"  Target bucket: {cog_data_bucket}")
print(f"  Target prefix: {cog_data_prefix}")

Configuration loaded:
  Source bucket: nasa-disasters
  Source prefix: drcs_activations/202405_Flood_TX/sentinel1
  Target bucket: nasa-disasters
  Target prefix: drcs_activations_new/Sentinel-1/WM


## Initialize AWS S3 Client with automatic credential detection

In [84]:
# Initialize AWS S3 Client using the imported function
s3_client, fs_read = initialize_s3_client(bucket_name='nasa-disasters', verbose=True)

‚ö†Ô∏è S3 client initialized (limited bucket list access)
‚úÖ Confirmed access to nasa-disasters bucket
‚úÖ S3 filesystem (fsspec) initialized


In [85]:
# Verify S3 client is ready using the imported function
verify_s3_client(s3_client, bucket_name='nasa-disasters', verbose=True)

‚úÖ S3 client ready for operations
   Bucket: nasa-disasters
   Ready to process files


True

In [86]:
# Get all TIF files using the imported function
keys = get_all_s3_keys(s3_client, raw_data_bucket, raw_data_prefix, ".tif") if s3_client else []

if keys:
    print(f"‚úÖ Found {len(keys)} .tif files in the S3 bucket.")
else:
    print("No keys found or S3 client not initialized")
    
keys

‚úÖ Found 11 .tif files in the S3 bucket.


['drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_a

In [48]:
def convert_sentinel_datetime(datetime_str):
    """
    Convert Sentinel datetime format to ISO 8601 format with UTC timezone.
    
    Args:
        datetime_str: String like '20240430T002653'
    
    Returns:
        String like '2024-04-30T00:26:53Z'
    """
    # Extract components
    year = datetime_str[0:4]
    month = datetime_str[4:6]
    day = datetime_str[6:8]
    hour = datetime_str[9:11]
    minute = datetime_str[11:13]
    second = datetime_str[13:15]
    
    # Format with dashes and colons, add Z for UTC
    return f"{year}-{month}-{day}T{hour}:{minute}:{second}Z"

# Test
datetime_str = '20240430T002653'
result = convert_sentinel_datetime(datetime_str)
print(result)  # 2024-04-30T00:26:53Z

2024-04-30T00:26:53Z


In [None]:
def create_cog_filename_WM(f, EVENT_NAME):
    """Create COG filename for water mask files."""
    f2 = Path(f).stem
    fsplit = f2.split('_')
    
    # Check if it's a diff file
    if "WM_diff" in f:
        # For diff files: S1_20240430_20240507_WM_diff.tif
        # Need to add "diff" before the first date
        # Result: S1_diff20240430_20240507_WM.tif
        return f'{EVENT_NAME}_S1_diff{fsplit[1]}_{fsplit[2]}_WM.tif'
    else:
        # Regular WM files
        cog_filename = f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'
        return cog_filename

def create_cog_filename_rgb(f, EVENT_NAME):
    """Create COG filename for RGB files."""
    f2 = Path(f).stem
    fsplit = f2.split('_')
    
    # For RGB files, similar to WM but keep rgb suffix
    cog_filename = f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_rgb_{convert_sentinel_datetime(fsplit[2])}.tif'
    return cog_filename

def create_cog_filename_diff(f, EVENT_NAME):
    """Create COG filename for water mask diff files."""
    f2 = Path(f).stem
    fsplit = f2.split('_')
    
    # For diff files: S1_20240430_20240507_WM_diff.tif
    # Need to add "diff" before the first date
    # Result: 202405_Flood_TX_S1_diff20240430_20240507_WM.tif
    return f'{EVENT_NAME}_S1_diff{fsplit[1]}_{fsplit[2]}_WM.tif'

# Test functions
print("Testing WM filename:")
print(create_cog_filename_WM('drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif', EVENT_NAME))

print("\nTesting RGB filename:")
print(create_cog_filename_rgb('drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif', EVENT_NAME))

print("\nTesting diff filename:")
print(create_cog_filename_diff('drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif', EVENT_NAME))

In [56]:
# f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'

In [57]:
# Define COG profile for rasterio
COG_PROFILE = {
    "driver": "COG",
    "compress": "DEFLATE",
}

## Define COG Conversion Function

This function handles the conversion of files to Cloud Optimized GeoTIFFs with proper CRS and caching.

In [None]:
def convert_to_proper_CRS_and_cogify(name, cog_filename, cog_data_bucket, cog_data_prefix, local_output_dir=None):
    """
    Convert a file to Cloud Optimized GeoTIFF with proper CRS.
    
    This function includes:
    - Download caching to avoid re-downloading files
    - CRS reprojection to EPSG:4326
    - COG validation before upload
    - Upload to S3
    """
    s3_key = f"{cog_data_prefix}/{cog_filename}"
    reproject_filename = f"reproj/{cog_filename}"
    
    # Create necessary directories
    os.makedirs("reproj", exist_ok=True)
    
    # Create data_download directory for caching
    data_download_dir = "data_download"
    os.makedirs(data_download_dir, exist_ok=True)
    
    # Create subdirectory structure to match S3 path
    s3_path_parts = name.split('/')
    local_subdir = os.path.join(data_download_dir, *s3_path_parts[:-1])
    os.makedirs(local_subdir, exist_ok=True)
    
    # Local path for the downloaded file (persistent storage)
    local_download_path = os.path.join(data_download_dir, name)
    
    # Temporary file for processing
    temp_input_file = f"temp_{os.path.basename(name)}"

    try:
        # Check if file already exists locally
        if os.path.exists(local_download_path):
            print(f"   [CACHE HIT] Using cached file: {local_download_path}")
            import shutil
            shutil.copy(local_download_path, temp_input_file)
        else:
            # Download the file from S3
            print(f"   [DOWNLOAD] Downloading from S3...")
            s3_client.download_file(raw_data_bucket, name, local_download_path)
            print(f"   [DOWNLOAD] ‚úÖ Saved to cache")
            import shutil
            shutil.copy(local_download_path, temp_input_file)
        
        # Reproject to EPSG:4326
        print(f"   [REPROJECT] Converting to EPSG:4326...")
        with rasterio.open(temp_input_file) as src:
            dst_crs = "EPSG:4326"
            
            # Check if reprojection is needed
            if src.crs and src.crs.to_string() == dst_crs:
                print(f"   [REPROJECT] Already in {dst_crs}, skipping reprojection")
                import shutil
                shutil.copy(temp_input_file, reproject_filename)
            else:
                transform, width, height = calculate_default_transform(
                    src.crs, dst_crs, src.width, src.height, *src.bounds
                )
                kwargs = src.meta.copy()
                kwargs.update({
                    "driver": "COG",
                    "compress": "DEFLATE",
                    "crs": dst_crs,
                    "transform": transform,
                    "width": width,
                    "height": height
                })

                with rasterio.open(reproject_filename, "w", **kwargs) as dst:
                    for band_idx in range(1, src.count + 1):
                        reproject(
                            source=rasterio.band(src, band_idx),
                            destination=rasterio.band(dst, band_idx),
                            src_transform=src.transform,
                            src_crs=src.crs,
                            dst_transform=transform,
                            dst_crs=dst_crs,
                            resampling=Resampling.nearest,
                            wrapdateline=True
                        )

        # COGify & upload
        print(f"   [COGIFY] Creating COG...")
        ds = rxr.open_rasterio(reproject_filename)
        
        # Handle coordinate naming
        if "y" in ds.dims and "x" in ds.dims:
            ds = ds.rename({"y": "lat", "x": "lon"})
            ds.rio.set_spatial_dims("lon", "lat", inplace=True)
        
        ds.rio.write_nodata(-9999, inplace=True)

        with tempfile.NamedTemporaryFile(suffix='.tif', delete=False) as tmp:
            tmp_name = tmp.name
            ds.rio.to_raster(tmp_name, **COG_PROFILE)
            
            # Validate COG
            print(f"   [VALIDATE] Checking COG validity...")
            is_valid_cog, validation_details = validate_cog(tmp_name)
            
            if is_valid_cog:
                print(f"   [VALIDATE] ‚úÖ Valid COG")
            else:
                print(f"   [VALIDATE] ‚ö†Ô∏è COG validation warnings")
                critical_errors = [e for e in validation_details['errors'] if 'Invalid driver' in e]
                if critical_errors:
                    raise ValueError(f"Critical COG validation failed")
            
            # Upload to S3
            print(f"   [UPLOAD] Uploading to S3...")
            s3_client.upload_file(
                Filename=tmp_name,
                Bucket=cog_data_bucket,
                Key=s3_key
            )
            print(f"   [SUCCESS] ‚úÖ Uploaded to s3://{cog_data_bucket}/{s3_key}")
            
            # Save locally if specified
            if local_output_dir:
                os.makedirs(local_output_dir, exist_ok=True)
                local_path = os.path.join(local_output_dir, cog_filename)
                import shutil
                shutil.copy(tmp_name, local_path)
            
    except Exception as e:
        print(f"   [ERROR] Failed: {str(e)}")
        raise
            
    finally:
        # Clean up temporary files
        for temp_file in [temp_input_file, reproject_filename]:
            if os.path.exists(temp_file):
                os.remove(temp_file)
        if 'tmp_name' in locals() and os.path.exists(tmp_name):
            os.remove(tmp_name)

print("‚úÖ COG conversion function defined")

In [88]:
# Check current cache status using the imported function
check_cache_status()

üìÅ Cache directory does not exist: data_download/


(0, 0)

In [None]:
## Process files using batch processing function

# Separate files by type
water_mask = [f for f in keys if "_WM.tif" in f and "WM_diff" not in f]
rgb = [f for f in keys if "rgb.tif" in f]
water_mask_diff = [f for f in keys if "WM_diff.tif" in f]

print("üìä File categorization:")
print(f"  - Water mask files: {len(water_mask)}")
print(f"  - RGB files: {len(rgb)}")
print(f"  - Water mask diff files: {len(water_mask_diff)}")
print(f"  - Total files: {len(keys)}")

# Initialize combined results DataFrame
all_files_processed = pd.DataFrame()

# Process water mask files
if water_mask:
    print("\n" + "="*50)
    print("üåä Processing Water Mask Files")
    print("="*50)
    
    wm_results = process_file_batch(
        file_list=water_mask,
        s3_client=s3_client,
        config=config_WM,
        filename_creator_func=create_cog_filename_WM,
        processing_func=convert_to_proper_CRS_and_cogify,
        event_name=EVENT_NAME,
        save_metadata=True,
        save_csv=True,
        verbose=True
    )
    all_files_processed = pd.concat([all_files_processed, wm_results], ignore_index=True)

# Process RGB files
if rgb:
    print("\n" + "="*50)
    print("üé® Processing RGB Files")
    print("="*50)
    
    rgb_results = process_file_batch(
        file_list=rgb,
        s3_client=s3_client,
        config=config_rgb,
        filename_creator_func=create_cog_filename_rgb,
        processing_func=convert_to_proper_CRS_and_cogify,
        event_name=EVENT_NAME,
        save_metadata=True,
        save_csv=True,
        verbose=True
    )
    all_files_processed = pd.concat([all_files_processed, rgb_results], ignore_index=True)

# Process water mask diff files
if water_mask_diff:
    print("\n" + "="*50)
    print("üîÑ Processing Water Mask Diff Files")
    print("="*50)
    
    diff_results = process_file_batch(
        file_list=water_mask_diff,
        s3_client=s3_client,
        config=config_WM_diff,
        filename_creator_func=create_cog_filename_diff,
        processing_func=convert_to_proper_CRS_and_cogify,
        event_name=EVENT_NAME,
        save_metadata=True,
        save_csv=True,
        verbose=True
    )
    all_files_processed = pd.concat([all_files_processed, diff_results], ignore_index=True)

# Print overall summary
print_batch_summary(all_files_processed)

In [None]:
# Save metadata if there are processed files
if len(files_processed) > 0:
    # Get metadata from one of the processed files
    sample_file = files_processed.iloc[0]['file_name']
    temp_sample_file = f"temp_{os.path.basename(sample_file)}"
    
    # Download sample file to extract metadata
    s3_client.download_file(raw_data_bucket, sample_file, temp_sample_file)
    
    with rasterio.open(temp_sample_file) as src:
        metadata = {
            "description": src.tags(),
            "driver": src.driver,
            "dtype": str(src.dtypes[0]),
            "nodata": src.nodata,
            "width": src.width,
            "height": src.height,
            "count": src.count,
            "crs": str(src.crs),
            "transform": list(src.transform),
            "bounds": list(src.bounds),
            "total_files_processed": len(files_processed),
            "year": "2000"
        }
    
    # Upload metadata
    with tempfile.NamedTemporaryFile(mode="w+") as fp:
        json.dump(metadata, fp, indent=2)
        fp.flush()
        
        s3_client.upload_file(
            Filename=fp.name,
            Bucket=bucket_name,
            Key=f"{cog_data_prefix}/metadata.json",
        )
        print(f"Uploaded metadata to s3://{bucket_name}/{cog_data_prefix}/metadata.json")
    
    # Clean up sample file
    if os.path.exists(temp_sample_file):
        os.remove(temp_sample_file)

# Save the files_processed DataFrame to CSV using the same s3_client
with tempfile.NamedTemporaryFile(mode="w+", suffix=".csv") as fp:
    files_processed.to_csv(fp.name, index=False)
    fp.flush()
    
    s3_client.upload_file(
        Filename=fp.name,
        Bucket=bucket_name,
        Key=f"{cog_data_prefix}/files_converted.csv",
    )
    print(f"Saved processing log to s3://{bucket_name}/{cog_data_prefix}/files_converted.csv")

In [None]:
# Display final results
print(f"\nüìä Final Processing Results:")
print(f"Total files processed: {len(all_files_processed)}")
print(f"\nProcessed files DataFrame:")
all_files_processed

## Enhanced Features in This Version

This enhanced notebook includes several improvements over the original:

### üîê **Automatic AWS Authentication**
- No need for `.env` files or manual credential configuration
- Automatically detects credentials from:
  - Environment variables
  - AWS CLI configuration
  - IAM roles (EC2/Lambda)

### üöÄ **Simplified Setup**
- Removed dependency on `python-dotenv`
- Direct boto3 client initialization
- Better error handling for authentication issues

### üìä **Additional Features**
- fsspec filesystem integration for alternative S3 operations
- Graceful handling of limited S3 permissions
- Download caching to avoid re-downloading large files
- COG validation before upload
- Comprehensive error messages

### üí° **Usage Tips**
1. Ensure AWS credentials are configured via one of the standard methods
2. The notebook will automatically detect and use available credentials
3. Check the authentication cell output to confirm S3 access
4. Use the cache management utilities to monitor downloaded files

This enhanced version follows AWS best practices and makes the notebook more portable and easier to use across different environments.