# Enhanced S3 to COG Converter with Automatic AWS Authentication

This notebook converts TIF files from S3 to Cloud Optimized GeoTIFFs (COGs) with:
- **Automatic AWS credential detection** (no .env file needed)
- **Download caching** to avoid re-downloading large files
- **COG validation** before uploading
- **Support for multiple AWS authentication methods**

Author: Kyle Lesinger (Enhanced version)

In [17]:
import os
import pandas as pd
import json
import tempfile
import boto3
import rasterio
import rioxarray as rxr
import s3fs
import fsspec
from rasterio.warp import calculate_default_transform, reproject, Resampling
from botocore.exceptions import NoCredentialsError, ClientError
from pathlib import Path
from datetime import datetime
import time

print("✅ Libraries imported successfully!")
print(f"Boto3 version: {boto3.__version__}")

✅ Libraries imported successfully!
Boto3 version: 1.37.3


In [18]:
# Add path for importing custom modules
import sys
from pathlib import Path

# Add the scripts directory to the Python path
scripts_dir = Path('../scripts').resolve()
if str(scripts_dir) not in sys.path:
    sys.path.insert(0, str(scripts_dir))

# Import functions from list_s3crawler_files module
from list_s3crawler_files import (
    load_drcs_data,
    get_tif_files_from_path,
    get_files_with_full_paths,
    list_available_directories
)

print("✅ Custom S3 crawler functions imported successfully!")
print(f"   Module path: {scripts_dir}")

✅ Custom S3 crawler functions imported successfully!
   Module path: /home/jovyan/conversion_scripts/convert-files-and-move/scripts


# Useful links

[drcs_activations OLD Directory](https://data.disasters.openveda.cloud/browseui/browseui/#drcs_activations/)

[VEDA docs for file naming conventions](https://docs.openveda.cloud/user-guide/content-curation/dataset-ingestion/file-preparation.html)

## List of new 2nd level directories

    "Sentinel-1"
    "Sentinel-2"
    "Landsat"
    "MODIS"
    "VIIRS"
    "ASTER"
    "MASTER"
    "ECOSTRESS"
    "Planet"
    "Maxar"
    "HLS"
    "IMERG"
    "GOES"
    "SMAP"
    "ICESat"
    "GEDI"
    "COMSAR"
    "UAVSAR"
    "WB-57"

In [19]:
# DO NOT CHANGE
DIR_OLD_BASE = 'drcs_activations'
DIR_NEW_BASE = 'drcs_activations_new'

In [33]:
EVENT_NAME = '202405_Flood_TX'
PRODUCT_NAME = 'sentinel1'

RENAME_PRODUCT = 'Sentinel-1'

PATH_OLD = f'{DIR_OLD_BASE}/{EVENT_NAME}/{PRODUCT_NAME}'  # Updated to use actual available directory
DIRECTORY_NEW = f'{DIR_NEW_BASE}/{RENAME_PRODUCT}'

## Load TIF Files from DRCS Data

This cell loads the pre-analyzed DRCS activation data from `drcs_activations_tif_files.json` which contains a complete inventory of all .tif files in the NASA Disasters S3 bucket.

The code will:
1. Load the JSON file containing the file inventory
2. Parse the `PATH_OLD` variable to find the corresponding directory
3. Extract all .tif filenames from that directory
4. Store them in `files_to_process` for later use

In [30]:
# Load the pre-analyzed DRCS TIF files data using imported functions
# The JSON path is relative to the notebook location
json_path = Path('../../s3-crawler/drcs_activations_tif_files.json')

# Load DRCS data
drcs_data = load_drcs_data(json_path)

if drcs_data:
    # Get TIF files from the specified PATH_OLD using the imported function
    tif_files = get_tif_files_from_path(PATH_OLD, drcs_data, DIR_OLD_BASE)
    
    if tif_files:
        print(f"\n📁 Found {len(tif_files)} .tif files in {PATH_OLD}:")
        print("\nFirst 10 files:")
        for i, file in enumerate(tif_files[:10], 1):
            print(f"  {i:2d}. {file}")
        if len(tif_files) > 10:
            print(f"  ... and {len(tif_files) - 10} more files")
        
        # Get files with full paths using the imported function
        files_to_process = get_files_with_full_paths(PATH_OLD, drcs_data, DIR_OLD_BASE, json_path)
        print(f"\n✅ Files ready for processing. Stored in 'files_to_process' variable.")
    else:
        print(f"\n❌ No files found. Please check the PATH_OLD variable.")
        files_to_process = []
else:
    print(f"\n❌ Could not load DRCS data.")
    files_to_process = []

✅ Loaded DRCS data from ../../s3-crawler/drcs_activations_tif_files.json

📁 Found 11 .tif files in drcs_activations/202405_Flood_TX/sentinel1:

First 10 files:
   1. S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif
   2. S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif
   3. S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif
   4. S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif
   5. S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif
   6. S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif
   7. S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif
   8. S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif
   9. S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif
  10. S1_20240430_20240507_WM_diff.tif
  ... and 1 more files

✅ Files ready for processing. Stored in 'files_to_process' variable.


In [31]:
files_to_process

['drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_a

In [32]:
# # Example: List available activation events using the imported function
# print("📂 Available activation events in DRCS data:")
# events = list_available_directories('drcs_activations', drcs_data, json_path)

# # Show first 10 events
# for event in events[:10]:
#     print(f"  - {event}")
# if len(events) > 10:
#     print(f"  ... and {len(events) - 10} more events")

# # Example: List subdirectories for a specific event
# print(f"\n📁 Subdirectories in {EVENT_NAME}:")
# subdirs = list_available_directories(f'drcs_activations/{EVENT_NAME}', drcs_data, json_path)
# for subdir in subdirs:
#     print(f"  - {subdir}")

# For these we can see three different types of files

1. WM = water mask
2. rgb = red green blue
3. WM_diff = water mask difference between dates

### We are going to need 2 different directories for these!!!

We will keep WaterMask (WM) and rgb as separate directories

In [25]:
# For simplicity, let's use python list comprehension to return the files
# We may need to rename them in different ways
# We will do a similar process later

water_mask = [f for f in files_to_process if "_WM.tif" in f]
rgb = [f for f in files_to_process if "rgb.tif" in f]
water_mask_diff = [f for f in files_to_process if "WM_diff.tif" in f]
water_mask_diff

['drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240507_20240512_WM_diff.tif']

In [None]:
# NOTE for the diff files, we need to add diff at the 1st date before it
# Otherwise VEDA will think that the first date is the most important

# Example S1_diff20240430_20240507_WM.tif

In [34]:
config_WM = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/WM",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

config_rgb = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/rgb",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

## Initialize AWS S3 Client with automatic credential detection

In [35]:
# Initialize AWS S3 Client with automatic credential detection
try:
    # Create S3 client - will automatically use AWS credentials from:
    # 1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    # 2. AWS CLI configuration (~/.aws/credentials)
    # 3. IAM role (if running on EC2/Lambda)
    s3_client = boto3.client('s3')
    
    # Test connection
    try:
        # Try to list buckets (might fail due to permissions, that's OK)
        response = s3_client.list_buckets()
        print(f"✅ S3 client initialized successfully")
        print(f"   Found {len(response.get('Buckets', []))} accessible buckets")
    except ClientError as e:
        if e.response['Error']['Code'] == 'AccessDenied':
            print(f"⚠️ S3 client initialized (limited bucket list access)")
            # Test access to nasa-disasters bucket specifically
            try:
                s3_client.head_bucket(Bucket='nasa-disasters')
                print(f"✅ Confirmed access to nasa-disasters bucket")
            except ClientError as bucket_error:
                print(f"❌ Cannot access nasa-disasters bucket: {bucket_error}")
                s3_client = None
        else:
            raise e
    
    # Also initialize fsspec filesystem for S3
    fs_read = fsspec.filesystem("s3", anon=False, skip_instance_cache=False)
    print(f"✅ S3 filesystem (fsspec) initialized")
    
except NoCredentialsError:
    print("❌ No AWS credentials found. Please configure credentials using:")
    print("   - Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)")
    print("   - AWS CLI: aws configure")
    print("   - IAM role (if on EC2)")
    s3_client = None
    fs_read = None
except Exception as e:
    print(f"❌ Error initializing S3 client: {e}")
    s3_client = None
    fs_read = None

⚠️ S3 client initialized (limited bucket list access)
✅ Confirmed access to nasa-disasters bucket
✅ S3 filesystem (fsspec) initialized


In [36]:
# Verify S3 client is ready
if s3_client is None:
    print("❌ S3 client not initialized. Please check your AWS credentials.")
    print("   The notebook will not be able to download files from S3.")
else:
    print("✅ S3 client ready for operations")
    print("   Bucket: nasa-disasters")
    print("   Ready to process files")

✅ S3 client ready for operations
   Bucket: nasa-disasters
   Ready to process files


## Configure bucket and paths (no need to create session manually)

In [37]:
# Configure bucket and paths (no need to create session manually)
bucket_name = config_WM["cog_data_bucket"]
raw_data_bucket = config_WM["raw_data_bucket"]
raw_data_prefix = config_WM["raw_data_prefix"]

cog_data_bucket = config_WM['cog_data_bucket']
cog_data_prefix = config_WM["cog_data_prefix"]

print(f"Configuration loaded:")
print(f"  Source bucket: {raw_data_bucket}")
print(f"  Source prefix: {raw_data_prefix}")
print(f"  Target bucket: {cog_data_bucket}")
print(f"  Target prefix: {cog_data_prefix}")

Configuration loaded:
  Source bucket: nasa-disasters
  Source prefix: drcs_activations/202405_Flood_TX/sentinel1
  Target bucket: nasa-disasters
  Target prefix: drcs_activations_new/Sentinel-1/WM


In [40]:
def get_all_s3_keys(bucket, model_name, ext):
    """Get a list of all keys in an S3 bucket."""
    if s3_client is None:
        print("❌ S3 client not initialized")
        return []
        
    keys = []

    kwargs = {"Bucket": bucket, "Prefix": f"{model_name}/"}
    while True:
        try:
            resp = s3_client.list_objects_v2(**kwargs)
            if 'Contents' in resp:
                for obj in resp["Contents"]:
                    if obj["Key"].endswith(ext) and "historical" not in obj["Key"]:
                        keys.append(obj["Key"])
        except ClientError as e:
            print(f"❌ Error listing objects: {e}")
            break

        try:
            kwargs["ContinuationToken"] = resp.get("NextContinuationToken")
            if not kwargs["ContinuationToken"]:
                break
        except KeyError:
            break

    return keys

# Get all TIF files
keys = get_all_s3_keys(raw_data_bucket, raw_data_prefix, ".tif") if s3_client else []
if keys:
    print(f"✅ Found {len(keys)} .tif files in the S3 bucket.")
else:
    print("No keys found or S3 client not initialized")
    
keys

✅ Found 11 .tif files in the S3 bucket.


['drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_a

In [48]:
def convert_sentinel_datetime(datetime_str):
    """
    Convert Sentinel datetime format to ISO 8601 format with UTC timezone.
    
    Args:
        datetime_str: String like '20240430T002653'
    
    Returns:
        String like '2024-04-30T00:26:53Z'
    """
    # Extract components
    year = datetime_str[0:4]
    month = datetime_str[4:6]
    day = datetime_str[6:8]
    hour = datetime_str[9:11]
    minute = datetime_str[11:13]
    second = datetime_str[13:15]
    
    # Format with dashes and colons, add Z for UTC
    return f"{year}-{month}-{day}T{hour}:{minute}:{second}Z"

# Test
datetime_str = '20240430T002653'
result = convert_sentinel_datetime(datetime_str)
print(result)  # 2024-04-30T00:26:53Z

2024-04-30T00:26:53Z


In [55]:
def create_cog_filename_WM(f, EVENT_NAME,):
    
    f2 = Path(f).stem
    f2
    fsplit = f2.split('_')
    fsplit
    
    cog_filename = f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'

    return cog_filename

# Test function below
create_cog_filename_WM(f='drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif', EVENT_NAME=EVENT_NAME)

'202405_Flood_TX_S1A_IW_DVR_RTC20_G_gpuned_0610_2024-04-30T00:26:53Z.tif'

In [56]:
# f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'

In [57]:
# Define COG profile for rasterio
COG_PROFILE = {
    "driver": "COG",
    "compress": "DEFLATE",
}

In [58]:
def validate_cog(filepath):
    """
    Validate that a file is a proper Cloud Optimized GeoTIFF.
    
    Args:
        filepath: Path to the file to validate
    
    Returns:
        tuple: (is_valid, details_dict) where is_valid is boolean and 
               details_dict contains validation information
    """
    import rasterio
    from rasterio.env import Env
    
    validation_details = {
        'is_cog': False,
        'has_tiles': False,
        'has_overviews': False,
        'tile_size': None,
        'overview_levels': [],
        'compression': None,
        'driver': None,
        'errors': []
    }
    
    try:
        with Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR'):
            with rasterio.open(filepath) as src:
                # Check driver
                validation_details['driver'] = src.driver
                
                # Check if it's a GeoTIFF
                if src.driver != 'GTiff' and src.driver != 'COG':
                    validation_details['errors'].append(f"Invalid driver: {src.driver}, expected GTiff or COG")
                    return False, validation_details
                
                # Check for tiling
                if src.profile.get('tiled', False):
                    validation_details['has_tiles'] = True
                    validation_details['tile_size'] = (
                        src.profile.get('blockxsize', 0),
                        src.profile.get('blockysize', 0)
                    )
                else:
                    validation_details['errors'].append("File is not tiled")
                
                # Check for overviews
                overviews = src.overviews(1)  # Check band 1
                if overviews:
                    validation_details['has_overviews'] = True
                    validation_details['overview_levels'] = overviews
                else:
                    validation_details['errors'].append("No overviews found")
                
                # Check compression
                compression = src.profile.get('compress', None)
                validation_details['compression'] = compression
                if compression not in ['DEFLATE', 'LZW', 'ZSTD', 'WEBP', 'JPEG']:
                    validation_details['errors'].append(f"Compression '{compression}' may not be optimal for COG")
                
                # Check if file structure is cloud optimized
                # A COG should have IFD (Image File Directory) offsets arranged properly
                # This is a simplified check - true COG validation would check IFD ordering
                is_likely_cog = (
                    validation_details['has_tiles'] and 
                    validation_details['has_overviews'] and
                    validation_details['compression'] in ['DEFLATE', 'LZW', 'ZSTD', 'WEBP', 'JPEG']
                )
                
                validation_details['is_cog'] = is_likely_cog
                
                # Additional check for internal structure
                if hasattr(src, 'is_tiled') and src.is_tiled:
                    # Check tile size is reasonable (typically 256 or 512)
                    tile_x, tile_y = validation_details['tile_size']
                    if tile_x not in [256, 512, 1024] or tile_y not in [256, 512, 1024]:
                        validation_details['errors'].append(f"Non-standard tile size: {tile_x}x{tile_y}")
                
                return is_likely_cog, validation_details
                
    except Exception as e:
        validation_details['errors'].append(f"Validation error: {str(e)}")
        return False, validation_details

print("✅ COG validation function added")

✅ COG validation function added


## Download Cache Management

The updated `convert_to_proper_CRS_and_cogify` function now includes intelligent caching:

1. **First-time download**: Files are downloaded from S3 and saved to `data_download/` directory
2. **Subsequent runs**: Files are loaded from the local cache, avoiding re-download
3. **Cache structure**: Preserves the original S3 directory structure for organization

Benefits:
- ⚡ **Faster processing** on repeated runs
- 💾 **Bandwidth savings** - large files aren't re-downloaded
- 🔄 **Resumable** - if processing fails, downloads are preserved
- 📁 **Organized** - maintains S3 directory structure locally

In [59]:
# Cache utilities
def check_cache_status():
    """Check the status of the download cache."""
    data_download_dir = "data_download"
    
    if not os.path.exists(data_download_dir):
        print(f"📁 Cache directory does not exist: {data_download_dir}/")
        return
    
    # Count files and calculate total size
    total_files = 0
    total_size = 0
    file_list = []
    
    for root, dirs, files in os.walk(data_download_dir):
        for file in files:
            if file.endswith('.tif'):
                file_path = os.path.join(root, file)
                file_size = os.path.getsize(file_path)
                total_files += 1
                total_size += file_size
                file_list.append((file_path.replace(data_download_dir + '/', ''), file_size))
    
    print(f"📊 Cache Status:")
    print(f"  - Directory: {data_download_dir}/")
    print(f"  - Total files: {total_files}")
    print(f"  - Total size: {total_size / (1024**3):.2f} GB")
    
    if file_list:
        print(f"\n📁 Cached files (first 10):")
        for file_path, file_size in sorted(file_list)[:10]:
            print(f"  - {file_path} ({file_size / (1024**2):.1f} MB)")
        if len(file_list) > 10:
            print(f"  ... and {len(file_list) - 10} more files")
    
    return total_files, total_size

def clear_cache(confirm=False):
    """Clear the download cache."""
    data_download_dir = "data_download"
    
    if not os.path.exists(data_download_dir):
        print(f"Cache directory does not exist: {data_download_dir}/")
        return
    
    if not confirm:
        print("⚠️ This will delete all cached downloads!")
        print(f"Directory: {data_download_dir}/")
        print("To confirm, run: clear_cache(confirm=True)")
        return
    
    import shutil
    shutil.rmtree(data_download_dir)
    print(f"✅ Cache cleared: {data_download_dir}/ removed")

# Check current cache status
check_cache_status()

📁 Cache directory does not exist: data_download/


In [60]:
def convert_to_proper_CRS_and_cogify(name, cog_filename, cog_data_bucket, cog_data_prefix, local_output_dir=None):
    s3_key = f"{cog_data_prefix}/{cog_filename}"
    reproject_filename = f"reproj/{cog_filename}"
    
    # Create necessary directories
    os.makedirs("reproj", exist_ok=True)
    
    # Create data_download directory at the same level as the script
    data_download_dir = "data_download"
    os.makedirs(data_download_dir, exist_ok=True)
    
    # Create subdirectory structure to match S3 path
    # This preserves the original directory structure for better organization
    s3_path_parts = name.split('/')
    local_subdir = os.path.join(data_download_dir, *s3_path_parts[:-1])
    os.makedirs(local_subdir, exist_ok=True)
    
    # Local path for the downloaded file (persistent storage)
    local_download_path = os.path.join(data_download_dir, name)
    
    # Temporary file for processing (will be cleaned up)
    temp_input_file = f"temp_{os.path.basename(name)}"

    try:
        # Check if file already exists locally
        if os.path.exists(local_download_path):
            print(f"[CACHE HIT] Found existing file: {local_download_path}")
            print(f"[CACHE] File size: {os.path.getsize(local_download_path):,} bytes")
            # Copy from local cache to temp file for processing
            import shutil
            shutil.copy(local_download_path, temp_input_file)
        else:
            # Download the file from S3
            print(f"[DOWNLOAD] Downloading {name} from S3...")
            print(f"[DOWNLOAD] Target: {local_download_path}")
            
            # Download directly to the persistent location
            s3_client.download_file(raw_data_bucket, name, local_download_path)
            print(f"[DOWNLOAD] ✅ Saved to cache: {local_download_path}")
            print(f"[DOWNLOAD] File size: {os.path.getsize(local_download_path):,} bytes")
            
            # Copy to temp file for processing
            import shutil
            shutil.copy(local_download_path, temp_input_file)
        
        # Reproject using the temp file
        print(f"[REPROJECT] {name} → {reproject_filename} (EPSG:4326)")
        with rasterio.open(temp_input_file) as src:
            # Check current CRS
            print(f"[REPROJECT] Source CRS: {src.crs}")
            
            dst_crs = "EPSG:4326"
            
            # Check if reprojection is needed
            if src.crs and src.crs.to_string() == dst_crs:
                print(f"[REPROJECT] Already in {dst_crs}, skipping reprojection")
                # Just copy the file
                import shutil
                shutil.copy(temp_input_file, reproject_filename)
            else:
                transform, width, height = calculate_default_transform(
                    src.crs, dst_crs, src.width, src.height, *src.bounds
                )
                kwargs = src.meta.copy()
                kwargs.update({
                    "driver": "COG",                 # write a COG instead of plain GTiff
                    "compress": "DEFLATE",           # or "LZW"
                    "crs": dst_crs,
                    "transform": transform,
                    "width": width,
                    "height": height
                })

                with rasterio.open(f"{reproject_filename}", "w", **kwargs) as dst:
                    for band_idx in range(1, src.count + 1):
                        reproject(
                            source=rasterio.band(src, band_idx),
                            destination=rasterio.band(dst, band_idx),
                            src_transform=src.transform,
                            src_crs=src.crs,
                            dst_transform=transform,
                            dst_crs=dst_crs,
                            resampling=Resampling.nearest,
                            wrapdateline=True
                        )

        # 3) COGify & upload
        print(f"[COGIFY] {reproject_filename} → s3://{cog_data_bucket}/{s3_key}")
        ds = rxr.open_rasterio(reproject_filename)
        
        # Handle coordinate naming based on what's present
        if "y" in ds.dims and "x" in ds.dims:
            ds = ds.rename({"y": "lat", "x": "lon"})
            ds.rio.set_spatial_dims("lon", "lat", inplace=True)
        elif "lat" not in ds.dims or "lon" not in ds.dims:
            print(f"[COGIFY] Warning: Unexpected dimension names: {list(ds.dims)}")
        
        ds.rio.write_nodata(-9999, inplace=True)

        with tempfile.NamedTemporaryFile(suffix='.tif', delete=False) as tmp:
            tmp_name = tmp.name
            ds.rio.to_raster(tmp_name, **COG_PROFILE)
            
            # Validate COG before uploading
            print(f"[VALIDATE] Checking if {cog_filename} is a valid COG...")
            is_valid_cog, validation_details = validate_cog(tmp_name)
            
            if is_valid_cog:
                print(f"[VALIDATE] ✅ Valid COG confirmed:")
                print(f"  - Tiled: {validation_details['has_tiles']} (tile size: {validation_details['tile_size']})")
                print(f"  - Overviews: {len(validation_details['overview_levels'])} levels {validation_details['overview_levels']}")
                print(f"  - Compression: {validation_details['compression']}")
            else:
                print(f"[VALIDATE] ⚠️ COG validation warnings:")
                if validation_details['errors']:
                    for error in validation_details['errors']:
                        print(f"  - {error}")
                
                # Decide whether to continue or fail based on severity
                critical_errors = [e for e in validation_details['errors'] if 'Invalid driver' in e]
                if critical_errors:
                    raise ValueError(f"Critical COG validation failed: {', '.join(critical_errors)}")
                else:
                    print(f"[VALIDATE] Proceeding with upload despite warnings...")
            
            # Upload to S3
            s3_client.upload_file(
                Filename = tmp_name, 
                Bucket = cog_data_bucket, 
                Key = s3_key)
            print(f"[SUCCESS] Uploaded to s3://{cog_data_bucket}/{s3_key}")
            
            # Save locally if output directory is specified (this is for COGs, separate from downloads)
            if local_output_dir:
                os.makedirs(local_output_dir, exist_ok=True)
                local_path = os.path.join(local_output_dir, cog_filename)
                
                # Copy the COG file to local directory
                import shutil
                shutil.copy(tmp_name, local_path)
                print(f"[LOCAL SAVE] Saved COG to {local_path}")
            
    except Exception as e:
        print(f"[ERROR] Failed to process {name}: {str(e)}")
        # Remove incomplete download if it exists
        if not os.path.exists(local_download_path) or os.path.getsize(local_download_path) == 0:
            if os.path.exists(local_download_path):
                os.remove(local_download_path)
                print(f"[CLEANUP] Removed incomplete download: {local_download_path}")
        raise
            
    finally:
        # Clean up temporary files (but NOT the cached download)
        if os.path.exists(temp_input_file):
            os.remove(temp_input_file)
            print(f"[CLEANUP] Removed temporary input file {temp_input_file}")
            
        # Clean up local intermediate
        if os.path.exists(reproject_filename):
            os.remove(reproject_filename)
            print(f"[CLEANUP] Removed intermediate {reproject_filename}")
            
        # Clean up temp COG file
        if 'tmp_name' in locals() and os.path.exists(tmp_name):
            os.remove(tmp_name)
            print(f"[CLEANUP] Removed temporary COG file")
        
        # Report cache status
        print(f"[CACHE] Downloaded files are preserved in: {data_download_dir}/")

In [61]:
water_mask = [f for f in keys if "_WM.tif" in f]
rgb = [f for f in keys if "rgb.tif" in f]
water_mask_diff = [f for f in keys if "WM_diff.tif" in f]
water_mask_diff

['drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240507_20240512_WM_diff.tif']

In [None]:
# Initialize DataFrame to track processed files
files_processed = pd.DataFrame(columns=["file_name", "COGs_created"])

# Get local output directory from config
local_output_dir = config.get("local_output_dir")

# Create output directories
if local_output_dir:
    os.makedirs(local_output_dir, exist_ok=True)
    print(f"Local COGs will be saved to: {local_output_dir}")

# Process all files
for name in sorted(keys):
    cog_filename = create_cog_filename(name)
    print(f"\nProcessing: {name}")
    print(f"Output filename: {cog_filename}")
    
    # Process the file with local output directory
    convert_to_proper_CRS_and_cogify(name, cog_filename, cog_data_bucket, cog_data_prefix, local_output_dir)
    
    # Add to tracking DataFrame
    files_processed = files_processed._append(
        {"file_name": name, "COGs_created": cog_filename},
        ignore_index=True,
    )
    print(f"Generated and saved COG: {cog_filename}")

print("\nDone generating COGs")
if local_output_dir:
    print(f"COGs saved locally to: {local_output_dir}")

In [None]:
# Save metadata if there are processed files
if len(files_processed) > 0:
    # Get metadata from one of the processed files
    sample_file = files_processed.iloc[0]['file_name']
    temp_sample_file = f"temp_{os.path.basename(sample_file)}"
    
    # Download sample file to extract metadata
    s3_client.download_file(raw_data_bucket, sample_file, temp_sample_file)
    
    with rasterio.open(temp_sample_file) as src:
        metadata = {
            "description": src.tags(),
            "driver": src.driver,
            "dtype": str(src.dtypes[0]),
            "nodata": src.nodata,
            "width": src.width,
            "height": src.height,
            "count": src.count,
            "crs": str(src.crs),
            "transform": list(src.transform),
            "bounds": list(src.bounds),
            "total_files_processed": len(files_processed),
            "year": "2000"
        }
    
    # Upload metadata
    with tempfile.NamedTemporaryFile(mode="w+") as fp:
        json.dump(metadata, fp, indent=2)
        fp.flush()
        
        s3_client.upload_file(
            Filename=fp.name,
            Bucket=bucket_name,
            Key=f"{cog_data_prefix}/metadata.json",
        )
        print(f"Uploaded metadata to s3://{bucket_name}/{cog_data_prefix}/metadata.json")
    
    # Clean up sample file
    if os.path.exists(temp_sample_file):
        os.remove(temp_sample_file)

# Save the files_processed DataFrame to CSV using the same s3_client
with tempfile.NamedTemporaryFile(mode="w+", suffix=".csv") as fp:
    files_processed.to_csv(fp.name, index=False)
    fp.flush()
    
    s3_client.upload_file(
        Filename=fp.name,
        Bucket=bucket_name,
        Key=f"{cog_data_prefix}/files_converted.csv",
    )
    print(f"Saved processing log to s3://{bucket_name}/{cog_data_prefix}/files_converted.csv")

In [None]:
# Display summary
print(f"\nProcessing Summary:")
print(f"Total files found: {len(keys)}")
print(f"Files processed: {len(files_processed)}")
print(f"\nProcessed files:")
files_processed

## Enhanced Features in This Version

This enhanced notebook includes several improvements over the original:

### 🔐 **Automatic AWS Authentication**
- No need for `.env` files or manual credential configuration
- Automatically detects credentials from:
  - Environment variables
  - AWS CLI configuration
  - IAM roles (EC2/Lambda)

### 🚀 **Simplified Setup**
- Removed dependency on `python-dotenv`
- Direct boto3 client initialization
- Better error handling for authentication issues

### 📊 **Additional Features**
- fsspec filesystem integration for alternative S3 operations
- Graceful handling of limited S3 permissions
- Download caching to avoid re-downloading large files
- COG validation before upload
- Comprehensive error messages

### 💡 **Usage Tips**
1. Ensure AWS credentials are configured via one of the standard methods
2. The notebook will automatically detect and use available credentials
3. Check the authentication cell output to confirm S3 access
4. Use the cache management utilities to monitor downloaded files

This enhanced version follows AWS best practices and makes the notebook more portable and easier to use across different environments.