# Enhanced S3 to COG Converter with Automatic AWS Authentication

This notebook converts TIF files from S3 to Cloud Optimized GeoTIFFs (COGs) with:
- **Automatic AWS credential detection** (no .env file needed)
- **Download caching** to avoid re-downloading large files
- **COG validation** before uploading
- **Support for multiple AWS authentication methods**

Author: Kyle Lesinger (Enhanced version)

In [89]:
import os
import pandas as pd
import json
import tempfile
import boto3
import rasterio
import rioxarray as rxr
import s3fs
import fsspec
from rasterio.warp import calculate_default_transform, reproject, Resampling
from botocore.exceptions import NoCredentialsError, ClientError
from pathlib import Path
from datetime import datetime
import time

print("✅ Libraries imported successfully!")
print(f"Boto3 version: {boto3.__version__}")

✅ Libraries imported successfully!
Boto3 version: 1.37.3


In [90]:
# Add path for importing custom modules
import sys
from pathlib import Path

# Add the scripts directory to the Python path
scripts_dir = Path('../scripts').resolve()
if str(scripts_dir) not in sys.path:
    sys.path.insert(0, str(scripts_dir))

# Import functions from list_s3crawler_files module
from list_s3crawler_files import (
    load_drcs_data,
    get_tif_files_from_path,
    get_files_with_full_paths,
    list_available_directories
)

# Import COG and cache utilities
from cog_utilities import (
    check_cache_status,
    clear_cache,
    validate_cog
)

# Import AWS S3 utilities
from aws_s3_utils import (
    initialize_s3_client,
    verify_s3_client,
    get_all_s3_keys
)

print("✅ Custom modules imported successfully!")
print(f"   Module path: {scripts_dir}")

✅ Custom modules imported successfully!
   Module path: /home/jovyan/conversion_scripts/convert-files-and-move/scripts


# Useful links

[drcs_activations OLD Directory](https://data.disasters.openveda.cloud/browseui/browseui/#drcs_activations/)

[VEDA docs for file naming conventions](https://docs.openveda.cloud/user-guide/content-curation/dataset-ingestion/file-preparation.html)

## List of new 2nd level directories

    "Sentinel-1"
    "Sentinel-2"
    "Landsat"
    "MODIS"
    "VIIRS"
    "ASTER"
    "MASTER"
    "ECOSTRESS"
    "Planet"
    "Maxar"
    "HLS"
    "IMERG"
    "GOES"
    "SMAP"
    "ICESat"
    "GEDI"
    "COMSAR"
    "UAVSAR"
    "WB-57"

# Starting editing here

In [91]:
# DO NOT CHANGE
DIR_OLD_BASE = 'drcs_activations'
DIR_NEW_BASE = 'drcs_activations_new'

In [92]:
EVENT_NAME = '202405_Flood_TX'
PRODUCT_NAME = 'sentinel1'

RENAME_PRODUCT = 'Sentinel-1'

PATH_OLD = f'{DIR_OLD_BASE}/{EVENT_NAME}/{PRODUCT_NAME}'  # Updated to use actual available directory
DIRECTORY_NEW = f'{DIR_NEW_BASE}/{RENAME_PRODUCT}'

## Initialize AWS S3 Client with automatic credential detection

In [122]:
# Initialize AWS S3 Client using the imported function
s3_client, fs_read = initialize_s3_client(bucket_name='nasa-disasters', verbose=True)
# Verify S3 client is ready using the imported function
verify_s3_client(s3_client, bucket_name='nasa-disasters', verbose=True)

# Get all TIF files using the imported function
keys = get_all_s3_keys(s3_client, 'nasa-disasters', PATH_OLD, ".tif") if s3_client else []

if keys:
    print(f"✅ Found {len(keys)} .tif files in the S3 bucket. These are the files we will be processing!")
else:
    print("No keys found or S3 client not initialized")
    
keys

⚠️ S3 client initialized (limited bucket list access)
✅ Confirmed access to nasa-disasters bucket
✅ S3 filesystem (fsspec) initialized
✅ S3 client ready for operations
   Bucket: nasa-disasters
   Ready to process files
✅ Found 11 .tif files in the S3 bucket. These are the files we will be processing!


['drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002719_DVR_RTC20_G_gpuned_F141_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240507T122323_DVR_RTC20_G_gpuned_5BA0_rgb.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002655_DVR_RTC20_G_gpuned_EC9C_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002720_DVR_RTC20_G_gpuned_D32B_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240512T002745_DVR_RTC20_G_gpuned_3F78_WM.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_a

# File types
### For these we can see three different types of files

1. WM = water mask
2. rgb = red green blue
3. WM_diff = water mask difference between dates

### We are going to need 2 different directories for these!!!

We will keep WaterMask (WM) and rgb as separate directories

In [116]:
# For simplicity, let's use python list comprehension to return the files
# We may need to rename them in different ways for different products
# We will do a similar process later

## NOTE --- We can actually use these objects since they have the same path as the s3 files. We will call them again later

water_mask = [f for f in keys if "_WM.tif" in f]
rgb = [f for f in keys if "rgb.tif" in f]
water_mask_diff = [f for f in keys if "WM_diff.tif" in f]
water_mask_diff


['drcs_activations/202405_Flood_TX/sentinel1/S1_20240430_20240507_WM_diff.tif',
 'drcs_activations/202405_Flood_TX/sentinel1/S1_20240507_20240512_WM_diff.tif']

In [98]:
# NOTE for the diff files, we need to add diff at the 1st date before it
# Otherwise VEDA will think that the first date is the most important

# Example S1_diff20240430_20240507_WM.tif

## Load TIF Files from DRCS Data
### If no files are found in the previous block (for keys). This section could help to diagnose any issues.

This cell loads the pre-analyzed DRCS activation data from `drcs_activations_tif_files.json` which contains a complete inventory of all .tif files in the NASA Disasters S3 bucket.

The code will:
1. Load the JSON file containing the file inventory
2. Parse the `PATH_OLD` variable to find the corresponding directory
3. Extract all .tif filenames from that directory
4. Store them in `files_to_process` for later use

In [115]:
# # Load the pre-analyzed DRCS TIF files data using imported functions
# # The JSON path is relative to the notebook location

# # Load DRCS data
# drcs_data = load_drcs_data(Path('../../s3-crawler/drcs_activations_tif_files.json'))

# if drcs_data:
#     # Get TIF files from the specified PATH_OLD using the imported function
#     tif_files = get_tif_files_from_path(PATH_OLD, drcs_data, DIR_OLD_BASE)
    
#     if tif_files:
#         print(f"\n📁 Found {len(tif_files)} .tif files in {PATH_OLD}:")
#         print("\nFirst 10 files:")
#         for i, file in enumerate(tif_files[:10], 1):
#             print(f"  {i:2d}. {file}")
#         if len(tif_files) > 10:
#             print(f"  ... and {len(tif_files) - 10} more files")
        
#         # Get files with full paths using the imported function
#         files_to_process = get_files_with_full_paths(PATH_OLD, drcs_data, DIR_OLD_BASE, json_path)
#         print(f"\n✅ Files ready for processing. Stored in 'files_to_process' variable.")
#     else:
#         print(f"\n❌ No files found. Please check the PATH_OLD variable.")
#         files_to_process = []
# else:
#     print(f"\n❌ Could not load DRCS data.")
#     files_to_process = []
# files_to_process

In [96]:
# # Example: List available activation events using the imported function
# print("📂 Available activation events in DRCS data:")
# events = list_available_directories('drcs_activations', drcs_data, json_path)

# # Show first 10 events
# for event in events[:10]:
#     print(f"  - {event}")
# if len(events) > 10:
#     print(f"  ... and {len(events) - 10} more events")

# # Example: List subdirectories for a specific event
# print(f"\n📁 Subdirectories in {EVENT_NAME}:")
# subdirs = list_available_directories(f'drcs_activations/{EVENT_NAME}', drcs_data, json_path)
# for subdir in subdirs:
#     print(f"  - {subdir}")

## Configure bucket and paths (no need to create session manually)

In [None]:
                                                                  
def return_bucket_info(config):
   """
   Extract bucket information from configuration and return as dictionary.
   
   Args:
       config: Configuration dictionary containing bucket and prefix information
   
   Returns:
       Dictionary with bucket and prefix information
   """
   # Configure bucket and paths (no need to create session manually)
   bucket_name = config["cog_data_bucket"]
   raw_data_bucket = config["raw_data_bucket"]
   raw_data_prefix = config["raw_data_prefix"]

   cog_data_bucket = config['cog_data_bucket']
   cog_data_prefix = config["cog_data_prefix"]

   print(f"Configuration loaded:")
   print(f"  Source bucket: {raw_data_bucket}")
   print(f"  Source prefix: {raw_data_prefix}")
   print(f"  Target bucket: {cog_data_bucket}")
   print(f"  Target prefix: {cog_data_prefix}")

   return {
       "bucket_name": bucket_name,
       "raw_data_bucket": raw_data_bucket,
       "raw_data_prefix": raw_data_prefix,
       "cog_data_bucket": cog_data_bucket,
       "cog_data_prefix": cog_data_prefix
   }



## We need to make a new set of configs for every different file type... These will be the final directory in the new S3 bucket

In [107]:
config_WM = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/WM",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

config_rgb = {
    "data_acquisition_method": "s3",
    "raw_data_bucket" : "nasa-disasters", #DO NOT CHANGE
    "raw_data_prefix": F"{PATH_OLD}",
    "cog_data_bucket": "nasa-disasters", #DO NOT CHANGE
    "cog_data_prefix": f"{DIRECTORY_NEW}/rgb",
    "local_output_dir": f"output/{EVENT_NAME}",  # Local directory to save COGs
    "transformation": {}
}

In [108]:
#Create bucket information
wm_bucket = return_bucket_info(config_WM)
rgb_bucket = return_bucket_info(config_rgb)

rgb_bucket

Configuration loaded:
  Source bucket: nasa-disasters
  Source prefix: drcs_activations/202405_Flood_TX/sentinel1
  Target bucket: nasa-disasters
  Target prefix: drcs_activations_new/Sentinel-1/WM
Configuration loaded:
  Source bucket: nasa-disasters
  Source prefix: drcs_activations/202405_Flood_TX/sentinel1
  Target bucket: nasa-disasters
  Target prefix: drcs_activations_new/Sentinel-1/rgb


{'bucket_name': 'nasa-disasters',
 'raw_data_bucket': 'nasa-disasters',
 'raw_data_prefix': 'drcs_activations/202405_Flood_TX/sentinel1',
 'cog_data_bucket': 'nasa-disasters',
 'cog_data_prefix': 'drcs_activations_new/Sentinel-1/rgb'}

# File naming

In [104]:
def convert_sentinel_datetime(datetime_str):
    """
    Convert Sentinel datetime format to ISO 8601 format with UTC timezone.
    
    Args:
        datetime_str: String like '20240430T002653'
    
    Returns:
        String like '2024-04-30T00:26:53Z'
    """
    # Extract components
    year = datetime_str[0:4]
    month = datetime_str[4:6]
    day = datetime_str[6:8]
    hour = datetime_str[9:11]
    minute = datetime_str[11:13]
    second = datetime_str[13:15]
    
    # Format with dashes and colons, add Z for UTC
    return f"{year}-{month}-{day}T{hour}:{minute}:{second}Z"

# Test
datetime_str = '20240430T002653'
result = convert_sentinel_datetime(datetime_str)
print(result)  # 2024-04-30T00:26:53Z

2024-04-30T00:26:53Z


# [File Types](#file-types)

In [124]:
def create_cog_filename_WM_and_rgb(f, EVENT_NAME,):
    
    f2 = Path(f).stem
    f2
    fsplit = f2.split('_')
    fsplit
    
    cog_filename = f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'

    return cog_filename

# Test function below
# create_cog_filename_WM_and_rgb(f='drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_WM.tif', EVENT_NAME=EVENT_NAME)
create_cog_filename_WM_and_rgb(f='drcs_activations/202405_Flood_TX/sentinel1/S1A_IW_20240430T002653_DVR_RTC20_G_gpuned_0610_rgb.tif', EVENT_NAME=EVENT_NAME)


'202405_Flood_TX_S1A_IW_DVR_RTC20_G_gpuned_0610_2024-04-30T00:26:53Z.tif'

In [118]:
# f'{EVENT_NAME}_{"_".join(fsplit[0:2])}_{"_".join(fsplit[3:8])}_{convert_sentinel_datetime(fsplit[2])}.tif'

In [119]:
# Define COG profile for rasterio
COG_PROFILE = {
    "driver": "COG",
    "compress": "DEFLATE",
}

In [120]:
# Check current cache status using the imported function
check_cache_status()

📁 Cache directory does not exist: data_download/


(0, 0)

In [None]:
# Initialize DataFrame to track processed files
files_processed = pd.DataFrame(columns=["file_name", "COGs_created"])

# Get local output directory from config
local_output_dir = config.get("local_output_dir")

# Create output directories
if local_output_dir:
    os.makedirs(local_output_dir, exist_ok=True)
    print(f"Local COGs will be saved to: {local_output_dir}")

# Process all files
for name in sorted(keys):
    cog_filename = create_cog_filename(name)
    print(f"\nProcessing: {name}")
    print(f"Output filename: {cog_filename}")
    
    # Process the file with local output directory
    convert_to_proper_CRS_and_cogify(name, cog_filename, cog_data_bucket, cog_data_prefix, local_output_dir)
    
    # Add to tracking DataFrame
    files_processed = files_processed._append(
        {"file_name": name, "COGs_created": cog_filename},
        ignore_index=True,
    )
    print(f"Generated and saved COG: {cog_filename}")

print("\nDone generating COGs")
if local_output_dir:
    print(f"COGs saved locally to: {local_output_dir}")

In [None]:
# Save metadata if there are processed files
if len(files_processed) > 0:
    # Get metadata from one of the processed files
    sample_file = files_processed.iloc[0]['file_name']
    temp_sample_file = f"temp_{os.path.basename(sample_file)}"
    
    # Download sample file to extract metadata
    s3_client.download_file(raw_data_bucket, sample_file, temp_sample_file)
    
    with rasterio.open(temp_sample_file) as src:
        metadata = {
            "description": src.tags(),
            "driver": src.driver,
            "dtype": str(src.dtypes[0]),
            "nodata": src.nodata,
            "width": src.width,
            "height": src.height,
            "count": src.count,
            "crs": str(src.crs),
            "transform": list(src.transform),
            "bounds": list(src.bounds),
            "total_files_processed": len(files_processed),
            "year": "2000"
        }
    
    # Upload metadata
    with tempfile.NamedTemporaryFile(mode="w+") as fp:
        json.dump(metadata, fp, indent=2)
        fp.flush()
        
        s3_client.upload_file(
            Filename=fp.name,
            Bucket=bucket_name,
            Key=f"{cog_data_prefix}/metadata.json",
        )
        print(f"Uploaded metadata to s3://{bucket_name}/{cog_data_prefix}/metadata.json")
    
    # Clean up sample file
    if os.path.exists(temp_sample_file):
        os.remove(temp_sample_file)

# Save the files_processed DataFrame to CSV using the same s3_client
with tempfile.NamedTemporaryFile(mode="w+", suffix=".csv") as fp:
    files_processed.to_csv(fp.name, index=False)
    fp.flush()
    
    s3_client.upload_file(
        Filename=fp.name,
        Bucket=bucket_name,
        Key=f"{cog_data_prefix}/files_converted.csv",
    )
    print(f"Saved processing log to s3://{bucket_name}/{cog_data_prefix}/files_converted.csv")

In [None]:
# Display summary
print(f"\nProcessing Summary:")
print(f"Total files found: {len(keys)}")
print(f"Files processed: {len(files_processed)}")
print(f"\nProcessed files:")
files_processed

## Enhanced Features in This Version

This enhanced notebook includes several improvements over the original:

### 🔐 **Automatic AWS Authentication**
- No need for `.env` files or manual credential configuration
- Automatically detects credentials from:
  - Environment variables
  - AWS CLI configuration
  - IAM roles (EC2/Lambda)

### 🚀 **Simplified Setup**
- Removed dependency on `python-dotenv`
- Direct boto3 client initialization
- Better error handling for authentication issues

### 📊 **Additional Features**
- fsspec filesystem integration for alternative S3 operations
- Graceful handling of limited S3 permissions
- Download caching to avoid re-downloading large files
- COG validation before upload
- Comprehensive error messages

### 💡 **Usage Tips**
1. Ensure AWS credentials are configured via one of the standard methods
2. The notebook will automatically detect and use available credentials
3. Check the authentication cell output to confirm S3 access
4. Use the cache management utilities to monitor downloaded files

This enhanced version follows AWS best practices and makes the notebook more portable and easier to use across different environments.