# BYOC Data Ingestion Tool

This notebook guides you through the process of ingesting data into a BYOC (Bring Your Own COG) collection in Sentinel Hub / Copernicus Data Space Ecosystem.

## Prerequisites

Before using this notebook, you need:

1. Sentinel Hub account with BYOC access
2. Client ID and Client Secret for authentication
3. Access to an S3 bucket containing your COG data in a supported system
4. S3 credentials (username and password)
5. Your data must be in Cloud Optimized GeoTIFF (COG) format

## Notebook Structure

This notebook is organized into the following sections:

1. **Import Required Libraries**: Setup and import all necessary dependencies
2. **Set Up Credentials**: Configure access to Sentinel Hub and S3 storage
3. **Configure Collection Parameters**: Set up collection information and data structure
5. **Define Band Information**: Configure the bands for your data
6. **Create Collection and Ingest Data**: Execute the ingestion process
7. **Monitoring and Validation**: Verify the successful ingestion


## 1. Import Required Libraries

First, let's import all the necessary libraries for the ingestion process.

In [None]:
import json
from src.auth import Configurator
from src.collections import Ingestor, TileListParameters

## 2. Set Up Credentials

Now, let's set up the credentials required for accessing Sentinel Hub and the S3 bucket. Here we will load credentials from a `config.json` file. This keeps sensitive information separate from your code and makes it easier to avoid accidentally exposing your credentials.

### What Credentials You'll Need

To use this notebook, you'll need:

1. **CDSE Sentinel Hub OAuth Credentials**
   - Client ID
   - Client Secret

2. **S3 Storage Access Credentials**
   - Username
   - Password
   - the URL of the bucket used: for CreoDIAS it should be: `s3.waw3-1.cloudferro.com`

In [None]:
# Load credentials from config.json file
try:
    with open('config.json', 'r') as f:
        config = json.load(f)
    
    # Extract credentials from config file
    CLIENT_ID = config.get('sentinel_hub_client_id')
    CLIENT_SECRET = config.get('sentinel_hub_client_secret')
    S3_USERNAME = config.get('s3_username')
    S3_PASSWORD = config.get('s3_password')
    BUCKET_URL = config.get('bucket_url')
    
except FileNotFoundError:
    print("config.json file not found in the current directory.")
except json.JSONDecodeError:
    print("config.json is not a valid JSON file. Please check its format.")
except Exception as e:
    print(f"Error loading credentials: {str(e)}")

## 3. Configure Collection Parameters

Now we need to configure the parameters for the BYOC collection.

### Understanding Path Configuration

The path configuration parameters are crucial for correctly extracting datetime information and identifying bands. When ingesting data into BYOC, the system uses these parameters to properly identify and organize your files.

#### File Path Structure in BYOC

In BYOC, files are referenced using a path with a `(BAND)` placeholder. For example, if your tile path is set to `folder/(BAND).tiff`, the system will replace `(BAND)` with the band source name to locate the actual files, like `folder/B01.tiff` or `folder/B02.tiff`.

#### Example Configuration

Suppose you have files with the following structure:
```
bucket-name/
└── path_to_data/
    ├── tile1_20220101/
    │   ├── file_name_B01.tif
    │   ├── file_name_B02.tif
    │   └── file_name_B03.tif
    └── tile2_20220102/
        ├── file_name_B01.tif
        ├── file_name_B02.tif
        └── file_name_B03.tif
```

For this structure, you would configure:

**DATETIME_POSITION**:
- `path`: 1 (the folder "tile1_20220101" is at position 1 in the path, as we don't count the bucket name)
- `delimiter`: "_" (the delimiter in "tile1_20220101")
- `folder`: 1 (the date part "20220101" is at position 1 after splitting)
- `format`: "%Y%m%d" (the format of "20220101")

**BAND_POSITION**:
- `path`: -1 (the filename "file_name_B01.tif" is the last part of the path)
- `delimiter`: "_" (the delimiter in "file_name_B01.tif")
- `position`: 2 (the band part "B01" is at position 2 after splitting)

Remember, all tiles in a collection must contain the same set of files with consistent naming. If a tile is missing one or more files, it will fail to ingest.

You'll need to adjust these parameters based on your specific file structure.

### Understanding BAND_INFORMATION

The `BAND_INFORMATION` parameter is essential when working with bands that have special characters in their names, particularly hyphens (`-`). This is common in CLMS (Copernicus Land Monitoring Service) datasets where band names might contain hyphens or other special characters.

When a band name contains a hyphen, the Sentinel Hub API requires additional configuration beyond the basic path structure. This is where `BAND_INFORMATION` comes in:

```python
BAND_INFORMATION = [
    {
        "name": "B1",         # The name you want to use for this band in your collection
        "source": "B1-NDVI",  # The actual source name with hyphen as it appears in your files
        "bit_depth": "8",     # The bit depth of this band
        "sample_format": "UINT" # The sample format (UINT, INT, FLOAT)
    }
]
```

#### Key Points:

1. **Handling Hyphens**: When band names in your files contain hyphens (like `B1-NDVI`), you must explicitly declare them in `BAND_INFORMATION`

2. **Band Renaming**: You can use this to rename complex band names to simpler ones:
   - `"name"`: What you want the band to be called in your collection
   - `"source"`: The actual name in the file path (including hyphens)

3. **Optional Usage**: If your band names don't contain hyphens or special characters, you can set `BAND_INFORMATION` to `None` or omit it entirely

This allows the ingestion process to correctly identify and rename the bands while handling the special characters in the file paths.

In [None]:
# Collection information
COLLECTION_NAME = "my-collection-name" # Provide a name for your collection

# Bucket information
BUCKET_NAME = "your-bucket-name" # Replace with your actual bucket name
BASE_PATH = "path/to/your/data" # Replace with the base path where your data is stored (this means 1 level above the the COGs)

DATETIME_POSITION = {
    "path": 2,             # Position of the folder with datetime info in the path
    "delimiter": "_",      # Delimiter in the folder name
    "position": 1,           # Position of the datetime in the folder name after splitting
    "format": "%Y%m%d"     # Format of the datetime string
}

BAND_POSITION = {
    "path": 2,            # Position of the filename in the path
    "delimiter": "_",      # Delimiter in the filename
    "position": 2          # Position of the band name in the filename after splitting
}

# Band information
BAND_INFORMATION = [
    {
        "name": "B1",
        "source": "B-1",
        "bit_depth": "8",
        "sample_format": "UINT"
    },
    {
        "name": "B2",
        "source": "B-2",
        "bit_depth": "8",
        "sample_format": "UINT"
    }
]

## 4. Create Collection and Ingest Data

Now that we have configured all the necessary parameters, we can create the collection and ingest the data. This process involves:

1. Creating a BYOC collection in Sentinel Hub
2. Listing all files in the S3 bucket that match our criteria
3. Ingesting these files into the collection
4. Monitoring the ingestion process

This might take some time depending on the amount of data being ingested.

In [None]:
# Create a configuration object for SH
configurator = Configurator(CLIENT_ID, CLIENT_SECRET)
sh_config = configurator.config

In [None]:
# Create a BYOC collection
sh_collection = Ingestor(sh_config)

try:
    sh_collection.create_byoc_collection(
        collection_name=COLLECTION_NAME,
        bucket_name=BUCKET_NAME,
        band_information=BAND_INFORMATION # Can be None if not needed, see specifics in the dedicated cell above
    )
    
except Exception as e:
    raise ValueError(f"Failed to create collection: {str(e)}")

In [None]:
# Get list of files
try:
    tile_list_params = TileListParameters(
        base_path=BASE_PATH,
        bucket_name=BUCKET_NAME,
        creodias_username=S3_USERNAME,
        creodias_password=S3_PASSWORD,
        bucket_url=BUCKET_URL # For WAW3-1, it should be "s3.waw3-1.cloudferro.com", for eodata: eodata.dataspace.copernicus.eu
)
    sh_collection.list_tiles(tile_list_params)

    print(
        f"Number of files to be ingested: {len(sh_collection.file_list)}"
    )
except Exception as e:
    raise ValueError(f"Failed to list files: {str(e)}")

In [None]:
# Ingest the files
try:
    sh_collection.ingest_tiles_to_collection(
        DATETIME_POSITION,
        BAND_POSITION,
    )
except Exception as e:
    raise ValueError(f"Failed to ingest tiles: {str(e)}")

## 5. Monitoring and Validation

After the ingestion process is complete, we can verify the collection and its contents.
If any tiles failed to ingest, we can identify them and diagnose the issues.

In [None]:
failure_report = sh_collection.collection_tile_report()

print(failure_report[0])

if failure_report[0]["Failed"] > 0:
   print(failure_report[1])