### Install helper libraries for uploads

This cell installs a few Python helper tools:

- **boto3** – a Python library that lets us talk to our storage system (MinIO/S3) from code.  
- **tqdm** – shows a nice progress bar while files are uploading.  
- **python-dotenv (`dotenv`)** – reads settings (like bucket names and URLs) from a simple text file called `config.env`.

These tools are used in later cells to upload the dataset into the workshop’s storage.

In [None]:
!pip install boto3 tqdm dotenv

### Import Python libraries used for configuration and uploads

This cell loads the Python libraries we just installed so we can use them:

- **boto3** – to connect to the object storage service (MinIO/S3).  
- **os** – to work with files and read environment variables.  
- **TransferConfig** – lets us tune how uploads behave (for example, splitting big files into pieces).  
- **tqdm** – to show a progress bar while each file uploads.  
- **load_dotenv** – to load settings from the `config.env` file.

Nothing is uploaded yet; we’re just getting our tools ready.

In [None]:
import boto3
import os
from boto3.s3.transfer import TransferConfig
from tqdm import tqdm
from dotenv import load_dotenv

### Load workshop configuration from `config.env`

This cell reads settings from the `config.env` file and exposes them to Python as environment variables.

- **`config.env`** is a simple text file with lines like `NAME=value`.  
- **Environment variables** are just named settings the code can read, such as the project **namespace**.  
- The **namespace** is a label that identifies your project in the cluster (for example, `myuser-raft-workshop`). We’ll reuse it to build folder paths in storage.

This gives the rest of the notebook the information it needs to know “who you are” and where to store data.

In [None]:
# ---------------------------------------------------------------------
# LOAD CONFIGURATION
# ---------------------------------------------------------------------
load_dotenv("config.env")

DATASCIENCE_PROJECT_NAMESPACE = os.getenv('DATASCIENCE_PROJECT_NAMESPACE')

### Define where the dataset lives and where it will be uploaded

This cell sets:

- The **local folder** where your generated dataset is stored (`directory`).  
- The **key base** (`key_base`), which is the “folder path” we’ll use inside object storage.  
- The **bucket name** (`bucket_name`), read from `AWS_S3_BUCKET` provided by the connection config.

> **Bucket? Key?**  
> - A **bucket** is like a top-level folder in the storage system.  
> - A **key** is the path to a specific file inside that bucket.

The cell then prints a short summary showing:

- Which local directory will be uploaded.  
- Which bucket and path in storage it will go to, in a URL-style form like:  
  `s3://<bucket>/<key_base>/`

This is your chance to double-check that the upload destination looks right.


In [None]:
directory = '/opt/app-root/src/raft-workshop/dataset'
key_base = f'{DATASCIENCE_PROJECT_NAMESPACE}/dataset'
bucket_name = os.getenv("AWS_S3_BUCKET")

print(
    f"""
Upload Configuration
--------------------
Local directory  : {directory}
S3 bucket        : {bucket_name}
S3 key base      : {key_base}

Result:
Files from the local dataset directory will be uploaded to:
s3://{bucket_name}/{key_base}/
"""
)

### Configure efficient uploads for large files

This cell tunes how files are uploaded to object storage using `TransferConfig`.

Key ideas:

- **Multi-part upload** – big files are automatically split into smaller pieces (“chunks”) and uploaded in parts.  
  - This makes uploads **faster** and **more reliable**, especially over slower or unstable networks. (like random hotel wifi connections)
- **Concurrency / threads** – multiple parts can be uploaded at the same time to speed things up.

You don’t need to change these values for the workshop; they’re sensible defaults for uploading dataset files.

In [None]:
# Configure S3 transfer settings for efficient multi-part uploads
config = TransferConfig(
    multipart_threshold=1024 * 25,
    max_concurrency=10,
    multipart_chunksize=1024 * 25,
    use_threads=True
)

### Upload all dataset files to MinIO with a progress bar

This is the main upload step.

The cell does the following:

1. **Walks through the local dataset folder**  
   - Finds every file inside `directory` (including files in subfolders).

2. **Builds the storage path (key) for each file**  
   - Keeps the same relative folder structure, but under `key_base` in the bucket.

3. **Uploads each file to MinIO**  
   - Creates a **boto3 client** that connects to the object storage service using the `AWS_S3_ENDPOINT` setting.  
   - Uses the connection configuration so large files are uploaded in chunks.  
   - Shows a **tqdm progress bar** so you can see how many bytes have been uploaded for each file.

4. **Reports success or errors**  
   - Prints a message when a file is successfully uploaded.  
   - If something goes wrong, it catches the error and prints it instead of crashing the whole notebook.

> **MinIO / S3 endpoint**  
> - **MinIO** is a storage service used in the workshop that behaves like Amazon’s S3.  
> - The **endpoint URL** is simply the web address of that storage service inside the cluster.

After this cell finishes, your dataset is safely stored in the workshop’s object storage and ready to be used by other jobs or notebooks.

In [None]:
# Walk the local dataset directory and upload each file to MinIO with progress bar
for root, dirs, files in os.walk(directory):
    for file in files:
        file_path = os.path.join(root, file)

        rel_path = os.path.relpath(file_path, directory)
        key_name = os.path.join(key_base, rel_path)

        print(key_name)
        try:
            with tqdm(
                total=os.path.getsize(file_path),
                unit='B',
                unit_scale=True,
                desc=file_path
            ) as pbar:
                s3_client = boto3.client('s3', endpoint_url=os.getenv("AWS_S3_ENDPOINT"))
                s3_client.upload_file(
                    file_path,
                    "test",
                    key_name,
                    Config=config,
                    Callback=lambda bytes_transferred: pbar.update(bytes_transferred)
                )
            print(f'File {file_path} uploaded to {bucket_name}/{key_name}')
        except Exception as e:
            print(f'Error occurred while uploading {file_path}: {e}')