* notebook created by nov05 on 2025-01-12  
* Registry of Open Data on AWS: [**Amazon Bin Image Dataset**](https://registry.opendata.aws/amazon-bin-imagery/)      
  https://us-east-1.console.aws.amazon.com/s3/buckets/aft-vbi-pds  

In [None]:
## windows cmd to launch notepad to edit aws config and credential files
# !notepad C:\Users\guido\.aws\config
!notepad C:\Users\guido\.aws\credentials

# 👉 **Download metadata from S3** 

Download a portion of the metadata from the public S3 bucket containing the **Amazon Bin Image Dataset** to your local system.  

In [None]:
## example code to download a file from s3 bucket
import boto3
from botocore import UNSIGNED
from botocore.client import Config
# Create an S3 client with unsigned requests (public access)
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
s3_client.download_file(
    Bucket='aft-vbi-pds',
    Key='bin-images/100313.jpg',
    Filename='../data/bin-images/100313.jpg'
)

In [None]:
import os
import json
from tqdm import tqdm
import boto3
from botocore import UNSIGNED
from botocore.client import Config

def download_and_arrange_data(
        prefix='bin-images', 
        file_extension='.jpg',
        download_dir='../data/train',
        partition=True):
    
    s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

    ## There are 140536 image file names in the list. 
    with open('file_list.json', 'r') as f:
        d = json.load(f)

    for k, v in d.items():  ## There are 5 items (for 5 classes) in the JSON file.
        print(f"Downloading images/metadata of images with {k} object...")
        if partition:
            download_dir = os.path.join(download_dir, k)
        if not os.path.exists(download_dir):
            os.makedirs(download_dir)
        for file_path in tqdm(v):
            file_name = os.path.basename(file_path).split('.')[0] + file_extension
            s3_client.download_file(
                'aft-vbi-pds', 
                prefix+'/'+file_name,  ## e.g. metadata/100313.json
                download_dir+'/'+file_name)
            
## download metadata, 17.9 MB, 56m 57.4s
download_and_arrange_data(
    prefix='metadata', 
    file_extension='.json',
    download_dir='../data/metadata',
    partition=False)

```text
Downloading images/metadata of images with 1 object...
100%|██████████| 1228/1228 [06:36<00:00,  3.09it/s]
Downloading images/metadata of images with 2 object...
100%|██████████| 2299/2299 [12:38<00:00,  3.03it/s]
Downloading images/metadata of images with 3 object...
100%|██████████| 2666/2666 [14:35<00:00,  3.04it/s]
Downloading images/metadata of images with 4 object...
100%|██████████| 2373/2373 [12:54<00:00,  3.06it/s]
Downloading images/metadata of images with 5 object...
100%|██████████| 1875/1875 [10:11<00:00,  3.07it/s]  
```

In [1]:
print("total metadata file number:", 1228 + 2299 + 2666 + 2373 + 1875)

total metadata file number: 10441


# 👉 **Upload metadata to S3**

Upload this portion of the metadata to my own S3 bucket for further experimental analysis using AWS Glue, Athena, and other services.  

In [4]:
## example code: upload a file to s3. mind the profile that is used.
import boto3
session = boto3.Session(profile_name='admin')  ## use the profile name in the credentials file
s3_client = session.client('s3')
bucket = 'dataset-aft-vbi-pds'
key = 'metadata/100313.json'
filename = '../data/metadata/100313.json'
s3_client.upload_file(
    Filename=filename,
    Key=key,
    Bucket=bucket
)

In [16]:
## example code of directory traversal
import os
local_folder = '../data/metadata'
for root, dir, files in os.walk(local_folder):
    print(root, dir)
    for i,file in enumerate(files):
        local_file = os.path.join(root, file)
        print(local_file)
        relative_path = os.path.relpath(local_file, '../data/')
        print(relative_path)
        break

../data/metadata []
../data/metadata\00004.json
metadata\00004.json


In [None]:
import os
from tqdm import tqdm
import boto3
from botocore.exceptions import NoCredentialsError
def upload_folder_to_s3(local_folder, bucket_name, s3_folder=''):
    session = boto3.Session(profile_name='admin')  ## use the profile name in the credentials file
    s3_client = session.client('s3')
    for root, _, files in os.walk(local_folder):
        for file in tqdm(files):
            local_file = os.path.join(root, file)
            relative_path = os.path.relpath(local_file, local_folder)  # Get relative file path
            s3_file = os.path.join(s3_folder, relative_path).replace("\\", "/")  # Handle folder structure in S3
            try:
                s3_client.upload_file(local_file, bucket_name, s3_file)
            except NoCredentialsError:
                print("AWS credentials not available.")
bucket = 'dataset-aft-vbi-pds'
local_folder = '../data/metadata/'  # Local folder path
s3_folder = 'metadata/'  # The folder in S3 to upload to (optional)
upload_folder_to_s3(local_folder, bucket, s3_folder)
## 53m 23.5s for uploading 10445 json files

100%|██████████| 10445/10445 [53:23<00:00,  3.26it/s] 
