
#  Kaggle to S3 Upload Notebook

> **Note:** While it's possible to upload files manually through the AWS console or use direct file transfer tools, this notebook provides an automated, scalable method.

- **Why this approach?**
  - Avoids downloading datasets to your local machine.
  - Uses **Google Colab**, which offers **~80 GB of free cloud disk space**, allowing you to download and unzip large datasets directly in the cloud.
  - Then, uploads selected files (like `.csv`) to an **Amazon S3 bucket** using the `boto3` SDK.


##  Step 1: Set up Kaggle API credentials

In [None]:

# Create the hidden .kaggle directory in the user's home if it doesn't already exist
!mkdir -p ~/.kaggle

# Copy the kaggle.json file (which contains Kaggle API credentials) to the .kaggle directory
!cp kaggle.json ~/.kaggle/


## Step 2: Install Required Python Packages

In [None]:

# Install the boto3 package to interact with AWS services like S3
!pip install boto3


##  Step 3: Download Dataset from Kaggle and Upload CSVs to S3

In [None]:

# Import essential libraries
import os                     # For file and directory operations
import kaggle                 # To interact with Kaggle API
import boto3                  # AWS SDK for Python to interact with S3
from google.colab import userdata  # For securely retrieving stored user credentials in Colab

# Define the Kaggle dataset identifier (e.g., username/dataset-name)
dataset_name = "ukveteran/adventure-works"  # Change this as needed

# Define local directory path where the downloaded dataset will be stored
local_path = "./data"

# Access environment variables securely stored in Colab
kaggle_username = userdata.get('KAGGLE_USERNAME')  # Retrieve Kaggle username
kaggle_key = userdata.get('KAGGLE_API_KEY')        # Retrieve Kaggle API key

# Download and unzip the Kaggle dataset to the specified local path
kaggle.api.dataset_download_files(dataset=dataset_name, path=local_path, unzip=True)

# Define AWS S3 bucket name and folder prefix (like a path inside the bucket)
bucket_name = userdata.get("S3_BUCKET_NAME")       # Retrieve target S3 bucket
prefix = userdata.get("S3_FOLDER_NAME")            # Retrieve the folder path/prefix inside the bucket

# Initialize a boto3 S3 client using AWS credentials securely stored in Colab
s3 = boto3.client(
    's3',
    aws_access_key_id=userdata.get("AWS_ACCESS_KEY_ID"),         # AWS access key
    aws_secret_access_key=userdata.get("AWS_SECRET_ACCESS_KEY"), # AWS secret key
    region_name='ap-south-1'                                      # Specify AWS region
)

# Upload only CSV files from the local dataset directory to the specified S3 bucket and prefix
for file in os.listdir(local_path):                               # Iterate over files in the local directory
    file_path = os.path.join(local_path, file)                    # Create full file path
    if os.path.isfile(file_path) and file.endswith('.csv'):      # Proceed only if it's a CSV file
        s3.upload_file(file_path, bucket_name, prefix + file)    # Upload the file to S3
        print(f"Uploaded {file} to {bucket_name}/{prefix}")      # Log upload confirmation

# Final status message
print("CSV files uploaded to S3")
