# Load Data from Kaggle via API

In [0]:
!pip install kaggle

## Process to setup Databricks secret scope for API key handling

### Step 1: Create a Secret Scope either via UI or CLI (terminal)
- Via UI: Navigate to `https://<workspace-url>#secrets/createScope` (Note: the URL is case-sensitive with an uppercase 'S').
- Via CLI: Use the command `databricks secrets create-scope <scope-name>`

### Step 2: Add Secrets to the Scope
- Via CLI: run `databricks secrets put-secret <scope-name> <key-name>`
The CLI will prompt to enter the value securely and it does not appear in terminal history.

Use the built-in `dbutils.secrets` utility to fetch the keys at runtime (example as below). Databricks automatically replaces these values with `[REDACTED]` in any output or logs.

## Security Best Practices:
- **Use External Backends**: For enterprise-grade security, back your Databricks Secret Scopes with _Azure Key Vault_ or _AWS Secrets Manager_. This allows security team to manage rotation and audit trails without accessing Databricks.
- **Principle of Least Privilege**: Grant READ permissions on a scope only to the specific users or service principals that need it.
- **Environment Isolation**: Create separate scopes for different environments (e.g., kaggle-dev, kaggle-prod) to prevent accidental production credential leaks in development notebooks.
- **Avoid Manual `os.environ` if possible**: For Databricks Apps or jobs, consider injecting secrets directly into the compute environment configuration rather than calling dbutils inside the code.

In [0]:
import os

# Prerequisite: Create & store a Kaggle credential secret in your Databricks secret scope
# Fetch credentials from secret scope
kaggle_user = dbutils.secrets.get(scope="kaggle_creds", key="username")
kaggle_key = dbutils.secrets.get(scope="kaggle_creds", key="api_key")

os.environ["KAGGLE_USERNAME"] = "kaggle_user"
os.environ["KAGGLE_KEY"] = "kaggle_key"

print("Kaggle credentials configured!")

In [0]:
# import os

# dbutils.widgets.text("kaggle_username", "", "Enter Kaggle Username")
# dbutils.widgets.text("kaggle_api_key", "", "Enter Kaggle API Key")

# kaggle_user = dbutils.widgets.get("kaggle_username")
# kaggle_key = dbutils.widgets.get("kaggle_api_key")

# os.environ["KAGGLE_USERNAME"] = kaggle_user
# os.environ["KAGGLE_KEY"] = kaggle_key

In [0]:
# # Remove widgets for security reason
# dbutils.widgets.remove("kaggle_username")
# dbutils.widgets.remove("kaggle_api_key")

In [0]:
spark.sql("""
-- DROP SCHEMA IF EXISTS workspace.ecommerce;
CREATE SCHEMA IF NOT EXISTS workspace.ecommerce;
""")

In [0]:
spark.sql("""
-- DROP VOLUME IF EXISTS workspace.ecommerce.ecommerce_data;
CREATE VOLUME IF NOT EXISTS workspace.ecommerce.ecommerce_data;
""")

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
kaggle datasets download -d mkechinov/ecommerce-behavior-data-from-multi-category-store

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
unzip -o ecommerce-behavior-data-from-multi-category-store.zip
ls -lh

In [0]:
%sh
cd /Volumes/workspace/ecommerce/ecommerce_data
rm -f ecommerce-behavior-data-from-multi-category-store.zip
ls -lh

In [0]:
%restart_python

In [0]:
df_n = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")
display(df_n.limit(10))

In [0]:
df = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv")

In [0]:
print(f"October 2019 - Total Events: {df.count():,}")
print("\n" + "="*60)
print("SCHEMA:")
print("="*60)
df.printSchema()

In [0]:
print("\n" + "="*60)
print("SAMPLE DATA (First 5 rows):")
print("="*60)
# df.show(5, truncate=False)
display(df.limit(10))