# AWS S3, IAM, and Spark Configuration

## 1. Overview
In this notebook, we will configure the connectivity between our local PySpark environment (running in Docker) and the AWS Data Lake (S3).

## 2. AWS Setup (Manual Steps)

Before running the scripts below, you must perform the following actions in your AWS Console:

### A. Create S3 Bucket & Folders
1.  Go to **S3 Console**.
2.  Create a bucket (e.g., `self-yourname`). *Bucket names must be globally unique.*
3.  Inside the bucket, create a folder named `dw-with-pyspark`.
4.  Inside `dw-with-pyspark`, create the following sub-folders:
    *   `landing`
    *   `archive`
    *   `warehouse`

### B. Configure IAM User
1.  Go to **IAM Console**.
2.  **Create Group:** Create a group (e.g., `S3DemoAccess`) and attach the policy **`AmazonS3FullAccess`**.
3.  **Create User:** Create a user (e.g., `s3_pyspark_user`).
4.  **Add to Group:** Add this user to the `S3DemoAccess` group.
5.  **Generate Keys:**
    *   Go to the user's **Security Credentials**.
    *   Create an **Access Key** (select Command Line Interface usage).
    *   **IMPORTANT:** Copy the `Access Key ID` and `Secret Access Key`. We will use them below.

In [None]:
import os

# ==========================================
# INPUT YOUR AWS CONFIGURATION HERE
# ==========================================
aws_access_key = "YOUR_ACCESS_KEY_ID"
aws_secret_key = "YOUR_SECRET_ACCESS_KEY"
s3_bucket_name = "YOUR_BUCKET_NAME" 

# Define paths (Based on the Docker container structure)
# Usually Spark is located at /spark or ~/spark inside these images
spark_home = os.environ.get('SPARK_HOME', '/spark')
spark_conf_dir = os.path.join(spark_home, 'conf')

print(f"Configuring for Bucket: {s3_bucket_name}")
print(f"Spark Config Directory: {spark_conf_dir}")

## 3. Configure AWS Credentials
We will create the `~/.aws/credentials` file. This allows libraries like `boto3` (and potentially Spark's Hadoop AWS module) to authenticate with AWS.

In [None]:
# Create .aws directory
aws_dir = os.path.expanduser('~/.aws')
os.makedirs(aws_dir, exist_ok=True)

# Define credentials content
credentials_content = f"""[default]
aws_access_key_id = {aws_access_key}
aws_secret_access_key = {aws_secret_key}
"""

# Write file
with open(os.path.join(aws_dir, 'credentials'), 'w') as f:
    f.write(credentials_content)

print(f"AWS Credentials file created at {aws_dir}/credentials")

## 4. Configure `spark-defaults.conf`
We need to tell Spark to:
1.  Load the Delta Lake and Hadoop AWS libraries.
2.  Use the Delta Catalog.
3.  Use S3A FileSystem for `s3a://` schemes.
4.  Set the default Data Warehouse location to our S3 bucket.

In [None]:
# Define the warehouse path
warehouse_dir = f"s3a://{s3_bucket_name}/dw-with-pyspark/warehouse"

spark_defaults_content = f"""
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

spark.master                            local[*]
spark.driver.memory                     4g
spark.executor.memory                   4g

# --- Packages ---
# Note: Ensure these versions are compatible with the Spark version in your Docker image.
# The video uses specific versions, but we generally use these for Spark 3.x:
spark.jars.packages                     io.delta:delta-core_2.12:2.1.0,org.apache.hadoop:hadoop-aws:3.3.1

# --- Delta Lake Configuration ---
spark.sql.extensions                    io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog         org.apache.spark.sql.delta.catalog.DeltaCatalog

# --- S3 Configuration ---
spark.sql.warehouse.dir                 {warehouse_dir}
spark.hadoop.fs.s3a.impl                org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key          {aws_access_key}
spark.hadoop.fs.s3a.secret.key          {aws_secret_key}
spark.hadoop.fs.s3a.endpoint            s3.amazonaws.com
"""

# Write the file
with open(os.path.join(spark_conf_dir, 'spark-defaults.conf'), 'w') as f:
    f.write(spark_defaults_content)

print("spark-defaults.conf updated.")

## 5. Configure `hive-site.xml`
This file configures the Metastore. We are using a local Derby database for the metastore, but we want it to point to our S3 warehouse location.

In [None]:
hive_site_content = f"""<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:;databaseName=/home/jovyan/metastore_db;create=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>{warehouse_dir}</value>
        <description>location of default database for the warehouse</description>
    </property>
</configuration>
"""

# Write the file
with open(os.path.join(spark_conf_dir, 'hive-site.xml'), 'w') as f:
    f.write(hive_site_content)

print("hive-site.xml created.")

## 6. Restart Kernel
For these configurations to take effect, you **MUST restart your Spark Session**.
1.  Stop the Kernel (Kernel -> Shut Down Kernel).
2.  Refresh the page or start a new notebook to initialize a fresh SparkContext.

In [None]:
# 7. Verification
# After restarting the kernel, run this cell to verify S3 access.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("S3 Connectivity Test") \
    .enableHiveSupport() \
    .getOrCreate()

print("Spark Session Created.")

# Test S3 Access
# We will try to list files in the bucket (it might be empty, but it shouldn't error out)
try:
    sc = spark.sparkContext
    # Using Hadoop FileSystem API via Spark
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
    path = sc._jvm.org.apache.hadoop.fs.Path(f"s3a://{s3_bucket_name}/")
    exists = fs.exists(path)
    print(f"Connection Successful! Bucket '{s3_bucket_name}' exists: {exists}")
except Exception as e:
    print("Error connecting to S3:")
    print(e)