# Using Iceberg tables in SageMaker PySparkProcessor Job


References:

* https://iceberg.apache.org/docs/latest/aws/
* https://iceberg.apache.org/docs/1.5.0/spark-writes/#overwriting-data

## Setup

Download jar dependencies, namely:abs
* iceberg-spark-runtime-3.3_2.12-1.5.2.jar
* iceberg-aws-bundle-1.5.2.jar -O iceberg-aws-bundle-1.5.2.jar

In [1]:
# https://iceberg.apache.org/releases/
!wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.5.2/iceberg-spark-runtime-3.3_2.12-1.5.2.jar -O iceberg-spark-runtime-3.3_2.12-1.5.2.jar

--2024-07-24 17:35:53--  https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.5.2/iceberg-spark-runtime-3.3_2.12-1.5.2.jar
Resolving search.maven.org (search.maven.org)... 34.234.198.27, 35.153.115.170
Connecting to search.maven.org (search.maven.org)|34.234.198.27|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.5.2/iceberg-spark-runtime-3.3_2.12-1.5.2.jar [following]
--2024-07-24 17:35:53--  https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.5.2/iceberg-spark-runtime-3.3_2.12-1.5.2.jar
Resolving repo1.maven.org (repo1.maven.org)... 151.101.20.209, 2a04:4e42:5::209
Connecting to repo1.maven.org (repo1.maven.org)|151.101.20.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41472221 (40M) [application/java-archive]
Saving to: ‘iceberg-spark-runtime-3.3_2.12-1

In [2]:
!wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar -O iceberg-aws-bundle-1.5.2.jar

--2024-07-24 17:35:53--  https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar
Resolving search.maven.org (search.maven.org)... 35.153.115.170, 34.234.198.27
Connecting to search.maven.org (search.maven.org)|35.153.115.170|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar [following]
--2024-07-24 17:35:54--  https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar
Resolving repo1.maven.org (repo1.maven.org)... 151.101.20.209, 2a04:4e42:5::209
Connecting to repo1.maven.org (repo1.maven.org)|151.101.20.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30512098 (29M) [application/java-archive]
Saving to: ‘iceberg-aws-bundle-1.5.2.jar’


2024-07-24 17:35:54 (204 MB/s) - ‘iceberg-aws-bundle-1.5.2.jar’ saved [

## Create a batch script read and write Iceberg tables

In [3]:
%%writefile read_write_iceberg_tables.py
import argparse
from pyspark.sql import SparkSession


def main():
    parser = argparse.ArgumentParser(description="app inputs and outputs")
    parser.add_argument("--database_name", type=str, help="name of the database in glue")
    parser.add_argument("--table_name", type=str, help="name of the glue table holding iceberg data")
    parser.add_argument("--warehouse_path", type=str, help="S3 URI serving as the warehouse of the iceberg table")
    args = parser.parse_args()

    catalog_name = "glue_catalog"
    database_name = args.database_name
    table_name = args.table_name
    warehouse_path = args.warehouse_path

    spark = (
        SparkSession.builder.appName("PySparkIcerberg")
        .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog")
        .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}")
        .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
        .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
        .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
        .getOrCreate()
    )

    # Get the list of databases from the current catalog
    databases = spark.catalog.listDatabases()

    # databases = spark.catalog.listDatabases()
    print("Available Databases:")
    for db in databases:
        print(f"- {db}")

    tables = spark.catalog.listTables(database_name)
    print("Available Tables:")
    for t in tables:
        print(t)

    # Read a table
    df = spark.table(f"{catalog_name}.{database_name}.{table_name}")

    print(df.head())

    columns_to_keep = ["VendorID", "passenger_count", "trip_distance", "fare_amount", "tip_amount"]
    cleaned_df = df.select(*columns_to_keep)

    cleaned_table_name = "cleaned_" + table_name
    cleaned_df.writeTo(f"{catalog_name}.{database_name}.{cleaned_table_name}").createOrReplace()


if __name__ == "__main__":
    main()


Overwriting read_write_iceberg_tables.py


## Step up a data processing Job and a SageMaker pipeline step with Iceberg and AWS Glue

In [4]:
import json
import boto3
import sagemaker
from time import gmtime, strftime

s3 = boto3.resource("s3")
role = sagemaker.get_execution_role()
default_bucket = sagemaker.Session().default_bucket()
region = sagemaker.Session().boto_region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


The classification in spark configuration must be one of the following: `['core-site', 'hadoop-env', 'hadoop-log4j', 'hive-env', 'hive-log4j', 'hive-exec-log4j', 'hive-site', 'spark-defaults', 'spark-env', 'spark-log4j', 'spark-hive-site', 'spark-metrics', 'yarn-env', 'yarn-site']` you can lookup the template of each of these configuration file on their respective official websites to identify the properties you can configure.

In [5]:
def upload_to_s3(bucket, prefix, body):
    s3_object = s3.Object(bucket, prefix)
    s3_object.put(Body=body)


default_spark_configuration = [
    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.executor.memory": "2g",
            "spark.executor.cores": "1"
        }
    }
]

# Upload the raw input dataset to a unique S3 location
prefix = "sagemaker/parametrize-spark-config-pysparkprocessor/"
default_spark_conf_prefix = "{}spark/conf/cores_1/configuration.json".format(prefix)
default_spark_configuration_object_s3_uri = "s3://{}/{}".format(
    default_bucket, default_spark_conf_prefix
)

upload_to_s3(default_bucket, default_spark_conf_prefix, json.dumps(default_spark_configuration))
# print(default_spark_configuration_object_s3_uri)

In [6]:
from sagemaker.workflow.parameters import ParameterString

spark_config_s3_uri = ParameterString(
    name="SparkConfigS3Uri",
    default_value=default_spark_configuration_object_s3_uri,
)

In [7]:
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.pipeline_context import LocalPipelineSession

local = False

if local:
    pipeline_session = LocalPipelineSession()
    instance_count = 1
else:
    pipeline_session = PipelineSession()
    instance_count = 2


In [8]:
from sagemaker.spark.processing import PySparkProcessor
from sagemaker.processing import ProcessingInput
from sagemaker.spark.processing import _SparkProcessorBase


pyspark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="3.3",
    role=role,
    instance_count=instance_count,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
    sagemaker_session=pipeline_session,
)

In [9]:
pyspark_processor._conf_container_base_path, pyspark_processor._conf_container_input_name

('/opt/ml/processing/input/', 'conf')

In [10]:
# Specify the Glue Catalog database name, table name, and warehouse uri
glue_database_name = "default"
catalog_name = "iceberg_catalog"
glue_catalog_uri = f"s3://{default_bucket}/{catalog_name}"
table_name = "taxi_dataset"

In [11]:
step_args = pyspark_processor.run(
    submit_app="read_write_iceberg_tables.py",
    submit_jars=["iceberg-spark-runtime-3.3_2.12-1.5.2.jar", "iceberg-aws-bundle-1.5.2.jar"],
    inputs=[
        ProcessingInput(
            source=spark_config_s3_uri,
            destination=f"{pyspark_processor._conf_container_base_path}{pyspark_processor._conf_container_input_name}",
            input_name=_SparkProcessorBase._conf_container_input_name,
        )
    ],
    arguments=[
        "--database_name",
        glue_database_name,
        "--table_name",
        table_name,
        "--warehouse_path",
        glue_catalog_uri
    ]
)



In [12]:
from sagemaker.workflow.steps import ProcessingStep

spark_step_process = ProcessingStep(name="IcebergTablesIO", step_args=step_args)

## Setting up a SageMaker Pipeline

In [13]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"PySparkIcebergPipeline"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        spark_config_s3_uri
    ],
    steps=[spark_step_process],
    sagemaker_session=pipeline_session
)

Creating or registering the pipeline. This does not start the pipeline execution yet.

In [14]:
pipeline.upsert(role_arn=role)

INFO:sagemaker.spark.processing:Copying dependency from local path iceberg-spark-runtime-3.3_2.12-1.5.2.jar to tmpdir /tmp/tmpwbpwlb63
INFO:sagemaker.spark.processing:Copying dependency from local path iceberg-aws-bundle-1.5.2.jar to tmpdir /tmp/tmpwbpwlb63
INFO:sagemaker.spark.processing:Uploading dependencies from tmpdir /tmp/tmpwbpwlb63 to S3 s3://sagemaker-us-west-2-AWS_ACCOUNT_ID/PySparkIcebergPipeline/code/4bdb7329b49bf6df2347db0434c2223b/jars
INFO:sagemaker.spark.processing:Copying dependency from local path iceberg-spark-runtime-3.3_2.12-1.5.2.jar to tmpdir /tmp/tmpeim_xjuk
INFO:sagemaker.spark.processing:Copying dependency from local path iceberg-aws-bundle-1.5.2.jar to tmpdir /tmp/tmpeim_xjuk
INFO:sagemaker.spark.processing:Uploading dependencies from tmpdir /tmp/tmpeim_xjuk to S3 s3://sagemaker-us-west-2-AWS_ACCOUNT_ID/PySparkIcebergPipeline/code/4bdb7329b49bf6df2347db0434c2223b/jars


{'PipelineArn': 'arn:aws:sagemaker:us-west-2:AWS_ACCOUNT_ID:pipeline/PySparkIcebergPipeline',
 'ResponseMetadata': {'RequestId': '706ae9a3-18ab-4604-90da-e5f58e73c864',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '706ae9a3-18ab-4604-90da-e5f58e73c864',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '90',
   'date': 'Wed, 24 Jul 2024 17:35:59 GMT'},
  'RetryAttempts': 0}}

In [15]:
execution = pipeline.start()

In [16]:
execution.list_steps()

[]