Skip to content

Run Spark 3.3.0 locally with remote AWS S3 using Glue Metastore

Latest
Compare
Choose a tag to compare
@jirislav jirislav released this 26 Jan 08:26
d885a99

What and Why?

The only purpose of this fork is to create this release and share the pre-built JARs with the community.

It allowed me to run local SparkSession connected to AWS Glue with AWS S3 backend storage with Iceberg tables. In essence, I can debug jobs that would normally run within the AWS EMR cluster.

How to use it?

I assume you already have relevant Spark version installed and SPARK_HOME environment variable set up.

For me, it was Spark 3.3.0, since this is used in AWS EMR, to have it as close to it as possible.

  1. Download & unpack the built JARs & copy them to the jars directory in $SPARK_HOME.
cd /tmp
wget https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz
sha512sum -c <(curl -sL https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz.sha512)

cd "$SPARK_HOME/jars"
tar -xf /tmp/spark-3.3.0-jars.tgz
  1. Make sure to use appropriate settings for Hive and Spark. I suggest you keep these configuration in ~/.config/spark/ and export this as SPARK_CONF_DIR.

First, let's configure the clean Spark configuration directory.

# Put this into your .bashrc / .zshrc, or export it everytime you run Spark
export SPARK_CONF_DIR=~/.config/spark
cd "$SPARK_CONF_DIR"

Next, put the configuration there:

spark-defaults.conf
cat <<EOF > "$SPARK_CONF_DIR/spark-defaults.conf"

spark.hadoop.hive.metastore.client.factory.class	com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.metastore.warehouse.dir		s3://YOUR_S3_BUCKET/default
spark.hadoop.fs.s3a.aws.credentials.provider		org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
spark.hadoop.fs.s3a.impl                    		org.apache.hadoop.fs.s3a.S3AFileSystem

spark.sql.warehouse.dir					hdfs:///user/spark/warehouse
spark.sql.extensions					org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

spark.sql.catalogImplementation			hive
spark.sql.catalog.iceberg				org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.catalog-impl			org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.iceberg.io-impl			org.apache.iceberg.aws.s3.S3FileIO
# You need IAM role to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:YOUR_REGION:1234567890:table/IcebergLockTable
#spark.sql.catalog.iceberg.lock-impl			org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
#spark.sql.catalog.iceberg.lock.table			IcebergLockTable
spark.sql.catalog.iceberg.warehouse			s3://YOUR_S3_BUCKET/iceberg

spark.sql.emr.internal.extensions			com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.sql.extensions					org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.hive.metastore.sharedPrefixes			com.amazonaws.services.dynamodbv2
spark.sql.parquet.output.committer.class		com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.sources.partitionOverwriteMode		dynamic
spark.sql.thriftserver.scheduler.pool			fair
spark.sql.ui.explainMode				extended

spark.sql.parquet.fs.optimized.committer.optimization-enabled true
EOF

hive-site.xml
cat <<EOF > "$SPARK_CONF_DIR/hive-site.xml"
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

    <property>
        <name>aws.glue.endpoint</name>
        <value>https://glue.YOUR_REGION.amazonaws.com</value>
    </property>

    <property>
        <name>aws.glue.region</name>
        <value>YOUR_REGION</value>
    </property>

    <property>
        <name>aws.glue.connection-timeout</name>
        <value>30000</value>
    </property>

    <property>
        <name>aws.glue.socket-timeout</name>
        <value>30000</value>
    </property>

    <!--
    <property>
        <name>aws.glue.proxy.host</name>
        <value>YOUR_GLUE_PROXY_HOST</value>
    </property>

    <property>
        <name>aws.glue.proxy.port</name>
        <value>8888</value>
    </property>
    -->

    <property>
        <!-- Setting for Hive2. See https://github.com/awslabs/aws-glue-catalog-sync-agent-for-hive/issues/3 -->
        <name>hive.imetastoreclient.factory.class</name>
        <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
    </property>

    <property>
        <!-- Setting for Hive3. See https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore -->
        <name>hive.metastore.client.factory.class</name>
        <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
    </property>

    <!-- Hive Metastore connection settings -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://FOR_YOUR_METASTORE_URI_SEE_SPARK_UI_ENVIRONMENT_TAB:9083</value>
        <description>URI for client to connect to metastore server</description>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>s3://YOUR_S3_BUCKET/default</value>
        <description>Location of default database for the warehouse</description>
    </property>

    <property>
        <name>hive.metastore.connect.retries</name>
        <value>15</value>
    </property>

    <property>
        <name>aws.glue.cache.table.enable</name>
        <value>true</value>
    </property>
    <property>
        <name>aws.glue.cache.table.size</name>
        <value>1000</value>
    </property>
    <property>
        <name>aws.glue.cache.table.ttl-mins</name>
        <value>30</value>
    </property>

    <property>
        <name>aws.glue.cache.db.enable</name>
        <value>true</value>
    </property>
    <property>
        <name>aws.glue.cache.db.size</name>
        <value>1000</value>
    </property>
    <property>
        <name>aws.glue.cache.db.ttl-mins</name>
        <value>30</value>
    </property>
</configuration>
EOF

And you're done! Now, given you have the SPARK_HOME & SPARK_CONF_DIR environment variables set, you can launch Spark locally, with remote connection to the data on S3 — enjoy! 🎉

How to build it yourself?

Anyone can build this release. Here is the step-by-step of how I've built the JARs (I assume you already have git and mvn installed):

  1. Build the Hive JARs with the Spark patch.
cd /tmp
wget https://issues.apache.org/jira/secure/attachment/12958418/HIVE-12679.branch-2.3.patch

git clone https://github.com/apache/hive.git
cd hive
git checkout branch-2.3

patch -p0 </tmp/HIVE-12679.branch-2.3.patch
mvn clean install -DskipTests
  1. Build the AWS Glue Catalog Client for the patched Hive Metastore
git clone https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git
cd aws-glue-data-catalog-client-for-apache-hive-metastore/
mvn clean install -DskipTests  # This actually failed on Hive3 build, but that's okay since I'm only interested in the build of Spark libraries
  1. Gather all the (Spark relevant!) JARs to one place (to later include them to $SPARK_HOME/jars)
mkdir /tmp/hive-jars
find ~/.m2/repository/org/apache/hive/ -type f -name "*.jar" | grep /2.3 | grep -v -- '-tests' | xargs -I{} cp '{}' /tmp/hive-jars/
find ~/.m2/repository/com/amazonaws/glue/ -type f -name "*.jar" | grep -vE 'shim|-tests' | xargs -I{} cp '{}' /tmp/hive-jars/
find ~/.m2/repository/org/apache/thrift -type f -name "libthrift-0.1*.jar" | xargs -I{} cp '{}' /tmp/hive-jars/

I have created the release file at this point, which you can download below ⬇.