In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = SparkSession\
    .builder\
    .appName("PythonSQL")\
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region","us-east-2")\
    .config("spark.yarn.access.hadoopFileSystems","s3a://go01-demo")\
    .getOrCreate()

Setting spark.hadoop.yarn.resourcemanager.principal to pauldefusco


In [3]:
spark.sparkContext.getConf().getAll()

[('spark.eventLog.enabled', 'true'),
 ('spark.ui.proxyRedirectUri',
  'https://spark-7hywdjpjbd7u8chr.ml-4c5feac0-3ec.go01-dem.ylcu-atmi.cloudera.site'),
 ('spark.hadoop.fs.s3a.s3guard.ddb.region', 'us-east-2'),
 ('spark.network.crypto.enabled', 'true'),
 ('spark.driver.memory', '1525m'),
 ('spark.kerberos.renewal.credentials', 'ccache'),
 ('spark.dynamicAllocation.maxExecutors', '49'),
 ('spark.eventLog.dir', 'file:///sparkeventlogs'),
 ('spark.hadoop.yarn.resourcemanager.principal', 'pauldefusco'),
 ('spark.app.id', 'spark-application-1682732794249'),
 ('spark.kubernetes.driver.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict',
  'false'),
 ('spark.ui.port', '20049'),
 ('spark.kubernetes.executor.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict',
  'false'),
 ('spark.io.encryption.enabled', 'true'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.deployMode', 'client'),
 ('spark.yarn.access.hadoopFileSystems', 's3a://go01-demo'),
 ('spark.master', 'k

#### Hive Metastore

Spark SQL uses a Hive metastore to manage the metadata of persistent relational entities (e.g. databases, tables, columns, partitions) in a relational database (for fast access). A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.

#### Hive Warehouse Connector

HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark SQL query on a Hive managed table.

You do not need HWC to read or write Hive external tables. You can use native Spark SQL. Spark tables will be tracked in the HMS.

In this tutorial we will not use the HWC.

In [4]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|  default|
+-------------+---------+



In [5]:
# Create a new database
spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.spark")
spark.sql("USE spark_catalog.spark")

Hive Session ID = ce3280e8-5184-4b22-8d58-c5b0dc7b298c


DataFrame[]

In [6]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|    spark|
+-------------+---------+



#### Non Partitioned Table

In [7]:
spark.sql("DROP TABLE IF EXISTS spark.non_partitioned_table")
spark.sql("CREATE TABLE IF NOT EXISTS spark.non_partitioned_table\
            (id BIGINT, state STRING, country STRING)")

DataFrame[]

In [8]:
spark.sql("INSERT INTO spark.non_partitioned_table VALUES (1, 'CA', 'USA'),(2, 'CA', 'USA'),\
                    (3, 'AZ', 'USA'),\
                    (4, 'ON', 'CAN'),\
                    (5, 'AL', 'CAN')")

                                                                                

DataFrame[]

In [10]:
try:
    spark.sql("SHOW PARTITIONS spark.non_partitioned_table").show()
except:
    print("There are no partitions to be shown")

There are no partitions to be shown


Hue Screenshot here

#### Partitioned Table

In [11]:
spark.sql("DROP TABLE IF EXISTS spark.partitioned_table")
spark.sql("CREATE TABLE IF NOT EXISTS spark.partitioned_table\
    (id BIGINT, state STRING, country STRING)\
    USING PARQUET\
    PARTITIONED BY (country)")

DataFrame[]

In [12]:
spark.sql("INSERT INTO spark.partitioned_table VALUES (1, 'CA', 'USA'),(2, 'CA', 'USA'),\
                    (3, 'AZ', 'USA'),\
                    (4, 'ON', 'CAN'),\
                    (5, 'AL', 'CAN')")

                                                                                

DataFrame[]

In [13]:
spark.sql("SHOW PARTITIONS spark.partitioned_table").show()

+-----------+
|  partition|
+-----------+
|country=CAN|
|country=USA|
+-----------+



Hue Screenshot here

In [14]:
df_notp = spark.sql("SELECT * FROM spark.non_partitioned_table")

In [15]:
df_p = spark.sql("SELECT * FROM spark.partitioned_table")

In [16]:
df_notp.explain(mode="extended")

== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation [spark, non_partitioned_table], [], false

== Analyzed Logical Plan ==
id: bigint, state: string, country: string
Project [id#108L, state#109, country#110]
+- SubqueryAlias spark_catalog.spark.non_partitioned_table
   +- HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []]

== Optimized Logical Plan ==
HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []]

== Physical Plan ==
Scan hive spark.non_partitioned_table [id#108L, state#109, country#110], HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []]



In [21]:
df_notp.explain(mode="codegen")

Found 0 WholeStageCodegen subtrees.



In [22]:
df_notp.explain(mode="cost")

== Optimized Logical Plan ==
HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []], Statistics(sizeInBytes=45.0 B)

== Physical Plan ==
Scan hive spark.non_partitioned_table [id#108L, state#109, country#110], HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []]




In [23]:
df_notp.explain(mode="formatted")

== Physical Plan ==
Scan hive spark.non_partitioned_table (1)


(1) Scan hive spark.non_partitioned_table
Output [3]: [id#108L, state#109, country#110]
Arguments: [id#108L, state#109, country#110], HiveTableRelation [`spark`.`non_partitioned_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#108L, state#109, country#110], Partition Cols: []]




In [19]:
df_p.explain(mode="extended")

== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation [spark, partitioned_table], [], false

== Analyzed Logical Plan ==
id: bigint, state: string, country: string
Project [id#114L, state#115, country#116]
+- SubqueryAlias spark_catalog.spark.partitioned_table
   +- Relation spark.partitioned_table[id#114L,state#115,country#116] parquet

== Optimized Logical Plan ==
Relation spark.partitioned_table[id#114L,state#115,country#116] parquet

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet spark.partitioned_table[id#114L,state#115,country#116] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex(1 paths)[s3a://go01-demo/warehouse/tablespace/external/hive/spark.db/partitioned..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,state:string>



In [20]:
df_p.explain(mode="codegen")

Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 (maxMethodCodeSize:324; maxConstantPoolSize:139(0.21% used); numInnerClasses:0) ==
*(1) ColumnarToRow
+- FileScan parquet spark.partitioned_table[id#114L,state#115,country#116] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex(1 paths)[s3a://go01-demo/warehouse/tablespace/external/hive/spark.db/partitioned..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,state:string>

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private int columnartorow_batchIdx_0;
/* 010 */   private org.apache.spark.sql.execution.vectorized.OnHe

In [24]:
df_p.explain(mode="cost")

== Optimized Logical Plan ==
Relation spark.partitioned_table[id#114L,state#115,country#116] parquet, Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet spark.partitioned_table[id#114L,state#115,country#116] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex(1 paths)[s3a://go01-demo/warehouse/tablespace/external/hive/spark.db/partitioned..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,state:string>




In [25]:
df_p.explain(mode="formatted")

== Physical Plan ==
* ColumnarToRow (2)
+- Scan parquet spark.partitioned_table (1)


(1) Scan parquet spark.partitioned_table
Output [3]: [id#114L, state#115, country#116]
Batched: true
Location: CatalogFileIndex [s3a://go01-demo/warehouse/tablespace/external/hive/spark.db/partitioned_table]
ReadSchema: struct<id:bigint,state:string>

(2) ColumnarToRow [codegen id : 1]
Input [3]: [id#114L, state#115, country#116]




#### Spark Challenges

Spark Merge Into Workaround

Spark Schema Evolution Workaround

Spark Partition Evolution Workaround