## Spark Hive Examples - Hadoop Execution Engine

- JEG Notebook Kernels will automatically talk to Hive Metastore
- In CDH, Impala Metastore access 
  - Impala Can be used to create tables on different data sources! E.g. Kudu, HDFS
- Spark "Temporary" tables vs Hive tables for spark.sql

In [1]:
%%time
# Note - spark.sql is a lazy operation! Will not display the actual query result
spark.sql("show tables")

CPU times: user 11 ms, sys: 2.54 ms, total: 13.6 ms
Wall time: 7.37 s


DataFrame[database: string, tableName: string, isTemporary: boolean]

In [2]:
%%time
spark.sql("show tables").show(6, truncate=False)

+--------+------------------------+-----------+
|database|tableName               |isTemporary|
+--------+------------------------+-----------+
|default |atlas_higgs_100x        |false      |
|default |atlas_higgs_100x_parquet|false      |
|default |atlas_higgs_demo        |false      |
|default |customer_history        |false      |
|default |customers               |false      |
|default |hive_test               |false      |
+--------+------------------------+-----------+
only showing top 6 rows

CPU times: user 3.38 ms, sys: 637 µs, total: 4.01 ms
Wall time: 737 ms


### Benchmark different Hive formats


**Example:** Query 4 columns from a 25 Million Row Dataset.

**- Table1: atlas_higgs_100x default hive storage**

In [3]:
%%time
df_table1 = spark.sql("""select EventId, DER_mass_MMC, DER_mass_vis, DER_sum_pt 
                    from atlas_higgs_100x
                    where EventId > 100000 and EventId < 120000 and DER_mass_MMC > 1
                    """)

CPU times: user 2.36 ms, sys: 232 µs, total: 2.59 ms
Wall time: 443 ms


In [4]:
%%time
df_table1.count()

CPU times: user 8.85 ms, sys: 3.49 ms, total: 12.3 ms
Wall time: 46.4 s


1697200

- **Table2: atlas_higgs_100x_parquet hive table stored as parquet**

In [5]:
%%time
df_table2 = spark.sql("""select EventId, DER_mass_MMC, DER_mass_vis 
                    from atlas_higgs_100x_parquet
                    where EventId > 100000 and EventId < 120000 and DER_mass_MMC > 1
                    """)

CPU times: user 2.26 ms, sys: 1.58 ms, total: 3.84 ms
Wall time: 755 ms


In [6]:
%%time
df_table2.count()

CPU times: user 9.57 ms, sys: 1.53 ms, total: 11.1 ms
Wall time: 5.97 s


1697200

### Result

1 Executors: Hive: 51seconds vs Hive+Parquet: 6.7seconds 

4 Executors: Hive: 46seconds vs Hive+Parquet: 5.9seconds

---

## Part 2 - "Registering temp tables" and spark.sql scope

In [7]:
%%time
df_table2.registerTempTable("tmp_table_dftable2")

CPU times: user 2.4 ms, sys: 246 µs, total: 2.64 ms
Wall time: 67.3 ms


Note, truncate=False to display wide tables / not truncate output:

In [8]:
spark.sql("show tables").show(truncate=False)

+--------+------------------------+-----------+
|database|tableName               |isTemporary|
+--------+------------------------+-----------+
|default |atlas_higgs_100x        |false      |
|default |atlas_higgs_100x_parquet|false      |
|default |atlas_higgs_demo        |false      |
|default |customer_history        |false      |
|default |customers               |false      |
|default |hive_test               |false      |
|default |kudu_external_t1        |false      |
|default |sample_07               |false      |
|default |sample_071              |false      |
|default |sample_0711             |false      |
|default |sample_072              |false      |
|default |sample_073              |false      |
|default |sample_08               |false      |
|default |sample_082              |false      |
|default |t1                      |false      |
|default |t2                      |false      |
|default |test001                 |false      |
|default |web_logs                |false

In [9]:
# Note, spark.sql can at this point query Hive/Impala/Kudu tables, as well as the isTemporary table we registered to memory