# Accessing Hive 3.0 via Spark with the Hive Warehouse Connector

In HDP 3.0, Hive 3.0 introduced some key changes into how Hive Managed tables can be accessed via Spark. 

Spark and Hive now use independent catalogs for accessing SparkSQL or Hive tables on the same platform. Accessing Hive from a remote Livy session via SparkSQL will no longer share the same Hive Catalog as Beeline or Hive JDBC Clients. To better understand the Hive 3.0 Changes, lets take a look at the 3 methods for accessing Hive 3.0:

**SparkSQL** - Hive thrift
- Hive Data scope: Spark Catalog
- Native SparkSQL Access to Hive
- Spark Driver accesses the metastore, Spark Executors access the data from HDFS in parallel.

**HiveWareHouseConnector - Hive JDBC**
- Hive Data scope: Hive Catalog
- HWC Library needed to execute queries and retrieve results as a DataFrame
- Query is submitted to Hive and run on the Hive Engine. 

**HiveWareHouseConnector - Hive LLAP JDBC**
- Hive Data scope: Hive Catalog
- Hive Interactive Queries
- Query is submitted to Hive LLAP and run on the LLAP Daemons. 


The [Hive Warehouse Connector](https://github.com/hortonworks-spark/spark-llap/tree/master) is a library to read/write DataFrames and Streaming DataFrames to/from Apache Hive� using LLAP. With Apache Ranger�, this library provides row/column level fine-grained access controls. 

See the [HDP 3.0.X Documentation](https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html) for further details on these changes. 

---

<div class="alert alert-block alert-info"> **Prerequisite:** The Hive Warehouse Connector Libraries are distributed on the edge nodes of an HDP Cluster. The DSXHI Administrator can upload them to a shared location on HDFS, or end users may place them on any accessibly path such as their HDFS Home Directories.</div>

## Table of Contents
This notebook contains these main sections:

1. [Using SparkSQL to access the Spark Catalog in Hive](#Spark_Catalog)
2. [Using the HWC to access Hive Managed Tables](#HWC_HIVE)
3. [Using the HWC to access LLAP Hive Managed Tables](#HWC_HIVE_LLAP)

<a id='Spark_Catalog'></a>
# 1. Using SparkSQL To access the Spark Catalog in Hive

[SparkSQL](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html) allows you to use the **SparkSession** which is automatically created in a remote Livy Session, to access the spark catalog within Hive. 

### 1.1 Create a Livy Session
First, let's import the necessary Python dependencies and see the registered Hadoop Systems available.

In [1]:
import dsx_core_utils
%load_ext sparkmagic.magics

# Retrieve a list of registered Hadoop Integration systems.
DSXHI_STSEMS = dsx_core_utils.get_dsxhi_info(showSummary=True)

Available Hadoop systems: 

    systemName  LIVYSPARK  LIVYSPARK2                  imageId
0  Azeroth-301             livyspark2  dsx-scripted-ml-python2
1        zinc1  livyspark  livyspark2  dsx-scripted-ml-python2


**Configure the spark properties** that we will use for interacting with Hive via a remote Livy Session. 

Additional [Livy properties](https://livy.incubator.apache.org/docs/latest/rest-api.html) can be provided, such as the Yarn Queue, and the driver memory.

In [2]:
# Set up sparkmagic to connect to the selected registered HI systemName above.
myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 1,
"executorMemory":"1G"
}

HI_CONFIG = dsx_core_utils.setup_livy_sparkmagic(
  system="Azeroth-301", 
  livy="livyspark2",
  imageId=None,
  addlConfig=myConfig)

# (Re-)load spark magic to apply the new configs.
%reload_ext sparkmagic.magics

sparkmagic has been configured to use https://azeroth-301-edge.fyre.ibm.com:8443/gateway/kanchws-52-nfs-master-1/livy2/v1 
success configuring sparkmagic livy.


In [3]:
session_name = 'spark_catalog_access'
livy_endpoint = HI_CONFIG['LIVY']
%spark add -s $session_name -l python -k -u $livy_endpoint

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
197,application_1544476065609_0015,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


### 1.2 Running sql queries

The Spark Catalog can be accessed natively via SparkSQL or SparkMagic.

#### 1.2.1 Show Databases / Tables 

In [4]:
%%spark
spark.sql("show databases").show(1)

+------------+
|databaseName|
+------------+
|     default|
+------------+

In [5]:
%%spark
spark.sql("use default").show(0)

++
||
++
++

#### 1.2.2 Show tables using Spark Magic
**Note** Accessing the Spark Catalog requires [Hive Acid Transactions](https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/managing-hive/content/hive_acid_operations.html) to be enabled. 

`Tip: Spark Magic may return an Encoding stacktrace on the first run of this cell. Re-running the cell allows it to render properly.`

In [7]:
%%spark -c sql -s $session_name
show tables

Unnamed: 0,database,tableName,isTemporary
0,default,spark_catalog_tabletest,False
1,default,t1,False


In [8]:
%%spark -c sql -s $session_name 
insert into spark_catalog_tabletest values("Other","Two")

#### 2.2.3 Store query results in a Spark DataFrame

In [10]:
%%spark 
REMOTE_DF = spark.sql("select * from spark_catalog_tabletest")

REMOTE_DF.schema

StructType(List(StructField(col1,StringType,true),StructField(cold2,StringType,true)))

#### 2.2.4 Retrieve query results to jupyter and visualize

Vizualiztion libraries require the Data to be present within Jupyter's Kernel (Running on Watson Studio). 

This means that a Spark DataFrame must be retrieved from the running session, into Jupyter to be able to apply a visualiztion library on it. 

In [11]:
%%spark -o WSL_DF -n 5
WSL_DF = spark.sql("select * from spark_catalog_tabletest")

**Note** `WSL_DF` is now a local dataframe within Jupyter. The following cell runs locally within jupyter, as it does **not** include `%%spark`

In [12]:
import pixiedust

Pixiedust database opened successfully


In [None]:
display(WSL_DF)

col1,cold2
Other,Two
Other,Two
Other,Two
Other,Two
Other,Two


**Important** Close the existing Livy Connection before proceeding to Part 2, as we'll be creating a **new** livy session for each test.

In [14]:
%spark delete -s $session_name

<a id='HWC_HIVE'></a>
---
# 2. Using the HWC to access Hive Managed Tables

The Hive Warehouse Connector allows Spark to access to the follow operations from Hive:

**Supported Catalog Operations**
- Set the current database for unqualified Hive table references<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.setDatabase(<database>)`

- Execute a catalog operation and return a DataFrame<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.execute("describe extended web_sales").show(100)`

- Show databases<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.showDatabases().show(100)`

- Show tables for the current database<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.showTables().show(100)`

- Describe a table<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.describeTable(<table_name>).show(100)`

- Create a database<br>
&nbsp;&nbsp;&nbsp;&nbsp;`hivecon.createDatabase(<database_name>,<ifNotExists>)`

**Supported Read Operations**
- Execute a Hive SELECT query and return a DataFrame.<br>
&nbsp;&nbsp;&nbsp;&nbsp;`DF1 = hivecon.executeQuery("select * from web_sales")`

---

### 2.1 Create a new Livy Session
Create a new Livy Connection, this time passing in the Spark Configuration **jars** and **pyFiles**, indicating the HDFS Location of the Hive Warehouse Connector libraries. 

**Recommended** The `conf` settings can be applied in Ambari as spark2-defaults for ALL Spark2-client dependent applications. This may be ideal for some situations, as it will make it easier for Data Scientists when configuring their Spark Session Properties. 

Required Spark Conf For Kerberized Clusters:

- spark.sql.hive.hiveserver2.jdbc.url
- spark.datasource.hive.warehouse.metastoreUri
- spark.datasource.hive.warehouse.load.staging.dir
- spark.hadoop.hive.llap.daemon.service.hosts
- spark.hadoop.hive.zookeeper.quorum
- spark.sql.hive.hiveserver2.jdbc.url.principal

For a detailed explanation on **where** to obtain the `conf` properties from Ambari, see: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_configure_a_spark_hive_connection.html

**Notice the `.jdbc.url` endpoints:**
- `zooKeeperNamespace=hiveserver2`  || Allows you to Connect to Hive Managed Tables using HWC: 
- `zooKeeperNamespace=hiveserver2-interactive`  || Allows you to Connect to LLAP-Hive Managed Tables using HWC: 

<div class="alert alert-block alert-info"> **IMPORTANT:** If working in a Kerberized Environment, the Hive Warehouse Connector Libraries used in `jars` and `pyFiles` MUST be read from the local filesystem, e.g. `file:///usr/hdp/current` and **not** from HDFS. </div>

In [15]:
myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 1,
 "jars": ["file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar"],
 "pyFiles": ["file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip"],
 "conf": { "spark.sql.hive.hiveserver2.jdbc.url": 
          "jdbc:hive2://azeroth-301-master-1.fyre.ibm.com:2181,azeroth-301-ambari.fyre.ibm.com:2181,azeroth-301-master-2.fyre.ibm.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2",
           "spark.datasource.hive.warehouse.metastoreUri":"thrift://azeroth-301-master-1.fyre.ibm.com:9083",
           "spark.datasource.hive.warehouse.load.staging.dir":"/tmp",
           "spark.hadoop.hive.llap.daemon.service.hosts":"@llap0",
           "spark.hadoop.hive.zookeeper.quorum":"azeroth-301-master-2.fyre.ibm.com:2181,azeroth-301-ambari.fyre.ibm.com:2181,azeroth-301-master-1.fyre.ibm.com:2181",
           "spark.sql.hive.hiveserver2.jdbc.url.principal":"hive/_HOST@FYRE.IBM.COM"
         }}
HI_CONFIG = dsx_core_utils.setup_livy_sparkmagic(
  system="Azeroth-301", 
  livy="livyspark2",
  imageId=None,
  addlConfig=myConfig)

# (Re-)load spark magic to apply the new configs.
%reload_ext sparkmagic.magics

sparkmagic has been configured to use https://azeroth-301-edge.fyre.ibm.com:8443/gateway/kanchws-52-nfs-master-1/livy2/v1 
success configuring sparkmagic livy.


In [16]:
session_name = 'hive_catalog_access'
livy_endpoint = HI_CONFIG['LIVY']
%spark add -s $session_name -l python -k -u $livy_endpoint

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
198,application_1544476065609_0016,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


#### 2.1.1 Loading the Hive Warehouse connector
See [HDP 3.0.1](https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehousesession_api_operations.html) Docs for Extended usage.

In [17]:
%%spark -s $session_name 
from pyspark_llap import HiveWarehouseSession
hivecon = HiveWarehouseSession.session(spark).build()

### 2.2 Accessing Hive Managed Tables via HWC

#### 2.2.1 Show databases / tables
<br>

In [18]:
%%spark -s $session_name 
hivecon.showDatabases().show(5)

+-------------+
|database_name|
+-------------+
|      default|
+-------------+

<div class="alert alert-block alert-info"> Notice - The Spark Catalog tables and Hive Catalog tables returned will differ: </div>

**Hive Catalog**

In [19]:
%%spark -s $session_name 
hivecon.showTables().show(5)

+------------------+
|          tab_name|
+------------------+
|       atlas_higgs|
| hive_catalog_test|
|hive_catalog_test2|
|hive_catalog_test3|
|         web_sales|
+------------------+

Spark Catalog
- **Note** SparkMagic (`%%spark -c sql`) is **not** supported for HWC, as it relies on the Spark Catalog.

In [21]:
%%spark -s $session_name -c sql
show tables

Unnamed: 0,database,tableName,isTemporary
0,default,spark_catalog_tabletest,False
1,default,t1,False


#### 2.2.2 Create test table

In [22]:
%%spark -s $session_name 
hivecon.createTable("hive_catalog_test3").ifNotExists().column("sold_time_sk", "bigint"
                                             ).column("ws_ship_date_sk", "bigint"
                                             ).create()

In [23]:
%%spark -s $session_name
hivecon.showTables().show(5)

+------------------+
|          tab_name|
+------------------+
|       atlas_higgs|
| hive_catalog_test|
|hive_catalog_test2|
|hive_catalog_test3|
|         web_sales|
+------------------+

In [24]:
%spark cleanup

<a id='HWC_HIVE_LLAP'></a>
---

## 3. Using the HWC to access Hive Managed Interactive Tables ( LLAP)

Connecting to LLAP Requires the same set of properties from section 3, with an updated `spark.sql.hive.hiveserver2.jdbc.url`.

The JDBC URL can be found from Ambari > Hive Summary, under  `HIVESERVER2 INTERACTIVE JDBC URL`. 

Note, for this example the following have been configured in Ambari under Spark > Configs > Advanced > Custom Spark-defaults.
- `spark.datasource.hive.warehouse.metastoreUri`
- `spark.datasource.hive.warehouse.load.staging.dir`
- `spark.hadoop.hive.llap.daemon.service.hosts`
- `spark.hadoop.hive.zookeeper.quorum`

In [25]:
myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 1,
 "jars": ["file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar"],
 "pyFiles": ["file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip"],
 "conf": { "spark.sql.hive.hiveserver2.jdbc.url": 
          "jdbc:hive2://azeroth-301-master-1.fyre.ibm.com:2181,azeroth-301-ambari.fyre.ibm.com:2181,azeroth-301-master-2.fyre.ibm.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive"
         }}
HI_CONFIG = dsx_core_utils.setup_livy_sparkmagic(
  system="Azeroth-301", 
  livy="livyspark2",
  imageId=None,
  addlConfig=myConfig)

# (Re-)load spark magic to apply the new configs.
%reload_ext sparkmagic.magics

sparkmagic has been configured to use https://azeroth-301-edge.fyre.ibm.com:8443/gateway/kanchws-52-nfs-master-1/livy2/v1 
success configuring sparkmagic livy.


### 3.1 Start a new Livy Session

In [26]:
session_name = 'llap_access'
livy_endpoint = HI_CONFIG['LIVY']
%spark add -s $session_name -l python -k -u $livy_endpoint

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
199,application_1544476065609_0017,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [27]:
%%spark -s $session_name 
from pyspark_llap import HiveWarehouseSession
hivecon = HiveWarehouseSession.session(spark).build()

### 3.2 Accessing Hive Managed Tables via HWC

In [28]:
%%spark -s $session_name 
hivecon.showDatabases().show(5)

+-------------+
|database_name|
+-------------+
|      default|
+-------------+

The **hive_catalog_test** table created in section 2.2 is visible for both LLAP and Non LLAP Connections via HWC. 

In [29]:
%%spark -s $session_name 
hivecon.showTables().show(5)

+------------------+
|          tab_name|
+------------------+
|       atlas_higgs|
| hive_catalog_test|
|hive_catalog_test2|
|hive_catalog_test3|
|         web_sales|
+------------------+

In [30]:
%%spark -s $session_name  
DF1 = hivecon.executeQuery("select * from hive_catalog_test3")

In [31]:
%%spark -s $session_name 
DF1.describe()

DataFrame[summary: string, sold_time_sk: string, ws_ship_date_sk: string]

In [32]:
%spark cleanup