
# Spark ETL with Hive tables

[1. Read data from Source (MongoDB)](#1)

[2. Create and Save Hive table from dataframe](#3)

[3. Create temp Hive view from dataframe](#4)

[4. Create global Hive view from dataframe](#5)

[5. List database and tables in database](#6)

[6. Drop all the created tables and views in default database](#7)

[7. Create Dataeng database and create global and temp view using SQL](#8) 

[8. Access global table from other session](#9)


## Load libraries

In [1]:
# load required Libraries

from pyspark.sql import SparkSession

In [2]:
# start spark session
#     config("spark.sql.warehouse.dir","warehouse_location") - will set warehouse location for table to be stored


spark = SparkSession.builder.appName("Hive Tables") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
    .config("spark.sql.warehouse.dir","warehouse_location")\
    .getOrCreate()

sqlContext = SparkSession(spark)

#showing only error, not warning
spark.sparkContext.setLogLevel("ERROR")



<a id = 1>  </a>
## Read data from Source (MongoDB)


In [3]:
mongo_df = spark.read.format("mongo") \
    .option("uri", "mongodb://localhost:27017/") \
    .option("database", "dataengineering") \
    .option("collection", "employee") \
    .load()

In [4]:
mongo_df.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- department_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: integer (nullable = true)



In [5]:
mongo_df.show(5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



<a id = 3>  </a>
## Create and Save Hive table from dataframe

It will save table as parquet file by default in pyspark\Hive Tables\spark-warehouse\hivesampletable


In [6]:
#this will save table as parquet file by default in pyspark\Hive Tables\spark-warehouse\hivesampletable

mongo_df.write.saveAsTable("hivesampletable")

<a id = 4>  </a>
## Create temp Hive view from dataframe

Temp Hive View tables can only be accessible for particular session in which they are created.


In [7]:
mongo_df.createOrReplaceTempView("sampletempview")

<a id = 5>  </a>
## Create global Hive view from dataframe

Global Hive View tables can be accessible for any spark session of that spark application 

In [8]:
mongo_df.createOrReplaceGlobalTempView("sampleglobalview")

<a id = 6>  </a>
## List database and tables in database

In [9]:
# show databases

sqlContext.sql("show databases").show()


# can also be write as

sqlContext.catalog.listDatabases()

+---------+
|namespace|
+---------+
|  default|
+---------+



[Database(name='default', description='default database', locationUri='file:/c:/Users/Admin/Desktop/pyspark/Hive Tables/warehouse_location')]

In [10]:
#show tables

sqlContext.sql("show tables").show()



# can also be write as

sqlContext.catalog.listTables()

+---------+---------------+-----------+
|namespace|      tableName|isTemporary|
+---------+---------------+-----------+
|  default|hivesampletable|      false|
|         | sampletempview|       true|
+---------+---------------+-----------+



[Table(name='hivesampletable', database='default', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='sampletempview', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [11]:
# get details columns of hive table
sqlContext.catalog.listColumns("hivesampletable")


# since sampletempview is not in default database and is temporary table, can't get details of columns

# sqlContext.catalog.listColumns("sampletempview")

[Column(name='_id', description=None, dataType='struct<oid:string>', nullable=True, isPartition=False, isBucket=False),
 Column(name='department_id', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='first_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='id', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='last_name', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='salary', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False)]

In [12]:
sqlContext.sql("SELECT * FROM hivesampletable").show(5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



In [13]:
# showing dataset of tempView

sqlContext.sql("SELECT * FROM sampletempview").show(5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



In [14]:
# showing data from Global View

# since it is global view, need to write global_temp. before table view
sqlContext.sql("SELECT * FROM global_temp.sampleglobalview").show(5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



<a id = 7>  </a>
## Drop all the created tables and views in default database

In [15]:
spark.catalog.dropGlobalTempView("sampleglobalview")
spark.catalog.dropTempView("sampletempview")

True

<a id = 8>  </a>
## Create Dataeng database and create global and temp view using SQL

In [16]:
sqlContext.sql("create database dataeng")
sqlContext.sql("use dataeng")
sqlContext.sql("show databases").show()

+---------+
|namespace|
+---------+
|  dataeng|
|  default|
+---------+



In [17]:
# this will create and save hivesampletable inside database
mongo_df.write.saveAsTable("hivesampletable")


mongo_df.createOrReplaceGlobalTempView("sampleglobalview")

mongo_df.createOrReplaceTempView("sampletempview")

In [18]:
spark.sql("show tables").show()

+---------+---------------+-----------+
|namespace|      tableName|isTemporary|
+---------+---------------+-----------+
|  dataeng|hivesampletable|      false|
|         | sampletempview|       true|
+---------+---------------+-----------+



<a id = 9>  </a>
## Access global table from other session

In [19]:
new_Session = spark.newSession()
new_Session

In [20]:
# since it is global view, can be accessed even in new session
new_Session.sql("SELECT * FROM global_temp.sampleglobalview").show(5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



In [21]:
''' new_Session.sql("Select * from sampletempview").show(5)

this will generate error- because it is temp view created in another session called 'spark'ArithmeticError

spark.sql("select * from sampletempview").show(5) -> wil give output
'''

spark.sql("select * from sampletempview").show(5)


+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows

