# PySpark - Hive
## PySpark Save DataFrame to Hive Table
To save a PySpark DataFrame to Hive table use saveAsTable() function or use SQL CREATE statement on top of the temporary view. In order to save DataFrame as a Hive table in PySpark, you need to create a SparkSession with enableHiveSupport().

This method is available pyspark.sql.SparkSession.builder.enableHiveSupport() which enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

Following are the Steps to Save PySpark DataFrame to Hive Table.<br>
Step 1 – Create SparkSession with hive enabled<br>
Step 2 – Create PySpark DataFrame<br>
Step 3 – Save PySpark DataFrame to Hive table<br>
Step 4 – Confirm Hive table is created<br>

### Create Spark Session with Hive Enabled
In order to read the hive table into pySpark DataFrame first, you need to create a SparkSession with Hive support enabled.

In [1]:
import findspark
findspark.init()

In [2]:
from os.path import abspath
from pyspark.sql import SparkSession

#enableHiveSupport() -> enables sparkSession to connect with Hive
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

### PySpark Save DataFrame to Hive Table
By using saveAsTable() from DataFrameWriter you can save or write a PySpark DataFrame to a Hive table. Pass the table name you wanted to save as an argument to this function and make sure the table name is in the form of database.tablename. If the database doesn’t exist, you will get an error. To start with you can also try just the table name without a database.

You can use this to write PySpark DataFrame to a new Hive table or overwrite an existing table. PySpark writes the data to the default Hive warehouse location which is /user/hive/warehouse when you use a Hive cluster. But on local it creates in the current directory. You can change this behavior, using the spark.sql.warehouse.dir configuration while creating a SparkSession .

Since we are running it locally from IntelliJ, it creates a metadata database metastore_db and spark-warehouse under the current directory.

### Save DataFrame as Internal Table from PySpark
By default saveAsTable() method saves PySpark DataFrame as a managed Hive table. Managed tables are also known as internal tables that are owned and managed by Hive. By default, Hive creates a table as an Internal table and owned the table structure and the files. When you drop an internal table, it drops the data and also drops the metadata of the table.

In [3]:
columns = ["id", "name","age","gender"]

# Create DataFrame 
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
    (3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]
sampleDF = spark.sparkContext.parallelize(data).toDF(columns)

# Create Hive Internal table
sampleDF.write.mode('overwrite') \
         .saveAsTable("employee")

# Read Hive table
df = spark.read.table("employee")
df.show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  1|   James| 30|     M|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
+---+--------+---+------+



It creates the Hive metastore metastore_db and Hive warehouse location spark-warehouse in the current directory. The employee table is created inside the warehouse directory.

Also, note that by default it creates files in parquet format with snappy compression.

If you wanted to create a table within a Database, use the prefix database name. If you don’t have the database, you can create one.

In [4]:
# Create database 
spark.sql("CREATE DATABASE IF NOT EXISTS emp")

# Create Hive Internal table
sampleDF.write.mode('overwrite') \
    .saveAsTable("emp.employee")

### Save as External Table
To create an external table use the path of your choice using option(). The data in External tables are not owned or managed by Hive. Dropping an external table just drops the metadata but not the actual data. The actual data is still accessible outside of Hive.

In [7]:
# Create Hive External table
sampleDF.write.mode('overwrite') \
        .option("path", "../resources/tmp/employee") \
        .saveAsTable("emp.employee")

### Using PySpark SQL Temporary View to Save Hive Table
Use SparkSession.sql() method and CREATE TABLE statement to create a table in Hive from PySpark temporary view. Above we have created a temporary view “sampleView“. Now we shall create a Database and Table using SQL in Hive Metastore and insert data into the Hive table using the view we created above.

In [8]:
# Create temporary view
sampleDF.createOrReplaceTempView("sampleView")

# Create a Database CT
spark.sql("CREATE DATABASE IF NOT EXISTS ct")

# Create a Table naming as sampleTable under CT database.
spark.sql("CREATE TABLE ct.sampleTable (id Int, name String, age Int, gender String)")

# Insert into sampleTable using the sampleView. 
spark.sql("INSERT INTO TABLE ct.sampleTable  SELECT * FROM sampleView")

# Lets view the data in the table
spark.sql("SELECT * FROM ct.sampleTable").show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  1|   James| 30|     M|
|  2|     Ann| 40|     F|
|  3|    Jeff| 41|     M|
|  4|Jennifer| 20|     F|
+---+--------+---+------+




## PySpark SQL Read Hive Table
PySpark SQL supports reading a Hive table to DataFrame in two ways: the SparkSesseion.read.table() method and the SparkSession.sql() statement.

In order to read a Hive table, you need to create a SparkSession with enableHiveSupport(). This method is available at pyspark.sql.SparkSession.builder.enableHiveSupport() which is used to enable Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

Steps to Read Hive Table into PySpark DataFrame<br>
Step 1 – Import PySpark<br>
Step 2 – Create SparkSession with Hive enabled<br>
Step 3 – Read Hive table into Spark DataFrame using spark.sql()<br>
Step 4 – Read using spark.read.table()<br>
Step 5 – Connect to remove Hive.<br>

PySpark reads the data from the default Hive warehouse location which is /user/hive/warehouse when you use a Hive cluster. But on local, it reads from the current directory. You can change this behavior, using the spark.sql.warehouse.dir configuration while creating a SparkSession .

### PySpark Read Hive Table into DataFrame

In [9]:
# Read Hive table
df = spark.sql("select * from emp.employee")
df.show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  1|   James| 30|     M|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
+---+--------+---+------+



### Using spark.read.table()
Alternatively, you can also read by using spark.read.table() method. here, spark.read is an object of the class DataFrameReader.

In [10]:
# Read Hive table
df = spark.read.table("employee")
df.show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  1|   James| 30|     M|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
+---+--------+---+------+



### PySpark Read Hive Table from Remote Hive

In [13]:
from os.path import abspath
from pyspark.sql import SparkSession

#enableHiveSupport() -> enables sparkSession to connect with Hive
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
    .builder \
    .appName("SparkByExamples.com") \
    .config("spark.sql.warehouse.dir", "/hive/warehouse/dir") \
    .config("hive.metastore.uris", "thrift://remote-host:9083") \
    .enableHiveSupport() \
    .getOrCreate()

# or Use the below approach
# Change using conf
spark.sparkContext.conf.set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext.conf.set("hive.metastore.uris", "thrift://localhost:9083");

AttributeError: 'SparkContext' object has no attribute 'conf'