## Import the modules 

- In this scenario, we are going to import the pyspark and pyspark SQL modules and create a spark session as below:

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row

In [2]:
# Create a Session

spark = SparkSession.builder.config("spark.jars", "/home/jovyan/work/jars/postgresql-42.4.0.jar") \
    .master("local").appName("PySpark-Postgres").getOrCreate()

## Read Data from the table
- Here we are going to read the data table from PostgreSQL and create the DataFrames. 
- To read the data frame, we will read() method through the JDBC URL and provide the PostgreSQL jar Driver path

In [3]:
## Reading the Oasis table we created from our DAG.
## Using the 'Test' database we created upon init 
## User 'airflow' as we defined in our .yaml

df = spark.read.format("jdbc").option("url", "jdbc:postgresql://oasispostgres:5432/metastore") \
    .option("driver", "org.postgresql.Driver").option("dbtable", "customers") \
    .option("user", "hive").option("password", "hive").load()

## To view the Schema
- Here we will read the schema of the stored table as a dataframe, as shown below.

In [4]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- created: timestamp (nullable = true)
 |-- updated: timestamp (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)



## To view the content of the table
- Here we are going to read the content of the table as a dataframe. 
- We will print the top 5 rows from the dataframe as shown below.

In [5]:
df.show(5)

+---+-------------------+--------------------+----------+---------+--------------------+
| id|            created|             updated|first_name|last_name|               email|
+---+-------------------+--------------------+----------+---------+--------------------+
|  1|2021-02-16 00:16:06|2022-07-29 11:58:...|     Scott|   Haines|  scott@coffeeco.com|
|  2|2021-02-16 00:16:06|2022-07-29 11:58:...|      John|     Hamm|  john.hamm@acme.com|
|  3|2021-02-16 00:16:06|2022-07-29 11:58:...|      Milo|   Haines|mhaines@coffeeco.com|
|  4|2021-02-21 21:00:00|2022-07-29 11:58:...|     Penny|   Haines|  penny@coffeeco.com|
|  5|2021-02-21 22:00:00|2022-07-29 11:58:...|     Cloud|     Fast| cloud.fast@acme.com|
+---+-------------------+--------------------+----------+---------+--------------------+
only showing top 5 rows



- Interacting with a JDBC Backed DataFrame Abstracts Away the Complexities of Connecting to a Remote (External) RDBMS
- The results of calling show on the JDBC backed DataFrame yields the first three entries in the customers table from your PostgresSQL database.

In [7]:
df.select("updated","id", "first_name", "email").limit(8).show()

+--------------------+---+----------+--------------------+
|             updated| id|first_name|               email|
+--------------------+---+----------+--------------------+
|2022-07-29 11:58:...|  1|     Scott|  scott@coffeeco.com|
|2022-07-29 11:58:...|  2|      John|  john.hamm@acme.com|
|2022-07-29 11:58:...|  3|      Milo|mhaines@coffeeco.com|
|2022-07-29 11:58:...|  4|     Penny|  penny@coffeeco.com|
|2022-07-29 11:58:...|  5|     Cloud| cloud.fast@acme.com|
|2022-07-29 11:58:...|  6|   Marshal|   paws@coffeeco.com|
|2022-07-29 11:58:...|  7|    Willow| willow@coffeeco.com|
|2022-07-29 11:58:...|  8|    Clover|    pup@coffeeco.com|
+--------------------+---+----------+--------------------+



## Describing Views and Tables

- You have learned to use JDBC to connect to your PostgreSQL docker container. 
- Wouldn’t it make sense that we could describe the schema of the table by doing a simple SQL style describe?

In [8]:
df.createOrReplaceTempView("customers")
spark.sql("desc customers").show()

+----------+---------+-------+
|  col_name|data_type|comment|
+----------+---------+-------+
|        id|      int|   null|
|   created|timestamp|   null|
|   updated|timestamp|   null|
|first_name|   string|   null|
| last_name|   string|   null|
|     email|   string|   null|
+----------+---------+-------+



## Conclusion
- Here we learned to read data from PostgreSQL in Pyspark.