# 3.1 Read write data from postgresql server

The official [doc](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

# 3.1.1 Check your postgresql server connexion
``` bash
psql -h postgresql://postgresql-955091 -p 5432 -U pengfei -W test test
pwd: test
```
1. show database list
\l

# check the version of your postgresql server

``` sql
SELECT version();
```

# 3.1.2 Get your postgresql jdbc driver
The maven dependencies of the postgresql driver

```xml
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <version>42.2.24</version>
</dependency>
```


In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import lit, col, when, concat, udf
import os

In [3]:
local=False
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .config('spark.jars.packages', 'org.postgresql:postgresql:42.2.24') \
                  .appName("RemoveDuplicates").getOrCreate()
    db_url="jdbc:postgresql://localhost:5432/test"
    table_name="employee"
    user="pengfei"
    password="toto"
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RemoveDuplicates") \
                      .config("spark.kubernetes.container.image",os.environ['IMAGE_NAME']) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()
    db_url="jdbc:postgresql://postgresql-124499:5432/test"
    table_name="employee"
    user="user-pengfei"
    password="toto"



:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ef9e7f5a-15fb-4a43-8c78-a8275b48a833;1.0
	confs: [default]
	found org.postgresql#postgresql;42.2.24 in central
	found org.checkerframework#checker-qual;3.5.0 in central
downloading https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.24/postgresql-42.2.24.jar ...
	[SUCCESSFUL ] org.postgresql#postgresql;42.2.24!postgresql.jar (134ms)
downloading https://repo1.maven.org/maven2/org/checkerframework/checker-qual/3.5.0/checker-qual-3.5.0.jar ...
	[SUCCESSFUL ] org.checkerframework#checker-qual;3.5.0!checker-qual.jar (28ms)
:: resolution report :: resolve 773ms :: artifacts dl 167ms
	:: modules in use:
	org.checkerframework#checker-qual;3.5.0 from central in [default]
	org.postgresql#postgresql;42.2.24 from central in [default]
	-------------------------------

In [4]:
emp = [(1, "Smith", -1, "2018", "10", "M", 3000),
           (2, "Rose", 1, "2010", "20", "M", 4000),
           (3, "Williams", 1, "2018", "21", "M", 1000),
           (4, "Jones", 2, "2005", "31", "F", 2000),
           (5, "Brown", 2, "2010", "30", "F", -1),
           (6, "Foobar", 2, "2010", "150", "F", -1)
           ]
emp_col_names = ["emp_id", "name", "superior_emp_id", "dept_creation_year",
                     "emp_dept_id", "gender", "salary"]
df = spark.createDataFrame(data=emp, schema=emp_col_names)
df.printSchema()
df.show(truncate=False)

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- dept_creation_year: string (nullable = true)
 |-- emp_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)





+------+--------+---------------+------------------+-----------+------+------+
|emp_id|name    |superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+--------+---------------+------------------+-----------+------+------+
|1     |Smith   |-1             |2018              |10         |M     |3000  |
|2     |Rose    |1              |2010              |20         |M     |4000  |
|3     |Williams|1              |2018              |21         |M     |1000  |
|4     |Jones   |2              |2005              |31         |F     |2000  |
|5     |Brown   |2              |2010              |30         |F     |-1    |
|6     |Foobar  |2              |2010              |150        |F     |-1    |
+------+--------+---------------+------------------+-----------+------+------+



                                                                                

## 3.1.3 We write a spark dataframe to the postgresql database server as a table

We have two ways to write dataframe via jdbc: 

1. Use df.write.format("jdbc).option(...).save()


```python
df.write \
    .format("jdbc") \
    .option("url", db_url) \
    .option("dbtable", "emp2") \
    .option("user", user) \
    .option("password", password) \
    .option("driver", driver) \
    .save()
```
2. Use df.write.jdbc(...)
``` python
# postgresql connexion config
db_url="jdbc:postgresql://postgresql-955091:5432/test"
table="employee"
user="pengfei"
password="test"
driver="org.postgresql.Driver"
# note the driver value need to be changed if you use other database
# e.g. Mysql: com.mysql.jdbc.Driver
#     postgresql: org.postgresql.Driver
db_properties={"user": user, "password": password, "driver" : driver }
df.write.jdbc(url=db_url,table=table,mode='overwrite',properties=db_properties)
```
Below code write a dataframe to a database server as a table

We need to check if **the generated table has the same schema as the dataframe**

```sql
SELECT 
   table_name, 
   column_name, 
   data_type 
FROM 
   information_schema.columns
WHERE 
   table_name = 'employee';
```   

We get below result, we noticed long is converted to bigint, string is converte to text.

``` text
table_name |    column_name     | data_type 
------------+--------------------+-----------
 employee   | superior_emp_id    | bigint
 employee   | emp_id             | bigint
 employee   | salary             | bigint
 employee   | gender             | text
 employee   | dept_creation_year | text
 employee   | name               | text
 employee   | emp_dept_id        | text
```


In [11]:
# postgresql connexion config
driver="org.postgresql.Driver"
# note the driver value need to be changed if you use other database
# e.g. Mysql: com.mysql.jdbc.Driver
#     postgresql: org.postgresql.Driver
db_properties={"user": user, "password": password, "driver" : driver }


In [12]:
# We use solution 1 to write to a table named emp2
df.write \
    .format("jdbc") \
    .option("url", db_url) \
    .option("dbtable", "emp2") \
    .option("user", user) \
    .option("password", password) \
    .option("driver", driver) \
    .save()


In [7]:
# We use solution 2 to write to a table named employee
# note that the db_properties is a dictionary that contains user and password
df.write.jdbc(url=db_url,table=table_name,mode='overwrite',properties=db_properties)

                                                                                

# 3.1.4 We read a table into a spark dataframe

Similar to write, we have two solutions to read a table to a dataframe:
1. Use spark.read.jdbc()
2. Use spark.read.format("jdbc").options

Use solution 1 to generate a dataframe from a table

In [15]:
df_read1=spark.read.jdbc(url=db_url, table=table, properties=db_properties)
df_read1.show()
df_read1.printSchema()

+------+--------+---------------+------------------+-----------+------+------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+--------+---------------+------------------+-----------+------+------+
|     2|    Rose|              1|              2010|         20|     M|  4000|
|     3|Williams|              1|              2018|         21|     M|  1000|
|     5|   Brown|              2|              2010|         30|     F|    -1|
|     6|  Foobar|              2|              2010|        150|     F|    -1|
|     4|   Jones|              2|              2005|         31|     F|  2000|
|     1|   Smith|             -1|              2018|         10|     M|  3000|
+------+--------+---------------+------------------+-----------+------+------+

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- dept_creation_year: string (nullable = true)
 |-- emp_dept_id: string (nullable = 

Use solution 2 to generate a dataframe from a table

In [17]:
df_read2=spark.read.format("jdbc") \
    .option("url", db_url) \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", driver) \
    .load()

df_read2.show()
df_read2.printSchema()

+------+--------+---------------+------------------+-----------+------+------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+--------+---------------+------------------+-----------+------+------+
|     2|    Rose|              1|              2010|         20|     M|  4000|
|     3|Williams|              1|              2018|         21|     M|  1000|
|     5|   Brown|              2|              2010|         30|     F|    -1|
|     6|  Foobar|              2|              2010|        150|     F|    -1|
|     4|   Jones|              2|              2005|         31|     F|  2000|
|     1|   Smith|             -1|              2018|         10|     M|  3000|
+------+--------+---------------+------------------+-----------+------+------+

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- dept_creation_year: string (nullable = true)
 |-- emp_dept_id: string (nullable = 

## Conclusion

Two important things:
1. You need to add jdbc driver to your sparkcontext.
   For example, in a notebooke, you can add .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') to the SparkSession.builder.
   In a submit mode, you need to add options such as "--driver-class-path path/to.jar --jars path/to.jar" or "--packages org.postgresql:postgresql:42.2.24"
2. When you use read or write you need to specify your jdbc driver type.
   For example, .option("driver", "org.postgresql.Driver")
                .option("driver", "com.mysql.jdbc.Driver")


When to use --jars, and --packages? Check this [answer](https://stackoverflow.com/questions/51434808/spark-submit-packages-vs-jars)