# Purpose

Explore PySpark and the JDBC connection functionality to read from operational databases.

In this notebook we will setup a PostgreSQL instance and populate it with the Pagila dataset. We will then connect to the database via a JDBC connector.

# Setup

## PostgreSQL

Firstly, let's install postgres in the this Colab instance.

In [None]:
!sudo apt install postgresql postgresql-contrib 

In [None]:
!service postgresql start

Create a user in Postgres ([stackoverflow](https://stackoverflow.com/questions/12720967/how-to-change-postgresql-user-password/12721020#12721020))


In [None]:
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'test';"

ALTER ROLE


Store you database password in an environmental variable so that we need no type it in all the time (not advisable generally).

We'll use the notebook magic `%end`

In [None]:
%env PGPASSWORD=test

## Pagila

Now, let's populate the PostgreSQL database with the Pagila data from the tutorial.

In [None]:
!git clone https://github.com/spatialedge-ai/pagila.git

In [None]:
!psql -h localhost -U postgres -c "create database pagila"

In [None]:
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-schema.sql"

In [None]:
!psql -h localhost -U postgres -d pagila -f "pagila/pagila-data.sql"

## PySpark Setup

Now, let's download what is necessary for initiating jdbc connections, as well as what is required to run PySpark itself.

In [None]:
# https://stackoverflow.com/questions/34948296/using-pyspark-to-connect-to-postgresql
!wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar

In [None]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np  

%config Completer.use_jedi = False


SPARKVERSION='2.4.8'
HADOOPVERSION='2.7'
pwd=os.getcwd()

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"{pwd}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}"

print(os.environ['SPARK_HOME'])


/content/spark-2.4.8-bin-hadoop2.7


In [None]:
!sudo apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://archive.apache.org/dist/spark/spark-{SPARKVERSION}/spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}.tgz
!tar xf spark-{SPARKVERSION}-bin-hadoop{HADOOPVERSION}.tgz

In [None]:
!cp postgresql-42.5.0.jar spark-2.4.8-bin-hadoop2.7/jars

In [None]:
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [None]:
import findspark
findspark.init()

# get a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.jars", 
                                                       "postgresql-42.2.5.jar").config(
                                                          "spark.driver.extraClassPath",
                                                          "spark-2.4.8-bin-hadoop2.7/jars"
                                                       ).getOrCreate()
print(spark.conf.get('spark.jars'))

%env PYARROW_IGNORE_TIMEZONE=1

env: PYARROW_IGNORE_TIMEZONE=1


# Questions

### Question 1

Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection.

In [1]:
# pyspark code

### Question 2

Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook.

In [2]:
# pyspark code

### Question 3 

Repeat this calculation using only the Dataframe API and print the result.

In [3]:
# pyspark code

### Question 4 

How many partitions are present in the dataframe resulting from Question 3 (additionally provide the code necessary to determine that)

### Question 5

Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API).

### Question 6.1

Determine which first names occur more than once:

1. using the Spark SQL API (printing the result)

### Question 6.2

  2. using the Spark Dataframe API (printing the result once more).

### Question 7

Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): 

```
SELECT
   staff.first_name
   ,staff.last_name
   ,SUM(payment.amount)
 FROM payment
   INNER JOIN staff ON payment.staff_id = staff.staff_id
 WHERE payment.payment_date BETWEEN '2007-01-01' AND '2007-02-01'
 GROUP BY
   staff.last_name
   ,staff.first_name
 ORDER BY SUM(payment.amount)
 ;
```

### Question 8

Are you currently executing commands on a driver node, or a worker? Provide the code you ran to determine that.