# Objectives
* Reading data into Spark
* Using data.table, vaex and lazy queries
* Running background jobs in Jupyter
* Run queries on Google Collab from local database
* Run H2O.ai to create ML model
* Save results to parquet file
* Use dbt and Airflow to orchestrate everything
* Using GitHub Actions with dbt
* Convert to script with classes
* Create front end using Django
* Test the following tools:
  * DVC for data versioning
  * ML flow for model versioning
  * ML flow vs Air flow
  * Model monitoring 

# Packages

In [1]:
import pandas as pd
from datetime import datetime, date
import sys

In [2]:
# Importing module from a different location
sys.path.insert(0, "C:\\Users\\User\\AppData\\Local\\spark\\spark-3.1.1-bin-hadoop3.2\\python")

from pyspark.sql import SparkSession
from pyspark import __version__ as py_ver
from py4j import __version__ as py4_ver

# Pyspark
Reference: 
* https://phoenixnap.com/kb/install-spark-on-windows-10
* https://spark.apache.org/docs/latest/api/python/index.html
* https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm

We will use a previous Spark installation: <br>
$spark_home <br>
[1] "C:\\Users\\User\\AppData\\Local\\spark\\spark-3.1.1-bin-hadoop3.2"

In [31]:
py_ver
py4_ver

# Python executable path
sys.executable

'3.1.1'

In [34]:
sc = SparkSession.builder.getOrCreate()
sc

## Importing data
### From Pandas

In [35]:
pandas_df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2., 3., 4.],
    'c': ['string1', 'string2', 'string3'],
    'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]
})

sp_df = sc.createDataFrame(pandas_df)
sp_df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [36]:
sp_df.printSchema()
sp_df.show()

root
 |-- a: long (nullable = true)
 |-- b: double (nullable = true)
 |-- c: string (nullable = true)
 |-- d: date (nullable = true)
 |-- e: timestamp (nullable = true)

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
|  3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+



### From CSV

In [37]:
sp_csv = sc.read.csv("../inputs/task2_data1.csv")
sp_csv

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string]

In [38]:
sp_csv.show()

+--------------------+--------------------+-----------------+--------+
|                 _c0|                 _c1|              _c2|     _c3|
+--------------------+--------------------+-----------------+--------+
|        Company.Name|             Address|             City|Postcode|
|        Carsten Helm|         Ulmenstr. 8|           Wismar|   23966|
|Zirpel & Pautzsch...|    Paditzer Str. 33|        Altenburg|    4600|
|     Eberhard Zessin|   Steingartenweg 12|       Heidelberg|   69118|
|        Gerold Fuchs|          Mühlweg 12|        Dietingen|   78661|
|     Rudi Biedritzky|  Zaisentalstr. 70/1|       Reutlingen|   72760|
|      Wolfgang Jäger|       Wiesenstr. 11|           Rodgau|   63110|
|       Mario Tsiknas|          Am Delf 31|  Bad Zwischenahn|   26160|
|Matthias Essers G...|Leopold-Hoesch-St...|    Geilenkirchen|   52511|
|       Andre Hanisch|   Im Kressgraben 18|   Untereisesheim|   74257|
|         Paul Strigl|Thomas-Schwarz-St...|           Dachau|   85221|
|     

### From SQL

Update conf/spark-defaults.conf to include the setting: `spark.driver.extraClassPath` = `E:\\Softwares\\postgresql-42.2.22.jar`. This can't be set through sparkConf(), the code chunk below will fail.

Reference:
* https://spark.apache.org/docs/latest/configuration.html
* https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
* https://stackoverflow.com/questions/30983982/how-to-use-jdbc-source-to-write-and-read-data-in-pyspark

In [39]:
# sc.sparkContext.stop()
# confs = [("spark.driver.extraClassPath", "E:\\Softwares\\postgresql-42.2.22.jar")]
# conf = sc.sparkContext._conf.setAll(confs)
# sc.sparkContext.stop()
# sc = SparkSession.builder.config(conf=conf).getOrCreate()
# sc.sparkContext._conf.getAll()

In [40]:
sc.sparkContext._conf.getAll()

[('spark.executor.id', 'driver'),
 ('spark.driver.port', '55215'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.sql.warehouse.dir',
  'C:/Users/User/AppData/Local/spark/spark-3.1.1-bin-hadoop3.2/tmp/hive'),
 ('spark.driver.host', 'DESKTOP-BTP6V55'),
 ('spark.app.id', 'local-1645581651266'),
 ('spark.local.dir',
  'C:/Users/User/AppData/Local/spark/spark-3.1.1-bin-hadoop3.2/tmp/local'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.extraClassPath', 'E:\\Softwares\\postgresql-42.2.22.jar'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.startTime', '1645581649062'),
 ('spark.ui.showConsoleProgress', 'true')]

In [41]:
sp_pg = sc.read.jdbc(
    url="jdbc:postgresql://localhost:5432/chinook", 
    table="album", 
    properties={"user":"postgres", "password":"john"})

sp_pg

DataFrame[AlbumId: int, Title: string, ArtistId: int]

In [42]:
sp_pg.show()

+-------+--------------------+--------+
|AlbumId|               Title|ArtistId|
+-------+--------------------+--------+
|      1|For Those About T...|       1|
|      2|   Balls to the Wall|       2|
|      3|   Restless and Wild|       2|
|      4|   Let There Be Rock|       1|
|      5|            Big Ones|       3|
|      6|  Jagged Little Pill|       4|
|      7|            Facelift|       5|
|      8|      Warner 25 Anos|       6|
|      9|Plays Metallica B...|       7|
|     10|          Audioslave|       8|
|     11|        Out Of Exile|       8|
|     12| BackBeat Soundtrack|       9|
|     13|The Best Of Billy...|      10|
|     14|Alcohol Fueled Br...|      11|
|     15|Alcohol Fueled Br...|      11|
|     16|       Black Sabbath|      12|
|     17|Black Sabbath Vol...|      12|
|     18|          Body Count|      13|
|     19|    Chemical Wedding|      14|
|     20|The Best Of Buddy...|      15|
+-------+--------------------+--------+
only showing top 20 rows



## Exporting data
### To Pandas

In [43]:
sp_pg.toPandas()

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3
...,...,...,...
342,343,Respighi:Pines of Rome,226
343,344,Schubert: The Late String Quartets & String Qu...,272
344,345,Monteverdi: L'Orfeo,273
345,346,Mozart: Chamber Music,274
