# Learning Objectives

In this notebook, you will 
- learn the concept of ETL
- write ETL jobs for CSV files from `pgexercises` https://pgexercises.com/gettingstarted.html

# What's ETL or ELT?

ETL stands for Extract, Transform, Load. In the context of Spark, ETL refers to the process of extracting data from various sources, transforming it into a desired format or structure, and loading it into a target system, such as a data warehouse or a data lake.

Here's a breakdown of each step in the ETL process:

## Extract
This step involves extracting data from multiple sources, such as databases, files (CSV, JSON, Parquet), APIs, or streaming data sources. Spark provides connectors and APIs to read data from a wide range of sources, allowing you to extract data in parallel and efficiently handle large datasets.

## Transform
In the transform step, the extracted data is processed and transformed according to specific business logic or requirements. This may involve cleaning the data, applying calculations or aggregations, performing data enrichment, filtering, joining datasets, or any other data manipulation operations. Spark provides a powerful set of transformation functions and SQL capabilities to perform these operations efficiently in a distributed and scalable manner.

## Load
Once the data has been transformed, it is loaded into a target system, such as a data warehouse, a data lake, or another storage system. Spark allows you to write the transformed data to various output formats and storage systems, including databases, distributed file systems (like Hadoop Distributed File System or Amazon S3), or columnar formats like Delta Lake or Apache Parquet. The data can be partitioned, sorted, or structured to optimize querying and analysis.

Spark's distributed computing capabilities, scalability, and rich ecosystem of libraries make it a popular choice for ETL workflows. It can handle large-scale data processing, perform complex transformations, and efficiently load data into different target systems.

By leveraging Spark for ETL, organizations can extract data from diverse sources, apply transformations to ensure data quality and consistency, and load the transformed data into a central repository for further analysis, reporting, or machine learning tasks.

# Enable DBFS UI

- Setting -> Admin Console -> search for dbfs

<img src="https://raw.githubusercontent.com/jarviscanada/jarvis_data_eng_demo/feature/data/spark/notebook/spark_fundamentals/img/entable_dbfs.jpg" width="700">

- Refresh the page and view DBFS files from UI

<img src="https://raw.githubusercontent.com/jarviscanada/jarvis_data_eng_demo/feature/data/spark/notebook/spark_fundamentals/img/dbfs%20ui.png" width="700">

## Import `pgexercises` CSV files

- The pgexercises CSV data files can be found [here](https://github.com/jarviscanada/jarvis_data_eng_demo/tree/feature/data/spark/data/pgexercises).
- The pgexercises schema can be found [here](https://pgexercises.com/gettingstarted.html) (for reference purposes).
- Upload the `bookings.csv`, `facilities.csv`, and `members.csv` files using Databricks UI (see screenshot)
- You can view the imported files from the DBFS UI.

![Upload Files](https://raw.githubusercontent.com/jarviscanada/jarvis_data_eng_demo/feature/data/spark/notebook/spark_fundamentals/img/upload%20file.png)

# Interview Questions

While completing the rest of the practice, try to answer the following questions:

## Concepts
- What is ETL? (Hint: Explain each step)
ETL, stands for extract, transform and load. It refers to teh process of extracting data from diffret sources, transforming that data in to desired format and then loading that data to a storage faiclity or a data warehouse for further analyisis .  

Extract:

Transform:

Loading:

## Databricks
- What is Databricks?
- What is a Notebook?
- What is DBFS?
- What is a cluster? 
- Is Databricks a data lake or a data warehouse?

## Managed Table
- What is a managed table in Databricks?
- Can you explain how to create a managed table in Databricks?
- Can you compare a managed table with an RDBMS table? (Hint: Schema on read vs schema on write)
- What is the Hive metastore and how does it relate to managed tables in Databricks?
- How does a managed table differ from an unmanaged (external) table in Databricks? (Hint: Consider what happens to the data when the table is deleted)
- How can you define a schema for a managed table?

## Spark
`df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_location)`
- What does the option("inferSchema", "true") do? 
- What does the option("header", "true") do?
- How can you write data to a managed table?
- How can you read data from a managed table into a DataFrame?

# ETL `bookings.csv` file

- **Extract**: Load data from CSV file into a DF
- **Transformation**: no transformation needed as we want to load data as it
- **Load**: Save the DF into a managed table (or Hive table); 

# Managed Table
This is an important interview topic. Some people may refer to managed tables as Hive tables.

https://docs.databricks.com/data-governance/unity-catalog/create-tables.html

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType

file_location = "/FileStore/tables/bookings.csv"

# What does `option("header", "true")` and `option("inferSchema", "true")` do?
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_location)

# Why the df schema doesn't match the DDL data type? https://pgexercises.com/gettingstarted.html (hint: `option("inferSchema", "true")`)
df.printSchema()

# Here is the solution to define schema manually
# Define schema for the bookings table
schema = StructType([
    StructField("bookid", IntegerType(), True),
    StructField("facid", IntegerType(), True),
    StructField("memid", IntegerType(), True),
    StructField("starttime", TimestampType(), True),
    StructField("slots", IntegerType(), True)
])

# Read data from CSV file into DataFrame with predefined schema
df = spark.read.format("csv").option("header", "true").schema(schema).load(file_location)

# No 

# Drop the table if it already exists
spark.sql("DROP TABLE IF EXISTS bookings")

# Write data from DataFrame into managed table
df.write.saveAsTable("bookings")

display(df.show(5))


root
 |-- bookid: integer (nullable = true)
 |-- facid: integer (nullable = true)
 |-- memid: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- slots: integer (nullable = true)

+------+-----+-----+-------------------+-----+
|bookid|facid|memid|          starttime|slots|
+------+-----+-----+-------------------+-----+
|     0|    3|    1|2012-07-03 11:00:00|    2|
|     1|    4|    1|2012-07-03 08:00:00|    2|
|     2|    6|    0|2012-07-03 18:00:00|    2|
|     3|    7|    1|2012-07-03 19:00:00|    2|
|     4|    8|    1|2012-07-03 10:00:00|    1|
+------+-----+-----+-------------------+-----+
only showing top 5 rows



# Complete ETL Jobs

- Complete ETL for `facilities.csv` and `members.csv`
- Tips
  - The Databricks community version will terminate the cluster after a few hours of inactivity. As a result, all managed tables will be deleted. You will need to rerun this notebook to perform the ETL on all files for the other exercises.
  - DBFS data will not be deleted when a custer become inactive/deleted

In [0]:
# Write a ETL job for `facilities.csv`

#EXTRACT

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType

facilities_file = "/FileStore/tables/facilities.csv"

# Define schema for the facilities table
facilities_schema = StructType([
    StructField("facid", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("membercost", DecimalType(5, 2), True),
    StructField("guestcost", DecimalType(5, 2), True),
    StructField("initialoutlay", DecimalType(8, 2), True),
    StructField("monthlymaintenance", DecimalType(8, 2), True)
])

facilities_df = spark.read.format("csv").option("header", "true").schema(facilities_schema).load(facilities_file)

facilities_df.printSchema()
facilities_df.show(5)  # Display first 5 rows

#TRANSFORM
## Not needed in this scenario

#LOADING

# drop table if it already exists
spark.sql("DROP TABLE IF EXISTS facilities") 

# write data from dataframe into a managed table
facilities_df.write.saveAsTable("facilities")

display(facilities_df.show(5))






root
 |-- facid: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- membercost: decimal(5,2) (nullable = true)
 |-- guestcost: decimal(5,2) (nullable = true)
 |-- initialoutlay: decimal(8,2) (nullable = true)
 |-- monthlymaintenance: decimal(8,2) (nullable = true)

+-----+---------------+----------+---------+-------------+------------------+
|facid|           name|membercost|guestcost|initialoutlay|monthlymaintenance|
+-----+---------------+----------+---------+-------------+------------------+
|    0| Tennis Court 1|      5.00|    25.00|     10000.00|            200.00|
|    1| Tennis Court 2|      5.00|    25.00|      8000.00|            200.00|
|    2|Badminton Court|      0.00|    15.50|      4000.00|             50.00|
|    3|   Table Tennis|      0.00|     5.00|       320.00|             10.00|
|    4| Massage Room 1|     35.00|    80.00|      4000.00|           3000.00|
+-----+---------------+----------+---------+-------------+------------------+
only showing top

In [0]:
from pyspark.sql.types import DateType

members_file = "/FileStore/tables/members.csv"

# Define schema for the members table
members_schema = StructType([
    StructField("memid", IntegerType(), True),
    StructField("surname", StringType(), True),
    StructField("firstname", StringType(), True),
    StructField("address", StringType(), True),
    StructField("zipcode", IntegerType(), True),
    StructField("telephone", StringType(), True),
    StructField("recommendedby", IntegerType(), True),
    StructField("joindate", DateType(), True)
])

# Read members data
members_df = spark.read.format("csv").option("header", "true").schema(members_schema).load(members_file)

# Drop table if exists
spark.sql("DROP TABLE IF EXISTS members")

# Write data from DataFrame into managed table
members_df.write.saveAsTable("members")

display(members_df.show(5))


+-----+--------+---------+--------------------+-------+--------------+-------------+----------+
|memid| surname|firstname|             address|zipcode|     telephone|recommendedby|  joindate|
+-----+--------+---------+--------------------+-------+--------------+-------------+----------+
|    0|   GUEST|    GUEST|               GUEST|      0|(000) 000-0000|         null|2012-07-01|
|    1|   Smith|   Darren|8 Bloomsbury Clos...|   4321|  555-555-5555|         null|2012-07-02|
|    2|   Smith|    Tracy|8 Bloomsbury Clos...|   4321|  555-555-5555|         null|2012-07-02|
|    3|  Rownam|      Tim|23 Highway Way, B...|  23423|(844) 693-0723|         null|2012-07-03|
|    4|Joplette|   Janice|20 Crossing Road,...|    234|(833) 942-4710|            1|2012-07-03|
+-----+--------+---------+--------------------+-------+--------------+-------------+----------+
only showing top 5 rows



# Save your work to Git

- Export the notebook to IPYTHON format, `notebook top menu bar -> File -> Export -> iphython`
- Upload to your Git repository, `your_repo/spark/notebooks/`
- Github can render ipython notebook https://github.com/josephcslater/JupyterExamples/blob/master/Calc_Review.ipynb