## Lab Overview
- We  can create a PySpark SQL DataFrame from different database sources such as SQL, Oracle, and PostgreSQL. In this lab, we will explore how to read data from SQL using a JDBC connection. We will use the “classicmodels” database in this example.

### Lab Objectives
- Identify how to set up a JDBC connector for Spark to establish communication with RDBMS sources
- Illustrate how to use PySpark's DataFrame API to read data from an RDBMS source using JDBC connections.
- List various DataFrame operations and methods to manipulate and analyze the data retrieved from the RDBMS source.

### 1 - Setting Up SQL Connector for Spark
- When we want Spark to communicate with RDBMS, we need a compatible connector. For SQL, you can download its connector at this link: SQL Connector. Once you download it, move it into the Spark Installation folder/jars folder and RESTART your Jupyter notebook.

### 2 - Read Full Data from MySQL and add Data into the PySpark DataFrame
- Creating SparkSession

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Test SQL app").getOrCreate()


### Example 1: 
- In this example, we will read data from a MySQL database into a DataFrame using JDBC (Java Database Connectivity). It connects to the "classicmodels" database and loads the "orders" table into a Spark DataFrame.

- Note: Do not forget to change the root and password as per your database.

In [None]:
df = spark.read.format("jdbc").options(driver="com.mysql.cj.jdbc.Driver",\
                                     user="root",\
                                     password="password",\
                                     url="jdbc:mysql://localhost:3306/classicmodels",\
                                     dbtable="classicmodels.orders").load()


In [3]:
# Count the number of rows available
df.count()

326

In [4]:
# Print schema of 'order' table
df.printSchema()

root
 |-- orderNumber: integer (nullable = true)
 |-- orderDate: date (nullable = true)
 |-- requiredDate: date (nullable = true)
 |-- shippedDate: date (nullable = true)
 |-- status: string (nullable = true)
 |-- comments: string (nullable = true)
 |-- customerNumber: integer (nullable = true)



In [5]:
# Show all columns with data from 'order' table
df.show()

+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10100|2003-01-06|  2003-01-13| 2003-01-10|Shipped|                NULL|           363|
|      10101|2003-01-09|  2003-01-18| 2003-01-11|Shipped|Check on availabi...|           128|
|      10102|2003-01-10|  2003-01-18| 2003-01-14|Shipped|                NULL|           181|
|      10103|2003-01-29|  2003-02-07| 2003-02-02|Shipped|                NULL|           121|
|      10104|2003-01-31|  2003-02-09| 2003-02-01|Shipped|                NULL|           141|
|      10105|2003-02-11|  2003-02-21| 2003-02-12|Shipped|                NULL|           145|
|      10106|2003-02-17|  2003-02-24| 2003-02-21|Shipped|                NULL|           278|
|      10107|2003-02-24|  2003-03-03| 2003-02-26|Shipped|Dif

### Example 2: Read with a custom query:
- Spark does not limit us to reading an entire table at a time. We can also pass any SQL query to the Spark read() method, and we will get the query result as a data frame. Below is an example:
- In this example, we will connect to a MySQL database and load a filtered subset of data from the "orders" table where customerNumber = 144

In [6]:
query = "(select * from orders where customerNumber = 144) as cust"

df = spark.read.format("jdbc").options(driver="com.mysql.cj.jdbc.Driver",\
                                     user="root",\
                                     password="password",\
                                     url="jdbc:mysql://localhost:3306/classicmodels",\
                                     dbtable=query).load()
df.show()


+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10112|2003-03-24|  2003-04-03| 2003-03-29|Shipped|Customer requeste...|           144|
|      10320|2004-11-03|  2004-11-13| 2004-11-07|Shipped|                NULL|           144|
|      10326|2004-11-09|  2004-11-16| 2004-11-10|Shipped|                NULL|           144|
|      10334|2004-11-19|  2004-11-28|       NULL|On Hold|The outstaniding ...|           144|
+-----------+----------+------------+-----------+-------+--------------------+--------------+



### Example: 3: Querying Multiple Records from MySQL in PySpark:

- In this example, we will connect to a MySQL database and retrieve orders for customers with customerNumber = 144 or customerNumber = 128. The SQL query is wrapped inside parentheses and aliased (as cust) for compatibility with JDBC. The filtered results are loaded into a Spark DataFrame and displayed using .show(), allowing efficient data processing and analysis within Spark.

In [None]:
query="(select * from orders where customerNumber = 144 or customerNumber = 128) as cust"

df = spark.read.format("jdbc").options(driver="com.mysql.cj.jdbc.Driver",\
                                     user="root",\
                                     password="password",\
                                     url="jdbc:mysql://localhost:3306/classicmodels",\
                                     dbtable=query \
                                    ).load()
df.show()


+-----------+----------+------------+-----------+-------+--------------------+--------------+
|orderNumber| orderDate|requiredDate|shippedDate| status|            comments|customerNumber|
+-----------+----------+------------+-----------+-------+--------------------+--------------+
|      10101|2003-01-09|  2003-01-18| 2003-01-11|Shipped|Check on availabi...|           128|
|      10230|2004-03-15|  2004-03-24| 2004-03-20|Shipped|Customer very con...|           128|
|      10300|2003-10-04|  2003-10-13| 2003-10-09|Shipped|                NULL|           128|
|      10323|2004-11-05|  2004-11-12| 2004-11-09|Shipped|                NULL|           128|
|      10112|2003-03-24|  2003-04-03| 2003-03-29|Shipped|Customer requeste...|           144|
|      10320|2004-11-03|  2004-11-13| 2004-11-07|Shipped|                NULL|           144|
|      10326|2004-11-09|  2004-11-16| 2004-11-10|Shipped|                NULL|           144|
|      10334|2004-11-19|  2004-11-28|       NULL|On Hold|The