# Make sure you run `pyspark_intro.ipynb` to write the partitioned data to the local directory

In [1]:
import pyspark
from pyspark.sql import SparkSession

# instantiate a Spark session, an object that we use to interact with Spark
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

In [2]:
import pandas as pd

**You will be able to see all 24 partitions as files in the `fhvhv/2021/01/`dir**

**Can also see a `parquet` job at http://localhost:4040/jobs/ that we can click into and view more information such as DAGs**

In [3]:
!dir fhvhv\2021\01

 Volume in drive C has no label.
 Volume Serial Number is 08A3-CF2D

 Directory of C:\Users\nimz\Documents\de_zoomcamp\week5_batch_processing\fhvhv\2021\01

05/09/2023  08:47 PM    <DIR>          .
05/09/2023  08:47 PM    <DIR>          ..
05/09/2023  08:47 PM            71,484 .part-00000-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,468 .part-00001-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,476 .part-00002-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,480 .part-00003-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,480 .part-00004-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,484 .part-00005-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-c000.snappy.parquet.crc
05/09/2023  08:47 PM            71,444 .part-00006-1fc571d4-0079-41cb-ad93-7b7dd5e2db6c-

In [6]:
# read the partitioned files back into a Spark dataframe
df_spark = spark.read.parquet('fhvhv/2021/01/')

In [7]:
# look at the DataFrame schema 
#    - parquet files are smaller because they know the schema and use more efficient ways of compressing data
df_spark.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)



## What can we do with Spark DataFrames?

We can do the usual stuff we do with pandas

In [8]:
# select only specific columns
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')

DataFrame[pickup_datetime: timestamp, dropoff_datetime: timestamp, PULocationID: int, DOLocationID: int]

In [13]:
# do filtering
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
    .filter(df_spark.hvfhs_license_num == 'HV0003') \
    .show()

+-------------------+-------------------+------------+------------+
|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|
+-------------------+-------------------+------------+------------+
|2021-01-03 15:59:58|2021-01-03 16:13:50|         144|         261|
|2021-01-01 14:39:29|2021-01-01 14:59:45|         148|          68|
|2021-01-01 07:25:16|2021-01-01 07:50:46|          61|          76|
|2021-01-02 01:05:28|2021-01-02 01:11:40|          42|          42|
|2021-01-02 13:01:44|2021-01-02 13:25:23|         155|         177|
|2021-01-01 05:51:46|2021-01-01 06:03:24|          49|         177|
|2021-01-01 02:12:08|2021-01-01 02:19:49|          94|         174|
|2021-01-01 02:17:17|2021-01-01 02:34:03|          42|           4|
|2021-01-01 01:05:04|2021-01-01 01:17:42|         231|         265|
|2021-01-03 01:05:38|2021-01-03 01:09:14|         229|         141|
|2021-01-03 00:37:31|2021-01-03 01:01:18|         179|          14|
|2021-01-01 17:23:04|2021-01-01 17:44:37|       

The reason `.partition()` and `.filter()` are lazy is because some operations are executed right away in Spark, and some are not

# Actions vs. Transformations
- **Actions** = code that is executed immediately (eager)
    - `show()`, `take()`, `head()`, `write()`, etc.
- **Transformations** = code that is lazy (i.e., not executed immediately)
    - Selecting columns, data filtering, JOIN's , and GROUP BY operations
    - In these cases, Spark creates a sequence of transformations that is executed only when we call some method like `show()`, which is an example of an Action.

***Spark creates a sequence of transformations until an action is executed***

In [15]:
# do filtering
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
    .filter(df_spark.hvfhs_license_num == 'HV0003') \
    .take(5)  # or .head(5)

[Row(pickup_datetime=datetime.datetime(2021, 1, 3, 15, 59, 58), dropoff_datetime=datetime.datetime(2021, 1, 3, 16, 13, 50), PULocationID=144, DOLocationID=261),
 Row(pickup_datetime=datetime.datetime(2021, 1, 1, 14, 39, 29), dropoff_datetime=datetime.datetime(2021, 1, 1, 14, 59, 45), PULocationID=148, DOLocationID=68),
 Row(pickup_datetime=datetime.datetime(2021, 1, 1, 7, 25, 16), dropoff_datetime=datetime.datetime(2021, 1, 1, 7, 50, 46), PULocationID=61, DOLocationID=76),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 1, 5, 28), dropoff_datetime=datetime.datetime(2021, 1, 2, 1, 11, 40), PULocationID=42, DOLocationID=42),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 13, 1, 44), dropoff_datetime=datetime.datetime(2021, 1, 2, 13, 25, 23), PULocationID=155, DOLocationID=177)]

In [16]:
# do group by's
# df_spark.groupBy() \
#     .select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
#     .filter(df_spark.hvfhs_license_num == 'HV0003') \
#     .take(5)  # or .head(5)

Why bother with the above when `SELECT * FROM df WHERE hvfhs_license_num = 'HV0003'` in SQL works?

Spark is more flexible, and gives us **user-defined functions**

But before we get into that, we can look at **Spark-provided functions**

In [17]:
# collection of functions Spark already has
from pyspark.sql import functions as F

In [18]:
# # type in "F." and hit TAB to see the list of functions
# F.

In [20]:
# # take a datetime and keep only the date
# F.to_date()

# add a new column to the dataframe
df_spark \
    .withColumn('pickup_date', F.to_date(df_spark.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df_spark.dropoff_datetime)) \
    .show()

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+-----------+------------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|pickup_date|dropoff_date|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+-----------+------------+
|           HV0005|              B02510|2021-01-02 11:31:29|2021-01-02 11:37:35|          28|         130|   null| 2021-01-02|  2021-01-02|
|           HV0003|              B02877|2021-01-03 15:59:58|2021-01-03 16:13:50|         144|         261|   null| 2021-01-03|  2021-01-03|
|           HV0005|              B02510|2021-01-02 20:41:20|2021-01-02 20:58:35|         138|         232|   null| 2021-01-02|  2021-01-02|
|           HV0005|              B02510|2021-01-02 12:32:53|2021-01-02 12:37:51|          42|         116|   null| 2021-01-02|  2021-01-02|
|           HV0003| 

In [22]:
# # take a datetime and keep only the date
# F.to_date()

# add a new column to the dataframe
df_spark \
    .withColumn('pickup_date', F.to_date(df_spark.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df_spark.dropoff_datetime)) \
    .select('pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
    .show()

+-----------+------------+------------+------------+
|pickup_date|dropoff_date|PULocationID|DOLocationID|
+-----------+------------+------------+------------+
| 2021-01-02|  2021-01-02|          28|         130|
| 2021-01-03|  2021-01-03|         144|         261|
| 2021-01-02|  2021-01-02|         138|         232|
| 2021-01-02|  2021-01-02|          42|         116|
| 2021-01-01|  2021-01-01|         148|          68|
| 2021-01-01|  2021-01-01|          61|          76|
| 2021-01-02|  2021-01-02|          42|          42|
| 2021-01-02|  2021-01-02|         155|         177|
| 2021-01-01|  2021-01-01|          49|         177|
| 2021-01-01|  2021-01-01|          94|         174|
| 2021-01-01|  2021-01-01|          42|           4|
| 2021-01-01|  2021-01-01|         231|         265|
| 2021-01-03|  2021-01-03|         229|         141|
| 2021-01-03|  2021-01-03|         179|          14|
| 2021-01-01|  2021-01-01|          76|          91|
| 2021-01-03|  2021-01-03|           7|       

**Again, we can also define our own functions**

**This is not something we'd typically do in data warehouses, because it can be cumbersome**

**But in PySpark, we can store all the code easily, cover it with tests, and really make sure the code works before executing it on you dataframes**

In [23]:
def cant_do_in_sql(base_num):
    num = int(base_num[1:])
    if num % 7 == 0:
        return f's/{num:03x}'
    elif num % 3 == 0:
        return f'a/{num:03x}'
    else:
        return f'e/{num:03x}'

In [25]:
cant_do_in_sql('B02884')

's/b44'

**The above can live in a separate Python module, and we can test it with unit tests**

In [27]:
from pyspark.sql import types

In [28]:
# turn our user-defined Python function into a Spark function
cant_do_in_sql_udf = F.udf(cant_do_in_sql, returnType=types.StringType())

In [31]:
df_spark \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \
    .withColumn('base_id', cant_do_in_sql_udf(df.dispatching_base_num)) \
    .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
    .show()

AnalysisException: Resolved attribute(s) pickup_datetime#2 missing from hvfhs_license_num#14,dispatching_base_num#15,pickup_datetime#16,dropoff_datetime#17,PULocationID#18,DOLocationID#19,SR_Flag#20 in operator !Project [hvfhs_license_num#14, dispatching_base_num#15, pickup_datetime#16, dropoff_datetime#17, PULocationID#18, DOLocationID#19, SR_Flag#20, to_date(pickup_datetime#2, None, Some(America/New_York)) AS pickup_date#244]. Attribute(s) with the same name appear in the operation: pickup_datetime. Please check if the right attribute(s) are used.;
!Project [hvfhs_license_num#14, dispatching_base_num#15, pickup_datetime#16, dropoff_datetime#17, PULocationID#18, DOLocationID#19, SR_Flag#20, to_date(pickup_datetime#2, None, Some(America/New_York)) AS pickup_date#244]
+- Relation [hvfhs_license_num#14,dispatching_base_num#15,pickup_datetime#16,dropoff_datetime#17,PULocationID#18,DOLocationID#19,SR_Flag#20] parquet


In [None]:


df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
  .filter(df.hvfhs_license_num == 'HV0003')

