# *Make sure you have ran all of `01_pyspark_intro.ipynb` to write the partitioned data to the local data directory*

In [1]:
!dir .\data\fhvhv\2021\01

 Volume in drive C has no label.
 Volume Serial Number is 08A3-CF2D

 Directory of C:\Users\nimz\Documents\de_zoomcamp\week5_batch_processing\data\fhvhv\2021\01

02/27/2024  06:51 PM    <DIR>          .
02/27/2024  06:51 PM    <DIR>          ..
02/27/2024  06:51 PM            71,484 .part-00000-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,468 .part-00001-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,476 .part-00002-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,480 .part-00003-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,480 .part-00004-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,484 .part-00005-41925c18-80ed-4acf-be10-12b01f6fd8d4-c000.snappy.parquet.crc
02/27/2024  06:51 PM            71,444 .part-00006-41925c18-80ed-4acf-be10-12b01f6f

In [2]:
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

# Instantiate a Spark session, an object that we use to interact with Spark
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

#### You should now see a Spark UI available at http://localhost:4040/jobs/

In [3]:
# Read the partitioned files into a Spark dataframe
df_spark = spark.read.parquet('./data/fhvhv/2021/01/')

#### Note that Parquet files are smaller because they *know* the schema and use more efficient ways of compressing data

In [4]:
# look at the DataFrame schema
df_spark.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)



# What can we do with Spark DataFrames?

**We can do the usual stuff we that do with pandas**

In [5]:
# Select only specific columns
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')

DataFrame[pickup_datetime: timestamp, dropoff_datetime: timestamp, PULocationID: int, DOLocationID: int]

In [6]:
# Do filtering
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
    .filter(df_spark.hvfhs_license_num == 'HV0003') \
    .show()

+-------------------+-------------------+------------+------------+
|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|
+-------------------+-------------------+------------+------------+
|2021-01-03 15:59:58|2021-01-03 16:13:50|         144|         261|
|2021-01-01 14:39:29|2021-01-01 14:59:45|         148|          68|
|2021-01-01 07:25:16|2021-01-01 07:50:46|          61|          76|
|2021-01-02 01:05:28|2021-01-02 01:11:40|          42|          42|
|2021-01-02 13:01:44|2021-01-02 13:25:23|         155|         177|
|2021-01-01 05:51:46|2021-01-01 06:03:24|          49|         177|
|2021-01-01 02:12:08|2021-01-01 02:19:49|          94|         174|
|2021-01-01 02:17:17|2021-01-01 02:34:03|          42|           4|
|2021-01-01 01:05:04|2021-01-01 01:17:42|         231|         265|
|2021-01-03 01:05:38|2021-01-03 01:09:14|         229|         141|
|2021-01-03 00:37:31|2021-01-03 01:01:18|         179|          14|
|2021-01-01 17:23:04|2021-01-01 17:44:37|       

#### You should see a new job in the Spark UI for this command above, but NOT for the `.select()` command

**Note that `.partition()` and `.filter()` and `.select()` are *lazy* commands in Spark, and we need to do something extra with them in order to get them to run as Spark jobs.**

The reason `.partition()` and `.filter()` are lazy is because some operations are executed right away in Spark, and some are not


# Actions vs. Transformations
- **Actions** are code that is executed *immediately* (eager)
    - Examples include `.show()`, `.take()`, `.head()`, `.write_csv()`, `.write_parquet()` etc.
- **Transformations** are code that is *lazy* (i.e., *NOT* executed immediately)
    - Examples include selecting columns, data filtering, JOIN's, GROUP BY operations, etc.
    - In these cases, Spark creates a sequence of transformations that is executed *only* when we call some method like `.show()`, which is an example of an Action

***To summarize, Spark creates a sequence of Transformations that aren't executed until an Action is executed***

**See more at**:
- https://spark.apache.org/docs/latest/rdd-programming-guide.html
- https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
- https://data-flair.training/blogs/apache-spark-lazy-evaluation/

In [7]:
# Do filtering and execute a job (i.e. a filter transformation) by performing a Spark Action
df_spark.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \
    .filter(df_spark.hvfhs_license_num == 'HV0003') \
    .take(5)  # or .head(5)

[Row(pickup_datetime=datetime.datetime(2021, 1, 3, 15, 59, 58), dropoff_datetime=datetime.datetime(2021, 1, 3, 16, 13, 50), PULocationID=144, DOLocationID=261),
 Row(pickup_datetime=datetime.datetime(2021, 1, 1, 14, 39, 29), dropoff_datetime=datetime.datetime(2021, 1, 1, 14, 59, 45), PULocationID=148, DOLocationID=68),
 Row(pickup_datetime=datetime.datetime(2021, 1, 1, 7, 25, 16), dropoff_datetime=datetime.datetime(2021, 1, 1, 7, 50, 46), PULocationID=61, DOLocationID=76),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 1, 5, 28), dropoff_datetime=datetime.datetime(2021, 1, 2, 1, 11, 40), PULocationID=42, DOLocationID=42),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 13, 1, 44), dropoff_datetime=datetime.datetime(2021, 1, 2, 13, 25, 23), PULocationID=155, DOLocationID=177)]

In [10]:
# Do a Group by's with a Collect action
df_spark.groupBy(df_spark.hvfhs_license_num) \
    .count() \
    .collect()

[Row(hvfhs_license_num='HV0004', count=110015),
 Row(hvfhs_license_num='HV0005', count=3094325),
 Row(hvfhs_license_num='HV0003', count=8704128)]

# Spark Functions

#### Why bother with the above filter statement in Spark when `SELECT * FROM df WHERE hvfhs_license_num = 'HV0003'` in SQL works just fine?

We do because **Spark is more flexible**, and gives us the ability to create **user-defined functions (UDFs)**

But, before we get into UDFs, we first look at **Spark-provided functions**

In [11]:
# Import a collection of functions that Spark already has
from pyspark.sql import functions as F

In [15]:
# # type in `F.` and hit TAB to see the list of functions
# F.

# Or do this
dir(F)

['Any',
 'ArrayType',
 'Callable',
 'Column',
 'DataFrame',
 'DataType',
 'Dict',
 'Iterable',
 'List',
 'Optional',
 'PandasUDFType',
 'PythonEvalType',
 'SparkContext',
 'StringType',
 'StructType',
 'TYPE_CHECKING',
 'Tuple',
 'Union',
 'UserDefinedFunction',
 'ValuesView',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_create_column_from_literal',
 '_create_lambda',
 '_create_udf',
 '_get_jvm_function',
 '_get_lambda_parameters',
 '_invoke_binary_math_function',
 '_invoke_function',
 '_invoke_function_over_columns',
 '_invoke_function_over_seq_of_columns',
 '_invoke_higher_order_function',
 '_options_to_str',
 '_test',
 '_to_java_column',
 '_to_seq',
 '_unresolved_named_lambda_variable',
 'abs',
 'acos',
 'acosh',
 'add_months',
 'aggregate',
 'approxCountDistinct',
 'approx_count_distinct',
 'array',
 'array_contains',
 'array_distinct',
 'array_except',
 'array_intersect',
 'array_join',
 'array_max',
 'array_m

In [19]:
# # take a datetime and keep only the date
# F.to_date()

# Add a new column to the DataFrame wherein we take a datetime column and keep ONLY the DATE
df_spark \
    .withColumn('pickup_date', F.to_date(df_spark.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df_spark.dropoff_datetime)) \
    .show(5)

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+-----------+------------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|pickup_date|dropoff_date|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+-----------+------------+
|           HV0005|              B02510|2021-01-02 11:31:29|2021-01-02 11:37:35|          28|         130|   null| 2021-01-02|  2021-01-02|
|           HV0003|              B02877|2021-01-03 15:59:58|2021-01-03 16:13:50|         144|         261|   null| 2021-01-03|  2021-01-03|
|           HV0005|              B02510|2021-01-02 20:41:20|2021-01-02 20:58:35|         138|         232|   null| 2021-01-02|  2021-01-02|
|           HV0005|              B02510|2021-01-02 12:32:53|2021-01-02 12:37:51|          42|         116|   null| 2021-01-02|  2021-01-02|
|           HV0003| 

In [20]:
# Do the same as above but select only a few columns
df_spark \
    .withColumn('pickup_date', F.to_date(df_spark.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df_spark.dropoff_datetime)) \
    .select('pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
    .show(5)

+-----------+------------+------------+------------+
|pickup_date|dropoff_date|PULocationID|DOLocationID|
+-----------+------------+------------+------------+
| 2021-01-02|  2021-01-02|          28|         130|
| 2021-01-03|  2021-01-03|         144|         261|
| 2021-01-02|  2021-01-02|         138|         232|
| 2021-01-02|  2021-01-02|          42|         116|
| 2021-01-01|  2021-01-01|         148|          68|
+-----------+------------+------------+------------+
only showing top 5 rows



# User-Defined Functions

**Again, we can also define our *own* functions**

**This is not something we'd typically do in a data warehouse because it can become cumbersome**

**But in PySpark, we can store all the code easily in a repository, cover it with *tests*, and really make sure the code works before executing it on DataFrames**

In [22]:
# Create a UDF
def cant_do_in_sql(base_num):
    
    num = int(base_num[1:])
    
    if num % 7 == 0:
        return f's/{num:03x}'
    elif num % 3 == 0:
        return f'a/{num:03x}'
    else:
        return f'e/{num:03x}'

In [23]:
# Execute the UDF
cant_do_in_sql('B02884')

's/b44'

#### The function above can live in a separate Python module, and we can test it with *unit tests*

In [24]:
from pyspark.sql import types

# Turn our user-defined Python function into a *Spark* function using Spark's Functions library (F)
cant_do_in_sql_udf = F.udf(cant_do_in_sql, returnType=types.StringType())

In [25]:
# Use the Spark UDF
df_spark \
    .withColumn('pickup_date', F.to_date(df_spark.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df_spark.dropoff_datetime)) \
    .withColumn('base_id', cant_do_in_sql_udf(df_spark.dispatching_base_num)) \
    .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
    .show(5)

+-------+-----------+------------+------------+------------+
|base_id|pickup_date|dropoff_date|PULocationID|DOLocationID|
+-------+-----------+------------+------------+------------+
|  e/9ce| 2021-01-02|  2021-01-02|          28|         130|
|  s/b3d| 2021-01-03|  2021-01-03|         144|         261|
|  e/9ce| 2021-01-02|  2021-01-02|         138|         232|
|  e/9ce| 2021-01-02|  2021-01-02|          42|         116|
|  e/b35| 2021-01-01|  2021-01-01|         148|          68|
+-------+-----------+------------+------------+------------+
only showing top 5 rows

