In [1]:
import pyspark
from pyspark.sql import SparkSession

**Step 1**: **SparkSession** = the entry point into all functionality in Spark.

We usually create a SparkSession as in the following example:

In [2]:
# instantiate a Spark session, an object that we use to interact with Spark
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

In the above code:
- `SparkSession` is the class of the object that we instantiate
- `builder` is the builder method
    - It's a class attribute that is an instance of `pyspark.sql.session.SparkSession.Builder` that is used to construct SparkSession instances
- `master()` sets the Spark master URL to connect to
    - The `local[*]` string means Spark will run on a local cluster (local machine)
        - `[*]` means Spark will run with as many CPU cores as possible
            - i.e., tells Spark to use all available cores (e.g., if we wanted to use only 2 cores, we would write `local[2]`)
- `appName()` defines the name of our application/session, which will show in the Spark UI at http://localhost:4040/
- `getOrCreate()` will create the session or recover the object if it was previously created
    - i.e., returns an existing `SparkSession`, if available, or creates a new one

In [3]:
# read the zone data
df_zone = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

In [4]:
# see the dataframe
df_zone.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

In [5]:
# read the fhv_tripdata_2021-01.csv.gz 
df_fhv = spark.read \
    .option("header", "true") \
    .option("inferSchema", True) \
    .csv('fhvhv_tripdata_2021-01.csv.gz')

In [6]:
# see the FHV data
df_fhv.show(5)

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|           HV0003|              B02682|2021-01-01 00:33:44|2021-01-01 00:49:07|         230|         166|   null|
|           HV0003|              B02682|2021-01-01 00:55:19|2021-01-01 01:18:21|         152|         167|   null|
|           HV0003|              B02764|2021-01-01 00:23:56|2021-01-01 00:38:05|         233|         142|   null|
|           HV0003|              B02764|2021-01-01 00:42:51|2021-01-01 00:45:50|         142|         143|   null|
|           HV0003|              B02764|2021-01-01 00:48:14|2021-01-01 01:08:42|         143|          78|   null|
+-----------------+--------------------+-------------------+-------------------+

In [7]:
df_zone.write.parquet('zones')

In [8]:
df_fhv.write.parquet('fhv')