In [1]:
import pyspark
from pyspark.sql import SparkSession

**Step 1**: **SparkSession** = the entry point into all functionality in Spark.

We usually create a SparkSession as in the following example:

In [2]:
# instantiate a Spark session, an object that we use to interact with Spark
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

In the above code:
- `SparkSession` is the class of the object that we instantiate
- `builder` is the builder method
    - It's a class attribute that is an instance of `pyspark.sql.session.SparkSession.Builder` that is used to construct SparkSession instances
- `master()` sets the Spark master URL to connect to
    - The `local[*]` string means Spark will run on a local cluster (local machine)
        - `[*]` means Spark will run with as many CPU cores as possible
            - i.e., tells Spark to use all available cores (e.g., if we wanted to use only 2 cores, we would write `local[2]`)
- `appName()` defines the name of our application/session, which will show in the Spark UI at http://localhost:4040/
- `getOrCreate()` will create the session or recover the object if it was previously created
    - i.e., returns an existing `SparkSession`, if available, or creates a new one

Similarlly to pandas, Spark can read CSV files into dataframes, a tabular data structure. Unlike pandas, Spark can handle much bigger datasets but it’s unable to infer the datatypes of each column

In [3]:
# read the zone data
df_zone = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

In [4]:
# see the dataframe
df_zone.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

In [5]:
# read the fhv_tripdata_2021-01.csv.gz 
df_fhv = spark.read \
    .option("header", "true") \
    .option("inferSchema", True) \
    .csv('fhvhv_tripdata_2021-01.csv.gz')

In [6]:
# see the FHV data
df_fhv.show(5)

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|           HV0003|              B02682|2021-01-01 00:33:44|2021-01-01 00:49:07|         230|         166|   null|
|           HV0003|              B02682|2021-01-01 00:55:19|2021-01-01 01:18:21|         152|         167|   null|
|           HV0003|              B02764|2021-01-01 00:23:56|2021-01-01 00:38:05|         233|         142|   null|
|           HV0003|              B02764|2021-01-01 00:42:51|2021-01-01 00:45:50|         142|         143|   null|
|           HV0003|              B02764|2021-01-01 00:48:14|2021-01-01 01:08:42|         143|          78|   null|
+-----------------+--------------------+-------------------+-------------------+

In [7]:
df_zone.write.parquet('zones')

In [8]:
df_fhv.write.parquet('fhv')

In [11]:
!dir zones

 Volume in drive C has no label.
 Volume Serial Number is 08A3-CF2D

 Directory of C:\Users\nimz\Documents\de_zoomcamp\week5_batch_processing\zones

05/09/2023  07:46 PM    <DIR>          .
05/09/2023  07:46 PM    <DIR>          ..
05/09/2023  07:46 PM                56 .part-00000-fea82282-46ba-4d79-a5e2-d4a128854b68-c000.snappy.parquet.crc
05/09/2023  07:46 PM                 8 ._SUCCESS.crc
05/09/2023  07:46 PM             5,916 part-00000-fea82282-46ba-4d79-a5e2-d4a128854b68-c000.snappy.parquet
05/09/2023  07:46 PM                 0 _SUCCESS
               4 File(s)          5,980 bytes
               2 Dir(s)  370,420,756,480 bytes free


In [12]:
!dir fhv

 Volume in drive C has no label.
 Volume Serial Number is 08A3-CF2D

 Directory of C:\Users\nimz\Documents\de_zoomcamp\week5_batch_processing\fhv

05/09/2023  07:47 PM    <DIR>          .
05/09/2023  07:47 PM    <DIR>          ..
05/09/2023  07:47 PM         1,288,716 .part-00000-b140ce80-a715-4d83-89e7-09279e1b1d2c-c000.snappy.parquet.crc
05/09/2023  07:47 PM                 8 ._SUCCESS.crc
05/09/2023  07:47 PM       164,954,455 part-00000-b140ce80-a715-4d83-89e7-09279e1b1d2c-c000.snappy.parquet
05/09/2023  07:47 PM                 0 _SUCCESS
               4 File(s)    166,243,179 bytes
               2 Dir(s)  370,420,690,944 bytes free


In [16]:
df_fhv.head(5)

[Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 33, 44), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 49, 7), PULocationID=230, DOLocationID=166, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 55, 19), dropoff_datetime=datetime.datetime(2021, 1, 1, 1, 18, 21), PULocationID=152, DOLocationID=167, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 23, 56), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 38, 5), PULocationID=233, DOLocationID=142, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 42, 51), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 45, 50), PULocationID=142, DOLocationID=143, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_dat

Note that Spark may be unable to infer data types of columns, say our `datetime` columns here might by strings

In [17]:
df_fhv.schema

StructType([StructField('hvfhs_license_num', StringType(), True), StructField('dispatching_base_num', StringType(), True), StructField('pickup_datetime', TimestampType(), True), StructField('dropoff_datetime', TimestampType(), True), StructField('PULocationID', IntegerType(), True), StructField('DOLocationID', IntegerType(), True), StructField('SR_Flag', IntegerType(), True)])

But it worked this time

Check the zones data too

In [19]:
df_zone.schema

StructType([StructField('LocationID', StringType(), True), StructField('Borough', StringType(), True), StructField('Zone', StringType(), True), StructField('service_zone', StringType(), True)])

In [20]:
df_zone.head(5)

[Row(LocationID='1', Borough='EWR', Zone='Newark Airport', service_zone='EWR'),
 Row(LocationID='2', Borough='Queens', Zone='Jamaica Bay', service_zone='Boro Zone'),
 Row(LocationID='3', Borough='Bronx', Zone='Allerton/Pelham Gardens', service_zone='Boro Zone'),
 Row(LocationID='4', Borough='Manhattan', Zone='Alphabet City', service_zone='Yellow Zone'),
 Row(LocationID='5', Borough='Staten Island', Zone='Arden Heights', service_zone='Boro Zone')]