# Data Engineering ZoomCamp
## Week 5: Batch Processing
Maria Fisher 

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)


In [28]:
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import types

### Question 1: 

**Install Spark and PySpark** 

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?

In [52]:
pyspark.__version__

'3.3.2'

In [None]:
pyspark.__file__

In [None]:
#!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

24/03/01 15:51:12 WARN Utils: Your hostname, river resolves to a loopback address: 127.0.1.1; using 192.168.1.252 instead (on interface wlp1s0)
24/03/01 15:51:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/03/01 15:51:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/01 15:51:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/03/01 15:51:14 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
!ls -lh fhv_tripdata_2019-10.csv

-rw-r--r-- 1 malu malu 115M Nov 21  2022 fhv_tripdata_2019-10.csv


In [4]:
!wc -l fhv_tripdata_2019-10.csv

1897494 fhv_tripdata_2019-10.csv


In [6]:
df = spark.read \
    .option("header", "true") \
    .csv('fhv_tripdata_2019-10.csv')

                                                                                

In [7]:
df.schema

StructType([StructField('dispatching_base_num', StringType(), True), StructField('pickup_datetime', StringType(), True), StructField('dropOff_datetime', StringType(), True), StructField('PUlocationID', StringType(), True), StructField('DOlocationID', StringType(), True), StructField('SR_Flag', StringType(), True), StructField('Affiliated_base_number', StringType(), True)])

In [27]:
df.dtypes

[('dispatching_base_num', 'string'),
 ('pickup_datetime', 'timestamp'),
 ('dropOff_datetime', 'timestamp'),
 ('PULocationID', 'int'),
 ('DOLocationID', 'int'),
 ('SR_Flag', 'string'),
 ('Affiliated_base_number', 'string')]

In [9]:
from pyspark.sql import types

In [10]:
schema = types.StructType([
    types.StructField('dispatching_base_num', types.StringType(), True),
    types.StructField('pickup_datetime', types.TimestampType(), True),
    types.StructField('dropOff_datetime', types.TimestampType(), True),
    types.StructField('PULocationID', types.IntegerType(), True),
    types.StructField('DOLocationID', types.IntegerType(), True),
    types.StructField('SR_Flag', types.StringType(), True),
    types.StructField('Affiliated_base_number', types.StringType(), True)
])

In [12]:
df

Row(dispatching_base_num='B00009', pickup_datetime='2019-10-01 00:23:00', dropOff_datetime='2019-10-01 00:35:00', PUlocationID='264', DOlocationID='264', SR_Flag=None, Affiliated_base_number='B00009')

### Question 2: 

**FHV October 2019**

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 1MB
- 6MB x
- 25MB
- 87MB


In [16]:
df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv('fhv_tripdata_2019-10.csv')


In [17]:
df = df.repartition(6)

In [18]:
df.write.parquet('fhv/2019/10/')

                                                                                

In [19]:
df = spark.read.parquet('fhv/2019/10/')

In [20]:
df.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropOff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



### Question 3: 

**Count records** 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 108,164
- 12,856
- 452,470
- 62,610 x

> [!IMPORTANT]
> Be aware of columns order when defining schema

In [None]:
from pyspark.sql import functions as F

In [22]:
df.show()

+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|dispatching_base_num|    pickup_datetime|   dropOff_datetime|PULocationID|DOLocationID|SR_Flag|Affiliated_base_number|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|              B00445|2019-10-07 03:54:41|2019-10-07 04:02:41|         252|         138|   null|                B00445|
|              B02613|2019-10-05 00:47:05|2019-10-05 01:11:53|         264|          39|   null|                B02613|
|              B01437|2019-10-02 20:29:31|2019-10-02 20:52:31|         264|         173|   null|                B01437|
|              B02292|2019-10-02 16:40:52|2019-10-02 16:53:57|         264|          17|   null|                B02292|
|              B01338|2019-10-03 17:21:08|2019-10-03 17:26:58|         264|          32|   null|                B01338|
|              B02715|2019-10-08 13:10:2

In [23]:
df \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .filter("pickup_date = '2019-10-15'") \
    .count()

                                                                                

62610

### Question 4: 

**Longest trip for each day** 

What is the length of the longest trip in the dataset in hours?

- 631,152.50 Hours x
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours


In [29]:
df \
    .withColumn('duration_seconds', F.unix_timestamp(df.dropOff_datetime) - F.unix_timestamp(df.pickup_datetime)) \
    .withColumn('duration_hours', F.col('duration_seconds') / 3600) \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .groupBy('pickup_date') \
    .max('duration_hours') \
    .orderBy('max(duration_hours)', ascending=False) \
    .limit(5) \
    .show()




+-----------+-------------------+
|pickup_date|max(duration_hours)|
+-----------+-------------------+
| 2019-10-28|           631152.5|
| 2019-10-11|           631152.5|
| 2019-10-31|  87672.44083333333|
| 2019-10-01|  70128.02805555555|
| 2019-10-17|             8794.0|
+-----------+-------------------+



                                                                                

### Question 5: 

**User Interface**

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040 x
- 8080

### Question 6: 

**Least frequent pickup location zone**

Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>

- East Chelsea
- Jamaica Bay x
- Union Sq
- Crown Heights North

In [44]:
df_zones = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

df_zones.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

In [47]:
# Join Fhv with zone lookup data
joined_data = df.join(df_zones, df['PULocationID'] == df_zones['LocationID'])


In [48]:
joined_data

DataFrame[dispatching_base_num: string, pickup_datetime: timestamp, dropOff_datetime: timestamp, PULocationID: int, DOLocationID: int, SR_Flag: string, Affiliated_base_number: string, LocationID: string, Borough: string, Zone: string, service_zone: string]

In [49]:
# Group by pickup zone ans count
df_zones = joined_data.groupBy('Zone').count()

In [50]:
# Show the 5 least frequent pickup locations
df_zones.orderBy('count').limit(5).show()



+--------------------+-----+
|                Zone|count|
+--------------------+-----+
|         Jamaica Bay|    1|
|Governor's Island...|    2|
| Green-Wood Cemetery|    5|
|       Broad Channel|    8|
|     Highbridge Park|   14|
+--------------------+-----+



                                                                                

## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5
- Deadline: See the website