## Week 5 Homework 

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)


In [7]:
%mkdir -p data
FILE = "fhv_tripdata_2019-10.csv.gz"
!wget -c https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/$FILE -O data/$FILE > /dev/null 2>&1
data_path = f"data/{FILE}"

In [58]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import types
from pyspark.sql import functions as F

In [59]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()

### Question 1: 

**Install Spark and PySpark** 

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?


**Solution**: 

- `spark.version`: 3.3.2

### Question 2: 

**FHV October 2019**

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 1MB
- 6MB
- 25MB
- 87MB

In [60]:
# Defining the schema for the csv file
schema = types.StructType([
    types.StructField('dispatching_base_num', types.StringType(), True),
    types.StructField('pickup_datetime', types.TimestampType(), True),
    types.StructField('dropOff_datetime', types.TimestampType(), True),
    types.StructField('PULocationID', types.IntegerType(), True),
    types.StructField('DOLocationID', types.IntegerType(), True),
    types.StructField('SR_Flag', types.DoubleType(), True),
    types.StructField('Affiliated_base_number', types.StringType(), True),
])

df_fhv = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv(data_path)

In [61]:
df_fhv.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropOff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: double (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



In [36]:
df_fhv.repartition(6).write.parquet(
    "data/question2/", 
    mode="overwrite"
)

                                                                                

Getting the file sizes of all generated parquet files of the repartitioned dataframe:

In [63]:
!du -sh data/question2/* | grep parquet

6,4M	data/question2/part-00000-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet
6,4M	data/question2/part-00001-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet
6,4M	data/question2/part-00002-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet
6,4M	data/question2/part-00003-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet
6,4M	data/question2/part-00004-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet
6,4M	data/question2/part-00005-f5129de5-1f20-4eea-ad41-1101fb113f40-c000.snappy.parquet


**Solution**:

- `6 MB`

### Question 3: 

**Count records** 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 108,164
- 12,856
- 452,470
- 62,610

> [!IMPORTANT]
> Be aware of columns order when defining schema

Version with SQL query:

In [64]:
df_fhv.createOrReplaceTempView("trips_data")

In [40]:
df_result = spark.sql("""
SELECT
    COUNT(*) AS trips_15_oct
FROM
    trips_data
WHERE
    EXTRACT(day FROM pickup_datetime) = 15
""")
df_result.show()

[Stage 23:>                                                         (0 + 1) / 1]

+------------+
|trips_15_oct|
+------------+
|       62610|
+------------+



                                                                                

Version with spark-functions:

In [41]:
df_fhv.filter(
    df_fhv.pickup_datetime.cast("date") == F.lit("2019-10-15")
).count()

                                                                                

62610

**Solution**:

- `62610`

### Question 4: 

**Longest trip for each day** 

What is the length of the longest trip in the dataset in hours?

- 631,152.50 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours

In [65]:
df_fhv = df_fhv.withColumn("trip_duration", (
    df_fhv.dropOff_datetime.cast("long") - df_fhv.pickup_datetime.cast("long")) / 3600
)

df_fhv = df_fhv.withColumn("trip_date", F.to_date(df_fhv.pickup_datetime))

df_fhv \
    .groupBy("trip_date") \
    .agg(F.max("trip_duration").alias("max_trip_duration")) \
    .orderBy("max_trip_duration", ascending=False) \
    .show(1)

[Stage 45:>                                                         (0 + 1) / 1]

+----------+-----------------+
| trip_date|max_trip_duration|
+----------+-----------------+
|2019-10-28|         631152.5|
+----------+-----------------+
only showing top 1 row



                                                                                

**`Solution`**:
- `631,152.50` Hours

### Question 5: 

**User Interface**

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080

**`Solution`**: 
- `4040` (by default, increments number if already in use)

### Question 6: 

**Least frequent pickup location zone**

Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>

- East Chelsea
- Jamaica Bay
- Union Sq
- Crown Heights North


**Getting the data**

In [66]:
%mkdir -p data/question4/
!wget -c https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv -O data/question4/taxi_zone_lookup.csv > /dev/null 2>&1
!du -sh data/question4/*

zones_path = "data/question4/taxi_zone_lookup.csv"

16K	data/question4/taxi_zone_lookup.csv
20K	data/question4/taxi_zone_lookup.parquet


**Loading data to dataframe**

In [67]:
schema = types.StructType([
    types.StructField('LocationID', types.IntegerType(), True),
    types.StructField('Borough', types.StringType(), True),
    types.StructField('Zone', types.StringType(), True),
    types.StructField('service_zone', types.StringType(), True),
])

df_zones = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv(zones_path)

df_zones.write.parquet(
    zones_path.replace(".csv", ".parquet"), 
    mode="overwrite"
)

df_zones = spark.read.parquet(zones_path.replace(".csv", ".parquet"))

**Joining the dataframes & obtaining least frequented Zones**

In [70]:
df_join = df_fhv.join(
    df_zones, 
    on=df_fhv.PULocationID == df_zones.LocationID,
    how="inner"
)

df_results = df_join \
    .groupBy("Zone").count() \
    .orderBy("count", ascending=True)

In [85]:
df_results.show(5)

[Stage 73:>                                                         (0 + 1) / 1]

+--------------------+-----+
|                Zone|count|
+--------------------+-----+
|         Jamaica Bay|    1|
|Governor's Island...|    2|
| Green-Wood Cemetery|    5|
|       Broad Channel|    8|
|     Highbridge Park|   14|
+--------------------+-----+
only showing top 5 rows



                                                                                

**`Solution`**:
- Jamaica Bay