## Week 5 Homework 

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)

## Question 1: 
### Answer 1: `3.3.2`

**Install Spark and PySpark** 

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?


In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

spark.version

24/03/20 12:39:40 WARN Utils: Your hostname, DESKTOP-PSK1GCJ resolves to a loopback address: 127.0.1.1; using 172.23.129.77 instead (on interface eth0)
24/03/20 12:39:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/03/20 12:39:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.3.2'

## Question 2: 
### Answer 2: `6MB`

**FHV October 2019**

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.</br> 
Repartition the Dataframe to 6 partitions and save it to parquet.</br>
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.</br>

- 1MB
- 6MB
- 25MB
- 87MB

Download the FHV data for October 2019:

In [2]:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz

--2024-03-20 12:39:48--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz
Resolving github.com (github.com)... 20.87.245.0, 64:ff9b::1457:f500
Connecting to github.com (github.com)|20.87.245.0|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/efdfcf82-6d5c-44d1-a138-4e8ea3c3a3b6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240320%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240320T173021Z&X-Amz-Expires=300&X-Amz-Signature=5d3b533057fe8d317b1b143ae56c804deb527049468c9f12b320de77a4aa2ccf&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dfhv_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream [following]
--2024-03-20 12:39:48--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814

In [3]:
# Unzip the file
!gunzip fhv_tripdata_2019-10.csv.gz -f

In [4]:
!ls -lh fhv_tripdata_2019-10.csv

-rw-r--r-- 1 kunmuli kunmuli 115M Dec  2  2022 fhv_tripdata_2019-10.csv


In [5]:
# Get the record count
!wc -l fhv_tripdata_2019-10.csv

1897494 fhv_tripdata_2019-10.csv


Explore the datatypes of the schema and read it with Spark as we did in the lessons. We will use this dataset for all the remaining questions.

In [6]:
df = spark.read \
    .option("header", "true") \
    .csv('fhv_tripdata_2019-10.csv')

                                                                                

In [7]:
df.show()

+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|dispatching_base_num|    pickup_datetime|   dropOff_datetime|PUlocationID|DOlocationID|SR_Flag|Affiliated_base_number|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|              B00009|2019-10-01 00:23:00|2019-10-01 00:35:00|         264|         264|   null|                B00009|
|              B00013|2019-10-01 00:11:29|2019-10-01 00:13:22|         264|         264|   null|                B00013|
|              B00014|2019-10-01 00:11:43|2019-10-01 00:37:20|         264|         264|   null|                B00014|
|              B00014|2019-10-01 00:56:29|2019-10-01 00:57:47|         264|         264|   null|                B00014|
|              B00014|2019-10-01 00:23:09|2019-10-01 00:28:27|         264|         264|   null|                B00014|
|     B00021         |2019-10-01 00:00:4

In [8]:
df.schema

StructType([StructField('dispatching_base_num', StringType(), True), StructField('pickup_datetime', StringType(), True), StructField('dropOff_datetime', StringType(), True), StructField('PUlocationID', StringType(), True), StructField('DOlocationID', StringType(), True), StructField('SR_Flag', StringType(), True), StructField('Affiliated_base_number', StringType(), True)])

In [9]:
from pyspark.sql import types

schema = types.StructType([
    types.StructField('dispatching_base_num', types.StringType(), True),
    types.StructField('pickup_datetime', types.TimestampType(), True),
    types.StructField('dropoff_datetime', types.TimestampType(), True),
    types.StructField('PULocationID', types.IntegerType(), True),
    types.StructField('DOLocationID', types.IntegerType(), True),
    types.StructField('SR_Flag', types.StringType(), True),
    types.StructField('Affiliated_base_number', types.StringType(), True),
])

In [10]:
df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv('fhv_tripdata_2019-10.csv')

df = df.repartition(6)

df.write.parquet('data/pq/fhv/2019/10/')

                                                                                

In [11]:
df = spark.read.parquet('data/pq/fhv/2019/10/')

In [12]:
# proposed command
!ls -lh data/pq/fhv/2019/10/

total 37M
-rw-r--r-- 1 kunmuli kunmuli    0 Mar 20 12:40 _SUCCESS
-rw-r--r-- 1 kunmuli kunmuli 6.1M Mar 20 12:40 part-00000-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet
-rw-r--r-- 1 kunmuli kunmuli 6.1M Mar 20 12:40 part-00001-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet
-rw-r--r-- 1 kunmuli kunmuli 6.0M Mar 20 12:40 part-00002-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet
-rw-r--r-- 1 kunmuli kunmuli 6.1M Mar 20 12:40 part-00003-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet
-rw-r--r-- 1 kunmuli kunmuli 6.0M Mar 20 12:40 part-00004-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet
-rw-r--r-- 1 kunmuli kunmuli 6.1M Mar 20 12:40 part-00005-98a29d99-f30f-4701-80c2-7283685ac02b-c000.snappy.parquet


## Question 3: 
### Answer 3: `62610`

**Count records** 

How many taxi trips were there on the 15th of October?</br></br>
Consider only trips that started on the 15th of October.</br>

- 108,164
- 12,856
- 452,470
- 62,610

> [!IMPORTANT]
> Be aware of columns order when defining schema

In [13]:
from pyspark.sql import functions as F

In [14]:
df \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .filter("pickup_date = '2019-10-15'") \
    .count()

62610

You can use a temp table also

In [15]:
df.registerTempTable('fhv_2019_10')



In [16]:
spark.sql("""
SELECT
    COUNT(1)
FROM 
    fhv_2019_10
WHERE
    to_date(pickup_datetime) = '2019-10-15';
""").show()

+--------+
|count(1)|
+--------+
|   62610|
+--------+



## Question 4: 
### Answer 4: `631,152.50 Hours`

**Longest trip for each day** 

What is the length of the longest trip in the dataset in hours?</br>

- 631,152.50 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours

In [17]:
df.columns

['dispatching_base_num',
 'pickup_datetime',
 'dropoff_datetime',
 'PULocationID',
 'DOLocationID',
 'SR_Flag',
 'Affiliated_base_number']

In [18]:
df \
    .withColumn('duration_in_seconds', df.dropoff_datetime.cast('long') - df.pickup_datetime.cast('long')) \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .groupBy('pickup_date') \
        .max('duration_in_seconds') \
    .orderBy('max(duration_in_seconds)', ascending=False) \
    .limit(5) \
    .show()

+-----------+------------------------+
|pickup_date|max(duration_in_seconds)|
+-----------+------------------------+
| 2019-10-28|              2272149000|
| 2019-10-11|              2272149000|
| 2019-10-31|               315620787|
| 2019-10-01|               252460901|
| 2019-10-17|                31658400|
+-----------+------------------------+



In [19]:
spark.sql("""
SELECT
    to_date(pickup_datetime) AS pickup_date,
    MAX((CAST(dropoff_datetime AS LONG) - CAST(pickup_datetime AS LONG)) / 3600) AS duration_in_hours
FROM 
    fhv_2019_10
GROUP BY
    1
ORDER BY
    2 DESC
LIMIT 20;
""").show()

+-----------+------------------+
|pickup_date| duration_in_hours|
+-----------+------------------+
| 2019-10-28|          631152.5|
| 2019-10-11|          631152.5|
| 2019-10-31| 87672.44083333333|
| 2019-10-01| 70128.02805555555|
| 2019-10-17|            8794.0|
| 2019-10-26| 8784.166666666666|
| 2019-10-30|1464.5344444444445|
| 2019-10-25|1056.8266666666666|
| 2019-10-02| 769.2313888888889|
| 2019-10-23| 745.6166666666667|
| 2019-10-03|          745.3825|
| 2019-10-04| 744.6166666666667|
| 2019-10-07| 744.1666666666666|
| 2019-10-05| 697.1808333333333|
| 2019-10-06| 674.0077777777777|
| 2019-10-08| 625.0822222222222|
| 2019-10-16| 604.0666666666667|
| 2019-10-09| 601.3102777777777|
| 2019-10-10| 577.3888888888889|
| 2019-10-12|          528.9125|
+-----------+------------------+



## Question 5: 
### Answer 5: `4040`

**User Interface**

Spark’s User Interface which shows the application's dashboard runs on which local port?</br>

- 80
- 443
- 4040
- 8080

### Question 6: 
### Answer 6: `Jamaica Bay`

**Least frequent pickup location zone**

Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)</br>

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>

- East Chelsea
- Jamaica Bay
- Union Sq
- Crown Heights North

In [20]:
df_zones = spark.read.parquet('zones')

In [21]:
df_zones.columns

['LocationID', 'Borough', 'Zone', 'service_zone']

In [22]:
df.columns

['dispatching_base_num',
 'pickup_datetime',
 'dropoff_datetime',
 'PULocationID',
 'DOLocationID',
 'SR_Flag',
 'Affiliated_base_number']

In [23]:
df_zones.registerTempTable('zones')

In [24]:
spark.sql("""
SELECT
    pul.Zone,
    COUNT(1)
FROM 
    fhv_2019_10 fhv LEFT JOIN zones pul ON fhv.PULocationID = pul.LocationID
                      LEFT JOIN zones dol ON fhv.DOLocationID = dol.LocationID
GROUP BY 
    1
ORDER BY
    2 ASC
LIMIT 10;
""").show()

[Stage 21:>                                                         (0 + 6) / 6]

+--------------------+--------+
|                Zone|count(1)|
+--------------------+--------+
|         Jamaica Bay|       1|
|Governor's Island...|       2|
| Green-Wood Cemetery|       5|
|       Broad Channel|       8|
|     Highbridge Park|      14|
|        Battery Park|      15|
|Saint Michaels Ce...|      23|
|Breezy Point/Fort...|      25|
|Marine Park/Floyd...|      26|
|        Astoria Park|      29|
+--------------------+--------+



                                                                                