## Preparation

In [1]:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz
!gunzip fhv_tripdata_2019-10.csv.gz

--2024-03-03 12:08:36--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving github.com (github.com)... 20.205.243.166, 64:ff9b::14cd:f3a6
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/efdfcf82-6d5c-44d1-a138-4e8ea3c3a3b6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240303%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240303T050837Z&X-Amz-Expires=300&X-Amz-Signature=ce99315ec9daac62aa4dcf8a126555b4fefb7e79d3e74a1e4a2bf2c6dcf5a783&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dfhv_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream [following]
--2024-03-03 12:08:37--  https://objects.gi

## Q1
**Install Spark and PySpark**
- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

In [2]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("homework").getOrCreate()
spark.version

24/03/03 12:10:04 WARN Utils: Your hostname, sugab-archlinux resolves to a loopback address: 127.0.1.1; using 192.168.81.206 instead (on interface wlan0)
24/03/03 12:10:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/03 12:10:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.5.1'

## Q2

**FHV October 2019**

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

In [3]:
df = spark.read.option("header", "true").csv("fhv_tripdata_2019-10.csv")
df.show(5)

                                                                                

+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|dispatching_base_num|    pickup_datetime|   dropOff_datetime|PUlocationID|DOlocationID|SR_Flag|Affiliated_base_number|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|              B00009|2019-10-01 00:23:00|2019-10-01 00:35:00|         264|         264|   NULL|                B00009|
|              B00013|2019-10-01 00:11:29|2019-10-01 00:13:22|         264|         264|   NULL|                B00013|
|              B00014|2019-10-01 00:11:43|2019-10-01 00:37:20|         264|         264|   NULL|                B00014|
|              B00014|2019-10-01 00:56:29|2019-10-01 00:57:47|         264|         264|   NULL|                B00014|
|              B00014|2019-10-01 00:23:09|2019-10-01 00:28:27|         264|         264|   NULL|                B00014|
+--------------------+------------------

In [4]:
df.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: string (nullable = true)
 |-- dropOff_datetime: string (nullable = true)
 |-- PUlocationID: string (nullable = true)
 |-- DOlocationID: string (nullable = true)
 |-- SR_Flag: string (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



### Infer to Correct Types

In [5]:
# Infer to correct types
!head -n 1001 "fhv_tripdata_2019-10.csv" > head.csv

In [6]:
# read head csv using pandas

import pandas as pd

df_pandas = pd.read_csv("head.csv")
df_pandas.dtypes

dispatching_base_num       object
pickup_datetime            object
dropOff_datetime           object
PUlocationID              float64
DOlocationID              float64
SR_Flag                   float64
Affiliated_base_number     object
dtype: object

In [7]:
# convert pandas dataframe to spark dataframe
spark.createDataFrame(df_pandas).printSchema()

  if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:


root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: string (nullable = true)
 |-- dropOff_datetime: string (nullable = true)
 |-- PUlocationID: double (nullable = true)
 |-- DOlocationID: double (nullable = true)
 |-- SR_Flag: double (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



Correct data types:

```
StructType(
    [
        StructField('dispatching_base_num', StringType(), True),
        StructField('pickup_datetime', StringType(), True),
        StructField('dropoff_datetime', StringType(), True),
        StructField('PULocationID', DoubleType(), True),
        StructField('DOLocationID', DoubleType(), True),
        StructField('SR_Flag', DoubleType(), True),
        StructField('Affiliated_base_number', StringType(), True)
    ]
)
```

### Import Data with Correct Types

In [8]:
from pyspark.sql import types

schema = types.StructType(
    [
        types.StructField("dispatching_base_num", types.StringType(), True),
        types.StructField("pickup_datetime", types.TimestampType(), True),
        types.StructField("dropoff_datetime", types.TimestampType(), True),
        types.StructField("PULocationID", types.IntegerType(), True),
        types.StructField("DOLocationID", types.IntegerType(), True),
        types.StructField("SR_Flag", types.StringType(), True),
        types.StructField("Affiliated_base_number", types.StringType(), True),
    ]
)

In [9]:
df = spark.read.option("header", "true").schema(schema).csv("fhv_tripdata_2019-10.csv")
df.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



### Partition and Save to Parquet

In [10]:
df_partition = df.repartition(6)

In [11]:
df_partition.write.parquet("fhv/2019/10/", mode="overwrite")

                                                                                

In [13]:
!ls -lh fhv/2019/10/

total 37M
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00000-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00001-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00002-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00003-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00004-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab 6,2M Mar  3 12:16 part-00005-909093c6-d9db-4110-9d54-98400cabc774-c000.snappy.parquet
-rw-r--r-- 1 sugab sugab    0 Mar  3 12:16 _SUCCESS


**Average parquet size: 6MB**

## Q3

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

In [14]:
df_partition.show(5)



+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|Affiliated_base_number|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+
|              B00937|2019-10-08 13:14:56|2019-10-08 13:24:40|         264|         243|   NULL|                B00937|
|              B00789|2019-10-02 14:15:10|2019-10-02 14:44:02|         264|         264|   NULL|                B00789|
|              B02509|2019-10-05 07:58:49|2019-10-05 08:16:04|         264|          70|   NULL|                B02682|
|              B01984|2019-10-03 09:22:00|2019-10-03 09:42:00|         264|          11|   NULL|                B01984|
|              B02735|2019-10-06 13:50:16|2019-10-06 14:24:55|         264|         265|   NULL|                B00882|
+--------------------+------------------

                                                                                

In [15]:
# register the dataframe as a temp table
df_partition.registerTempTable("fhv_tripdata")



In [22]:
spark.sql(
    """
SELECT
    COUNT(*)
FROM
    fhv_tripdata
WHERE
    DATE(pickup_datetime) = '2019-10-15'
"""
).show()



+--------+
|count(1)|
+--------+
|   62610|
+--------+



                                                                                

## Q4

**Longest trip for each day**

What is the length of the longest trip in the dataset in hours?

In [41]:
spark.sql(
    """
SELECT 
    DATE(pickup_datetime) AS trip_date,
    MAX(CAST(TO_TIMESTAMP(dropoff_datetime) - TO_TIMESTAMP(pickup_datetime) AS LONG) / (1000 * 60 * 60)) AS longest_trip_duration_hours
FROM 
    fhv_tripdata
GROUP BY 
    1
ORDER BY 
    2 DESC;
"""
).show()



+----------+---------------------------+
| trip_date|longest_trip_duration_hours|
+----------+---------------------------+
|2019-10-28|                   631.1525|
|2019-10-11|                   631.1525|
|2019-10-31|          87.67244083333334|
|2019-10-01|          70.12802805555556|
|2019-10-17|                      8.794|
|2019-10-26|          8.784166666666666|
|2019-10-30|         1.4645344444444444|
|2019-10-25|         1.0568266666666666|
|2019-10-02|         0.7692313888888889|
|2019-10-23|         0.7456166666666667|
|2019-10-03|                  0.7453825|
|2019-10-04|         0.7446166666666667|
|2019-10-07|         0.7441666666666666|
|2019-10-05|         0.6971808333333334|
|2019-10-06|         0.6740077777777778|
|2019-10-08|         0.6250822222222222|
|2019-10-16|         0.6040666666666666|
|2019-10-09|         0.6013102777777778|
|2019-10-10|         0.5773888888888888|
|2019-10-12|                  0.5289125|
+----------+---------------------------+
only showing top

                                                                                

**Answer: 631.1525**

## Q5

**User Interface**

Spark’s User Interface which shows the application's dashboard runs on which local port?
- 80
- 443
- **4040**
- 8080

## Q6

**Least frequent pickup location zone**

Load the zone lookup data into a temp view in Spark [Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv) or from `02_1_test.ipynb`

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?

In [42]:
df_zones = spark.read.parquet("zones/")
df_zones.show(5)

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
+----------+-------------+--------------------+------------+
only showing top 5 rows



In [44]:
df_result = df_partition.join(
    df_zones, df_partition.PULocationID == df_zones.LocationID
)
df_result.show(5)



+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+----------+-------+----+------------+
|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|Affiliated_base_number|LocationID|Borough|Zone|service_zone|
+--------------------+-------------------+-------------------+------------+------------+-------+----------------------+----------+-------+----+------------+
|              B00937|2019-10-08 13:14:56|2019-10-08 13:24:40|         264|         243|   NULL|                B00937|       264|Unknown|  NV|         N/A|
|              B00789|2019-10-02 14:15:10|2019-10-02 14:44:02|         264|         264|   NULL|                B00789|       264|Unknown|  NV|         N/A|
|              B02509|2019-10-05 07:58:49|2019-10-05 08:16:04|         264|          70|   NULL|                B02682|       264|Unknown|  NV|         N/A|
|              B01984|2019-10-03 09:22:00|2019-10-03 09:42

                                                                                

In [45]:
df_zones.registerTempTable("zones")



In [47]:
spark.sql(
    """
SELECT
    zones.Zone as pickup_zone,
    COUNT(*)
FROM
    fhv_tripdata
JOIN zones ON
    fhv_tripdata.PULocationID = zones.LocationID
GROUP BY
    1
ORDER BY
    2
"""
).show()    



+--------------------+--------+
|         pickup_zone|count(1)|
+--------------------+--------+
|         Jamaica Bay|       1|
|Governor's Island...|       2|
| Green-Wood Cemetery|       5|
|       Broad Channel|       8|
|     Highbridge Park|      14|
|        Battery Park|      15|
|Saint Michaels Ce...|      23|
|Breezy Point/Fort...|      25|
|Marine Park/Floyd...|      26|
|        Astoria Park|      29|
|    Inwood Hill Park|      39|
|       Willets Point|      47|
|Forest Park/Highl...|      53|
|  Brooklyn Navy Yard|      57|
|        Crotona Park|      62|
|        Country Club|      77|
|     Freshkills Park|      89|
|       Prospect Park|      98|
|     Columbia Street|     105|
|  South Williamsburg|     110|
+--------------------+--------+
only showing top 20 rows



                                                                                

**Answer: Jamaica Bay**