# Module 5 Homework


In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the Yellow 2024-10 data from the official website: 

In [2]:
%mkdir -p data
FILE = "yellow_tripdata_2024-10.parquet"
!wget -c https://d37ci6vzurychx.cloudfront.net/trip-data/$FILE -O data/$FILE > /dev/null 2>&1
data_path = f"data/{FILE}"

In [3]:
import pyspark
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()

25/03/06 07:47:22 WARN Utils: Your hostname, hp-computer resolves to a loopback address: 127.0.1.1; using 192.168.96.164 instead (on interface wlp0s20f3)
25/03/06 07:47:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/03/06 07:47:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Question 1: Install Spark and PySpark

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?

### Answer 2

In [5]:
spark.version

'3.3.2'

## Question 2: Yellow October 2024

Read the October 2024 Yellow into a Spark Dataframe.

Repartition the Dataframe to 4 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 6MB
- 25MB
- 75MB
- 100MB


### Answer 2

In [6]:
df_yellow = spark.read \
    .option("header", "true") \
    .parquet(data_path)

                                                                                

In [7]:
df_yellow.repartition(4).write.parquet(
    "data/question2/",
    mode="overwrite"
)

                                                                                

In [8]:
!du -sh data/question2/* | grep parquet

25M	data/question2/part-00000-eecd2ad1-7e96-493b-8d97-5113ae391001-c000.snappy.parquet
25M	data/question2/part-00001-eecd2ad1-7e96-493b-8d97-5113ae391001-c000.snappy.parquet
25M	data/question2/part-00002-eecd2ad1-7e96-493b-8d97-5113ae391001-c000.snappy.parquet
25M	data/question2/part-00003-eecd2ad1-7e96-493b-8d97-5113ae391001-c000.snappy.parquet


`Answer`: 25MB

## Question 3: Count records 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 85,567
- 105,567
- 125,567
- 145,567

### Answer 3

In [9]:
df_yellow.createOrReplaceTempView("trips_data")

In [10]:
df_result = spark.sql("""
SELECT
    COUNT(1) AS trips_15_oct
FROM
    trips_data
WHERE
    EXTRACT(day FROM tpep_pickup_datetime) = 15 
""")
df_result.show()



+------------+
|trips_15_oct|
+------------+
|      125567|
+------------+



                                                                                

In [11]:
from pyspark.sql import functions as F

In [12]:
df_yellow.filter(
    df_yellow.tpep_pickup_datetime.cast("date") == F.lit("2024-10-15")
).count()

                                                                                

125567

### Answer 3
`125,567`

## Question 4: Longest trip

What is the length of the longest trip in the dataset in hours?

- 122
- 142
- 162
- 182

### Answer 4

In [13]:
df_q4 = df_yellow.withColumn(
    "trip_duration",
    (df_yellow.tpep_dropoff_datetime.cast("long") - 
     df_yellow.tpep_pickup_datetime.cast("long")) / 3600
)

df_q4 = df_q4.withColumn("trip_date", F.to_date(df_q4.tpep_pickup_datetime))

df_q4 \
    .groupBy("trip_date") \
    .agg(F.max("trip_duration").alias("max_trip_duration")) \
    .orderBy("max_trip_duration", ascending=False) \
    .show(1)



+----------+------------------+
| trip_date| max_trip_duration|
+----------+------------------+
|2024-10-16|162.61777777777777|
+----------+------------------+
only showing top 1 row



                                                                                

`162` hours

## Question 5: User Interface

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080


### Answer 5

`4040` (default)

## Question 6: Least frequent pickup location zone

Load the zone lookup data into a temp view in Spark:

```bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
```

Using the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?

- Governor's Island/Ellis Island/Liberty Island
- Arden Heights
- Rikers Island
- Jamaica Bay

In [16]:
zone_path = "data/taxi_zone_lookup.csv"
!wget -c https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv -O $zone_path > /dev/null 2>&1

### Answer 6

In [17]:
df_zones = spark.read.option("header", "true").csv(zone_path)
df_zones.write.parquet(
    zone_path.replace(".csv", ".parquet"),
    mode="overwrite"
)
df_zones = spark.read.parquet(zone_path.replace(".csv", ".parquet"))

In [22]:
df_join = df_yellow.join(
    df_zones,
    on=df_yellow.PULocationID == df_zones.LocationID,
    how="inner"
)

df_result = df_join \
    .groupBy("Zone").count() \
    .orderBy("count", ascending=True)

In [25]:
df_result.show(5)



+--------------------+-----+
|                Zone|count|
+--------------------+-----+
|Governor's Island...|    1|
|       Rikers Island|    2|
|       Arden Heights|    2|
| Green-Wood Cemetery|    3|
|         Jamaica Bay|    3|
+--------------------+-----+
only showing top 5 rows



                                                                                

`Governor's Island/Ellis Island/Liberty Island`