## Instanciating Spark

Import the PySpark library and the `SparkSession` class

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[*]").appName("spark_sql_groupby_join").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 15:48:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df_green = spark.read.parquet("data/raw/green/*/*")

                                                                                

In [4]:
df_green.show()

                                                                                

+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|ehail_fee|improvement_surcharge|total_amount|payment_type|trip_type|congestion_surcharge|
+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|       2| 2019-12-18 15:52:30|  2019-12-18 15:54:39|                 N|       1.0|         264|         264|            5.0|          0.0|        3.5|  0.5|    0.

Now that we have loaded our sample data (i.e. green taxi data for 2020 and 2021) for this exercise, let's proceed with another `sql` query, but similar to the previous exercise, that involves a `groupby` statement.

In [5]:
# Firstly, we need to always create a temp table to be queried on - sql queries cannot be queried on a spark dataframe
df_green.registerTempTable("green")



In [6]:
# Secondly, we write a query that breaks down the revenue as well as the number of trips by hour by zone
df_green_revenue = spark.sql(
    """
SELECT
    EXTRACT(HOUR FROM lpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) as revenue,
    COUNT(1) as number_records
FROM green
WHERE lpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY 1,2
ORDER BY 1,2
"""
)

In [7]:
df_green_revenue.show()



+----+----+------------------+--------------+
|hour|zone|           revenue|number_records|
+----+----+------------------+--------------+
|   0|   3|            386.14|            11|
|   0|   4|             74.31|             2|
|   0|   5|            179.12|             3|
|   0|   7| 23819.25999999975|          1754|
|   0|   8|             10.79|             1|
|   0|   9|            187.71|             5|
|   0|  10|            805.69|            23|
|   0|  11|            378.27|            10|
|   0|  13|             61.55|             1|
|   0|  14|1721.9299999999998|            36|
|   0|  15|            289.06|             5|
|   0|  16|484.18000000000006|            15|
|   0|  17| 4357.509999999998|           220|
|   0|  18|2221.7199999999993|            72|
|   0|  19|            280.63|             8|
|   0|  20|1001.5400000000001|            54|
|   0|  21| 941.6099999999999|            23|
|   0|  22|1592.9499999999998|            58|
|   0|  23|176.10000000000002|    

                                                                                

In [8]:
# Lastly, lets write the output of our query into a parquet file
df_green_revenue.repartition(2).write.parquet('data/report/revenue_green', mode='overwrite')

                                                                                

In [9]:
# Repeat same steps for yellow taxi data

df_yellow = spark.read.parquet('data/raw/yellow/*/*')

In [10]:
df_yellow.show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       1| 2020-01-01 00:28:15|  2020-01-01 00:33:03|            1.0|          1.2|       1.0|                 N|         238|         239|           1|        6.0|  3.0|    0.5|      1.4

In [11]:
df_yellow.registerTempTable("yellow")

In [12]:
df_yellow_revenue = spark.sql(
    """
SELECT
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) as revenue,
    COUNT(1) as number_records
FROM yellow
WHERE tpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY 1,2
ORDER BY 1,2
""")

In [13]:
# Lastly, lets write the output of our query into a parquet file
df_yellow_revenue.repartition(2).write.parquet('data/report/revenue_yellow', mode='overwrite')

                                                                                