## Instanciating Spark

Import the PySpark library and the `SparkSession` class

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.master("local[*]").appName("spark_sql_groupby_join").getOrCreate()

In [None]:
df_green = spark.read.parquet("data/raw/green/*/*")

In [None]:
df_green.show()

Now that we have loaded our sample data (i.e. green taxi data for 2020 and 2021) for this exercise, let's proceed with another `sql` query, but similar to the previous exercise, that involves a `groupby` statement.

In [None]:
# Firstly, we need to always create a temp table to be queried on - sql queries cannot be queried on a spark dataframe
df_green.registerTempTable("green")

In [None]:
# Secondly, we write a query that breaks down the revenue as well as the number of trips by hour by zone
df_green_revenue = spark.sql(
    """
SELECT
    EXTRACT(HOUR FROM lpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) as revenue,
    COUNT(1) as number_records
FROM green
WHERE lpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY 1,2
ORDER BY 1,2
"""
)

In [None]:
df_green_revenue.show()

In [None]:
# Lastly, lets write the output of our query into a parquet file
df_green_revenue.repartition(2).write.parquet('data/report/revenue_green', mode='overwrite')

In [None]:
# Repeat same steps for yellow taxi data

df_yellow = spark.read.parquet('data/raw/yellow/*/*')

In [None]:
df_yellow.show()

In [None]:
df_yellow.registerTempTable("yellow")

In [None]:
df_yellow_revenue = spark.sql(
    """
SELECT
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) as revenue,
    COUNT(1) as number_records
FROM yellow
WHERE tpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY 1,2
ORDER BY 1,2
""")

In [None]:
# Lastly, lets write the output of our query into a parquet file
df_yellow_revenue.repartition(2).write.parquet('data/report/revenue_yellow', mode='overwrite')