# Trip Data Aggregation

### Group By Columns
1. Year
2. Month
3. Pickup Location ID
4. Drop Off Location ID

### Aggregated Columns
1. Total Trip Count
2. Total Fare Amount

### Purpose of the Notebook
Demonstrate the integration between Spark Pool and Serverless SQL Pool

1. Create the aggregated table in Spark Pool
2. Access the data from Serverless SQL Pool 

In [6]:
#Set Folder Paths
raw_folder_path = 'abfss://nyc-taxi@synlakehousedev.dfs.core.windows.net/raw'
processed_folder_path = 'abfss://nyc-taxi@synlakehousedev.dfs.core.windows.net/processed'
curated_folder_path = 'abfss://nyc-taxi@synlakehousedev.dfs.core.windows.net/curated'

In [2]:
#Set Spark Config (Partition Year and Month as Strings)
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In [12]:
%%sql -- Create Database
CREATE DATABASE IF NOT EXISTS NYC_Taxi_Spark
LOCATION 'abfss://nyc-taxi@synlakehousedev.dfs.core.windows.net/curated_spark';

In [8]:
# Read the Processed Data 
trip_processed_df = spark.read.parquet(f"{processed_folder_path}/trip_partitioned") 

trip_processed_df.head()

In [9]:
# Perform the Required aggregations
from pyspark.sql.functions import *

trip_curated_agg_df = trip_processed_df \
                        .groupBy("PartitionYear", "PartitionMonth", "PULocationID", "DOLocationID") \
                        .agg(count(lit(1)).alias("TotalTripCount"),
                        round(sum("FareAmount"), 2).alias("TotalFareAmount"))

In [11]:
# Write the Aggregated Data to Curated Layer for Consumption
trip_curated_agg_df.write.mode("overwrite").partitionBy("PartitionYear", "PartitionMonth").format("parquet").saveAsTable("NYC_Taxi_Spark.TripAggregated")