# Advanced Analytics & Window Functions with PySpark

**Dataset**: `samples.nyctaxi.trips`

In this notebook you will:
1. Load the NYC Taxi sample dataset
2. Engineer features (trip duration, speed, tip rate)
3. Use advanced window functions (ranking, moving averages, per-group stats)
4. Use complex aggregations and pivots


In [None]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Load the built-in Databricks NYC Taxi dataset
nyc_taxi_df = spark.read.table("samples.nyctaxi.trips")

# Quick sanity check
display(nyc_taxi_df.limit(5))
print(f"Row count (approx): {nyc_taxi_df.count()}")


## 1. Feature Engineering

We'll:
- Compute `trip_duration_min`
- Compute `avg_speed_mph`
- Compute `tip_rate` = tip / fare


In [None]:
# Some columns may differ slightly between DBR versions; adjust if needed.
# Common columns: tpep_pickup_datetime, tpep_dropoff_datetime, trip_distance, fare_amount, tip_amount

taxi_feat_df = (
    nyc_taxi_df
    .withColumn(
        "trip_duration_min",
        (F.col("tpep_dropoff_datetime").cast("long") - F.col("tpep_pickup_datetime").cast("long")) / 60.0
    )
    .withColumn(
        "avg_speed_mph",
        F.when(F.col("trip_distance") > 0,
               F.col("trip_distance") / (F.col("trip_duration_min") / 60.0)
        )
    )
    .withColumn(
        "tip_rate",
        F.when(F.col("fare_amount") > 0,
               F.col("tip_amount") / F.col("fare_amount")
        )
    )
)

display(taxi_feat_df.select("tpep_pickup_datetime", "tpep_dropoff_datetime",
                            "trip_distance", "trip_duration_min", "avg_speed_mph", "fare_amount", "tip_amount", "tip_rate").limit(10))


## 2. Window Functions – Ranking Top Tippers Per Day

- We'll create a **date** column
- Partition by `pickup_date`
- Rank trips by `tip_rate` within each day


In [None]:
taxi_with_date_df = taxi_feat_df.withColumn("pickup_date", F.to_date("tpep_pickup_datetime"))

# Define a window partitioned by day, ordered by tip_rate descending
day_tip_window = Window.partitionBy("pickup_date").orderBy(F.col("tip_rate").desc())

ranked_tips_df = taxi_with_date_df.withColumn("tip_rank_day", F.dense_rank().over(day_tip_window))

# Show top 5 tippers per day
top_tippers_df = ranked_tips_df.filter(F.col("tip_rank_day") <= 5)

display(
    top_tippers_df.select("pickup_date", "tip_rank_day", "fare_amount", "tip_amount", "tip_rate", "trip_distance")
    .orderBy("pickup_date", "tip_rank_day")
)


## 3. Rolling / Moving Average with Window

We'll compute:
- A 1-hour moving average of `fare_amount` for each day.

This uses a **rangeBetween** window in seconds.


In [None]:
# Add a timestamp in seconds for easier windowing
taxi_ts_df = taxi_with_date_df.withColumn("pickup_ts", F.col("tpep_pickup_datetime").cast("timestamp"))

# Define a window: 1 hour before to current row, partitioned by date
# NOTE: rangeBetween uses "seconds" when ordering by a timestamp cast as long.
time_window = (
    Window
    .partitionBy("pickup_date")
    .orderBy(F.col("pickup_ts").cast("long"))
    .rangeBetween(-3600, 0)   # last 1 hour
)

rolling_df = taxi_ts_df.withColumn(
    "rolling_hour_fare_avg",
    F.avg("fare_amount").over(time_window)
)

display(
    rolling_df
    .select("tpep_pickup_datetime", "pickup_date", "fare_amount", "rolling_hour_fare_avg")
    .orderBy("tpep_pickup_datetime")
    .limit(50)
)


## 4. Per-Group Statistics with Window Functions

We'll compute **z-score** of `trip_distance` within each day:

\[
z = \frac{x - \mu}{\sigma}
\]


In [None]:
distance_stats_window = Window.partitionBy("pickup_date")

distance_stats_df = (
    taxi_with_date_df
    .withColumn("mean_distance", F.avg("trip_distance").over(distance_stats_window))
    .withColumn("std_distance", F.stddev("trip_distance").over(distance_stats_window))
    .withColumn(
        "distance_zscore",
        (F.col("trip_distance") - F.col("mean_distance")) / F.col("std_distance")
    )
)

display(
    distance_stats_df
    .select("pickup_date", "trip_distance", "mean_distance", "std_distance", "distance_zscore")
    .orderBy(F.col("pickup_date"), F.col("distance_zscore").desc())
    .limit(50)
)


## 5. Pivoting & Complex Aggregations

Example:
- Bucket `trip_distance` into categories
- Aggregate **average tip_rate** per distance bucket and passenger count


In [None]:
# Create distance buckets
bucketed_df = (
    taxi_feat_df
    .withColumn(
        "distance_bucket",
        F.when(F.col("trip_distance") < 1, "<1 mile")
         .when(F.col("trip_distance") < 3, "1-3 miles")
         .when(F.col("trip_distance") < 7, "3-7 miles")
         .otherwise(">=7 miles")
    )
)

agg_df = (
    bucketed_df
    .groupBy("distance_bucket")
    .pivot("passenger_count")  # pivot on passenger count
    .agg(F.round(F.avg("tip_rate"), 3))
    .orderBy("distance_bucket")
)

display(agg_df)
