# Advanced Analytics & Window Functions with PySpark

**Dataset**: `samples.nyctaxi.trips`

In this notebook you will:
1. Load the NYC Taxi sample dataset
2. Engineer features (trip duration, speed, tip rate)
3. Use advanced window functions (ranking, moving averages, per-group stats)
4. Use complex aggregations and pivots


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Load the built-in Databricks NYC Taxi dataset
nyc_taxi_df = spark.read.table("samples.nyctaxi.trips")

# Quick sanity check
display(nyc_taxi_df.limit(5))
print(f"Row count (approx): {nyc_taxi_df.count()}")


tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,pickup_zip,dropoff_zip
2016-02-13T21:47:53.000Z,2016-02-13T21:57:15.000Z,1.4,8.0,10103,10110
2016-02-13T18:29:09.000Z,2016-02-13T18:37:23.000Z,1.31,7.5,10023,10023
2016-02-06T19:40:58.000Z,2016-02-06T19:52:32.000Z,1.8,9.5,10001,10018
2016-02-12T19:06:43.000Z,2016-02-12T19:20:54.000Z,2.3,11.5,10044,10111
2016-02-23T10:27:56.000Z,2016-02-23T10:58:33.000Z,2.6,18.5,10199,10022


Row count (approx): 21932


## 1. Feature Engineering

We'll:
- Compute `trip_duration_min`
- Compute `avg_speed_mph`
- Compute `tip_rate` = tip / fare


In [0]:
# Some columns may differ slightly between DBR versions; adjust if needed.
# Common columns: tpep_pickup_datetime, tpep_dropoff_datetime, trip_distance, fare_amount, tip_amount

taxi_feat_df = (
    nyc_taxi_df
    .withColumn(
        "trip_duration_min",
        (F.col("tpep_dropoff_datetime").cast("long") - F.col("tpep_pickup_datetime").cast("long")) / 60.0
    )
    .withColumn(
        "avg_speed_mph",
        F.when(F.col("trip_distance") > 0,
               F.col("trip_distance") / (F.col("trip_duration_min") / 60.0)
        )
    )

    )


display(taxi_feat_df.select("tpep_pickup_datetime", "tpep_dropoff_datetime",
                            "trip_distance", "trip_duration_min", "avg_speed_mph", "fare_amount").limit(10))


tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,trip_duration_min,avg_speed_mph,fare_amount
2016-02-13T21:47:53.000Z,2016-02-13T21:57:15.000Z,1.4,9.366666666666667,8.967971530249109,8.0
2016-02-13T18:29:09.000Z,2016-02-13T18:37:23.000Z,1.31,8.233333333333333,9.546558704453442,7.5
2016-02-06T19:40:58.000Z,2016-02-06T19:52:32.000Z,1.8,11.566666666666666,9.337175792507203,9.5
2016-02-12T19:06:43.000Z,2016-02-12T19:20:54.000Z,2.3,14.183333333333334,9.729729729729728,11.5
2016-02-23T10:27:56.000Z,2016-02-23T10:58:33.000Z,2.6,30.616666666666667,5.095264017419706,18.5
2016-02-13T00:41:43.000Z,2016-02-13T00:46:52.000Z,1.4,5.15,16.31067961165048,6.5
2016-02-18T23:49:53.000Z,2016-02-19T00:12:53.000Z,10.4,23.0,27.130434782608692,31.0
2016-02-18T20:21:45.000Z,2016-02-18T20:38:23.000Z,10.15,16.633333333333333,36.61322645290581,28.5
2016-02-03T10:47:50.000Z,2016-02-03T11:07:06.000Z,3.27,19.266666666666666,10.183391003460208,15.0
2016-02-19T01:26:39.000Z,2016-02-19T01:40:01.000Z,4.42,13.366666666666667,19.840399002493765,15.0


## 2. Window Functions â€“ Ranking Top Tippers Per Day

- We'll create a **date** column
- Partition by `pickup_date`
- Rank trips by `tip_rate` within each day


In [0]:
taxi_with_date_df = taxi_feat_df.withColumn("pickup_date", F.to_date("tpep_pickup_datetime"))

# Define a window partitioned by day, ordered by tip_rate descending
#day_tip_window = Window.partitionBy("pickup_date").orderBy(F.col("tip_rate").desc())

#ranked_tips_df = taxi_with_date_df.withColumn("tip_rank_day", F.dense_rank().over(day_tip_window))

# Show top 5 tippers per day
#top_tippers_df = ranked_tips_df.filter(F.col("tip_rank_day") <= 5)

# display(
#     top_tippers_df.select("pickup_date", "tip_rank_day", "fare_amount", "tip_amount", "tip_rate", "trip_distance")
#     .orderBy("pickup_date", "tip_rank_day")
# )


## 3. Rolling / Moving Average with Window

We'll compute:
- A 1-hour moving average of `fare_amount` for each day.

This uses a **rangeBetween** window in seconds.


In [0]:
# Add a timestamp in seconds for easier windowing
taxi_ts_df = taxi_with_date_df.withColumn("pickup_ts", F.col("tpep_pickup_datetime").cast("timestamp"))

# Define a window: 1 hour before to current row, partitioned by date
# NOTE: rangeBetween uses "seconds" when ordering by a timestamp cast as long.
time_window = (
    Window
    .partitionBy("pickup_date")
    .orderBy(F.col("pickup_ts").cast("long"))
    .rangeBetween(-3600, 0)   # last 1 hour
)

rolling_df = taxi_ts_df.withColumn(
    "rolling_hour_fare_avg",
    F.avg("fare_amount").over(time_window)
)

display(
    rolling_df
    .select("tpep_pickup_datetime", "pickup_date", "fare_amount", "rolling_hour_fare_avg")
    .orderBy("tpep_pickup_datetime")
    .limit(50)
)


tpep_pickup_datetime,pickup_date,fare_amount,rolling_hour_fare_avg
2016-01-01T00:04:30.000Z,2016-01-01,5.0,5.0
2016-01-01T00:11:29.000Z,2016-01-01,24.5,14.75
2016-01-01T00:12:48.000Z,2016-01-01,5.0,11.5
2016-01-01T00:13:37.000Z,2016-01-01,10.0,11.125
2016-01-01T00:15:46.000Z,2016-01-01,8.5,10.6
2016-01-01T00:15:57.000Z,2016-01-01,11.0,10.666666666666666
2016-01-01T00:20:57.000Z,2016-01-01,5.5,9.928571428571429
2016-01-01T00:22:30.000Z,2016-01-01,4.0,9.1875
2016-01-01T00:23:34.000Z,2016-01-01,13.0,9.61111111111111
2016-01-01T00:24:29.000Z,2016-01-01,10.0,9.65


## 4. Per-Group Statistics with Window Functions

We'll compute **z-score** of `trip_distance` within each day:

\[
z = \frac{x - \mu}{\sigma}
\]


In [0]:
distance_stats_window = Window.partitionBy("pickup_date")

distance_stats_df = (
    taxi_with_date_df
    .withColumn("mean_distance", F.avg("trip_distance").over(distance_stats_window))
    .withColumn("std_distance", F.stddev("trip_distance").over(distance_stats_window))
    .withColumn(
        "distance_zscore",
        (F.col("trip_distance") - F.col("mean_distance")) / F.col("std_distance")
    )
)

display(
    distance_stats_df
    .select("pickup_date", "trip_distance", "mean_distance", "std_distance", "distance_zscore")
    .orderBy(F.col("pickup_date"), F.col("distance_zscore").desc())
    .limit(500)
)


pickup_date,trip_distance,mean_distance,std_distance,distance_zscore
2016-01-01,21.7,3.1345398773006137,3.418745285349174,5.43048942612383
2016-01-01,19.9,3.1345398773006137,3.418745285349174,4.903980473346976
2016-01-01,18.85,3.1345398773006137,3.418745285349174,4.596850250893812
2016-01-01,18.7,3.1345398773006137,3.418745285349174,4.5529745048290735
2016-01-01,18.1,3.1345398773006137,3.418745285349174,4.377471520570123
2016-01-01,17.76,3.1345398773006137,3.418745285349174,4.27801982949005
2016-01-01,17.64,3.1345398773006137,3.418745285349174,4.242919232638259
2016-01-01,17.47,3.1345398773006137,3.418745285349174,4.193193387098223
2016-01-01,16.56,3.1345398773006137,3.418745285349174,3.927013860972147
2016-01-01,10.9,3.1345398773006137,3.418745285349174,2.271435709462707


## 5. Pivoting & Complex Aggregations

Example:
- Bucket `trip_distance` into categories
- Aggregate **average tip_rate** per distance bucket and passenger count


In [0]:
# Create distance buckets
bucketed_df = (
    taxi_feat_df
    .withColumn(
        "distance_bucket",
        F.when(F.col("trip_distance") < 1, "<1 mile")
         .when(F.col("trip_distance") < 3, "1-3 miles")
         .when(F.col("trip_distance") < 7, "3-7 miles")
         .otherwise(">=7 miles")
    )
)

agg_df = (
    bucketed_df
    .groupBy("distance_bucket")
    .pivot("passenger_count")  # pivot on passenger count
    .agg(F.round(F.avg("tip_rate"), 3))
    .orderBy("distance_bucket")
)

display(agg_df)
