# 5. Data Manipulation III - Grouping and Aggregation

The goal of this module is to become familiar with grouping and aggregation. The two main topics we'll cover are:
1. `pl.DataFrame.group_by`, the main entrypoint to group operations, along with `.agg`, the way to compute aggregations within groups.
2. Using `polars`'s implementation of window functions, `pl.Expr.over()`.
3. Two way group-by's: pivot tables.

But first we import `polars`...

In [1]:
import polars as pl

... and load the data, this time using the name mapping from the last module.

In [11]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
(
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
    .head()
)

vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
i32,datetime[ns],datetime[ns],i64,f64,i64,str,i32,i32,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1,2024-03-01 00:18:51,2024-03-01 00:23:45,0,1.3,1,"""N""",142,239,1,8.6,3.5,0.5,2.7,0.0,1.0,16.3,2.5,0.0
1,2024-03-01 00:26:00,2024-03-01 00:29:06,0,1.1,1,"""N""",238,24,1,7.2,3.5,0.5,3.0,0.0,1.0,15.2,2.5,0.0
2,2024-03-01 00:09:22,2024-03-01 00:15:24,1,0.86,1,"""N""",263,75,2,7.9,1.0,0.5,0.0,0.0,1.0,10.4,0.0,0.0
2,2024-03-01 00:33:45,2024-03-01 00:39:34,1,0.82,1,"""N""",164,162,1,7.9,1.0,0.5,1.29,0.0,1.0,14.19,2.5,0.0
1,2024-03-01 00:05:43,2024-03-01 00:26:22,0,4.9,1,"""N""",263,7,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0


## 5.1. Grouping Dataframes

As we've progressed through previous modules, we've seen how to do some basic aggregations on an entire dataframe, with functions such as `.sum()`, `.min()`, `.max()`, `.count()`, and `.mean()`. Collecting some of the examples from previous modules, we can begin to see a short summary about our dataset:

In [12]:
(
    df
    .select(
        (
            pl.col("passenger_count").eq(0).sum() / pl.col("passenger_count").count()
        ).alias("fraction_rides_zero_passengers"),
        pl.col("passenger_count").max().name.suffix("_max"),
        pl.col("trip_distance").min().name.suffix("_min"),
        pl.col("trip_distance").max().name.suffix("_max"),
    )
)

fraction_rides_zero_passengers,passenger_count_max,trip_distance_min,trip_distance_max
f64,i64,f64,f64
0.01279,9,0.0,176836.3


One taxi ride had 9 passengers? That's something we haven't seen yet, that's a lot of passengers!

But what if we wanted to know these statistics for the different pickup zones? To do that, we need to crack open the `group_by` functionality.