# 5. Data Manipulation III - Grouping and Aggregation - Quiz.ipynb

## 5.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

## 5.1. Question 1

What is the maximum trip distance for trips with `pu_location_id = 1`?

In [3]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        #### YOUR CODE HERE
        "pu_location_id"
    )
    .agg(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 2)
┌────────────────┬───────────────┐
│ pu_location_id ┆ trip_distance │
│ ---            ┆ ---           │
│ i32            ┆ f64           │
╞════════════════╪═══════════════╡
│ 1              ┆ 35.75         │
└────────────────┴───────────────┘


1. 35.75 - correct
2. 29.7 - incorrect - are you sure you chose the right `pu_location_id`?
3. 176744.79 - incorrect - make sure to use `pu_location_id`, not `do_location_id`!
4. 0.0 - incorrect - are you taking the minimum or the maximum?

## 5.2. Question 2

Group the data by `vendor_id` and calculate the average `fare_amount` and average `trip_distance` for each. Sort descending by both `mean_fare_amount` and `mean_trip_distance`. What is the top `vendor_id`?

In [22]:
result = (
    #### YOUR CODE HERE
    df
    .group_by("do_location_id")
    .agg(
        pl.col("fare_amount").mean().name.prefix("mean_"),
        pl.col("trip_distance").mean().name.prefix("mean_"),
    )
    .sort(
        by=["mean_fare_amount", "mean_trip_distance", ],
        descending=True
    )
)
print(result.head())

shape: (5, 3)
┌────────────────┬──────────────────┬────────────────────┐
│ do_location_id ┆ mean_fare_amount ┆ mean_trip_distance │
│ ---            ┆ ---              ┆ ---                │
│ i32            ┆ f64              ┆ f64                │
╞════════════════╪══════════════════╪════════════════════╡
│ 204            ┆ 121.428571       ┆ 28.712857          │
│ 109            ┆ 100.6356         ┆ 23.93              │
│ 5              ┆ 99.553333        ┆ 27.622667          │
│ 84             ┆ 98.601515        ┆ 27.881818          │
│ 265            ┆ 96.026596        ┆ 36.313287          │
└────────────────┴──────────────────┴────────────────────┘


1. 207 - incorrect - are you sure you sorted in the right direction?
2. 205 - incorrect - be sure to sort by mean_fare_amount then mean_trip_distance, and not the other way around!
3. 265 - incorrect - are you sure you're using `mean` and not `max`?
4. 204 - correct

## 5.3. Question 3

Which date for `tpep_pickup_datetime` had the most rides?

In [32]:
result = (
    #### YOUR CODE HERE
    df
    .group_by(pl.col("tpep_dropoff_datetime").dt.date())
    .agg(pl.len().alias("num_rides"))
    .sort(by="num_rides", descending=True)
)
print(result)

shape: (36, 2)
┌───────────────────────┬───────────┐
│ tpep_dropoff_datetime ┆ num_rides │
│ ---                   ┆ ---       │
│ date                  ┆ u32       │
╞═══════════════════════╪═══════════╡
│ 2024-03-09            ┆ 140383    │
│ 2024-03-14            ┆ 136541    │
│ 2024-03-06            ┆ 132711    │
│ 2024-03-28            ┆ 132054    │
│ 2024-03-16            ┆ 131129    │
│ …                     ┆ …         │
│ 2024-04-01            ┆ 629       │
│ 2024-02-29            ┆ 3         │
│ 2024-04-02            ┆ 2         │
│ 2003-01-01            ┆ 1         │
│ 2002-12-31            ┆ 1         │
└───────────────────────┴───────────┘


1. 2002-12-31 - incorrect - make sure to get the date with the most rides, not the least.
2. 2024-03-14 22:04:00 - incorrect - we're looking for the date with the most rides, not the datetime with the most rides.
3. 2024-03-09 - correct
4. 2024-04-01 - incorrect - try again.

## 5.4. Question 4

Create a pivot table that shows the average fare amount for each combination of `VendorID` and `payment_type`. What is the average fare amount associated with `vendor_id` 2 and `payment_type` 2?

In [39]:
result = (
    #### YOUR CODE HERE
    df
    .pivot(
        values="fare_amount",
        index="vendor_id",
        columns="payment_type",
        aggregate_function="mean"
    )
)
print(result)

shape: (3, 6)
┌───────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ vendor_id ┆ 1         ┆ 2         ┆ 4         ┆ 3         ┆ 0         │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ i32       ┆ f64       ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 1         ┆ 18.582571 ┆ 16.996263 ┆ 16.23786  ┆ 15.228555 ┆ 18.562753 │
│ 2         ┆ 19.523008 ┆ 18.390497 ┆ -0.002382 ┆ 0.0       ┆ 18.746054 │
│ 6         ┆ null      ┆ null      ┆ null      ┆ null      ┆ 46.21087  │
└───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘


1. 850.0 - incorrect - be sure to take the mean, not the max.
2. 18.390497 - correct
3. 3.411724 - incorrect - are you sure your choice of `values` is `fare_amount`?
4. 18.582571 - incorrect - close! but the question is about precisely `vendor_id` 2 and `payment_type` 2.

## 5.5. Question 5

Create a pivot table that shows the average trip distance for every combination of `vendor_id` and whether or not the ride has an airport fee. What is the average trip distance for rides with `vendor_id = 1` that have an airport fee?

In [44]:
result = (
    #### YOUR CODE HERE
    df
    .with_columns(pl.col("airport_fee").gt(0).name.prefix("has_"))
    .pivot(
        values="trip_distance",
        index="vendor_id",
        columns="has_airport_fee",
        aggregate_function="mean"
    )
)
print(result)

shape: (3, 4)
┌───────────┬──────────┬───────────┬───────────┐
│ vendor_id ┆ false    ┆ true      ┆ null      │
│ ---       ┆ ---      ┆ ---       ┆ ---       │
│ i32       ┆ f64      ┆ f64       ┆ f64       │
╞═══════════╪══════════╪═══════════╪═══════════╡
│ 1         ┆ 3.377438 ┆ 12.570663 ┆ 1.756034  │
│ 2         ┆ 2.409085 ┆ 13.309613 ┆ 15.076071 │
│ 6         ┆ null     ┆ null      ┆ 10.129087 │
└───────────┴──────────┴───────────┴───────────┘


1. 115.3 - incorrect - are you sure you're taking the average trip_distance and not the maximum?
2. 13.309613 - incorrect - Make sure to check for `vendor_id = 1`, not `2`.
3. 12.570663 - correct
4. 2.409085 - incorrect - Make sure to check for rides that had an airport fee, not ones that didn't.

## 5.6. Question 6

Repeat Question 5, but instead of using a `pivot_table`, use `filter` to get rid of any data that's not `vendor_id = 1` and `airport_fee > 0` and then `select` to measure the average `trip_distance`. Do you arrive to the same answer as Question 5?

In [46]:
result = (
    #### YOUR CODE HERE
    df
    .filter(pl.col("airport_fee").gt(0).and_(pl.col("vendor_id").eq(1)))
    .select(pl.col("trip_distance").mean())
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 12.570663     │
└───────────────┘


1. Yes
2. No

## 5.7. Question 7

Using `rank().over()`, what is the sum of the 10th largest `trip_distance`s over all `vendor_id`s?

In [59]:
result = (
    #### YOUR CODE HERE
    df
    .filter(
        pl.col("trip_distance")
        .rank(descending=True)
        .over(pl.col("vendor_id"))
        .eq(10)
    )
    .select(pl.col("trip_distance").sum())
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 113916.29     │
└───────────────┘


1. 113916.29 - correct
2. 115599.84 - incorrect - are you sure you taking the 10th largest trip distance, and not the 9th?
3. 0.76 - incorrect - make sure to correctly set the `descending` argument in the `rank` function.
4. 1178.31 - incorrect - are you sure you're using `trip_distance` and not `total_amount`?

## 5.8. Question 8

What is the sum of the `total_amount`s for the three `pu_location_id`s with the highest maximum `trip_distance`s?

In [68]:
result = (
    df
    .group_by("pu_location_id")
    .agg(
        pl.col("total_amount").sum(),
        pl.col("trip_distance").max().name.prefix("max_")
    )
    .sort("max_trip_distance", descending=True)
    .head(3)
    .select(pl.col("total_amount").sum())
)
print(result)

shape: (1, 1)
┌──────────────┐
│ total_amount │
│ ---          │
│ f64          │
╞══════════════╡
│ 2.3340e7     │
└──────────────┘


1. 6.0735e6 - incorrect - make sure to take only the top three `pu_location_id`s.
2. 891.48 - incorrect - you might be taking the three `pu_location_id`s with the lowest maximum `trip_distance`, not the highest maximum `trip_distance`.
3. 4.6460e6 - correct
4. 2.3340e7 - incorrect - remember, you want the sum `total_amount` of the three `pu_location_id`s with the highest `max_trip_distance`, not the highest sum `total_amount`.

## 5.9. Question 9

Question 1 involved filtering then grouping data, but we could have done it the other way around, i.e. grouping and then filtering. In fact, let's try that here, and do a timing test. Which approach is faster and why?

In [4]:
%%timeit -n 3
result = (
    df
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
)

1.28 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [5]:
%%timeit -n 3
result = (
    df
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
)

82.8 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


1. They are the same, because the query optimization engine doesn't care about the order of operations. - incorrect - there is no query optimization in in-memory mode.
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation. - correct
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end. - incorrect - Filtering is actually not an expensive operation; grouping is far more expensive.
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out. - incorrect - Though it might be true that there are less rows to filter out, the true bottleneck of the operation is the grouping itself!

## 5.10. Question 10

Repeat question 9, comparing but load the data afresh with `pl.scan_parquet`. Which is faster now--filtering before grouping, or grouping before filtering?

In [6]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
lf = (
    pl.scan_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

In [7]:
%%timeit -n 3
result = (
    lf
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .collect()
)

23.3 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [8]:
%%timeit -n 3
result = (
    lf
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .collect()
)

22.4 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


1. They are the same, because the query optimization engine doesn't care about the order of operations. - correct - after the query hits the query optimization engine, they become the same query.
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation. - incorrect - this might be the case in in-memory mode, but not in lazy mode!
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end. - incorrect - Filtering is actually not an expensive operation; grouping is far more expensive.
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out. - incorrect - Though it might be true that there are less rows to filter out, the true bottleneck of the operation is the grouping itself!