# 5. Data Manipulation III - Grouping and Aggregation - Quiz.ipynb

## 5.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

## 5.1 Question 1: Maximum Trip Distance for Specific Location

What is the maximum trip distance for trips with `pu_location_id = 1`?

In [3]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        #### YOUR CODE HERE
        "pu_location_id"
    )
    .agg(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 2)
┌────────────────┬───────────────┐
│ pu_location_id ┆ trip_distance │
│ ---            ┆ ---           │
│ i32            ┆ f64           │
╞════════════════╪═══════════════╡
│ 1              ┆ 35.75         │
└────────────────┴───────────────┘


1. 35.75 - correct - Exactly! The maximum trip distance for trips with `pu_location_id = 1` is 35.75. See “Grouping DataFrames with .group_by().”
2. 29.7 - incorrect - You may not have chosen the right `pu_location_id`. See “Grouping DataFrames with .group_by().”
3. 176744.79 - incorrect - Make sure to use `pu_location_id`, not `do_location_id`. See “Grouping DataFrames with .group_by().”
4. 0.0 - incorrect - You probably found the minimum rather than the maximum. See “Grouping DataFrames with .group_by().”

## 5.2 Question 2: Top Vendor by Average Fare and Trip Distance

Group the data by `vendor_id` and calculate the average `fare_amount` and average `trip_distance` for each vendor. Then sort in descending order first by `mean_fare_amount` and then by `mean_trip_distance`. What is the top `vendor_id`?

In [4]:
result = (
    #### YOUR CODE HERE
    df
    .group_by("vendor_id")
    .agg(
        pl.col("fare_amount").mean().name.prefix("mean_"),
        pl.col("trip_distance").mean().name.prefix("mean_"),
    )
    .sort(
        by=["mean_fare_amount", "mean_trip_distance", ],
        descending=True
    )
)
print(result.head())

shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ vendor_id ┆ mean_fare_amount ┆ mean_trip_distance │
│ ---       ┆ ---              ┆ ---                │
│ i32       ┆ f64              ┆ f64                │
╞═══════════╪══════════════════╪════════════════════╡
│ 6         ┆ 46.21087         ┆ 10.129087          │
│ 2         ┆ 18.792448        ┆ 4.783591           │
│ 1         ┆ 18.324351        ┆ 3.682561           │
└───────────┴──────────────────┴────────────────────┘


1. 207 - incorrect - You may have sorted in ascending order instead of descending order. See "Grouping DataFrames with .group_by()."
2. 205 - incorrect - Be sure to sort by `mean_fare_amount` and then by `mean_trip_distance` and not the other way around. See "Grouping DataFrames with .group_by()."
3. 265 - incorrect - You may have found the max rather than the mean? See "Grouping DataFrames with .group_by()."
4. 6 - correct - Exactly! The vendor with `vendor_id = 6` has a much higher `mean_fare_amount` than the other vendors. See "Grouping DataFrames with .group_by()."

## 5.3 Question 3: Date with Most Rides

Which date for `tpep_pickup_datetime` had the most rides?

In [5]:
result = (
    #### YOUR CODE HERE
    df
    .group_by(pl.col("tpep_dropoff_datetime").dt.date())
    .agg(pl.len().alias("num_rides"))
    .sort(by="num_rides", descending=True)
)
print(result)

shape: (36, 2)
┌───────────────────────┬───────────┐
│ tpep_dropoff_datetime ┆ num_rides │
│ ---                   ┆ ---       │
│ date                  ┆ u32       │
╞═══════════════════════╪═══════════╡
│ 2024-03-09            ┆ 140383    │
│ 2024-03-14            ┆ 136541    │
│ 2024-03-06            ┆ 132711    │
│ 2024-03-28            ┆ 132054    │
│ 2024-03-16            ┆ 131129    │
│ …                     ┆ …         │
│ 2024-04-01            ┆ 629       │
│ 2024-02-29            ┆ 3         │
│ 2024-04-02            ┆ 2         │
│ 2003-01-01            ┆ 1         │
│ 2002-12-31            ┆ 1         │
└───────────────────────┴───────────┘


1. 2002-12-31 - incorrect - The question asks for the date with the most rides, not the least. See "Grouping DataFrames with .group_by()."
2. 2024-03-14 22:04:00 - incorrect - The question asks for the date, not the datetime, with the most rides. See "Grouping DataFrames with .group_by()."
3. 2024-03-09 - correct - Exactly! There were 140,383 rides on March 9, 2024, more than on any other day. See "Grouping DataFrames with .group_by()."
4. 2024-04-01 - incorrect - Confirm that you found the maximum of the correct column. See "Grouping DataFrames with .group_by()."

## 5.4 Question 4: Average Fare by Vendor and Payment Type

Create a pivot table that shows the average fare amount for each combination of `VendorID` and `payment_type`. What is the average fare amount associated with `vendor_id` 2 and `payment_type` 2?

In [16]:
result = (
    #### YOUR CODE HERE
    df
    .pivot(
        values="fare_amount",
        index="vendor_id",
        on="payment_type",
        aggregate_function="mean"
    )
)
print(result)

shape: (3, 6)
┌───────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ vendor_id ┆ 1         ┆ 2         ┆ 4         ┆ 3         ┆ 0         │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ i32       ┆ f64       ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 1         ┆ 18.582571 ┆ 16.996263 ┆ 16.23786  ┆ 15.228555 ┆ 18.562753 │
│ 2         ┆ 19.523008 ┆ 18.390497 ┆ -0.002382 ┆ 0.0       ┆ 18.746054 │
│ 6         ┆ null      ┆ null      ┆ null      ┆ null      ┆ 46.21087  │
└───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘


1. 850.0 - incorrect - If you arrived at this amount, you probably found the max instead of the mean. See "Pivot Tables with .pivot()."
2. 18.390497 - correct - Exactly! The pivot table shows that this amount is similar across most `payment_type`/`vendor_id` combinations. See "Pivot Tables with .pivot()."
3. 3.411724 - incorrect - You may not have chosen `fare_amount` for values. See "Pivot Tables with .pivot()."
4. 18.582571 - incorrect - Close! However, the question asks specifically about `vendor_id` 2 and `payment_type` 2. See "Pivot Tables with .pivot()."

## 5.5 Question 5: Average Trip Distance for Airport Fee Rides

Create a pivot table that shows the average trip distance for every `vendor_id` by whether or not the ride has an airport fee. What is the average trip distance for rides with `vendor_id = 1` that have an airport fee?

In [17]:
result = (
    #### YOUR CODE HERE
    df
    .with_columns(pl.col("airport_fee").gt(0).name.prefix("has_"))
    .pivot(
        values="trip_distance",
        index="vendor_id",
        on="has_airport_fee",
        aggregate_function="mean"
    )
)
print(result)

shape: (3, 4)
┌───────────┬──────────┬───────────┬───────────┐
│ vendor_id ┆ false    ┆ true      ┆ null      │
│ ---       ┆ ---      ┆ ---       ┆ ---       │
│ i32       ┆ f64      ┆ f64       ┆ f64       │
╞═══════════╪══════════╪═══════════╪═══════════╡
│ 1         ┆ 3.377438 ┆ 12.570663 ┆ 1.756034  │
│ 2         ┆ 2.409085 ┆ 13.309613 ┆ 15.076071 │
│ 6         ┆ null     ┆ null      ┆ 10.129087 │
└───────────┴──────────┴───────────┴───────────┘


1. 115.3 - incorrect - To find the correct answer, you need to find the average `trip_distance`, not the maximum distance. See "Pivot Tables with .pivot()."
2. 13.309613 - incorrect - Make sure to check for `vendor_id = 1`, not 2. See "Pivot Tables with .pivot()."
3. 12.570663 - correct - Exactly right! Tip: You could have also solved by using `.filter()` and then `.mean()`. See "Pivot Tables with .pivot()."
4. 2.409085 - incorrect - If you got this result, you probably looked at rides that did not have an airport fee, but the question asks about rides where a fee was charged. See "Pivot Tables with .pivot()."

## 5.6 Question 6: Average Trip Distance for Airport Fee Rides (Alternative Method)

What is the average trip distance for rides with `vendor_id = 1` that have an airport fee? Use `.filter()` to include only data where `vendor_id = 1` and `airport_fee > 0`, and then `.select()` to measure the average `trip_distance`.

In [8]:
result = (
    #### YOUR CODE HERE
    df
    .filter(pl.col("airport_fee").gt(0).and_(pl.col("vendor_id").eq(1)))
    .select(pl.col("trip_distance").mean())
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 12.570663     │
└───────────────┘


1. 115.3 - incorrect - You may have found the maximum rather than the average trip distance. See "Grouping DataFrames with .group_by()."
2. 13.309613 - incorrect - Make sure to check for `vendor_id = 1`, not 2. See "Grouping DataFrames with .group_by()."
3. 12.570663 - correct - Exactly right! Hint: You could have also used a pivot table to solve this. See "Grouping DataFrames with .group_by()."
4. 2.409085 - incorrect - Make sure to check for rides that had an airport fee, not ones that didn't. See "Grouping DataFrames with .group_by()."

## 5.7 Question 7: Sum of 10th Largest Trip Distances by Vendor

Using `rank().over()`, what is the 10th-largest `trip_distance` for each `vendor_id`, summed over all `vendor_id`s?

In [9]:
result = (
    #### YOUR CODE HERE
    df
    .filter(
        pl.col("trip_distance")
        .rank(descending=True)
        .over(pl.col("vendor_id"))
        .eq(10)
    )
    .select(pl.col("trip_distance").sum())
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 113916.29     │
└───────────────┘


1. 113916.29 - correct - Exactly! Using `.rank().over()`, you can create a column that represents the `trip_distance` rank within each `vendor_id`, filter for the 10th-largest trip distances, and then sum. See "Window Functions in Polars."
2. 115599.84 - incorrect - If you arrived at this result, you likely found the 9th-largest trip distance rather than the 10th? See "Window Functions in Polars."
3. 0.76 - incorrect - It's important to set the `descending` argument correctly in the rank function. See "Window Functions in Polars."
4. 1178.31 - incorrect - If you got this result, you may have used `total_amount`, but the question is about `trip_distance`? See "Window Functions in Polars."

## 5.8 Question 8: Total Amount for Top 3 Pickup Locations

What is the sum of the `total_amount`s for all rides taken with one of the three `pu_location_id`s with the highest maximum `trip_distance`?

In [10]:
result = (
    #### YOUR CODE HERE
    df
    .group_by("pu_location_id")
    .agg(
        pl.col("total_amount").sum(),
        pl.col("trip_distance").max().name.prefix("max_")
    )
    .sort("max_trip_distance", descending=True)
    .head(3)
    .select(pl.col("total_amount").sum())
)
print(result)

shape: (1, 1)
┌──────────────┐
│ total_amount │
│ ---          │
│ f64          │
╞══════════════╡
│ 4.6460e6     │
└──────────────┘


1. 6.0735e6 - incorrect - Make sure to take only the three `pu_location_id`s with the highest maximum `trip_distance`. See "Grouping DataFrames with .group_by()."
2. 891.48 - incorrect - You might have taken the three `pu_location_id`s with the lowest maximum `trip_distance` rather than the highest maximum `trip_distance`. See "Grouping DataFrames with .group_by()."
3. 4.6460e6 - correct - Nice! There is more than one aggregation involved here. First, group by `pu_location_id`. Then take only the top three `pu_location_id`s by `trip_distance`. Finally, add the `total_amount`s across those three `pu_location_id`s. See "Grouping DataFrames with .group_by()."
4. 2.3340e7 - incorrect - You need to sum the `total_amount` of the three `pu_location_id`s with the highest `max_trip_distance`, not the highest `total_amount`. See "Grouping DataFrames with .group_by()."

## 5.9 Question 9: Performance Comparison: Filtering vs Grouping (In-Memory)

Sometimes we want to both filter and group data; for example, in this question we want to both group by `pu_location_id` and view the results for just one `pu_location_id`. In these cases, we can filter first or group by first and get the same result. So which is faster, and why--grouping then filtering, or filtering and grouping? Perform the following timing tests to get the answer, and choose the best explanation. (Note: we are in in-memory mode here.)

In [11]:
%%timeit -n 3
result = (
    df
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
)

1.58 ms ± 385 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [12]:
%%timeit -n 3
result = (
    df
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
)

153 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


1. They are the same, because the query optimization engine doesn't care about the order of operations. - incorrect - there is no query optimization in in-memory mode. See “Grouping DataFrames with .group_by().”
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation. - correct - Since group-by is a relatively expensive operation, reducing the amount of data it needs to operate on speeds up performance. See “Grouping DataFrames with .group_by().”
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end. - incorrect - Filtering is not a very expensive operation; grouping is far more expensive. See “Grouping DataFrames with .group_by().”
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out. - incorrect - Though it might be true that there are less rows to filter out, the true bottleneck of the operation is the grouping itself! See “Grouping DataFrames with .group_by().”

## 5.10 Question 10: Performance Comparison: Filtering vs Grouping (Lazy Mode)

Sometimes we want to both filter and group data; for example, in this question we want to both group by `pu_location_id` and view the results for just one `pu_location_id`. In these cases, we can filter first or group by first and get the same result. So which is faster, and why--grouping then filtering, or filtering and grouping? Perform the following timing tests to get the answer, and choose the best explanation. (Note: we are in lazy mode here.)

In [13]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
lf = (
    pl.scan_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

In [14]:
%%timeit -n 3
result = (
    lf
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .collect()
)

13.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [15]:
%%timeit -n 3
result = (
    lf
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .collect()
)

12.6 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


1. They are the same, because the query optimization engine doesn't care about the order of operations. - correct - after the query hits the query optimization engine, they become the same query. See “Grouping DataFrames with .group_by().”
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation. - incorrect - this might be the case in in-memory mode, but not in lazy mode! See “Grouping DataFrames with .group_by().”
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end. - incorrect - Filtering is actually not an expensive operation; grouping is far more expensive. Either way, it doesn't apply in lazy mode! See “Grouping DataFrames with .group_by().”
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out. - incorrect - Though it might be true that there are less rows to filter out, the true bottleneck of the operation is the grouping itself. Regardless, the query optimization engine makes this irrelevant in lazy mode! See “Grouping DataFrames with .group_by().”