# 5. Data Manipulation III - Grouping and Aggregation - Quiz.ipynb

## 5.0. Import `polars` and Load Data

In [None]:
import polars as pl

In [None]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

## 5.1 Question 1: Maximum Trip Distance for Specific Location

What is the maximum trip distance for trips with `pu_location_id = 1`?

In [None]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
    )
    .group_by(
        #### YOUR CODE HERE
    )
    .agg(
        #### YOUR CODE HERE
    )
)
print(result)

1. 35.75
2. 29.7
3. 176744.79
4. 0.0

## 5.2 Question 2: Top Vendor by Average Fare and Trip Distance

Group the data by `vendor_id` and calculate the average `fare_amount` and average `trip_distance` for each vendor. Then sort in descending order first by `mean_fare_amount` and then by `mean_trip_distance`. What is the top `vendor_id`?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result.head())

1. 207
2. 205
3. 265
4. 6

## 5.3 Question 3: Date with Most Rides

Which date for `tpep_pickup_datetime` had the most rides?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 2002-12-31
2. 2024-03-14 22:04:00
3. 2024-03-09
4. 2024-04-01

## 5.4 Question 4: Average Fare by Vendor and Payment Type

Create a pivot table that shows the average fare amount for each combination of `VendorID` and `payment_type`. What is the average fare amount associated with `vendor_id` 2 and `payment_type` 2?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 850.0
2. 18.390497
3. 3.411724
4. 18.582571

## 5.5 Question 5: Average Trip Distance for Airport Fee Rides

Create a pivot table that shows the average trip distance for every `vendor_id` by whether or not the ride has an airport fee. What is the average trip distance for rides with `vendor_id = 1` that have an airport fee?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 115.3
2. 13.309613
3. 12.570663
4. 2.409085

## 5.6 Question 6: Average Trip Distance for Airport Fee Rides (Alternative Method)

What is the average trip distance for rides with `vendor_id = 1` that have an airport fee? Use `.filter()` to include only data where `vendor_id = 1` and `airport_fee > 0`, and then `.select()` to measure the average `trip_distance`.

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 115.3
2. 13.309613
3. 12.570663
4. 2.409085

## 5.7 Question 7: Sum of 10th Largest Trip Distances by Vendor

Using `rank().over()`, what is the 10th-largest `trip_distance` for each `vendor_id`, summed over all `vendor_id`s?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 113916.29
2. 115599.84
3. 0.76
4. 1178.31

## 5.8 Question 8: Total Amount for Top 3 Pickup Locations

What is the sum of the `total_amount`s for all rides taken with one of the three `pu_location_id`s with the highest maximum `trip_distance`?

In [None]:
result = (
    #### YOUR CODE HERE
)
print(result)

1. 6.0735e6
2. 891.48
3. 4.6460e6
4. 2.3340e7

## 5.9 Question 9: Performance Comparison: Filtering vs Grouping (In-Memory)

Sometimes we want to both filter and group data; for example, in this question we want to both group by `pu_location_id` and view the results for just one `pu_location_id`. In these cases, we can filter first or group by first and get the same result. So which is faster, and why--grouping then filtering, or filtering and grouping? Perform the following timing tests to get the answer, and choose the best explanation. (Note: we are in in-memory mode here.)

In [None]:
%%timeit -n 3
result = (
    df
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
)

In [None]:
%%timeit -n 3
result = (
    df
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
)

1. They are the same, because the query optimization engine doesn't care about the order of operations.
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation.
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end.
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out.

## 5.10 Question 10: Performance Comparison: Filtering vs Grouping (Lazy Mode)

Sometimes we want to both filter and group data; for example, in this question we want to both group by `pu_location_id` and view the results for just one `pu_location_id`. In these cases, we can filter first or group by first and get the same result. So which is faster, and why--grouping then filtering, or filtering and grouping? Perform the following timing tests to get the answer, and choose the best explanation. (Note: we are in lazy mode here.)

In [None]:
column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
lf = (
    pl.scan_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(column_rename_mapping)
)

In [None]:
%%timeit -n 3
result = (
    lf
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .collect()
)

In [None]:
%%timeit -n 3
result = (
    lf
    .group_by(
        "pu_location_id"
    )
    .agg(
        pl.col("trip_distance").max()
    )
    .filter(
        pl.col("pu_location_id").eq(1)
    )
    .collect()
)

1. They are the same, because the query optimization engine doesn't care about the order of operations.
2. Filtering before the group-by is faster, because you reduce the amount of data handled by the group by operation.
3. Filtering after the group-by is faster, because the computer doesn't have to worry about the expensive filter operation until the end.
4. Filtering after the group-by is faster, since the filter occurs on grouped data, thus it has less total rows to filter out.