# 6. Data Manipulation IV - Combining Data - Quiz

## 6.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}
march_yellow_rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
)

In [4]:
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename({
        "LocationID": "location_id",
        "Borough": "borough",
        "Zone": "zone",
    })
)

## 6.1 Question 1: Most Common Pickup-Dropoff Zone Pair

Using the `zones_df` combined with the `march_yellow_rides_df`, which `pu_zone` `do_zone` pair had the most rides?

In [5]:
result = (
    march_yellow_rides_df
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.with_columns(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
    #### YOUR CODE HERE
    .group_by(["pu_zone", "do_zone"])
    .agg(pl.len().alias("num_rides"))
    .sort("num_rides", descending=True)
    .head(1)
)
print(result)

shape: (1, 3)
┌───────────────────────┬───────────────────────┬───────────┐
│ pu_zone               ┆ do_zone               ┆ num_rides │
│ ---                   ┆ ---                   ┆ ---       │
│ str                   ┆ str                   ┆ u32       │
╞═══════════════════════╪═══════════════════════╪═══════════╡
│ Upper East Side South ┆ Upper East Side North ┆ 21477     │
└───────────────────────┴───────────────────────┴───────────┘


1. (Upper East Side South, Upper East Side North) - correct - To get this result, you joined the `zones_df` into the `rides_df`, as in the module, and grouped by the combination of `pu_zone` and `do_zone`. See "Joining DataFrames with .join()."
2. (Erasmus, Astoria) - incorrect - Make sure to find the (`pu_zone`, `do_zone`) pair with the most rides, not the least. See "Joining DataFrames with .join()."
3. (Upper East Side North, Upper East Side South) - incorrect - Make sure to find the (`pu_zone`, `do_zone`) pair with the most rides, not the second-most rides. See "Joining DataFrames with .join()."
4. (Midtown Center, Upper East Side North) - incorrect - If you got this result, you may not have aggregated by the right column. See "Joining DataFrames with .join()."

## 6.2 Question 2: Average Passenger Count for Specific Route

What is the average `passenger_count` for rides that started in the zone "Midtown Center" and ended in the zone "Upper East Side North"?

In [8]:
result = (
    march_yellow_rides_df
    #### YOUR CODE HERE
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.with_columns(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
    .filter(
        pl.col("pu_zone").eq("Midtown Center")
        .and_(pl.col("do_zone").eq("Upper East Side North"))
    )
    .select(pl.col("passenger_count").mean())
)
print(result)

shape: (1, 1)
┌─────────────────┐
│ passenger_count │
│ ---             │
│ f64             │
╞═════════════════╡
│ 1.277752        │
└─────────────────┘


1. 13.526644 - incorrect - You might be measuring average `fare_amount` instead of average `passenger_count`. See "Joining DataFrames with .join()."
2. 1.932876 - incorrect - You might be measuring average `trip_distance` instead of average `passenger_count`. See "Joining DataFrames with .join()."
3. 1.277752 - correct - Exactly! First, join `zones_df` into `rides_df` twice (once for pickup and once for dropoff). Then, filter for the right pickup and dropoff zone. Finally, take the `.mean()` `passenger_count`. See "Joining DataFrames with .join()."
4. 21.211721 - incorrect - You might be measuring average `total_amount` instead of average `passenger_count`. See "Joining DataFrames with .join()."

## 6.3 Question 3: Diagonal Concatenation Result Shape

Take the two toy dataframes below and concatenate them diagonally. What is the shape of the result?

In [41]:
toy_1_df = pl.DataFrame({
    "a": [1, 2, 4],
    "b": [5, 7, 8],
})
toy_2_df = pl.DataFrame({
    "a": [10, 20, 50, 90],
    "b": [50, 3, 99, 3],
    "c": [4, 6, 10, 12]
})
result = (
    #### YOUR CODE HERE
    pl.concat(
        [
            toy_1_df,
            toy_2_df
        ],
        how="diagonal"
    )
    .shape
)
print(result)

(7, 3)


1. (6, 2) - incorrect - Make sure you're not concatenating `toy_1_df` to itself! See "Concatenating DataFrames with .concat()."
2. (7, 3) - correct - Since `toy_2_df` has a column that `toy_1_df` doesn't, an extra column is added. See "Concatenating DataFrames with .concat()."
3. (8, 3) - incorrect - Make sure you're not concatenating `toy_2_df` to itself! See "Concatenating DataFrames with .concat()."
4. (7, 2) - incorrect - This would be correct if `toy_2_df` didn't have the extra column "c"... See "Concatenating DataFrames with .concat()."