# 7. Data Manipulation V - Working With Data Types - Quiz

## 7.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 7.1 Question 1: Highest Average Fare by Day and Hour

Extract the day of the week (as a string) and the hour from the `tpep_pickup_datetime` column. Then, calculate the average fare amount for each day-hour combination. Finally, sort the results by average fare amount. Which day-hour combination has the highest average fare amount?

In [4]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns([
        pl.col("tpep_pickup_datetime").dt.weekday().alias("day_of_week"),
        pl.col("tpep_pickup_datetime").dt.hour().alias("hour")
    ])
    .group_by(["day_of_week", "hour"])
    .agg(avg_fare=pl.col("fare_amount").mean())
    .sort("avg_fare", descending=True)
)
print(result)

shape: (168, 3)
┌─────────────┬──────┬───────────┐
│ day_of_week ┆ hour ┆ avg_fare  │
│ ---         ┆ ---  ┆ ---       │
│ i8          ┆ i8   ┆ f64       │
╞═════════════╪══════╪═══════════╡
│ 3           ┆ 4    ┆ 28.732735 │
│ 2           ┆ 4    ┆ 28.361275 │
│ 1           ┆ 4    ┆ 28.345536 │
│ 1           ┆ 1    ┆ 28.123612 │
│ 7           ┆ 23   ┆ 27.736774 │
│ …           ┆ …    ┆ …         │
│ 6           ┆ 10   ┆ 15.954722 │
│ 6           ┆ 1    ┆ 15.582336 │
│ 6           ┆ 3    ┆ 15.483906 │
│ 7           ┆ 2    ┆ 15.34928  │
│ 6           ┆ 2    ┆ 14.980229 │
└─────────────┴──────┴───────────┘


1. day=6, hour=2 - incorrect - you might be sorting ascending rather than descending. See "Working with Temporal Columns: The .dt Namespace."
2. day=3, hour=4 - correct - When you sort using `descending=True`, this is indeed the top result! See "Working with Temporal Columns: The .dt Namespace."
3. day=7, hour=23 - incorrect - you might be taking the average `total_amount` rather than the average `fare_amount`. See "Working with Temporal Columns: The .dt Namespace."
4. day=1, hour=0 - incorrect - you might be taking the weekday and hour from `tpep_dropoff_datetime` instead of `tpep_pickup_datetime`. See "Working with Temporal Columns: The .dt Namespace."

## 7.2 Question 2: Dropoff Zone with Longest Trip Duration

Which `do_zone` has the highest trip duration (where "trip duration" is measured as the `.total_seconds()` between `tpep_pickup_datetime` and `tpep_dropoff_datetime`)?

In [5]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns(
        pl.col("tpep_dropoff_datetime")
        .sub(pl.col("tpep_pickup_datetime"))
        .dt.total_seconds()
        .alias("trip_duration")
    )
    .sort("trip_duration", descending=True)
    .head(1)
    .select(["do_zone", "trip_duration"])
)
print(result)

shape: (1, 2)
┌─────────────────────────────────┬───────────────┐
│ do_zone                         ┆ trip_duration │
│ ---                             ┆ ---           │
│ str                             ┆ i64           │
╞═════════════════════════════════╪═══════════════╡
│ Saint Michaels Cemetery/Woodsi… ┆ 545553        │
└─────────────────────────────────┴───────────────┘


1. Saint Michaels Cemetery/Woodside - correct - Nice! You could solve this in a few ways. The most straightforward way is to create the `trip_duration` column and sort it in descending order. See "Working with Temporal Columns: The .dt Namespace."
2. Midtown Center - incorrect - you might be taking the `do_zone` with the lowest trip duration. See "Working with Temporal Columns: The .dt Namespace."
3. 207 - incorrect - almost right! But the answer should be a `do_zone`, not a `do_location_id`. See "Working with Temporal Columns: The .dt Namespace."
4. Woodside - incorrect - you may not have used `.total_seconds()` as the measure of trip duration. See "Working with Temporal Columns: The .dt Namespace."

## 7.3 Question 3: Pickup Location with Most Diverse Dropoff Zones

With a group-by in `polars`, instead of finding some aggregate summary statistic for each group, you can also collect all the elements for each group into a list by simply passing in the column you'd like to aggregate to a list as a name (see below). With this, for each `pu_location_id`, make a column that aggregates all the `do_zones` associated with that `pu_location_id`; what is the `pu_location_id` with the longest list of associated `do_zone`s (hint: use the `.list` namespace)?

In [6]:
result = (
    rides_df
    .group_by("pu_location_id")
    .agg(pl.col("do_zone"))
    #### YOUR CODE HERE
    .sort(pl.col("do_zone").list.len(), descending=True)
)
print(result)

shape: (259, 2)
┌────────────────┬─────────────────────────────────┐
│ pu_location_id ┆ do_zone                         │
│ ---            ┆ ---                             │
│ i32            ┆ list[str]                       │
╞════════════════╪═════════════════════════════════╡
│ 161            ┆ ["Lenox Hill West", "Greenwich… │
│ 132            ┆ ["South Jamaica", "Clinton Hil… │
│ 237            ┆ ["Lenox Hill West", "Union Sq"… │
│ 236            ┆ ["Upper East Side South", "Blo… │
│ 162            ┆ ["Flatiron", "Midtown Center",… │
│ …              ┆ …                               │
│ 199            ┆ ["Midtown Center", "Times Sq/T… │
│ 84             ┆ ["Sheepshead Bay"]              │
│ 156            ┆ ["Coney Island"]                │
│ 109            ┆ ["Great Kills"]                 │
│ 5              ┆ ["Arden Heights"]               │
└────────────────┴─────────────────────────────────┘


1. 161 - correct - Exactly! Note that aggregating all elements of the group into a list takes all values (i.e., it doesn't take just unique values). See "Working with List Columns: The .list Namespace."
2. 5 - incorrect - are you sure you didn't find the `pu_location_id` with the shortest list of associated `do_zone`s? See "Working with List Columns: The .list Namespace."
3. Midtown Center - incorrect - we are looking for the `pu_location_id` with the longest list of associated `do_zone`s, not the `pu_zone`! See "Working with List Columns: The .list Namespace."
4. Arden Heights - incorrect - it looks like you chose the wrong sort order, and looked for `pu_zone` rather than `pu_location_id`! See "Working with List Columns: The .list Namespace."

## 7.4 Question 4: Most Common Second Word in Zone Names

Using just `zones_df`, split `zone` into a list on `" "` as seen during the module, and take the 2nd element of every list using `.list.get()`. Then, using `group_by`, answer the question--what is the most commonly occurring second word in `zones_df` (excluding `null`)?

In [9]:
result = (
    zones_df
    .with_columns(pl.col("zone").str.split(" ").alias("zone_splitted"))
    #### YOUR CODE HERE
    .group_by(pl.col("zone_splitted").list.get(3))
    .agg(pl.len().alias("num_occurences"))
    .sort("num_occurences", descending=True)
)
print(result)

shape: (11, 2)
┌───────────────┬────────────────┐
│ zone_splitted ┆ num_occurences │
│ ---           ┆ ---            │
│ str           ┆ u32            │
╞═══════════════╪════════════════╡
│ null          ┆ 250            │
│ North         ┆ 3              │
│ Island        ┆ 3              │
│ South         ┆ 2              │
│ West          ┆ 1              │
│ …             ┆ …              │
│ Plaza         ┆ 1              │
│ Village       ┆ 1              │
│ Avenue        ┆ 1              │
│ Beach         ┆ 1              │
│ Point         ┆ 1              │
└───────────────┴────────────────┘


1. Park - correct - Indeed, "Park" is a common second word in New York City zone names! See "Working with List Columns: The .list Namespace."
2. `null` - incorrect - The question specifies "excluding null"... take another look! See "Working with List Columns: The .list Namespace."
3. East - incorrect - Make sure to take the 2nd element, not the 0th! See "Working with List Columns: The .list Namespace."
4. North - incorrect - Make sure to take the 2nd element, not the 3rd! See "Working with List Columns: The .list Namespace."

## 7.5 Question 5: Rides with Duration Between 60-120 Seconds

How many rides had a duration of more than 60 seconds and less than 120 seconds?

In [15]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns((pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime")).dt.total_seconds().alias("trip_duration"))
    .filter(pl.col("trip_duration").lt(120).and_(pl.col("trip_duration").gt(60)))
    .shape
)
print(result)

(1, 26)


1. 28426 - incorrect - make sure to use "less than", not "less than or equal to". See "Working with Temporal Columns: The .dt Namespace."
2. 27994 - incorrect - make sure to use "greater than", not "greater than or equal to". See "Working with Temporal Columns: The .dt Namespace."
3. 27538 - correct - Perfect! You could solve this with `.and_()` combination of `.gt()` and `.lt()`, or try the more advanced `.is_between()`. See "Working with Temporal Columns: The .dt Namespace."
4. 1 - incorrect - check again, you might have got the order of the pickup-dropoff subtraction incorrect! See "Working with Temporal Columns: The .dt Namespace."

## 7.6 Question 6: Rides Longer Than One Day

How many rides had a duration of more than 1 day (hint: instead of using `.dt.total_seconds()`, you can use `.dt.total_days()`)?

In [23]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns((pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime")).dt.total_days().alias("trip_duration_days"))
    .filter(pl.col("trip_duration_days").gt(1))
    .shape
)
print(result)

(6, 26)


1. 3541969 - incorrect - are you sure you're not accidentally using `.total_minutes()`? See "Working with Temporal Columns: The .dt Namespace."
2. 20 - correct - Exactly! Note that you could also use `.dt.total_seconds()` and divide by seconds per day. See "Working with Temporal Columns: The .dt Namespace."
3. 3581500 - incorrect - are you sure you're not accidentally using `.total_seconds()`? See "Working with Temporal Columns: The .dt Namespace."
4. 6 - incorrect - make sure to use "greater than" and not "great than or equal to". See "Working with Temporal Columns: The .dt Namespace."

## 7.7 Question 7: Zones Containing 'North'

How many zones in `zones_df` contain the word "North"?

In [33]:
result = (
    zones_df
    #### YOUR CODE HERE
    .select(pl.col("zone").str.contains("North").sum())
)
print(result)

shape: (1, 1)
┌──────┐
│ zone │
│ ---  │
│ u32  │
╞══════╡
│ 15   │
└──────┘


1. 15 - correct - Indeed many neighborhoods in New York City have the word "North" in their name! See "Working with String Columns: The .str Namespace."
2. 0 - incorrect - note that string containment checks are case-sensitive in polars, so checking for "north" won't work! See "Working with String Columns: The .str Namespace."
3. 265 - incorrect - looks like you accidentally got the total number of rows! Perhaps you accidentally used `count()` or `len()`... See "Working with String Columns: The .str Namespace."
4. 19 - incorrect - make sure to not accidentally check for "South"! See "Working with String Columns: The .str Namespace."

## 7.8 Question 8: Most Common First Word in Zone Names (Reversed)

For `zones_df`, what is the most common `0th` word in the `zone` column, spelled in reverse? (Hint: split the `zone` column by `" "` into a list of strings; then, take the `0th` element of each list in that column, and apply `.str.reverse()` to it.)

In [37]:
result = (
    zones_df
    #### YOUR CODE HERE
    .group_by(pl.col("zone").str.split(" ").list.get(0).str.reverse())
    .agg(pl.len().alias("num_occurrences"))
    .sort("num_occurrences", descending=True)
)
print(result)

shape: (195, 2)
┌─────────────┬─────────────────┐
│ zone        ┆ num_occurrences │
│ ---         ┆ ---             │
│ str         ┆ u32             │
╞═════════════╪═════════════════╡
│ tsaE        ┆ 13              │
│ tseW        ┆ 5               │
│ nwotdiM     ┆ 4               │
│ htuoS       ┆ 4               │
│ reppU       ┆ 4               │
│ …           ┆ …               │
│ noskcaJ     ┆ 1               │
│ dlroW       ┆ 1               │
│ sdnaltalF   ┆ 1               │
│ hgielretseW ┆ 1               │
│ sserpyC     ┆ 1               │
└─────────────┴─────────────────┘


1. East - incorrect - Don't forget to reverse the string! See "Working with String Columns: The .str Namespace."
2. tseW - incorrect - Not quite, this is the second most common 0th word! See "Working with String Columns: The .str Namespace."
3. kraP - incorrect - You might be taking the 1st element of the list rather than the 0th! See "Working with String Columns: The .str Namespace."
4. tsaE - correct - Exactly! "East" is a very common first word in zone names, and this is its reverse! See "Working with String Columns: The .str Namespace."