# 7. Data Manipulation V - Data Types - Quiz

## 7.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 7.1. Question 1

Extract the day of the week (as a string) and the hour from the 'tpep_pickup_datetime' column. Then, calculate the average fare amount for each day-hour combination, and sort the results by average fare amount. Which day-hour combination had the highest average fare amount?

In [4]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns([
        pl.col("tpep_pickup_datetime").dt.weekday().alias("day_of_week"),
        pl.col("tpep_pickup_datetime").dt.hour().alias("hour")
    ])
    .group_by(["day_of_week", "hour"])
    .agg(avg_fare=pl.col("fare_amount").mean())
    .sort("avg_fare", descending=True)
)
print(result)

shape: (168, 3)
┌─────────────┬──────┬───────────┐
│ day_of_week ┆ hour ┆ avg_fare  │
│ ---         ┆ ---  ┆ ---       │
│ i8          ┆ i8   ┆ f64       │
╞═════════════╪══════╪═══════════╡
│ 3           ┆ 4    ┆ 28.732735 │
│ 2           ┆ 4    ┆ 28.361275 │
│ 1           ┆ 4    ┆ 28.345536 │
│ 1           ┆ 1    ┆ 28.123612 │
│ 7           ┆ 23   ┆ 27.736774 │
│ …           ┆ …    ┆ …         │
│ 6           ┆ 10   ┆ 15.954722 │
│ 6           ┆ 1    ┆ 15.582336 │
│ 6           ┆ 3    ┆ 15.483906 │
│ 7           ┆ 2    ┆ 15.34928  │
│ 6           ┆ 2    ┆ 14.980229 │
└─────────────┴──────┴───────────┘


1. day=6, hour=2 - incorrect - you might be sorting ascending rather than descending.
2. day=3, hour=4 - correct
3. day=7, hour=23 - incorrect - you might be taking the average `total_amount` rather than the average `fare_amount`.
4. day=1, hour=0 - incorrect - you might be taking the weekday and hour from `tpep_dropoff_datetime` instead of `tpep_pickup_datetime`.

## 7.2. Question 2

Which is the `do_zone` with the highest trip duration (where "trip duration" is measured as the `.total_seconds()` between `tpep_pickup_datetime` and `tpep_dropoff_datetime`)?

In [5]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns(
        pl.col("tpep_dropoff_datetime")
        .sub(pl.col("tpep_pickup_datetime"))
        .dt.total_seconds()
        .alias("trip_duration")
    )
    .sort("trip_duration", descending=True)
    .head(1)
    .select(["do_zone", "trip_duration"])
)
print(result)

shape: (1, 2)
┌─────────────────────────────────┬───────────────┐
│ do_zone                         ┆ trip_duration │
│ ---                             ┆ ---           │
│ str                             ┆ i64           │
╞═════════════════════════════════╪═══════════════╡
│ Saint Michaels Cemetery/Woodsi… ┆ 545553        │
└─────────────────────────────────┴───────────────┘


1. Saint Michaels Cemetery/Woodside - correct
2. Midtown Center - incorrect - you might be taking the `do_zone` with the lowest trip duration.
3. 207 - incorrect - almost right! But the answer should be a `do_zone`, not a `do_location_id`.
4. Woodside - not quite!

## 7.3. Question 3

With a group-by in `polars`, instead of finding some aggregate summary statistic for each group, you can also collect all the elements for each group into a list by simply passing in the column you'd like to aggregate to a list as a name (see below). With this, for each `pu_location_id`, make a column that aggregates all the `do_zones` associated with that `pu_location_id`; what is the `pu_location_id` with the longest list of associated `do_zone`s (hint: use the `.list` namespace)?

In [6]:
result = (
    rides_df
    .group_by("pu_location_id")
    .agg(pl.col("do_zone"))
    #### YOUR CODE HERE
    .sort(pl.col("do_zone").list.len(), descending=True)
)
print(result)

shape: (259, 2)
┌────────────────┬─────────────────────────────────┐
│ pu_location_id ┆ do_zone                         │
│ ---            ┆ ---                             │
│ i32            ┆ list[str]                       │
╞════════════════╪═════════════════════════════════╡
│ 161            ┆ ["Lenox Hill West", "Greenwich… │
│ 132            ┆ ["South Jamaica", "Clinton Hil… │
│ 237            ┆ ["Lenox Hill West", "Union Sq"… │
│ 236            ┆ ["Upper East Side South", "Blo… │
│ 162            ┆ ["Flatiron", "Midtown Center",… │
│ …              ┆ …                               │
│ 199            ┆ ["Midtown Center", "Times Sq/T… │
│ 84             ┆ ["Sheepshead Bay"]              │
│ 156            ┆ ["Coney Island"]                │
│ 109            ┆ ["Great Kills"]                 │
│ 5              ┆ ["Arden Heights"]               │
└────────────────┴─────────────────────────────────┘


1. 161 - correct
2. 5 - incorrect - are you sure you didn't find the `pu_location_id` with the shortest list of associated `do_zone`s?
3. Midtown Center - incorrect - we are looking for the `pu_location_id` with the longest list of associated `do_zone`s, not the `pu_zone`!
4. Arden Heights - incorrect - it looks like you chose the wrong sort order, and looked for `pu_zone` rather than `pu_location_id`!

## 7.4. Question 4

You could have also answered Question 3 by finding the `pl.len()` of all the `do_zone`s for each group, rather than first putting them all into a list. Answer the same question now, but using a simple `.group_by().agg()`. Do you get the same answer as in Question 3?

In [7]:
result = (
    rides_df
    .group_by("pu_location_id")
    #### YOUR CODE HERE
    .agg(pl.col("do_zone").len())
    .sort(pl.col("do_zone"), descending=True)
)
print(result)

shape: (259, 2)
┌────────────────┬─────────┐
│ pu_location_id ┆ do_zone │
│ ---            ┆ ---     │
│ i32            ┆ u32     │
╞════════════════╪═════════╡
│ 161            ┆ 163269  │
│ 132            ┆ 157706  │
│ 237            ┆ 155631  │
│ 236            ┆ 146044  │
│ 162            ┆ 123805  │
│ …              ┆ …       │
│ 199            ┆ 2       │
│ 84             ┆ 1       │
│ 156            ┆ 1       │
│ 109            ┆ 1       │
│ 5              ┆ 1       │
└────────────────┴─────────┘


1. Yes - correct
2. No

## 7.5. Question 5

Questions 3 and 4 create a list of the `do_zones` associated with all rides for every given `pu_location_id`. The result of this is that all the duplicates get counted! To handle this, there is a function `.list.unique()` to remove duplicates from a list column. Answer Question 3 again, but this time removing duplicate values of `do_zone` with `.list.unique()`. Is the answer the same as Question 3?

In [8]:
result = (
    rides_df
    .group_by("pu_location_id")
    .agg(pl.col("do_zone"))
    #### YOUR CODE HERE
    .sort(pl.col("do_zone").list.unique().list.len(), descending=True)
)
print(result)

shape: (259, 2)
┌────────────────┬─────────────────────────────────┐
│ pu_location_id ┆ do_zone                         │
│ ---            ┆ ---                             │
│ i32            ┆ list[str]                       │
╞════════════════╪═════════════════════════════════╡
│ 132            ┆ ["South Jamaica", "Clinton Hil… │
│ 138            ┆ ["Old Astoria", "Lincoln Squar… │
│ 70             ┆ ["Flushing", "Gramercy", … "Ja… │
│ 230            ┆ ["Midtown North", "Clinton Wes… │
│ 186            ┆ ["Lenox Hill West", "Murray Hi… │
│ …              ┆ …                               │
│ 199            ┆ ["Midtown Center", "Times Sq/T… │
│ 84             ┆ ["Sheepshead Bay"]              │
│ 156            ┆ ["Coney Island"]                │
│ 109            ┆ ["Great Kills"]                 │
│ 5              ┆ ["Arden Heights"]               │
└────────────────┴─────────────────────────────────┘


1. Yes
2. No - correct

## 7.6. Question 6

Using just `zones_df`, split `zone` into a list on `" "` as seen during the module, and take the 2nd element of every list using `.list.get()`. Then, using `group_by`, answer the question--what is the most commonly occurring second word in `zones_df` (excluding `null`)?

In [9]:
result = (
    zones_df
    .with_columns(pl.col("zone").str.split(" ").alias("zone_splitted"))
    #### YOUR CODE HERE
    .group_by(pl.col("zone_splitted").list.get(3))
    .agg(pl.len().alias("num_occurences"))
    .sort("num_occurences", descending=True)
)
print(result)

shape: (11, 2)
┌───────────────┬────────────────┐
│ zone_splitted ┆ num_occurences │
│ ---           ┆ ---            │
│ str           ┆ u32            │
╞═══════════════╪════════════════╡
│ null          ┆ 250            │
│ North         ┆ 3              │
│ Island        ┆ 3              │
│ South         ┆ 2              │
│ Village       ┆ 1              │
│ …             ┆ …              │
│ Beach         ┆ 1              │
│ West          ┆ 1              │
│ Plaza         ┆ 1              │
│ Point         ┆ 1              │
│ Field         ┆ 1              │
└───────────────┴────────────────┘


1. Park - correct
2. `null` - incorrect - The question specifies "excluding null"... take another look!
3. East - incorrect - Make sure to take the 2nd element, not the 0th!
4. North - incorrect - Make sure to take the 2nd element, not the 3rd!

## 7.7. Question 7

How many rides had a duration of more than 60 seconds and less than 120 seconds?

In [12]:
result = (
    rides_df
    #### YOUR CODE HERE
    .with_columns((pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime")).dt.total_seconds().alias("trip_duration"))
    .filter(pl.col("trip_duration").lt(120).and_(pl.col("trip_duration").gt(60)))
    .shape
)
print(result)

(27538, 26)


3. 27538 - correct