# 3. Data Manipulation I: Basics - Quiz

## 3.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)

## 3.1. Question 1

Using `.select()`, fetch a column from `df` which represents whether or not a toll was paid as part of the trip. What is the datatype of that new column? (Hint: you can check if a toll was paid by seeing if `tolls_amount` is greater than 0.)

In [3]:
result = (
    df
    .select(
        ### YOUR CODE HERE
        pl.col("tolls_amount").gt(0)
    )
)
print(result)

shape: (3_582_628, 1)
┌──────────────┐
│ tolls_amount │
│ ---          │
│ bool         │
╞══════════════╡
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
│ …            │
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
└──────────────┘


1. `int64` - incorrect - Are you sure you're checking for "greater than 0" with the right function? Try using `.gt()`!
2. `float64` - incorrect - This is the original datatype of the `tolls_amount` column... are you sure your function worked correctly?
3. `str` - incorrect - Are you sure you are performing the operation on the correct column?
4. `bool` - correct - Indeed, the function `.gt()` will return a boolean--true if greater than, false if not.

## 3.2. Question 2

What was the longest trip that had `0` tolls paid (hint: use `.filter()` to get only the trips with `tolls_amount` equal to `0`, and `.select()` with `.max()` to find the longest trip)?

In [4]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tolls_amount").eq(0)
    )
    .select(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 176836.3      │
└───────────────┘


1. `176836.3` - correct - Exactly!
2. `0` - incorrect - You might be using `.min()` instead of `.max()` by accident...
3. `176744.79` - incorrect - Make sure to check for rides for which `tolls_amount` was equal to 0, not rides where `tolls_amount` is greater than 0!
4. `176329.23` - incorrect - Are you sure you're using the `tolls_amount` column and not accidentally the `tip_amount` column?

## 3.3. Question 3

What is the mean tip amount for trips where the fare amount was greater than $20?

In [5]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("fare_amount").gt(20)
    )
    .select(
        #### YOUR CODE HERE
        pl.col("fare_amount").mean()
    )
)
print(result)

shape: (1, 1)
┌─────────────┐
│ fare_amount │
│ ---         │
│ f64         │
╞═════════════╡
│ 39.32003    │
└─────────────┘


1. 39.32003 - incorrect - We're looking for the mean `tip_amount` for trips where the `fare_amount` was greater than `$20`, not the mean `fare_amount` for trips where the `fare_amount` was greater than `$20`!
2. 598.58 - incorrect - That's a big tip amount! You might accidentally be taking the maximum tip amount rather than the mean tip amount...
3. 5.870243 - incorrect - Make sure you're using the `.gt` function ("greater than") and not the `.ge` function ("greater than or equal to")!
4. 5.872741 - correct - Exactly!

## 3.4. Question 4

Find the maximum trip distance for trips with a passenger count of 1 or 2.

In [6]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("passenger_count").eq(1)
        .or_(pl.col("passenger_count").eq(2))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("total_amount").max()
    )
)
print(result)

shape: (1, 1)
┌──────────────┐
│ total_amount │
│ ---          │
│ f64          │
╞══════════════╡
│ 1021.99      │
└──────────────┘


1. 159.74 - incorrect - That's the maximum `trip_distance` for trips that had a `passenger_count` of exactly 2... Check your code again!
2. 66907.9 - correct - Exactly!
3. 3.530788 - incorrect - That's the mean `trip_distance` for trips with a passenger count of 1 or 2, but we're looking for the maximum!
4. 1021.99 - incorrect - That's the maximum `total_amount` for trips with a passenger count of 1 or 2, but we're looking for the maximum `trip_distance`!

## 3.5. Question 5

Sort the dataframe by `total_amount` in descending order; then, select and display only the top 5 rows and the columns `trip_distance` and `total_amount`. What are the two values of `trip_distance` associated with the two trips with the highest `total_amount`?

In [7]:
result = (
    df
    .sort(
        #### YOUR CODE HERE
        "total_amount", descending=True
    )
    .select([
        #### YOUR CODE HERE
        "trip_distance", "total_amount"
    ])
    .head()  # Displays the top 5 rows
)
print(result)

shape: (5, 2)
┌───────────────┬──────────────┐
│ trip_distance ┆ total_amount │
│ ---           ┆ ---          │
│ f64           ┆ f64          │
╞═══════════════╪══════════════╡
│ 3.8           ┆ 1021.99      │
│ 181.5         ┆ 951.26       │
│ 0.0           ┆ 851.0        │
│ 0.1           ┆ 841.0        │
│ 159.74        ┆ 807.75       │
└───────────────┴──────────────┘


1. 0.0, 159.74 - incorrect - You might be sorting in ascending order rather than descending order...
2. 3.8, 181.5 - correct
3. 1021.99, 951.26 - incorrect - These are the two highest values for `total_amount`, but we are looking for the values of `trip_distance` associated with two highest values of `total_amount`!
4. 5.1, 8.3 - incorrect - Are you sure you're looking at the right columns?

## 3.6. Question 6

Calculate the maximum fare amount for trips that had a `tip_amount` greater than `$10` and a `trip_distance` greater than `10` miles.

In [8]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tip_amount").gt(10)
        .and_(pl.col("trip_distance").gt(10))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("fare_amount").max()
    )
)
print(result)

shape: (1, 1)
┌─────────────┐
│ fare_amount │
│ ---         │
│ f64         │
╞═════════════╡
│ 472.0       │
└─────────────┘


1. 472.0 - correct - Exactly!
2. 900.0 - incorrect - We're looking for the maximum fare amount for trips that had a `tip_amount` greater than `$10` and a `trip_distance` greater than `10` miles, not either or!
3. 66.124137 - incorrect - We're looking for the maximum fare amount, not the average!
4. 633.3 - incorrect - Are you sure you're looking at the right column?

## 3.7. Question 7

Find the maximum `price_per_mile` (by dividing the `total_amount` by the `trip_distance`) for trips with a distance greater than `30`.

In [9]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("trip_distance").gt(30)
    )
    .select([
        #### YOUR CODE HERE
        (pl.col("total_amount") / pl.col("trip_distance")).max().alias("max_price_per_mile")
    ])
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ max_price_per_mile │
│ ---                │
│ f64                │
╞════════════════════╡
│ 14.067142          │
└────────────────────┘


1. 0.034265 - incorrect - Are you sure you're filtering for just trips with a distance greater than 30, and not filtering for trips with a distance greater than 300?
2. 4.868929 - incorrect - This is the average `price_per_mile`, not the maximum!
3. 14.067142 - correct - Exactly!
4. 9.384736 - incorrect - Are you sure you're looking at the right columns?

## 3.8. Question 8

Of all the trips which had a `trip_distance` of exactly `5`, what was the latest `tpep_pickup_datetime`?

In [10]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("trip_distance").eq(5))
    .select(pl.col("tpep_pickup_datetime").max())
)
print(result)

shape: (1, 1)
┌──────────────────────┐
│ tpep_pickup_datetime │
│ ---                  │
│ datetime[μs]         │
╞══════════════════════╡
│ 2024-03-31 23:46:21  │
└──────────────────────┘


1. 2024-03-31 23:46:21 - correct
2. 2024-03-31 23:57:07 - incorrect - Remember--we're looking for maximum pickup datetime, not dropoff datetime!
3. 2024-03-01 00:05:00 - incorrect - Remember--we're looking for maximum pickup datetime, not minimum!
4. 2024-04-01 00:34:55 - incorrect - Are you sure your filtering for only trips that had a `trip_distance` of exactly 5?

## 3.9. Question 9

Filtering only for trips that had a `fare_amount` and a `tip_amount` greater than 0, what was the lowest tip percentage (expressed as a fraction) that somebody paid? (Hint: divide `tip_amount` by `fare_amount`).

In [11]:
result = (
    df
    #### YOUR CODE HERE
    .filter(
        pl.col("fare_amount").gt(0)
        .and_(pl.col("tip_amount").gt(0))
    )
    .select(
        (pl.col("tip_amount") / pl.col("fare_amount"))
        .min().alias("min_tip_percentage")
    )
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ min_tip_percentage │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.00003            │
└────────────────────┘


1. 0.01 - incorrect - Are you sure you're using the right columns for calculating the tip percentage of each ride?
2. 0.276066 - incorrect - We're looking for the minimum tip, not the average tip!
3. -40.0 - incorrect - Don't forget to filter the data!
4. 0.00003 - correct - Exactly!

## 3.10. Question 10

How many trips had a `tip_amount` greater than the `Airport_fee`?

In [12]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("tip_amount").gt(pl.col("Airport_fee")))
    .shape
)
print(result)

(2461463, 19)


1. 3089862 - incorrect - Make sure to check for rides where the `tip_amount` is greater than `Airport_fee`, not rides where `tip_amount` is greater than or equal to `Airport_fee`!
2. 3582628 - incorrect - That's the total amount of rides in the dataset! Don't forget to filter.
3. 2461463 - correct - Exactly!
4. 982746 - incorrect - Are you sure you're using the correct columns?