# 3. Data Manipulation I: Basics - Quiz

## 3.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)

## 3.1 Question 1: Toll Payment Column Datatype

Using `.select()`, fetch a column from `df` that represents whether or not a toll was paid as part of the trip. What is the datatype of that new column? (Hint: You can check whether a toll was paid by seeing if `tolls_amount` is greater than 0.)

In [3]:
result = (
    df
    .select(
        ### YOUR CODE HERE
        pl.col("tolls_amount").gt(0)
    )
)
print(result)

shape: (3_582_628, 1)
┌──────────────┐
│ tolls_amount │
│ ---          │
│ bool         │
╞══════════════╡
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
│ …            │
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
└──────────────┘


1. `int64` - incorrect - You may not be checking for "greater than 0" with the right function. Try using `.gt()`. See "Getting Started with Column Expressions."
2. `float64` - incorrect - This is the original datatype of the `tolls_amount` column. See "Getting Started with Column Expressions."
3. `str` - incorrect - If you got this answer, you are probably not performing the operation on the correct column. See "Getting Started with Column Expressions."
4. `bool` - correct - The function `.gt()` will return a boolean datatype: true if greater than, false if not. See "Getting Started with Column Expressions."

## 3.2 Question 2: Longest Trip with Zero Tolls

What was the longest trip that had zero tolls paid? (Hint: Use `.filter()` to get only the trips with `tolls_amount` equal to 0 and use `.select()` with `.max()` to find the longest trip.)

In [4]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tolls_amount").eq(0)
    )
    .select(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 176836.3      │
└───────────────┘


1. 176836.3 - correct - This is precisely the result you get when you filter and then take the max. And it's quite a long trip indeed! See "Getting Started with Column Expressions."
2. 0 - incorrect - If you arrived at zero, you might be using `.min()` instead of `.max()`. See "Getting Started with Column Expressions."
3. 176744.79 - incorrect - Make sure to check for rides for which `tolls_amount` is equal to 0, not rides where `tolls_amount` is greater than 0. See "Getting Started with Column Expressions."
4. 176329.23 - incorrect - You may be using the `tip_amount` column rather than the `tolls_amount` column. See "Getting Started with Column Expressions."

## 3.3 Question 3: Mean Tip for High Fare Trips

What is the mean tip amount for trips where the fare amount was greater than $20?

In [5]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("fare_amount").gt(20)
    )
    .select(
        #### YOUR CODE HERE
        pl.col("fare_amount").mean()
    )
)
print(result)

shape: (1, 1)
┌─────────────┐
│ fare_amount │
│ ---         │
│ f64         │
╞═════════════╡
│ 39.32003    │
└─────────────┘


1. 39.32003 - incorrect - You're looking for the mean `tip_amount` for trips where the `fare_amount` is greater than $20, not the mean `fare_amount` for trips where the `fare_amount` is greater than $20. See "Getting Started with Column Expressions."
2. 598.58 - incorrect - That's a big average tip! You might have found the maximum tip rather than the mean tip. See "Getting Started with Column Expressions."
3. 5.870243 - incorrect - You may have used the `.gt()` function ("greater than") instead of the `.ge()` function ("greater than or equal to"). See "Getting Started with Column Expressions."
4. 5.872741 - correct - This is indeed the mean tip amount for trips with a fare greater than $20. You correctly used `gt()` and not `ge()`. See "Getting Started with Column Expressions."

## 3.4 Question 4: Max Trip Distance for 1-2 Passengers

Find the maximum distance for trips with a passenger count of 1 or 2.

In [6]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("passenger_count").eq(1)
        .or_(pl.col("passenger_count").eq(2))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("total_amount").max()
    )
)
print(result)

shape: (1, 1)
┌──────────────┐
│ total_amount │
│ ---          │
│ f64          │
╞══════════════╡
│ 1021.99      │
└──────────────┘


1. 159.74 - incorrect - This is the maximum `trip_distance` for trips with a `passenger_count` of exactly 2, not 1 or 2. See "The .filter() Query Statement."
2. 66907.9 - correct - Precisely, and it's a long trip indeed! There are certainly some outliers in this dataset. See "The .filter() Query Statement."
3. 3.530788 - incorrect - This is the mean `trip_distance` for trips with a passenger count of 1 or 2, but the question is asking for the maximum. See "The .filter() Query Statement."
4. 1021.99 - incorrect - This is the maximum `total_amount` for trips with a passenger count of 1 or 2, but the question is asking for the maximum `trip_distance`. See "The .filter() Query Statement."

## 3.5 Question 5: Top 5 Trips by Total Amount

Sort the DataFrame by `total_amount` in descending order. Then select and display only the top five rows and the columns `trip_distance` and `total_amount`. What are the two values of `trip_distance` associated with the two trips with the highest `total_amount`?

In [7]:
result = (
    df
    .sort(
        #### YOUR CODE HERE
        "total_amount", descending=True
    )
    .select([
        #### YOUR CODE HERE
        "trip_distance", "total_amount"
    ])
    .head()  # Displays the top 5 rows
)
print(result)

shape: (5, 2)
┌───────────────┬──────────────┐
│ trip_distance ┆ total_amount │
│ ---           ┆ ---          │
│ f64           ┆ f64          │
╞═══════════════╪══════════════╡
│ 3.8           ┆ 1021.99      │
│ 181.5         ┆ 951.26       │
│ 0.0           ┆ 851.0        │
│ 0.1           ┆ 841.0        │
│ 159.74        ┆ 807.75       │
└───────────────┴──────────────┘


1. 0.0, 159.74 - incorrect - This is the result you get if you sort in ascending order rather than descending order. See "The .sort() Query Statement."
2. 3.8, 181.5 - correct - Exactly! This is the result when you sort by `total_amount` descending. See "The .sort() Query Statement."
3. 1021.99, 951.26 - incorrect - These are the two highest values for `total_amount`, but the question asks for the values of `trip_distance` associated with two highest values of `total_amount`. See "The .sort() Query Statement."
4. 5.1, 8.3 - incorrect - If you arrived at this answer, you are probably not looking at the right columns. See "The .sort() Query Statement."

## 3.6 Question 6: Maximum Fare for Long Trips with High Tips

Calculate the maximum fare amount for trips with a `tip_amount` greater than $10 and a `trip_distance` greater than 10 miles.

In [8]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tip_amount").gt(10)
        .and_(pl.col("trip_distance").gt(10))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("fare_amount").max()
    )
)
print(result)

shape: (1, 1)
┌─────────────┐
│ fare_amount │
│ ---         │
│ f64         │
╞═════════════╡
│ 472.0       │
└─────────────┘


1. 472.0 - correct - This problem is solved by using `.filter()` and then `.select(...max())`, and the maximum fare is high indeed! See "The .filter() Query Statement."
2. 900.0 - incorrect - The question asks for the maximum fare for trips that had a `tip_amount` greater than $10 and (not or) a `trip_distance` greater than 10 miles. See "The .filter() Query Statement."
3. 66.124137 - incorrect - The question asks for the maximum fare, but this is the average. See "The .filter() Query Statement."
4. 633.3 - incorrect - If you arrived at this result, you probably did not look at the right column. See "The .filter() Query Statement."

## 3.7 Question 7: Maximum Price per Mile for Long Trips

Find the maximum `price_per_mile` (by dividing the `total_amount` by the `trip_distance`) for trips with a distance greater than 30.

In [9]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("trip_distance").gt(30)
    )
    .select([
        #### YOUR CODE HERE
        (pl.col("total_amount") / pl.col("trip_distance")).max().alias("max_price_per_mile")
    ])
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ max_price_per_mile │
│ ---                │
│ f64                │
╞════════════════════╡
│ 14.067142          │
└────────────────────┘


1. 0.034265 - incorrect - To arrive at the correct answer, you need to filter for trips with a distance greater than 30, not for trips with a distance greater than 300. See "The .filter() Query Statement."
2. 4.868929 - incorrect - This is the average `price_per_mile`, not the maximum. See "The .filter() Query Statement."
3. 14.067142 - correct - You correctly used `.filter()`, column arithmetic to compute `price_per_mile` and then used `.max` to get the answer. See "The .filter() Query Statement."
4. 9.384736 - incorrect - If you arrived at this result, you are probably not looking at the right columns. See "The .filter() Query Statement."

## 3.8 Question 8: Latest Pickup for 5-Mile Trips

Of all the trips with a `trip_distance` of exactly 5, what is the latest `tpep_pickup_datetime`?

In [10]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("trip_distance").eq(5))
    .select(pl.col("tpep_pickup_datetime").max())
)
print(result)

shape: (1, 1)
┌──────────────────────┐
│ tpep_pickup_datetime │
│ ---                  │
│ datetime[μs]         │
╞══════════════════════╡
│ 2024-03-31 23:46:21  │
└──────────────────────┘


1. 2024-03-31 23:46:21 - correct - Exactly! Only a few trips have a `trip_distance` of exactly 5, and this is the latest one. See "The .filter() Query Statement."
2. 2024-03-31 23:57:07 - incorrect - Remember that you're looking for the maximum pickup datetime, not the dropoff datetime. See "The .filter() Query Statement."
3. 2024-03-01 00:05:00 - incorrect - Remember that you're looking for the maximum pickup datetime, not the minimum. See "The .filter() Query Statement."
4. 2024-04-01 00:34:55 - incorrect - Remember to filter for only those trips with a `trip_distance` of exactly 5. See "The .filter() Query Statement."

## 3.9 Question 9: Lowest Tip Percentage

On those trips with a `fare_amount` and a `tip_amount` greater than 0, what is the lowest tip percentage (expressed as a fraction) that somebody paid? (Hint: Divide `tip_amount` by `fare_amount`.)

In [11]:
result = (
    df
    #### YOUR CODE HERE
    .filter(
        pl.col("fare_amount").gt(0)
        .and_(pl.col("tip_amount").gt(0))
    )
    .select(
        (pl.col("tip_amount") / pl.col("fare_amount"))
        .min().alias("min_tip_percentage")
    )
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ min_tip_percentage │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.00003            │
└────────────────────┘


1. 0.01 - incorrect - If you got this result, you probably did not use the right columns for calculating the tip percentage of each ride. See "The .filter() Query Statement."
2. 0.276066 - incorrect - The question asks for the minimum tip, not the average tip. See "The .filter() Query Statement."
3. -40.0 - incorrect - This is the result if you overlook filtering the data to look at just trips with a positive `fare_amount` and `tip_amount`. See "The .filter() Query Statement."
4. 0.00003 - correct - You correctly used `.filter()`, column arithmetic to compute the tip percentage and then used `.min()` to get the answer. See "The .filter() Query Statement."

## 3.10 Question 10: Trips with Tips Exceeding Airport Fee

How many trips have a `tip_amount` greater than the `Airport_fee`?

In [12]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("tip_amount").gt(pl.col("Airport_fee")))
    .shape
)
print(result)

(2461463, 19)


1. 3582628 - incorrect - This is the total number of rides in the dataset! Don't forget to filter. See "Getting Started with Column Expressions."
2. 2 - incorrect - When you pass in a fraction as an argument, you have to pass it to the keyword argument `fraction`. See "Getting Started with Column Expressions."
3. 0 - incorrect - Make sure to convert 2% to a fraction and then use the `fraction` keyword argument. See "Getting Started with Column Expressions."
4. 2461463 - correct - You correctly used `.filter()` and then used one of the many ways to check the DataFrame shape–: `.shape`, `display()`, `print()`, etc. See "Getting Started with Column Expressions."