# 3. Data Manipulation I: Basics - Quiz

## 3.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={"tpep_pickup_datetime": pl.Datetime, "tpep_dropoff_datetime": pl.Datetime}
)

## 3.1. Question 1

Using `.select()`, fetch a column from `df` which represents whether or not a toll was paid as part of the trip. What is the datatype of that new column?

In [3]:
result = (
    df
    .select(
        ### YOUR CODE HERE
        pl.col("tolls_amount").gt(0)
    )
)
print(result)

shape: (3_582_628, 1)
┌──────────────┐
│ tolls_amount │
│ ---          │
│ bool         │
╞══════════════╡
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
│ …            │
│ false        │
│ false        │
│ false        │
│ false        │
│ false        │
└──────────────┘


1. `int64`
2. `float64`
3. `str`
4. `bool`

## 3.2. Question 2

What was the longest trip that had `0` tolls paid (hint: use `.filter()` to get only the trips with `0` tolls, and `.select()` to find the longest trip)?

In [4]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tolls_amount").eq(0)
    )
    .select(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 176836.3      │
└───────────────┘


1. `176836.3` - correct
2. `100039.2`
3. `89312.0`
4. `3749.5`

## 3.3. Question 3

What is the maximum tip amount for trips where the fare amount was greater than $20?

In [5]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("fare_amount") > 20
    )
    .select(
        #### YOUR CODE HERE
        pl.col("tip_amount").mean()
    )
)
print(result)

shape: (1, 1)
┌────────────┐
│ tip_amount │
│ ---        │
│ f64        │
╞════════════╡
│ 5.872741   │
└────────────┘


1. 22.384950
2. 7.192837
3. 18.28375
4. 5.872741 - correct

## 3.4. Question 4

Find the maximum trip distance for trips with a passenger count of 1 or 2.

In [6]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("passenger_count").eq(1)
        .or_(pl.col("passenger_count").eq(2))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("trip_distance").max()
    )
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 66907.9       │
└───────────────┘


1. 3894.1
2. 66907.9 - correct
3. 898.0
4. 5.4

## 3.5. Question 5

Sort the dataframe by `total_amount` in descending order. Display only the top 5 rows and the columns `trip_distance` and `total_amount`. What are the two values of `trip_distance` associated with the two trips with the highest `total_amount`?

In [7]:
result = (
    df
    .sort(
        #### YOUR CODE HERE
        "total_amount", descending=True
    )
    .select([
        #### YOUR CODE HERE
        "trip_distance", "total_amount"
    ])
    .head()
)
print(result)

shape: (5, 2)
┌───────────────┬──────────────┐
│ trip_distance ┆ total_amount │
│ ---           ┆ ---          │
│ f64           ┆ f64          │
╞═══════════════╪══════════════╡
│ 3.8           ┆ 1021.99      │
│ 181.5         ┆ 951.26       │
│ 0.0           ┆ 851.0        │
│ 0.1           ┆ 841.0        │
│ 159.74        ┆ 807.75       │
└───────────────┴──────────────┘


1. 2.2, 1.1
2. 3.8, 181.5 - correct
3. 4.6, 99.2
4. 5.1, 8.3

## 3.6. Question 6

Calculate the minimum fare amount for trips that had a `tip_amount` greater than `$10` and a `trip_distance` greater than `10` miles.

In [8]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("tip_amount").gt(10)
        .and_(pl.col("trip_distance").gt(10))
    )
    .select(
        #### YOUR CODE HERE
        pl.col("fare_amount").max()
    )
)
print(result)

shape: (1, 1)
┌─────────────┐
│ fare_amount │
│ ---         │
│ f64         │
╞═════════════╡
│ 472.0       │
└─────────────┘


1. 472.0 - correct
2. 522.1
3. 30.3
4. 633.3

## 3.7. Question 7

Find the maximum `price_per_mile` (by dividing the `total_amount` by the `trip_distance`) for trips with a distance greater than `30`.

In [1]:
result = (
    df
    .filter(
        #### YOUR CODE HERE
        pl.col("trip_distance").gt(30)
    )
    .select([
        #### YOUR CODE HERE
        (pl.col("total_amount") / pl.col("trip_distance")).max().alias("max_price_per_mile")
    ])
)
print(result)

NameError: name 'df' is not defined

1. 2.059684
2. 59.293857
3. 14.067142 - correct
4. 9.384736

## 3.8. Question 8

Of all the trips which had a `trip_distance` of exactly `5`, what was the latest `tpep_pickup_datetime`?

In [10]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("trip_distance").eq(5))
    .select(pl.col("tpep_pickup_datetime").max())
)
print(result)

shape: (1, 1)
┌──────────────────────┐
│ tpep_pickup_datetime │
│ ---                  │
│ datetime[μs]         │
╞══════════════════════╡
│ 2024-03-31 23:46:21  │
└──────────────────────┘


1. 2024-03-31 23:46:21 - correct
2. 2024-03-30 10:30:33
3. 2024-03-30 17:48:11
4. 2024-03-31 23:52:25

## 3.9. Question 9

Filtering only for trips that had a `fare_amount` greater than 0, what was the lowest tip percentage (expressed as a fraction) that somebody paid? (Hint: divide `tip_amount` by `fare_amount`).

In [11]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("fare_amount").gt(0))
    .select(
        (pl.col("tip_amount") / pl.col("fare_amount"))
        .min().alias("min_tip_percentage")
    )
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ min_tip_percentage │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.0                │
└────────────────────┘


1. 0.01
2. 0.02
3. 0.006
4. 0.0 - correct

## 3.10. Question 10

How many trips had a `tip_amount` greater than the `Airport_fee`?

In [12]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("tip_amount").gt(pl.col("Airport_fee")))
    .shape
)
print(result)

(2461463, 19)


1. 394871
2. 1980384
3. 2461463 - correct
4. 982746