# 4. Data Manipulation II - Advanced Selecting - Quiz.ipynb

## 4.0. Import `polars` and Load Data

In [2]:
import polars as pl

In [3]:
df = pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")

## 4.1. Question 1

Select all columns from the dataframe that have the datatype `pl.String`. How many columns are there in the result?

In [7]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.String))
    .shape
)
print(result)

(3582628, 1)


1. 1 - correct
2. 2
3. 3
4. 4

## 4.2. Question 2

Select all columns from the dataframe that have the datatype `pl.Int64`. How many columns are there in the result?

In [8]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.Int64))
    .shape
)
print(result)

(3582628, 3)


1. 1
2. 2
3. 3 - correct
4. 4

## 4.3. Question 3

What fraction of rows have at least one string column equal to an empty string (i.e. `""`)? (Hint: use `pl.any_horizontal()`)

In [10]:
result = (
    df
    .select(
        pl.any_horizontal(pl.col(pl.String).eq("")).mean().alias("fraction")
    )
)
print(result)

shape: (1, 1)
┌────────────────────┐
│ store_and_fwd_flag │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.0                │
└────────────────────┘


1. 0 - correct
2. .25
3. .50
4. 1.0

## 4.4. Question 4

Create a new column called `tip_amount_plus_fare_amount` and sort by it in descending order. What is the `tip_amount` and `fare_amount` for the highest `tip_amount_plus_fare_amount`?

In [12]:
result = (
    df
    #### YOUR CODE HERE
    .select(
        pl.col(["tip_amount", "fare_amount"]),
        (pl.col("tip_amount") + pl.col("fare_amount")).alias("tip_amount_plus_fare_amount")
    )
    .sort(by="tip_amount_plus_fare_amount", descending=True)
)
print(result)

shape: (3_582_628, 3)
┌────────────┬─────────────┬─────────────────────────────┐
│ tip_amount ┆ fare_amount ┆ tip_amount_plus_fare_amount │
│ ---        ┆ ---         ┆ ---                         │
│ f64        ┆ f64         ┆ f64                         │
╞════════════╪═════════════╪═════════════════════════════╡
│ 999.99     ┆ 17.0        ┆ 1016.99                     │
│ 0.0        ┆ 900.0       ┆ 900.0                       │
│ 0.0        ┆ 850.0       ┆ 850.0                       │
│ 0.0        ┆ 840.0       ┆ 840.0                       │
│ 0.0        ┆ 800.0       ┆ 800.0                       │
│ …          ┆ …           ┆ …                           │
│ 0.0        ┆ -558.8      ┆ -558.8                      │
│ 1.0        ┆ -600.0      ┆ -599.0                      │
│ 0.0        ┆ -619.7      ┆ -619.7                      │
│ 161.2      ┆ -800.0      ┆ -638.8                      │
│ 0.01       ┆ -800.0      ┆ -799.99                     │
└────────────┴─────────────┴──────

1. `tip_amount = 17.0`, `fare_amount = 999.99`
2. `tip_amount = 17.0`, `fare_amount = 17.0`
3. `tip_amount = 999.99`, `fare_amount = 17.0` - correct
4. `tip_amount = 999.99`, `fare_amount = 999.99`

## 4.5. Question 5

What fraction of rides had the same pickup and dropoff location?

In [16]:
result = (
    df
    .select(pl.col("DOLocationID").ne(pl.col("PULocationID")).mean())
)
print(result)

shape: (1, 1)
┌──────────────┐
│ DOLocationID │
│ ---          │
│ f64          │
╞══════════════╡
│ 0.949397     │
└──────────────┘


1. 0.949397
2. 0.050603 - correct
3. 1.0
4. 0.0

## 4.6. Question 6

Using `.with_columns()`, `pl.all()`, and `.name.suffix()`, add to the dataframe a copy of all the columns, just with the name `_new` added on to the end of each column name. How many columns are there in the resultant dataframe?

In [21]:
result = (
    df
    #### YOUR CODE HERE
    .with_columns(pl.all().name.suffix("_new"))
)
print(result.shape)

(3582628, 38)


1. 3582628
2. 38 - correct
3. 19
4. 22

## 4.7 Question 7

Add a new column to the dataframe for every `pl.String` column that checks if that column has an empty string (i.e. equal to `""`). How many columns are in the resultant dataframe?

In [24]:
result = (
    df
    #### YOUR CODDE HERE
    .with_columns(pl.col(pl.String).eq("").name.suffix("_is_empty"))
    .shape
)
print(result)

(3582628, 20)


1. 19 - Are you using `.select()`? Be sure to use `.with_columns()` since we are adding columns.
2. 20 - correct
3. 22 - Are you sure you're performing the operations on `pl.String` columns?
4. 38 - incorrect - You may have tried something with `pl.all()`. Try again!

## 4.8. Question 8

We are now interested in only rides that took place in one dropoff-location. So, filter for only trips where the dropoff location was the same as the pickup location, and remove either of the pickup/dropoff location columns. What is the shape of the resultant dataframe?

In [27]:
result = (
    df
    #### YOUR CODE HERE
    .filter(pl.col("DOLocationID").ne(pl.col("PULocationID")))
    .drop("DOLocationID")
    .shape
)
print(result)

(3401337, 18)


1. (181291, 18) - correct
2. (3582628, 18) - incorrect - are you making sure to filter appropriately?
3. (181291, 19) - incorrect - did you make sure to remove either the pickup or dropoff location column from the dataframe?
4. (3401337, 18) - incorrect - are you sure you didn't accidentally take only trips where the dropoff location and pickup location are not equal?
