# 4. Data Manipulation II - Advanced Selecting - Quiz.ipynb

## 4.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
df = pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")

## 4.1. Question 1

Select all columns from the dataframe that have the datatype `pl.String`. How many columns are there in the result?

In [3]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.String))
    .shape
)
print(result)

(3582628, 1)


1. 1 - correct - Exactly!
2. 19 - incorrect - These are all the columns in the original dataframe! Are you sure you have your `.select()` statement correct?
3. 3 - incorrect - Are you sure you're not checking for a datatype other than `pl.String`?
4. 0 - incorrect - There is at least one column with the `pl.String` data type... check again!

## 4.2. Question 2

Select all columns from the dataframe that have the datatype `pl.Int64`. How many columns are there in the result?

In [4]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.Int64))
    .shape
)
print(result)

(3582628, 3)


1. 19 - incorrect - These are all the columns in the original dataframe! Are you sure you have your `.select()` statement correct?
2. 1 - incorrect - Are you sure you're not checking for a datatype other than `pl.Int64`?
3. 3 - correct - Exactly!
4. 3582628 - incorrect - This is the number of rows, not the number of columns!

## 4.3. Question 3

What fraction of rows have at least one of their `pl.Float64` columns equal to exactly `0`? (Hint: use `pl.any_horizontal()`.)

In [5]:
result = (
    df
    .select(
        pl.any_horizontal(pl.col(pl.Float32)).eq(0).mean().alias("fraction")
    )
)
print(result)

shape: (1, 1)
┌──────────┐
│ fraction │
│ ---      │
│ f64      │
╞══════════╡
│ 0.0      │
└──────────┘


1. 0 - incorrect - Make sure you're using `pl.Float64` and not another datatype.
2. 0.000076 - correct - Exactly!
3. .50 - incorrect - Are you sure you're using the correct datatype?
4. 1.0 - incorrect - This is all the rows! Check your code again, maybe you have the aggregation incorrect...

## 4.4. Question 4

Create a new column called `tip_amount_plus_fare_amount`; sort the dataframe by this new column descending order. What is the `tip_amount` and `fare_amount` for the highest `tip_amount_plus_fare_amount`?

In [6]:
result = (
    df
    #### YOUR CODE HERE
    .select(
        pl.col(["tip_amount", "fare_amount"]),
        (pl.col("tip_amount") + pl.col("fare_amount")).alias("tip_amount_plus_fare_amount")
    )
    .sort(by="tip_amount_plus_fare_amount", descending=True)
)
print(result)

shape: (3_582_628, 3)
┌────────────┬─────────────┬─────────────────────────────┐
│ tip_amount ┆ fare_amount ┆ tip_amount_plus_fare_amount │
│ ---        ┆ ---         ┆ ---                         │
│ f64        ┆ f64         ┆ f64                         │
╞════════════╪═════════════╪═════════════════════════════╡
│ 999.99     ┆ 17.0        ┆ 1016.99                     │
│ 0.0        ┆ 900.0       ┆ 900.0                       │
│ 0.0        ┆ 850.0       ┆ 850.0                       │
│ 0.0        ┆ 840.0       ┆ 840.0                       │
│ 0.0        ┆ 800.0       ┆ 800.0                       │
│ …          ┆ …           ┆ …                           │
│ 0.0        ┆ -558.8      ┆ -558.8                      │
│ 1.0        ┆ -600.0      ┆ -599.0                      │
│ 0.0        ┆ -619.7      ┆ -619.7                      │
│ 161.2      ┆ -800.0      ┆ -638.8                      │
│ 0.01       ┆ -800.0      ┆ -799.99                     │
└────────────┴─────────────┴──────

1. `tip_amount = 17.0`, `fare_amount = 999.99` - incorrect - You might have the two columns mixed up!
2. `tip_amount = 0.01`, `fare_amount = -800.0` - incorrect - Make sure to take sort in descending order, not ascending!
3. `tip_amount = 999.99`, `fare_amount = 17.0` - correct - Exactly!
4. `tip_amount = 999.99`, `fare_amount = 999.99` - incorrect - Are you sure you're computing the column `tip_amount_plus_fare_amount` correctly?

## 4.5. Question 5

What fraction of rides had the same pickup and dropoff location?

In [7]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col("DOLocationID").eq(pl.col("PULocationID")).mean())
)
print(result)

shape: (1, 1)
┌──────────────┐
│ DOLocationID │
│ ---          │
│ f64          │
╞══════════════╡
│ 0.050603     │
└──────────────┘


1. 0.949397 - incorrect - Close... the answer is actually 1 minus this number! Check your code again.
2. 0.050603 - correct - Exactly!
3. 1.0 - incorrect - This would mean that all the rides had the same pickup and dropoff location; try again!
4. 0.0 - incorrect - This would mean that none of the rides had the same pickup and dropoff location; try again!

## 4.6. Question 6

Using `.with_columns()`, `pl.all()`, and `.name.suffix()`, add to the dataframe a copy of all the columns, just with the name `_new` added on to the end of each column name. How many columns are there in the resultant dataframe?

In [8]:
result = (
    df
    #### YOUR CODE HERE
    .with_columns(pl.all().name.suffix("_new"))
)
print(result.shape)

(3582628, 38)


1. 3582628 - incorrect - This is the number of rows, not columns; try again!
2. 38 - correct - Exactly!
3. 19 - incorrect - This is the the number of columns in the original dataframe; the answer should actually be two times this number!
4. 22 - incorrect - Are you sure you selected all the columns in your `.with_columns()` call?

## 4.7 Question 7

Add a new column to the dataframe for every `pl.String` column that checks if that column has an empty string (i.e. equal to `""`). How many columns are in the resultant dataframe?

In [9]:
result = (
    df
    #### YOUR CODDE HERE
    .with_columns(pl.col(pl.String).eq("").name.suffix("_is_empty"))
    .shape
)
print(result)

(3582628, 20)


1. 1 - Are you using `.select()`? Be sure to use `.with_columns()` since we are adding columns.
2. 20 - correct - Exactly! There's only one column with the datatype `pl.String`.
3. 22 - incorrect - Are you sure you're performing the operations on `pl.String` columns?
4. 38 - incorrect - You may have tried something with `pl.all()`. Try again!

## 4.8. Question 8

We'd like a dataframe of only rides that took place in one location (i.e. where `DOLocationID` equals `PULocationID`). This also means that we no longer need both of the columns `DOLocationID` and `PULocationID` (since they are equal). So, filter for same-pickup-dropoff trips, and remove either one of the pickup/dropoff location columns, and rename the other one to just be `LocationID`. What is the shape of the resultant dataframe?

In [10]:
result = (
    df
    #### YOUR CODE HERE
    # .filter(pl.col("DOLocationID").eq(pl.col("PULocationID")))
    .drop("DOLocationID")
    .rename({"PULocationID": "LocationID"})
    .shape
)
print(result)

(3582628, 18)


1. (181291, 18) - correct - Exactly! A small fraction of the rides had the same pickup and dropoff location.
2. (3582628, 18) - incorrect - This is the number of rows in the original dataframe; are you making sure to filter appropriately?
3. (181291, 19) - incorrect - Did you make sure to remove either the pickup or dropoff location column from the dataframe?
4. (3401337, 18) - incorrect - are you sure you didn't accidentally take only trips where the dropoff location and pickup location are not equal?