# 4. Data Manipulation II - Advanced Selecting - Quiz.ipynb

## 4.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
df = pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")

## 4.1 Question 1: String Columns Count

Select all columns from the DataFrame that have the datatype `pl.String`. How many columns are there in the result?

In [3]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.String))
    .shape
)
print(result)

(3582628, 1)


1. 1 - correct - Just as in the example in the module, only one column in the DataFrame is a string datatype. See "Operating on Multiple Columns at the Same Time."
2. 19 - incorrect - This is the number of the columns in the original DataFrame. It's possible that your `.select()` statement is incorrect? See "Operating on Multiple Columns at the Same Time."
3. 3 - incorrect - You may have checked for a datatype other than `pl.String`. See "Operating on Multiple Columns at the Same Time."
4. 0 - incorrect - There is at least one column with the `pl.String` datatype. See "Operating on Multiple Columns at the Same Time."

## 4.2 Question 2: Int64 Columns Count

How many columns in the DataFrame have the datatype `pl.Int64`?

In [4]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col(pl.Int64))
    .shape
)
print(result)

(3582628, 3)


1. 19 - incorrect - This is the number of columns in the original DataFrame. Your `.select()` statement may be incorrect. See "Operating on Multiple Columns at the Same Time."
2. 1 - incorrect - You may have checked for a datatype other than `pl.Int64`. See "Operating on Multiple Columns at the Same Time."
3. 3 - correct - Just like the example in the module, three columns in the DataFrame have a `pl.Int64` datatype. See "Operating on Multiple Columns at the Same Time."
4. 3582628 - incorrect - This is the number of rows, not the number of columns. See "Operating on Multiple Columns at the Same Time."

## 4.3 Question 3: Rows with Zero Float64 Values

What fraction of rows have at least one of their `pl.Float64` columns equal to exactly 0? (Hint: Use `pl.any_horizontal()`.)

In [5]:
result = (
    df
    .select(
        pl.any_horizontal(pl.col(pl.Float32)).eq(0).mean().alias("fraction")
    )
)
print(result)

shape: (1, 1)
┌──────────┐
│ fraction │
│ ---      │
│ f64      │
╞══════════╡
│ 0.0      │
└──────────┘


1. 0 - incorrect - You may not have used the `pl.Float64` datatype. See "Operating on Multiple Columns at the Same Time."
2. 0.000076 - correct - There are a few ways you could have gotten to this answer. The most straightforward is to construct a column expression like 'has at least one null float64 value' and take the `.mean()`. See "Operating on Multiple Columns at the Same Time."
3. 0.50 - incorrect - You may not have used the correct datatype. See "Operating on Multiple Columns at the Same Time."
4. 1.0 - incorrect - The value 1.0 represents all the rows! Check your code again—you may have the aggregation incorrect. See "Operating on Multiple Columns at the Same Time."

## 4.4 Question 4: Highest Tip and Fare for New Column

Create a new column called `tip_amount_plus_fare_amount`. Then sort the DataFrame by this new column in descending order. What are the `tip_amount` and `fare_amount` for the highest `tip_amount_plus_fare_amount`?

In [6]:
result = (
    df
    #### YOUR CODE HERE
    .select(
        pl.col(["tip_amount", "fare_amount"]),
        (pl.col("tip_amount") + pl.col("fare_amount")).alias("tip_amount_plus_fare_amount")
    )
    .sort(by="tip_amount_plus_fare_amount", descending=True)
)
print(result)

shape: (3_582_628, 3)
┌────────────┬─────────────┬─────────────────────────────┐
│ tip_amount ┆ fare_amount ┆ tip_amount_plus_fare_amount │
│ ---        ┆ ---         ┆ ---                         │
│ f64        ┆ f64         ┆ f64                         │
╞════════════╪═════════════╪═════════════════════════════╡
│ 999.99     ┆ 17.0        ┆ 1016.99                     │
│ 0.0        ┆ 900.0       ┆ 900.0                       │
│ 0.0        ┆ 850.0       ┆ 850.0                       │
│ 0.0        ┆ 840.0       ┆ 840.0                       │
│ 0.0        ┆ 800.0       ┆ 800.0                       │
│ …          ┆ …           ┆ …                           │
│ 0.0        ┆ -558.8      ┆ -558.8                      │
│ 1.0        ┆ -600.0      ┆ -599.0                      │
│ 0.0        ┆ -619.7      ┆ -619.7                      │
│ 161.2      ┆ -800.0      ┆ -638.8                      │
│ 0.01       ┆ -800.0      ┆ -799.99                     │
└────────────┴─────────────┴──────

1. `tip_amount = 17.0, fare_amount = 999.99` - incorrect - If you got this result, you probably have the two columns mixed up. See "Adding New Columns with .with_columns()."
2. `tip_amount = 0.01, fare_amount = -800.0` - incorrect - Make sure to sort in descending order, not ascending. See "Adding New Columns with .with_columns()."
3. `tip_amount = 999.99, fare_amount = 17.0` - correct - This is a very high tip amount! It's likely some noise; there is a lot of that in the dataset. See "Adding New Columns with .with_columns()."
4. `tip_amount = 999.99, fare_amount = 999.99` - incorrect - If you got this result, you're probably not computing the column `tip_amount_plus_fare_amount` correctly. See "Adding New Columns with .with_columns()."

## 4.5 Question 5: Same Pickup-Dropoff Fraction

What fraction of rides have the same location for pickup and dropoff?

In [7]:
result = (
    df
    #### YOUR CODE HERE
    .select(pl.col("DOLocationID").eq(pl.col("PULocationID")).mean())
)
print(result)

shape: (1, 1)
┌──────────────┐
│ DOLocationID │
│ ---          │
│ f64          │
╞══════════════╡
│ 0.050603     │
└──────────────┘


1. 0.949397 - incorrect - Close... the answer is actually 1 minus this number. Check your code again. See "Adding New Columns with .with_columns()."
2. 0.050603 - correct - There are a few ways you could arrive at this answer. The most straightforward is to construct a column expression of 'same-pickup-dropoff' and take the `.mean()`. See "Adding New Columns with .with_columns()."
3. 1.0 - incorrect - A fraction of 1.0 would mean that all the rides have the same pickup and dropoff location. See "Adding New Columns with .with_columns()."
4. 0.0 - incorrect - An answer of zero would mean that none of the rides have the same pickup and dropoff location. See "Adding New Columns with .with_columns()."

## 4.6 Question 6: Duplicate Columns with Suffix

Using `.with_columns()`, `pl.all()`, and `.name.suffix()`, add to the DataFrame a copy of all the columns with `_new` added to the end of each column name. How many columns are there in the resultant DataFrame?

In [8]:
result = (
    df
    #### YOUR CODE HERE
    .with_columns(pl.all().name.suffix("_new"))
)
print(result.shape)

(3582628, 38)


1. 3582628 - incorrect - This is the number of rows, not columns. See "Adding New Columns with .with_columns()."
2. 38 - correct - There are 19 columns in the original DataFrame, so when you add each column with just a new name, the number of columns doubles to 38. See "Adding New Columns with .with_columns()."
3. 19 - incorrect - This is the number of columns in the original DataFrame; the answer should actually be two times this number. See "Adding New Columns with .with_columns()."
4. 22 - incorrect - If you arrived at this result, you may not have selected all the columns in your `.with_columns()` call. See "Adding New Columns with .with_columns()."

## 4.7 Question 7: Empty String Check Columns

Add a new column to the DataFrame for every `pl.String` column that checks whether that column has an empty string (i.e., a string equal to `""`). How many columns are in the resultant DataFrame?

In [9]:
result = (
    df
    #### YOUR CODDE HERE
    .with_columns(pl.col(pl.String).eq("").name.suffix("_is_empty"))
    .shape
)
print(result)

(3582628, 20)


1. 1 - incorrect - Be sure to use `.select()` and `.with_columns()` since you are adding columns. See "Adding New Columns with .with_columns()."
2. 20 - correct - Exactly! There's only one column with the datatype `pl.String` for a total of 20 columns. See "Adding New Columns with .with_columns()."
3. 22 - incorrect - You may not have performed the operations on the `pl.String` columns. See "Adding New Columns with .with_columns()."
4. 38 - incorrect - You may have tried to perform operations using `pl.all()`. See "Adding New Columns with .with_columns()."

## 4.8 Question 8: Same Location Trips Dataframe

Your goal is to produce a DataFrame of only the rides that took place in one location (i.e., where `DOLocationID` equals `PULocationID`). This means that you no longer need both of the columns `DOLocationID` and `PULocationID` (since they are equal). So filter for trips with the same pickup and dropoff sites and remove either one of the pickup/dropoff location columns. Then rename the remaining one to `LocationID`. What is the shape of the resultant DataFrame?

In [10]:
result = (
    df
    #### YOUR CODE HERE
    # .filter(pl.col("DOLocationID").eq(pl.col("PULocationID")))
    .drop("DOLocationID")
    .rename({"PULocationID": "LocationID"})
    .shape
)
print(result)

(3582628, 18)


1. (181291, 18) - correct - Exactly! A small fraction of the rides have the same pickup and dropoff location. See "Renaming Columns with .rename()."
2. (3582628, 18) - incorrect - This is the number of rows in the original DataFrame. Check that you filtered appropriately. See "Renaming Columns with .rename()."
3. (181291, 19) - incorrect - You may have overlooked removing either the pickup or dropoff location column from the DataFrame. See "Renaming Columns with .rename()."
4. (3401337, 18) - incorrect - If you got this result, you may have included only trips where the dropoff location and pickup location are not equal. See "Renaming Columns with .rename()."