# 10. Summative - Quiz

## 10.0. Import `polars` and Load Data

In [None]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.metrics import mean_absolute_error

pl.Config.set_tbl_rows(16)

In [None]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [None]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 10.1 Question 1: Identifying Data Types in a New DataFrame

Create a DataFrame from the data. Which of the following datatypes can be found in the resultant DataFrame? Select all that apply.

In [None]:
address_data_dict = {
    "id": [1, 2, 3, 4, ],
    "street": ["Thoreau Rd", "Flanders St", "Candy Ave", "State St", ],
    "street_number": [17, 18, 39, 30, ]
}
#### YOUR CODE HERE

1. `i64`
2. `u64`
3. `i8`
4. `str`

## 10.2 Question 2: Schema Override for Trip Distance

Load the rides data from the CSV file, using `schema_overrides` to force `trip_distance` to be `pl.Int64`. What happens?

In [None]:
result_df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={
        #### YOUR CODE HERE
    }
)
result_df.head()

1. An error is thrown, stating that "`schema_overrides` only works on `str` columns".
2. The code runs successfully, casting the would-be `float` column to `pl.Int64` upon instantiation of the dataframe.
3. An error is thrown, stating that data from the column can't be parsed to `pl.Int64`.
4. The code runs successfully, ignoring the schema override and simply loading the data as `pl.Float64`

## 10.3 Question 3: Maximum Congestion Surcharge

What is the maximum `congestion_surcharge` in `rides_df_raw`?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. -2.5
2. 900.0
3. 2.5
4. 3.4

## 10.4 Question 4: Top Trip Distance After Sorting

Sort `rides_df_raw` in descending order by the following columns in this order: `congestion_surcharge`, `tip_amount`, `trip_distance`. What is the `trip_distance` of the top trip?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. 176836.3
2. 0.0
3. 166.1
4. 28.9

## 10.5 Question 5: Trips Within Distance Range

How many trips had a `trip_distance` greater than 1 kilometer (km) and less than 2 km?

In [None]:
kilometers_per_mile = 1.61
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. 838278
2. 1112153
3. 1
4. 99283

## 10.6 Question 6: Comparison of Different Data Types

Try to add a column code that checks whether `do_zone` is greater than zero. What happens and why?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. The code crashes because a string cannot be compared with an integer
2. The code crashes because a Boolean column cannot be added to a DataFrame with `.with_columns()`
3. The code runs successfully, adding a column that checks whether the `do_zone` is alphabetically greater than the string "0"
4. The code runs successfully; however, since strings and integers can't be compared in Polars, the resultant column is null everywhere

## 10.7 Question 7: Equivalent Renaming Expression

See the renaming, which uses `.select` and `.rename()`. What expression using `.select` and name transformations from the `.name` namespace would accomplish the same function?

In [None]:
columns_to_rename = ["do_location_id", "pu_location_id"]
df_w_cols_renamed = (
    rides_df_raw
    .select(columns_to_rename)
    .rename({
        column_to_rename: f"{column_to_rename}_renamed"
        for column_to_rename in columns_to_rename
    })
    .head()
)
print(df_w_cols_renamed)

1. `.select(pl.col(pl.Int32).name.suffix("_renamed"))`
2. `.select(pl.col(["do_location_id", "pu_location_id"]).name.suffix("_renamed"))`
3. `.select(pl.col(["do_location_id", "pu_location_id"]).alias("_renamed"))`
4. `.select(pl.String).name.suffix("_renamed"))`

## 10.8 Question 8: Adding Suffixed Columns for Int8 Type

Add a few new columns to the DataFrame that copy all `pl.Int8` columns and give them the suffix `_new`. What is the size of the resultant DataFrame?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. (3582628, 28)
2. (0, 0)
3. (3582628, 25)
4. (3582628, 35)

## 10.9 Question 9: Most Common Congestion Surcharge

Use `.group_by()` to determine the most common value for `congestion_surcharge` in the dataset?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. -0.75
2. 202.18
3. 134.51
4. 2.5

## 10.10 Question 10: Zero-Tip Two-Passenger Rides Count

Create a `.pivot_table()` where the rows reflect whether the `tip_amount` was 0 or not and the columns are `passenger_count`. Then, for each combination of "tip_amount is 0" and `passenger_count`, compute the number of instances by using `aggregate_function` "len". How many rides had a `tip_amount` of 0 and a `passenger_count` of 2?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. 104748
2. 349149
3. 1.7435e6
4. 1358

## 10.11 Question 11: Left-Join Result Shape

Join the following two DataFrames using a left-join (`restaurant_df` into `name_df`, on the name of the restaurant). What is the shape of the resultant DataFrame?

In [None]:
name_df = pl.DataFrame({
    "name": ["dan", "stan", "ran", "cran"],
    "favorite_restaurant": ["Pablo's Pizza", "Taco Time!", "Taco Time!", "Pablo's Pizza"]
})
restaurant_df = pl.DataFrame({
    "restaurant_name": ["Pablo's Pizza", "Taco Time!", "Toledo's Burritos",],
    "restaurant_city": ["Boston", "New York", "Los Angeles",]
})
result = (
    #### YOUR CODE HERE
)
print(result)

1. (5, 3)
2. The code doesn't run, throwing a `ColumnNotFoundError`
3. (4, 3)
4. (0, 2)

## 10.12 Question 12: Anti-Join Result Shape

Join the two DataFrames using an anti-join (`restaurant_df` into `name_df`, on the name of the restaurant). What is the shape of the resultant DataFrame?

In [None]:
name_df = pl.DataFrame({
    "name": ["dan", "stan", "ran", "cran"],
    "favorite_restaurant": ["Pablo's Pizza", "Taco Time!", "Taco Time!", "Pablo's Pizza"]
})
restaurant_df = pl.DataFrame({
    "restaurant_name": ["Pablo's Pizza", "Taco Time!", "Toledo's Burritos",],
    "restaurant_city": ["Boston", "New York", "Los Angeles",]
})
result = (
    #### YOUR CODE HERE
)
print(result)

1. (1, 2)
2. The code doesn't run, throwing a `ColumnNotFoundError`
3. (4, 3)
4. (0, 2)

## 10.13 Question 13: Null Count After Diagonal Concatenation

Vertically concatenate the DataFrames (with setting `"how=diagonal"`). How many null values are there in the resultant DataFrame?

In [None]:
df1 = pl.DataFrame({
    "A": [1, 2, None, 4],
    "B": [5, None, 7, 8]
})
df2 = pl.DataFrame({
    "A": [9, 10, None],
    "B": [None, 13, 14],
    "C": [15, 16, 17]
})
df3 = pl.DataFrame({
    "A": [18, None, 20],
    "B": [21, 22, 23],
    "D": [24, 25, None]
})
#### YOUR CODE HERE

1. 20
2. 6
3. The code doesn't run due to a `ShapeError`
4. 0

## 10.14 Question 14: Fraction of Early March Rides

What fraction of rides in the DataFrame started before 2024-03-15 (i.e., had a `tpep_pickup_datetime` before 2024-03-15)?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. 0.455174
2. 0.455173
3. 1.0
4. 1630716

## 10.15 Question 15: List Column Aggregation Behavior

Group the following DataFrame by `class`, and `.sum()` the `salient_appendages` column (which is a list type) to create a list of all `salient_appendages` for that animal class. What happens?

In [None]:
animal_appendages_df = pl.DataFrame({
    "animal": ["human", "elephant", "shark", "stingray"],
    "class": ["mammal", "mammal", "chondrichthyes", "chondrichthyes"],
    "salient_appendages": [
        ["arm", "leg", "head"],
        ["arm", "leg", "head", "trunk"],
        ["fin", "tail", "jaw"],
        ["fin", "tail", "stinger"],
    ]
})
result = (
    #### YOUR CODE HERE
)
print(result)

1. The code runs smoothly, but the resultant aggregated column is filled with nulls
2. The code runs smoothly, and the resultant column is a combined list of all `salient_appendages` for that animal class
3. The code crashes with a datatype error, saying that "list columns cannot be added together in a group_by"
4. The code runs smoothly, and the resultant column is a list of lists of all values for `salient_appendages` for that animal class

## 10.16 Question 16: Weekend vs Weekday Tip Comparison

True or False: weekend taxi trips (trips that have a `tpep_pickup_datetime` on Saturday or Sunday) have on average higher tip amounts than non-weekend taxi trips.

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. False
2. True

## 10.17 Question 17: NumPy to Polars DataFrame Conversion

Convert the following NumPy array to a Polars DataFrame. What are the datatypes of the resultant DataFrame?

In [None]:
import numpy as np
np_array = np.random.rand(4, 2)
#### YOUR CODE HERE

1. `f64`, `f64`
2. `f32`, `f32`
3. `f64`, `f64`, `f64`, `f64`
4. The code doesn't run due to a datatype conversion error

## 10.18 Question 18: Polars to Pandas to Polars Series Conversion

Create a Polars series, convert it to Pandas, and then convert it back to Polars. True or False: the result upon returning to Polars is now a single-column DataFrame.

In [None]:
polars_series = pl.Series(name="polars_series", values=list(range(10)))
#### YOUR CODE HERE

1. True
2. False

## 10.19 Question 19: Feature Most Correlated with Tip Amount

Using `rides_df_raw`, which feature is most highly correlated (either positively or negatively) with `tip_amount` (excluding `tip_amount` itself)? Also, please filter out `null` values as done in the module!

In [None]:
#### YOUR CODE HERE

1. `tip_amount`
2. `total_amount`
3. `congestion_surcharge`
4. `passenger_count`

## 10.20 Question 20: Hourly Average Total Amount Analysis

With `rides_df_raw`, make a plot of 'hour of day of taxi ride' vs 'average total_amount'. Which of the following statements is True (hint: there are multiple options)?

In [None]:
#### YOUR CODE HERE

1. There appears to be a peak in average total amount around 2:00.
2. There appears to be a peak in average total amount around 5:00.
3. There appears to be a peak in average total amount around 16:00.
4. There appears to be a peak in average total amount around 23:00.