# 10. Summative - Quiz

## 10.0. Import `polars` and Load Data

In [1]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.metrics import mean_absolute_error

pl.Config.set_tbl_rows(16)

polars.config.Config

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 10.1 Question 1: Identifying Data Types in a New DataFrame

Create a DataFrame from the data. Which of the following datatypes can be found in the resultant DataFrame? Select all that apply.

In [4]:
address_data_dict = {
    "id": [1, 2, 3, 4, ],
    "street": ["Thoreau Rd", "Flanders St", "Candy Ave", "State St", ],
    "street_number": [17, 18, 39, 30, ]
}
#### YOUR CODE HERE
pl.DataFrame(address_data_dict)

id,street,street_number
i64,str,i64
1,"""Thoreau Rd""",17
2,"""Flanders St""",18
3,"""Candy Ave""",39
4,"""State St""",30


1. `i64` - correct - Both `id` and `street_number` get loaded as `i64`. See Module 2, "Getting Started."
2. `u64` - incorrect - Though the values of both `street_number` and `id` appear to be strictly positive, Polars's default behavior is to load integers as `i64`. See Module 2, "Getting Started."
3. `i8` - incorrect - Though the values of both `street_number` and `id` appear to fit in the range of 8-bit values, Polars's default behavior is to load integers as `i64`. See Module 2, "Getting Started."
4. `str` - correct - The column `street` gets loaded as a `str`. See Module 2, "Getting Started."

## 10.2 Question 2: Schema Override for Trip Distance

Load the rides data from the CSV file, using `schema_overrides` to force `trip_distance` to be `pl.Int64`. What happens?

In [5]:
result_df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={
        #### YOUR CODE HERE
        "trip_distance": pl.Int64
    }
)
result_df.head()

ComputeError: could not parse `0.73` as dtype `i64` at column 'trip_distance' (column number 5)

The current offset in the file is 323447842 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `0.73` to the `null_values` list.

Original error: ```remaining bytes non-empty```

1. An error is thrown, stating that "`schema_overrides` only works on `str` columns". - incorrect - An error is indeed thrown, but not this one. See Module 2, “Getting Started.”
2. The code runs successfully, casting the would-be `float` column to `pl.Int64` upon instantiation of the dataframe. - incorrect - This code is not able to run successfully. See Module 2, “Getting Started.”
3. An error is thrown, stating that data from the column can't be parsed to `pl.Int64`. - correct - Data that is `float` cannot be cast to `pl.Int64` upon being read. This can certainly happen later though, once the data has been read. See Module 2, “Getting Started.”
4. The code runs successfully, ignoring the schema override and simply loading the data as `pl.Float64` - incorrect - This code is not able to run successfully. See Module 2, “Getting Started.”

## 10.3 Question 3: Maximum Congestion Surcharge

What is the maximum `congestion_surcharge` in `rides_df_raw`?

In [6]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .select([
        pl.col("fare_amount").max().name.suffix("_min")
    ])
)
print(result)

shape: (1, 1)
┌─────────────────┐
│ fare_amount_min │
│ ---             │
│ f64             │
╞═════════════════╡
│ 900.0           │
└─────────────────┘


1. -2.5 - incorrect - This is the minimum `congestion_surcharge`, but the question asks for the maximum. See Module 3, "Data Manipulation I: Basics."
2. 900.0 - incorrect - This is the maximum `fare_amount`, but the question asks for the maximum `congestion_surcharge`. See Module 3, "Data Manipulation I: Basics."
3. 2.5 - correct - Exactly! You find this value by using the `.max()` function. See Module 3, "Data Manipulation I: Basics."
4. 3.4 - incorrect - If you came up with this result, you may not be using the correct column. See Module 3, "Data Manipulation I: Basics."

## 10.4 Question 4: Top Trip Distance After Sorting

Sort `rides_df_raw` in descending order by the following columns in this order: `congestion_surcharge`, `tip_amount`, `trip_distance`. What is the `trip_distance` of the top trip?

In [7]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .sort(
        by=[
            "congestion_surcharge",
            "tip_amount",
            "trip_distance"
        ],
        descending=True
    )
    .select("trip_distance")
    .head(1)
)
print(result)

shape: (1, 1)
┌───────────────┐
│ trip_distance │
│ ---           │
│ f64           │
╞═══════════════╡
│ 28.9          │
└───────────────┘


1. 176836.3 - incorrect - Make sure you're sorting by the columns in precisely the correct order: `congestion_surcharge`, `tip_amount`, and then `trip_distance`. See Module 3, "Data Manipulation I: Basics."
2. 0.0 - incorrect - Make sure to sort in descending order, not ascending. See Module 3, "Data Manipulation I: Basics."
3. 166.1 - incorrect - If you got this result, you probably didn't select the correct column. See Module 3, "Data Manipulation I: Basics."
4. 28.9 - correct - Exactly! This is the top row after sorting by all the specified columns. See Module 3, "Data Manipulation I: Basics."

## 10.5 Question 5: Trips Within Distance Range

How many trips had a `trip_distance` greater than 1 kilometer (km) and less than 2 km?

In [8]:
kilometers_per_mile = 1.61
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .select(
        (pl.col("trip_distance") * kilometers_per_mile)#.name.suffix("_kilometers")
    )
    .filter(
        pl.col("trip_distance").ge(1)
        .and_(pl.col("trip_distance").lt(2))
    )
    .shape
)
print(result)

(838278, 1)


1. 838278 - correct - You arrived at this answer by using `.gt()` and `.lt()`, after converting to kilometers. See Module 3, "Data Manipulation I: Basics."
2. 1112153 - incorrect - You probably forgot to convert miles to kilometers. See Module 3, "Data Manipulation I: Basics."
3. 1 - incorrect - You were probably thinking of the number of rows, but the question asks for the number of trips. See Module 3, "Data Manipulation I: Basics."
4. 99283 - incorrect - If you got this result, you likely did not use the correct column for `trip_distance`. See Module 3, "Data Manipulation I: Basics."

## 10.6 Question 6: Comparison of Different Data Types

Try to add a column code that checks whether `do_zone` is greater than zero. What happens and why?

In [9]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .with_columns(pl.col("do_zone").gt(0))
)
print(result)

ComputeError: cannot compare string with numeric type (i32)

1. The code crashes because a string cannot be compared with an integer - correct - To compare these two datatypes, something must be done to make them comparable, either casting the integer to a string or casting the string to an integer somehow. See Module 7, "Data Manipulation V: Data Types."
2. The code crashes because a Boolean column cannot be added to a DataFrame with `.with_columns()` - incorrect - There's no problem with adding a Boolean column to a DataFrame, and in fact you saw it done in the module! See Module 7, "Data Manipulation V: Data Types."
3. The code runs successfully, adding a column that checks whether the `do_zone` is alphabetically greater than the string "0" - incorrect - Polars does not do typecasting like this–you must request it explicitly. See Module 7, "Data Manipulation V: Data Types."
4. The code runs successfully; however, since strings and integers can't be compared in Polars, the resultant column is null everywhere - incorrect - Polars does not do this kind of handling, though you could probably use column expressions to do something similar. See Module 7, "Data Manipulation V: Data Types."

## 10.7 Question 7: Equivalent Renaming Expression

See the renaming, which uses `.select` and `.rename()`. What expression using `.select` and name transformations from the `.name` namespace would accomplish the same function?

In [10]:
columns_to_rename = ["do_location_id", "pu_location_id"]
df_w_cols_renamed = (
    rides_df_raw
    .select(columns_to_rename)
    .rename({
        column_to_rename: f"{column_to_rename}_renamed"
        for column_to_rename in columns_to_rename
    })
    .head()
)
print(df_w_cols_renamed)

shape: (5, 2)
┌────────────────────────┬────────────────────────┐
│ do_location_id_renamed ┆ pu_location_id_renamed │
│ ---                    ┆ ---                    │
│ i32                    ┆ i32                    │
╞════════════════════════╪════════════════════════╡
│ 239                    ┆ 142                    │
│ 24                     ┆ 238                    │
│ 75                     ┆ 263                    │
│ 162                    ┆ 164                    │
│ 7                      ┆ 263                    │
└────────────────────────┴────────────────────────┘


In [12]:
df_w_cols_renamed = (
    rides_df_raw
    .select(pl.col(["do_location_id", "pu_location_id"]).name.suffix("_renamed"))
    .head()
)
print(df_w_cols_renamed)

shape: (5, 2)
┌────────────────────────┬────────────────────────┐
│ do_location_id_renamed ┆ pu_location_id_renamed │
│ ---                    ┆ ---                    │
│ i32                    ┆ i32                    │
╞════════════════════════╪════════════════════════╡
│ 239                    ┆ 142                    │
│ 24                     ┆ 238                    │
│ 75                     ┆ 263                    │
│ 162                    ┆ 164                    │
│ 7                      ┆ 263                    │
└────────────────────────┴────────────────────────┘


1. `.select(pl.col(pl.Int32).name.suffix("_renamed"))` - incorrect - This almost works, but it accidentally includes `vendor_id` in the renaming. See Module 4, "Data Manipulation II: Advanced Selecting."
2. `.select(pl.col(["do_location_id", "pu_location_id"]).name.suffix("_renamed"))` - correct - Using the `.name.suffix()` function, you can accomplish the same thing that `.rename()` does. See Module 4, "Data Manipulation II: Advanced Selecting."
3. `.select(pl.col(["do_location_id", "pu_location_id"]).alias("_renamed"))` - incorrect - Using `alias()` will simply replace the column name altogether, not add a suffix. Furthermore, this code will rename both columns to the same thing and thus give an error. See Module 4, "Data Manipulation II: Advanced Selecting."
4. `.select(pl.String).name.suffix("_renamed"))` - incorrect - This will select the wrong columns. See Module 4, "Data Manipulation II: Advanced Selecting."

## 10.8 Question 8: Adding Suffixed Columns for Int8 Type

Add a few new columns to the DataFrame that copy all `pl.Int8` columns and give them the suffix `_new`. What is the size of the resultant DataFrame?

In [13]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .with_columns(pl.col(pl.Float64).name.suffix("_new"))
    .shape
)
print(result)

(3582628, 35)


1. (3582628, 28) - incorrect - You may have checked for `pl.Int32` columns instead of `pl.Int8` columns. See Module 4, "Data Manipulation II: Advanced Selecting."
2. (0, 0) - incorrect - You probably used `.select()` instead of `.with_columns()`. See Module 4, "Data Manipulation II: Advanced Selecting."
3. (3582628, 25) - correct - Exactly! There are no columns with this datatype, so the shape remains the same. See Module 4, "Data Manipulation II: Advanced Selecting."
4. (3582628, 35) - incorrect - Make sure to check for `pl.Int8` columns, not `pl.Float64` columns. See Module 4, "Data Manipulation II: Advanced Selecting."

## 10.9 Question 9: Most Common Congestion Surcharge

Use `.group_by()` to determine the most common value for `congestion_surcharge` in the dataset?

In [16]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .group_by("congestion_surcharge")
    .agg(pl.len().alias("num_occurrences"))
    .sort("num_occurrences", descending=True)
)
print(result)

shape: (7, 2)
┌──────────────────────┬─────────────────┐
│ congestion_surcharge ┆ num_occurrences │
│ ---                  ┆ ---             │
│ f64                  ┆ u32             │
╞══════════════════════╪═════════════════╡
│ 2.5                  ┆ 2880218         │
│ null                 ┆ 426190          │
│ 0.0                  ┆ 239793          │
│ -2.5                 ┆ 36421           │
│ 1.0                  ┆ 3               │
│ 0.75                 ┆ 2               │
│ -0.75                ┆ 1               │
└──────────────────────┴─────────────────┘


1. -0.75 - incorrect - This is the value of `congestion_surcharge` with the least occurrences in the dataset, but the question asks for the value that occurs the most. See Module 5, "Data Manipulation III: Grouping and Aggregation."
2. 202.18 - incorrect - If you got this result, you probably didn't use the correct column. See Module 5, "Data Manipulation III: Grouping and Aggregation."
3. 134.51 - incorrect - If you got this result, you probably didn't use the correct column. See Module 5, "Data Manipulation III: Grouping and Aggregation."
4. 2.5 - correct - Exactly! You can group by `congestion_surcharge`, then aggregate for `.len()`, and finally sort. See Module 5, "Data Manipulation III: Grouping and Aggregation."

## 10.10 Question 10: Zero-Tip Two-Passenger Rides Count

Create a `.pivot_table()` where the rows reflect whether the `tip_amount` was 0 or not and the columns are `passenger_count`. Then, for each combination of "tip_amount is 0" and `passenger_count`, compute the number of instances by using `aggregate_function` "len". How many rides had a `tip_amount` of 0 and a `passenger_count` of 2?

In [17]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .with_columns([
        pl.col("tip_amount").eq(0).name.suffix("_eq_0")
    ])
    .pivot(
        index="tip_amount",
        on="passenger_count",
        values="tip_amount",
        aggregate_function="len",
        sort_columns=True
    )
)
print(result)

shape: (4_363, 12)
┌────────────┬──────┬────────┬────────┬───┬──────┬──────┬──────┬────────┐
│ tip_amount ┆ 0    ┆ 1      ┆ 2      ┆ … ┆ 7    ┆ 8    ┆ 9    ┆ null   │
│ ---        ┆ ---  ┆ ---    ┆ ---    ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---    │
│ f64        ┆ u32  ┆ u32    ┆ u32    ┆   ┆ u32  ┆ u32  ┆ u32  ┆ u32    │
╞════════════╪══════╪════════╪════════╪═══╪══════╪══════╪══════╪════════╡
│ 2.7        ┆ 384  ┆ 7180   ┆ 1358   ┆ … ┆ null ┆ null ┆ null ┆ 74     │
│ 3.0        ┆ 1426 ┆ 65889  ┆ 12053  ┆ … ┆ null ┆ null ┆ null ┆ 893    │
│ 0.0        ┆ 9740 ┆ 525349 ┆ 104748 ┆ … ┆ 4    ┆ null ┆ 1    ┆ 368549 │
│ 1.29       ┆ 5    ┆ 640    ┆ 146    ┆ … ┆ null ┆ null ┆ null ┆ 43     │
│ 5.13       ┆ null ┆ 504    ┆ 113    ┆ … ┆ null ┆ null ┆ null ┆ 31     │
│ 2.04       ┆ 7    ┆ 1251   ┆ 258    ┆ … ┆ null ┆ null ┆ null ┆ 148    │
│ 2.0        ┆ 2588 ┆ 127344 ┆ 20620  ┆ … ┆ null ┆ null ┆ null ┆ 2007   │
│ 4.54       ┆ 12   ┆ 4747   ┆ 1102   ┆ … ┆ null ┆ null ┆ null ┆ 64     │
│ …          ┆ …   

1. 104748 - correct - Exactly right—this is the value you find by viewing the appropriate cell in the pivot table. See Module 5, "Data Manipulation III: Grouping and Aggregation."
2. 349149 - incorrect - Close! This is the number of rides with two passengers who did give a tip, but the question asks about rides with no tip. See Module 5, "Data Manipulation III: Grouping and Aggregation."
3. 1.7435e6 - incorrect - Your answer might look like this if you used "sum" instead of "len" as an aggregate_function. See Module 5, "Data Manipulation III: Grouping and Aggregation."
4. 1358 - incorrect - You may have overlooked first creating a Boolean column for "tip_amount is 0", as instructed in the question. The pivot table is easier to read when you start with this step. See Module 5, "Data Manipulation III: Grouping and Aggregation."

## 10.11 Question 11: Left-Join Result Shape

Join the following two DataFrames using a left-join (`restaurant_df` into `name_df`, on the name of the restaurant). What is the shape of the resultant DataFrame?

In [18]:
name_df = pl.DataFrame({
    "name": ["dan", "stan", "ran", "cran"],
    "favorite_restaurant": ["Pablo's Pizza", "Taco Time!", "Taco Time!", "Pablo's Pizza"]
})
restaurant_df = pl.DataFrame({
    "restaurant_name": ["Pablo's Pizza", "Taco Time!", "Toledo's Burritos",],
    "restaurant_city": ["Boston", "New York", "Los Angeles",]
})
result = (
    #### YOUR CODE HERE
    name_df
    .join(
        restaurant_df,
        left_on="favorite_restaurant",
        right_on="restaurant_name",
        how="left",
    )
)
print(result)

shape: (4, 3)
┌──────┬─────────────────────┬─────────────────┐
│ name ┆ favorite_restaurant ┆ restaurant_city │
│ ---  ┆ ---                 ┆ ---             │
│ str  ┆ str                 ┆ str             │
╞══════╪═════════════════════╪═════════════════╡
│ dan  ┆ Pablo's Pizza       ┆ Boston          │
│ stan ┆ Taco Time!          ┆ New York        │
│ ran  ┆ Taco Time!          ┆ New York        │
│ cran ┆ Pablo's Pizza       ┆ Boston          │
└──────┴─────────────────────┴─────────────────┘


1. (5, 3) - incorrect - To get this result, you probably joined `name_df` into `restaurant_df`. The order of the join is important! See Module 6, "Data Manipulation IV: Combining Data."
2. The code doesn't run, throwing a `ColumnNotFoundError` - incorrect - This error results from trying to join using the `on` input argument. Be sure to use `left_on` and `right_on` since the `restaurant_name` column has a different name in the two DataFrames. See Module 6, "Data Manipulation IV: Combining Data."
3. (4, 3) - correct - Exactly! The join columns combine into one, yielding a total of three columns. See Module 6, "Data Manipulation IV: Combining Data."
4. (0, 2) - incorrect - If you got this result, you may have done an anti-join rather than a left-join. See Module 6, "Data Manipulation IV: Combining Data."

## 10.12 Question 12: Anti-Join Result Shape

Join the two DataFrames using an anti-join (`restaurant_df` into `name_df`, on the name of the restaurant). What is the shape of the resultant DataFrame?

In [19]:
name_df = pl.DataFrame({
    "name": ["dan", "stan", "ran", "cran"],
    "favorite_restaurant": ["Pablo's Pizza", "Taco Time!", "Taco Time!", "Pablo's Pizza"]
})
restaurant_df = pl.DataFrame({
    "restaurant_name": ["Pablo's Pizza", "Taco Time!", "Toledo's Burritos",],
    "restaurant_city": ["Boston", "New York", "Los Angeles",]
})
result = (
    #### YOUR CODE HERE
    name_df
    .join(
        restaurant_df,
        left_on="favorite_restaurant",
        right_on="restaurant_name",
        how="anti",
    )
)
print(result)

shape: (0, 2)
┌──────┬─────────────────────┐
│ name ┆ favorite_restaurant │
│ ---  ┆ ---                 │
│ str  ┆ str                 │
╞══════╪═════════════════════╡
└──────┴─────────────────────┘


1. (1, 2) - incorrect - To get this result, you probably anti-joined `name_df` into `restaurant_df`. The order of the anti-join is important! See Module 6, "Data Manipulation IV: Combining Data."
2. The code doesn't run, throwing a `ColumnNotFoundError` - incorrect - This error results from trying to join using the `on` input argument. Be sure to use `left_on` and `right_on` since the `restaurant_name` column has a different name in the two DataFrames. See Module 6, "Data Manipulation IV: Combining Data."
3. (4, 3) - incorrect - If you got this result, you may have used a left-join. See Module 6, "Data Manipulation IV: Combining Data."
4. (0, 2) - correct - Exactly! There is nothing in `names_df` that is missing a row to join with in `restaurant_df`. See Module 6, "Data Manipulation IV: Combining Data."

## 10.13 Question 13: Null Count After Diagonal Concatenation

Vertically concatenate the DataFrames (with setting `"how=diagonal"`). How many null values are there in the resultant DataFrame?

In [20]:
df1 = pl.DataFrame({
    "A": [1, 2, None, 4],
    "B": [5, None, 7, 8]
})
df2 = pl.DataFrame({
    "A": [9, 10, None],
    "B": [None, 13, 14],
    "C": [15, 16, 17]
})
df3 = pl.DataFrame({
    "A": [18, None, 20],
    "B": [21, 22, 23],
    "D": [24, 25, None]
})
#### YOUR CODE HERE
result = (
    pl.concat([df1, df2, df3], how="diagonal")
    .null_count()
    .select(pl.sum_horizontal(pl.all()))
)
print(result)

shape: (1, 1)
┌─────┐
│ A   │
│ --- │
│ u32 │
╞═════╡
│ 20  │
└─────┘


1. 20 - correct - Exactly! There are a few nulls in the DataFrame to begin with, and there are even more after a diagonal concatenation. See Module 6, "Data Manipulation IV: Combining Data."
2. 6 - incorrect - This is the number of nulls across the DataFrames to begin with. The concatenation adds more. See Module 6, "Data Manipulation IV: Combining Data."
3. The code doesn't run due to a `ShapeError` - incorrect - Some DataFrames have nonoverlapping columns, so make sure to set how as diagonal. See Module 6, "Data Manipulation IV: Combining Data."
4. 0 - incorrect - There are null values in the DataFrame to begin with, and the concatenation adds more. Did you replace them with something else somehow? See Module 6, "Data Manipulation IV: Combining Data."

## 10.14 Question 14: Fraction of Early March Rides

What fraction of rides in the DataFrame started before 2024-03-15 (i.e., had a `tpep_pickup_datetime` before 2024-03-15)?

In [21]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .select(pl.col("tpep_pickup_datetime").lt(pl.date(2024, 3, 15)).mean())
)
print(result)

shape: (1, 1)
┌──────────────────────┐
│ tpep_pickup_datetime │
│ ---                  │
│ f64                  │
╞══════════════════════╡
│ 0.455173             │
└──────────────────────┘


1. 0.455174 - incorrect - You may have used `.le()` rather than `.lt()` to find the rides that started before the given date. See Module 9, "Integrating with the Data Science Workflow."
2. 0.455173 - correct - You can do a commonsense check to confirm: March 15 is almost halfway through the month, so it makes sense that the number here is almost 0.5! See Module 9, "Integrating with the Data Science Workflow."
3. 1.0 - incorrect - The value of 1.0 represents all the rides! Ensure you are capturing just the fraction of rides specified. See Module 9, "Integrating with the Data Science Workflow."
4. 1630716 - incorrect - The question asks for the fraction of rides that started before March 15, 2024, not the number of rides. See Module 9, "Integrating with the Data Science Workflow."

## 10.15 Question 15: List Column Aggregation Behavior

Group the following DataFrame by `class`, and `.sum()` the `salient_appendages` column (which is a list type) to create a list of all `salient_appendages` for that animal class. What happens?

In [22]:
animal_appendages_df = pl.DataFrame({
    "animal": ["human", "elephant", "shark", "stingray"],
    "class": ["mammal", "mammal", "chondrichthyes", "chondrichthyes"],
    "salient_appendages": [
        ["arm", "leg", "head"],
        ["arm", "leg", "head", "trunk"],
        ["fin", "tail", "jaw"],
        ["fin", "tail", "stinger"],
    ]
})
result = (
    #### YOUR CODE HERE
    animal_appendages_df
    .group_by("class")
    .agg(pl.col("salient_appendages").sum())
)
print(result)

shape: (2, 2)
┌────────────────┬────────────────────┐
│ class          ┆ salient_appendages │
│ ---            ┆ ---                │
│ str            ┆ list[str]          │
╞════════════════╪════════════════════╡
│ mammal         ┆ null               │
│ chondrichthyes ┆ null               │
└────────────────┴────────────────────┘


1. The code runs smoothly, but the resultant aggregated column is filled with nulls - correct - Clearly a different approach to concatenating the lists is needed. As you advance in your mastery of Polars, you will learn more about this. See Module 7, "Data Manipulation V: Data Types."
2. The code runs smoothly, and the resultant column is a combined list of all `salient_appendages` for that animal class - incorrect - This is the desired result, but it requires an advanced technique to achieve. See Module 7, "Data Manipulation V: Data Types."
3. The code crashes with a datatype error, saying that "list columns cannot be added together in a group_by" - incorrect - Indeed, the code doesn't successfully produce the desired outcome. However, it doesn't throw an error. See Module 7, "Data Manipulation V: Data Types."
4. The code runs smoothly, and the resultant column is a list of lists of all values for `salient_appendages` for that animal class - incorrect - This is what would happen if you didn't use the `.sum()` function and just left the aggregation as a simple `.agg(pl.col("salient_appendages"))`. See Module 7, "Data Manipulation V: Data Types."

## 10.16 Question 16: Weekend vs Weekday Tip Comparison

True or False: weekend taxi trips (trips that have a `tpep_pickup_datetime` on Saturday or Sunday) have on average higher tip amounts than non-weekend taxi trips.

In [23]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
    .with_columns(
        pl.col("tpep_pickup_datetime").dt.weekday().eq(6)
        .or_(pl.col("tpep_pickup_datetime").dt.weekday().eq(7))
        .alias("is_weekend_pickup")
    )
    .group_by("is_weekend_pickup")
    .agg(pl.col("tip_amount").mean())
)
print(result)

shape: (2, 2)
┌───────────────────┬────────────┐
│ is_weekend_pickup ┆ tip_amount │
│ ---               ┆ ---        │
│ bool              ┆ f64        │
╞═══════════════════╪════════════╡
│ true              ┆ 2.966726   │
│ false             ┆ 3.294616   │
└───────────────────┴────────────┘


1. False - correct - Exactly! Weekend pickups have an average `tip_amount` of 2.97, while non-weekend pickups have an average `tip_amount` of 3.29. See Module 7, "Data Manipulation V: Data Types."
2. True - incorrect - If you found the statement to be true, you may not have created the column for grouping correctly. See Module 7, "Data Manipulation V: Data Types."

## 10.17 Question 17: NumPy to Polars DataFrame Conversion

Convert the following NumPy array to a Polars DataFrame. What are the datatypes of the resultant DataFrame?

In [24]:
import numpy as np
np_array = np.random.rand(4, 2)
#### YOUR CODE HERE
pl.from_numpy(np_array).head(0)

column_0,column_1
f64,f64


1. `f64`, `f64` - correct - Exactly! The data were Float64's in the NumPy array, and they stayed that way when they got converted to a Polars DataFrame. See Module 8, "Data Manipulation VI: Interoperation and IO."
2. `f32`, `f32` - incorrect - Despite being random floats between 0 and 1, the floats are still 64-bit precision. See Module 8, "Data Manipulation VI: Interoperation and IO."
3. `f64`, `f64`, `f64`, `f64` - incorrect - If you got this result, you may have transposed the data somehow. See Module 8, "Data Manipulation VI: Interoperation and IO."
4. The code doesn't run due to a datatype conversion error - incorrect - There is no problem with creating a Polars DataFrame from a NumPy array. See Module 8, "Data Manipulation VI: Interoperation and IO."

## 10.18 Question 18: Polars to Pandas to Polars Series Conversion

Create a Polars series, convert it to Pandas, and then convert it back to Polars. True or False: the result upon returning to Polars is now a single-column DataFrame.

In [25]:
polars_series = pl.Series(name="polars_series", values=list(range(10)))
#### YOUR CODE HERE
pl.from_pandas(polars_series.to_pandas())

polars_series
i64
0
1
2
3
4
5
6
7
8
9


1. True - incorrect - If you are seeing a single column, perhaps you accidentally converted it to a DataFrame yourself. See Module 8, "Data Manipulation VI: Interoperation and IO."
2. False - correct - Exactly! Polars handles this type of interoperation without a problem, cleanly able to convert a Polars series to a Pandas series and back. See Module 8, "Data Manipulation VI: Interoperation and IO."

## 10.19 Question 19: Feature Most Correlated with Tip Amount

Using `rides_df_raw`, which feature is most highly correlated (either positively or negatively) with `tip_amount` (excluding `tip_amount` itself)? Also, please filter out `null` values as done in the module!

In [26]:
#### YOUR CODE HERE
corr = (
    rides_df_raw
    .select(pl.col([pl.Int32, pl.Int64, pl.Float64]))
    # .filter(
    #     pl.all_horizontal(pl.all().is_not_null())
    # )
    .corr()
)
(
    corr
    .select(
        pl.Series(corr.columns),
        "tip_amount"
    )
    .sort(
        pl.col("tip_amount").abs(),
        descending=True
    )
    .head(2)
)

Unnamed: 0_level_0,tip_amount
str,f64
"""passenger_count""",
"""ratecode_id""",


1. `tip_amount` - incorrect - If this were included in the scope of the question, it would be correct since a value always correlates perfectly with itself. However, the question explicitly excludes self-correlation. See Module 9, "Integrating with the Data Science Workflow."
2. `total_amount` - correct - Indeed, `total_amount` is most highly correlated with `tip_amount`. This makes sense since `total_amount` is a sum that includes `tip_amount`! See Module 9, "Integrating with the Data Science Workflow."
3. `congestion_surcharge` - incorrect - This is the feature least correlated with `tip_amount`. You may have made an error in your filtering, sorting, and/or selecting. See Module 9, "Integrating with the Data Science Workflow."
4. `passenger_count` - incorrect - If you got this result, you may have overlooked the instruction to filter out nulls before checking the correlations. See Module 9, "Integrating with the Data Science Workflow."

## 10.20 Question 20: Hourly Average Total Amount Analysis

With `rides_df_raw`, make a plot of 'hour of day of taxi ride' vs 'average total_amount'. Which of the following statements is True (hint: there are multiple options)?

In [4]:
#### YOUR CODE HERE
(
    rides_df_raw
    .group_by(pl.col("tpep_pickup_datetime").dt.hour().alias("tpep_pickup_hour"))
    .agg(pl.col("total_amount").mean().name.prefix("average_"))
    .plot
    .line(
        x="tpep_pickup_hour",
        y="average_total_amount",
    )
)

1. There appears to be a peak in average total amount around 2:00 - incorrect - There is no peak around 2:00 a.m.. In fact, this is the hour with the smallest `average_total_amount`. See Module 9, "Integrating with the Data Science Workflow."
2. There appears to be a peak in average total amount around 5:00 - correct - There is indeed a peak at 5:00 a.m.! Check the plot against your x-axis. See Module 9, "Integrating with the Data Science Workflow."
3. There appears to be a peak in average total amount around 16:00 - correct - There is indeed a peak at 4:00 p.m.! Check the plot against your x-axis. See Module 9, "Integrating with the Data Science Workflow."
4. There appears to be a peak in average total amount around 23:00 - correct - There is indeed a peak at 11:00 p.m.! Check the plot against your x-axis. See Module 9, "Integrating with the Data Science Workflow."