# 9. Integrating Polars Into the Data Science Workflow - Quiz

## 9.0. Import `polars` and Load Data

In [1]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.metrics import mean_absolute_error

pl.Config.set_tbl_rows(16)

polars.config.Config

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 9.1. Question 1

Which feature is least correlated with `"passenger_count"` (either negatively or positively)? (Hint: you might need the polars function for absolute value, `.abs()`. Also, please filter out `null` values as done in the module!)

In [4]:
#### YOUR CODE HERE
corr = (
    rides_df_raw
    .select(pl.col([pl.Int32, pl.Int64, pl.Float64]))
    .filter(
        pl.all_horizontal(pl.all().is_not_null())
    )
    .corr()
)
(
    corr
    .select(
        pl.Series(corr.columns),
        "passenger_count"
    )
    .sort(
        pl.col("passenger_count").abs(),
        descending=False
    )
    .head(2)
)

Unnamed: 0_level_0,passenger_count
str,f64
"""trip_distance""",0.000956
"""improvement_surcharge""",-0.003378


1. `passenger_count` - incorrect - we are not including self-correlation, here.
2. `extra` - incorrect - we are looking for the lowest absolute correlation, so don't forget to take the absolute value!
3. `trip_distance` - correct
4. `vendor_id` - incorrect - we are looking for the least correlated, not the most correlated!

## 9.2. Question 2

Plot `total_amount` as a function of `trip_distance`. Which of the following statements about the resultant plot is false?

In [5]:
#### YOUR CODE HERE
rides_df_raw.sample(10000).plot(x="trip_distance", y="total_amount", kind="scatter")

ModuleUpgradeRequiredError: altair>=5.4.0 is required for `.plot`

1. There is a second sub-majority of the data which adheres to a correlation line which has a slope of approximately `$20/mile - $22/mile`. - correct - there is no such correlation line.
2. The majority of the data adheres to a correlation line which has a slope of approximately `$5/mile - $7/mile`. - incorrect - this statement is true; there is some data which deviates from this trend, but it is the majority.
3. Some rides appear to have a negative trip distance. - incorrect - this statement is true, in fact there is a highly non-neglible amount of rides with a negative trip distance.
4. A non-negligible minority of the data appears to have a trip distance of exactly 0. - incorrect - This statement is true, in fact there is a spike of data along the y-axis, where trip distance equals 0.

## 9.3. Question 3

Plot an ECDF of 'fare_amount'. Is the resultant distribution unimodal or multimodal (i.e. is there one peak to the distribution or multiple)? (Hint: exclude any noisy spikes!)

In [None]:
#### YOUR CODE HERE
rides_df_raw.plot.hist("fare_amount", bins=1000, xlim=(-1, 200))

1. Unimodal - correct
2. Multimodal - incorrect

## 9.4. Question 4

Given the following toy dataframe of `y_predicted` and `y_truth`, measure the `mean_absolute_error`. True or False: the result is greater than `.5`. (Hint: use the `sklearn` implementation of `mean_absolute_error`.)

In [None]:
toy_result_df = pl.DataFrame({
    "y_predicted": [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, ],
    "y_truth": [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ]
})
#### YOUR CODE HERE
print(mean_absolute_error(toy_result_df["y_truth"], toy_result_df["y_predicted"]))

1. True - correct
2. False - incorrect

## 9.5. Question 5

In the module, we reviewed the function `.sample()`, and used it to reduce our data to a fixed number of rows; to this end, we passed in simply the number of rows that we wanted in the result with e.g. `.sample(10000)`. However, `.sample()` also provides the option to pass in a fraction of rows, with `.sample(fraction=X)`, where `X` must be between 0 and 1. Use this new way of using the function to reduce the data to 2% of its original size. What is the shape of the result?

In [None]:
result = (
    rides_df_raw
    .sample(.02)
    .shape
)
print(result)

1. 3582628 - incorrect - looks like the `.sample()` didn't work--`3582628` is just the size of the entire dataframe!
2. 2 - incorrect - if you're going to pass in a fraction as an argument, you have to pass it to the keyword argument "fraction"!
3. 0 - incorrect - Make sure to convert `2%` to a fraction, and use the "fraction" keyword argument!
4. 71652 - correct