# 9. Integrating Polars Into the Data Science Workflow - Quiz

## 9.0. Import `polars` and Load Data

In [13]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.metrics import mean_absolute_error
import hvplot.polars

pl.Config.set_tbl_rows(16)

polars.config.Config

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 9.1 Question 1: Feature Least Correlated with Passenger Count

Using `rides_df_raw`, which feature is least correlated with `passenger_count` (either negatively or positively)? (Hint: You might need the polars function for absolute value, `.abs()`; also, remember to filter out `null` values as done in the module!)

In [4]:
#### YOUR CODE HERE
corr = (
    rides_df_raw
    .select(pl.col([pl.Int32, pl.Int64, pl.Float64]))
    .filter(
        pl.all_horizontal(pl.all().is_not_null())
    )
    .corr()
)
(
    corr
    .select(
        pl.Series(corr.columns),
        "passenger_count"
    )
    .sort(
        pl.col("passenger_count").abs(),
        descending=False
    )
    .head(2)
)

Unnamed: 0_level_0,passenger_count
str,f64
"""trip_distance""",0.000956
"""improvement_surcharge""",-0.003378


1. `passenger_count` - incorrect - self-correlation is always 1.0, or 100%, so it cannot be the least correlation. See "Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars."
2. `extra` - incorrect - The question asks for the lowest absolute correlation (closest to zero, whether positive or negative). You may have forgotten to take the absolute value. See "Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars."
3. `trip_distance` - correct - Exactly! By using `.corr()`, we can compute correlations, and this feature has the lowest absolute correlation with `passenger_count`. See "Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars."
4. `vendor_id` - incorrect - You probably identified the feature that is most correlated with `passenger_count` rather than the least correlated feature. See "Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars."

## 9.2 Question 2: Total Amount vs Trip Distance Plot Analysis

Plot `total_amount` as a function of `trip_distance`. Which statements about the plot are true?

In [9]:
#### YOUR CODE HERE
rides_df_raw.sample(5000).plot.scatter(x="trip_distance", y="total_amount")

1. There is a second sub-majority of the data which adheres to a correlation line which has a slope of approximately $20/mile - $22/mile - incorrect - There is no such correlation line. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
2. The majority of the data adheres to a correlation line which has a slope of approximately $5/mile - $7/mile - correct - this statement is true; while some data deviates from this trend, the majority adheres to the correlation line described. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
3. Some rides appear to have a negative trip distance - correct - Indeed, a significant number of rides have a negative trip distance. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
4. A non-negligible minority of the data appears to have a trip distance of exactly 0 - correct - In fact, there is a spike of data along the y-axis, where trip distance equals zero. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”


## 9.3 Question 3: Fare Amount Distribution Analysis

Plot an ECDF of `fare_amount`. Is the resultant distribution unimodal or multimodal (i.e. is there one peak to the distribution or multiple)? (Hint: exclude any noisy spikes!)

In [14]:
#### YOUR CODE HERE
rides_df_raw.hvplot.hist("fare_amount", bins=1000, xlim=(-1, 200))

1. Unimodal - correct - Exactly! The distribution looks like a unimodal log normal distribution. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
2. Multimodal - incorrect - You may have overlooked excluding spikes of statistical noise from your analysis. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”


## 9.4 Question 4: Mean Absolute Error Calculation

Given the following toy DataFrame of `y_predicted` and `y_truth`, measure the `mean_absolute_error`. True or False: the result is greater than .5. (Hint: use the `sklearn` implementation of `mean_absolute_error`.)

In [None]:
toy_result_df = pl.DataFrame({
    "y_predicted": [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, ],
    "y_truth": [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ]
})
#### YOUR CODE HERE
print(mean_absolute_error(toy_result_df["y_truth"], toy_result_df["y_predicted"]))

1. True - correct - Exactly! The data can be passed directly into `mean_absolute_error` as a Polars series. See “Machine Learning Model Building, Evaluation, and Discussion.”
2. False - incorrect - Are you sure? Make sure you're using the correct function from scikit-learn! See “Machine Learning Model Building, Evaluation, and Discussion.”


## 9.5 Question 5: Sampling DataFrame with Fraction

In the module, we reviewed the function `.sample()`, and used it to reduce our data to a fixed number of rows; to this end, we passed in simply the number of rows that we wanted in the result with e.g. `.sample(10000)`. However, `.sample()` also provides the option to pass in a fraction of rows, with `.sample(fraction=X)`, where `X` must be between 0 and 1. Use this new way of using the function to reduce the data to 2% of its original size. What is the shape of the result?

In [None]:
result = (
    rides_df_raw
    .sample(.02)
    .shape
)
print(result)

1. 3582628 - incorrect - It looks like the .sample() function didn't work, since 3582628 is the size of the entire DataFrame. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
2. 2 - incorrect - To pass in a fraction as an argument, you have to pass it to the keyword argument "fraction." You may have overlooked that here. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
3. 0 - incorrect - Make sure to convert 2% to a fraction and use the "fraction" keyword argument. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”
4. 71652 - correct - Exactly! There are 3,582,628 rows in the original DataFrame, and 2% of 3,582,628 is 71,652. See “Brief Data Exploration: Plots, Correlations, and Summary Statistics with Polars.”