# 10. Integrating Polars Into the Data Science Workflow

By this point in this course, we've gained a lot of dexterity with `polars`; now, it's time to use it in a real data science problem!

The problem we will be trying to solve is **tip amount prediction**, using the NYC taxi rides data that we've been using throughout this course. The goal here is to provide a service to taxi drivers which, before starting a taxi ride, they can plug the details of their ride into, and receive a prediction for what tip amount that customer can be expected to pay. They could then use this information to inform themselves on how to best interact with their passengers to increase their tip; furthermore, feature importance insights from such a model could help a taxi driver decide the best places in the city to work, or the best times in the week, etc.

We'll go through the entire data science process:
1. Setting Up the Environment
2. Loading Data
4. Brief Data Exploration
3. Cleaning and Preparing the Data
4. Feature Engineering and Selection
5. Model Building
6. Model Evaluation
7. Discussion

Let's get into it!

## 10.1. Setting up the Environment

Import `polars` as usual...

In [1]:
import polars as pl

But also import some other tools from the data science suite that we'll need throughout this notebook.

In [None]:
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor

## 10.2. Loading Data

First, we load the data about zones:

In [None]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

Then, we load in the yellow trips taxi data, for both february and march (using `*` file pattern notation), joining in the zone information as before:

In [None]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

zone_df_columns = ["borough", "zone", "service_zone",]

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-*.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(zones_df, left_on="pu_location_id", right_on="location_id")
    .rename({zone_df_column: f"pu_{zone_df_column}" for zone_df_column in zone_df_columns})
    .join(zones_df, left_on="do_location_id", right_on="location_id")
    .rename({zone_df_column: f"do_{zone_df_column}" for zone_df_column in zone_df_columns})
)

In [None]:
display(rides_df_raw.shape)
display(rides_df_raw.head())

Looks good to go!

## 10.3. Brief Data Exploration

Before we proceed with any machine learning, let's first have a look at the data, viewed through the lens of the machine learning problem we're going to solve, keeping an eye out for anything we'll need to take care of during data cleaning and preparation:

In [None]:
rides_df_raw.describe()

A few notes:
- Some rides had pickup and dropoff times that took place outside of the data's known time range of beginning of february 2024 until the end of march 2024 (for example the minimum `"tpep_pickup_datetime"` is `"2002-12-31 22:17:10"`.
- A few columns are missing data, all in exactly `611800` rows (about `10%` of the data): `"passenger_count"`, `"ratecode_id"`, `"store_and_fwd_flag"`, `"congestion_surcharge"`, and `"airport_fee"`.
- As we've seen throughout the course, some fields have impossibly subzero values, namely `"trip_distance"` and some of the currency columns (e.g. the mnimum `"fare_amount"` was `-$999`).

To the point that a few columns are missing data, let's double check the percentage of that:

In [None]:
rides_df_raw.select(pl.all().null_count() / pl.len())

Since they are all the exact same number of missing rows, hopefully they are null together. Let's check:

In [None]:
cols_with_null = ["passenger_count", "ratecode_id", "store_and_fwd_flag", "congestion_surcharge", "airport_fee"]
rides_df_raw.select(
    pl.all_horizontal(pl.col(cols_with_null).is_null()).sum() / pl.len()
)

Indeed, they are all null together! We'll take care of this when it comes time to clean and prepare the data.

Next, let's plot an ECDF of the target variable. We can do this by passing a polars series directly to matplotlib:

In [None]:
x = rides_df_raw["tip_amount"]
plt.ecdf(x)

Already we see some weird stuff with `"tip_amount"`--the highest tip 

## 10.3. Cleaning and Preparing the Data

Over the course of this course, we've compiled some filters of impossible data, such as rides without passengers, and rides that had a pickup that occurred after the dropoff, and some of the other data issues we just discovered in the prior section (such as missing values). We'd like to handle this bad data before jumping into machine learning, so first let's quantify the scope of the issue:

In [None]:
remove_impossible_data_filter = (
    pl.col("tpep_pickup_datetime").lt(pl.col("tpep_dropoff_datetime")) &
    pl.col("passenger_count").gt(0)
)

In [None]:
rides_df_raw.select(remove_impossible_data_filter.mean().alias("fraction_data_usable"))

Looks fine to remove!

While we're here, before we go

 However, we don't want to remove all the bad data just yet--**we only want to remove bad data from our train data**. This is because, when the model is performing live in the wild, we won't necessarily be guaranteed that the data will be clean in the way we'd like it to be; when taxi drivers use our system, perhaps they accidentally enter "0 passengers" 