Prepare data for modeling with modular preprocessing steps.

# INTRODUCTION

...

In [1]:
import ibis

con = ibis.duckdb.connect("nycflights13.ddb")

We’ll turn on interactive mode, which partially executes queries to give users a preview of the results. 

In [2]:
ibis.options.interactive = True

In [3]:
flights = con.table("flights")
flights = flights.mutate(
    dep_time=(
        flights.dep_time.lpad(4, "0").substr(0, 2)
        + ":"
        + flights.dep_time.substr(-2, 2)
        + ":00"
    ).try_cast("time"),
    arr_delay=flights.arr_delay.try_cast(int),
    air_time=flights.air_time.try_cast(int),
)
flights

In [4]:
weather = con.table("weather")
weather

# THE NEW YORK CITY FLIGHT DATA

Let’s use the [nycflights13 data](https://github.com/hadley/nycflights13) to predict whether a plane arrives more than 30 minutes late. This data set contains information on 325,819 flights departing near New York City in 2013. Let’s start by loading the data and making a few changes to the variables:

In [5]:
flight_data = (
    flights.mutate(
        # Convert the arrival delay to a factor
        arr_delay=ibis.ifelse(flights.arr_delay >= 30, "late", "on_time"),
        # We will use the date (not date-time) in the recipe below
        date=flights.time_hour.date(),
    )
    # Include the weather data
    .inner_join(weather, ["origin", "time_hour"])
    # Only retain the specific columns we will use
    .select(
        "dep_time",
        "flight",
        "origin",
        "dest",
        "air_time",
        "distance",
        "carrier",
        "date",
        "arr_delay",
        "time_hour",
    )
    # Exclude missing data
    .dropna()
)
flight_data

We can see that about 16% of the flights in this data set arrived more than 30 minutes late.

In [6]:
flight_data.arr_delay.value_counts().rename(n="arr_delay_count").mutate(
    prop=ibis._.n / ibis._.n.sum()
)

# DATA SPLITTING

To get started, let’s split this single dataset into two: a training set and a testing set. We’ll keep most of the rows in the original dataset (subset chosen randomly) in the training set. The training data will be used to fit the model, and the testing set will be used to measure model performance.

# CREATE FEATURES

In [7]:
import ibisml as ml

flights_rec = ml.Recipe(
    ml.ExpandDate("date", components=["dow", "month"]),
    ml.Drop("date"),
    ml.OneHotEncode(ml.nominal()),
    ml.ZeroVariance(ml.everything()),
    ml.MutateAt("dep_time", ibis._.hour() * 60 + ibis._.minute()),
    ml.MutateAt(ml.timestamp(), ibis._.epoch_seconds()),
)

# FIT A MODEL WITH A RECIPE

Let’s use logistic regression to model the flight data.

We will want to use our recipe across several steps as we train and test our model. We will:

1. **Process the recipe using the training set:** This involves any estimation or calculations based on the training set. For our recipe, the training set will be used to determine which predictors should be converted to dummy variables and which predictors will have zero-variance in the training set, and should be slated for removal.

1. **Apply the recipe to the training set:** We create the final predictor set on the training set.

1. **Apply the recipe to the test set:** We create the final predictor set on the test set. Nothing is recomputed and no information from the test set is used here; the dummy variable and zero-variance results from the training set are applied to the test set.

To simplify this process, we can use a [scikit-learn `Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([("flights_rec", flights_rec), ("lr_mod", LogisticRegression())])

Now, there is a single function that can be used to prepare the recipe and train the model from the resulting predictors:

In [9]:
X_train = flight_data.drop("arr_delay")
y_train = flight_data.arr_delay
pipe.fit(X_train, y_train)

# USE A TRAINED WORKFLOW TO PREDICT

...

In [10]:
pipe.score(X_train, y_train)

0.8387448245805186