## Working with Polars and Scikit-learn
By the end of this lecture you will be able to:
- use Polars objects to fit Scikit-learn models
- work with categorical columns in a gradient boosting model
- output a Polars `DataFrame` from Sklean pipeline processing tools

You may need to install scikit-learn first

In [None]:
# %pip install scikit-learn

In [None]:
import polars as pl
import numpy as np

from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.ensemble import HistGradientBoostingClassifier,HistGradientBoostingRegressor

from sklearn.metrics import root_mean_squared_error,accuracy_score


In this lecture we fit simple models that aim to predict the binary `Survived` column in the Titanic data with features from a selection of other columns

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = (
    pl.read_csv(csv_file)
    .select("Pclass","Sex","Age","Fare","Embarked","Survived")
)
df.head()

## Fitting a model with Polars
We can pass a Polars `DataFrame` and `Series` directly to Scikit-learn.

Here we fit a logistic regression model using just the numeric features with some simple `null`-filling

In [None]:
model = LogisticRegression()
# Make the feature matrix and target vector
X = df.select(pl.col("Age","Fare").fill_null(strategy="mean"))
y = df["Survived"]

We can pass this Polars `DataFrame` and `Series` directly to scikit-learn

In [None]:
# Fit the model
model.fit(X,y)

We then make a new `DataFrame` with actual labels and predicted labels

In [None]:
# Make predictions on the training data
pred_df = pl.DataFrame(
    {
        "label":y,
        "pred":model.predict(X)
    }
)

print(f'Accuracy: {100*accuracy_score(pred_df["label"],pred_df["pred"]):.2f}%')

Note that `model.predict(X)` is a Numpy ndarray that we turn into a column in a Polars `DataFrame`

In [None]:
type(model.predict(X))

## Categorical features in a gradient boosting model
Typically we have to encode categorical features manually (we see how to do this in pipelines later in this lecture). However, if we use Scikit-learn's gradient boosting model `HistGradientBoostingClassifier` (or `HistGradientBoostingRegressor`) we can pass a Polars `pl.Categorical` column directly to the model without encoding it.

In this example we cast the integer `Pclass` and string `Sex` column to categorical. Recall that to cast an integer column to categorical we must first cast to `pl.String`.

We can pass a Polars `DataFrame` directly to `train_test_split`

In [None]:
df_train, df_test = train_test_split(
    df.select("Pclass","Age","Fare","Sex","Survived"),
    test_size=0.2, 
    random_state=0
)

We then create our feature matrix `X` and target vector `y` by:
- casting `Pclass` and `Sex` to categorical
- filling `nulls` in the `Age` column

In [None]:
X_train = (
    df_train
    .select(
        pl.col("Pclass").cast(pl.String).cast(pl.Categorical),
        pl.col("Sex").cast(pl.Categorical),
        pl.col("Age").fill_null(pl.col("Age").median()),
        pl.col("Fare")
    )
)
y_train = df_train["Survived"]

We can now fit the model and make predic pass the categorical columns directly by passing the `categorical_features="from_dtype"` argument to the model

In [None]:
model_grad_boost = HistGradientBoostingClassifier(categorical_features="from_dtype")

model_grad_boost.fit(X_train,y_train)

Now we make the test feature matrix.

We use `with_context` to fill `nulls` in the `Age` column of the test data with the median of the training data. We do this by:
- `converting `df_test` to lazy mode
- in `with_context` reference `df_train` and get the fill value as a column called `Age_median`
- filling `nulls` in `Age` with the `Age_median` expression

In [None]:
X_test = (
    df_test
    .lazy()
    .with_context(
        df_train
        .lazy()
        .select(pl.col("Age").median().alias("Age_median"))
    )
    .select(
        pl.col("Pclass").cast(pl.String).cast(pl.Categorical),
        pl.col("Sex").cast(pl.Categorical),
        pl.col("Age").fill_null(pl.col("Age_median")),
        pl.col("Fare")
    )
    .collect()
)
y_test = df_test["Survived"]

We now make predictions on the test data and check the accuracy

In [None]:
pred_df = pl.DataFrame(
    {
        "label":y_test,
        "pred":model_grad_boost.predict(X_test)
    }
)
print(f'Accuracy: {100*accuracy_score(pred_df["label"],pred_df["pred"]):.2f}%')

## Scikit-learn pipelines
A more systematic way to produce ML pipelines is to use Scikit-learn's preprocessing tools: 
- `Pipeline` to compose multiple transformation steps on a column together
- `ColumnTransformer` to run transformations on multiple columns together
- `Pipeline` (again) to compose the pre-proprocessing and model fit/predict steps together

In this example we do not cast the categorical columns to `pl.Categorical` but instead use sklearn preprocessing to create categorical features inside the preprocessing objects

In [None]:
df_train, df_test = train_test_split(
    df.select("Pclass","Age","Fare","Sex","Survived"),
    test_size=0.2, 
    random_state=0
)
X_train,y_train = df_train.drop("Survived"),df_train["Survived"]
X_test,y_test = df_test.drop("Survived"),df_test["Survived"]

We first create the `Pipeline` for the columns with numerical features. In this example we:
- use `SimpleImputer` to fill nulls with the median value and
- `StandardScaler` to convert numerical columns to their z-score (i.e. subtracting the mean and dividing by the standard deviation)

In [None]:
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")), 
        ("scaler", StandardScaler())
    ]
)

We then create a one-hot encoding pipeline for categorical features. Note that we have to set `sparse_output=False` if we want the output as a Polars `DataFrame`. If you have a lot of sparse data you may be better off using the normal sparse matrix as the output and not using Polars in the pipeline

In [None]:
categorical_features = ["Sex", "Pclass"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore",sparse_output=False)),
    ]
)

We then make a `ColumnTransformer` to handle the preprocessing for the different types of columns

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
)

We create another `Pipeline` to handle preprocessing and model fitting. In this example we set the output to be Polars objects by calling `set_output` on the `Pipeline`

In [None]:
preprocess_model_pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
preprocess_model_pipeline.set_output(transform="polars")

preprocess_model_pipeline.fit(X_train, y_train)
print("model score: %.3f" % preprocess_model_pipeline.score(X_test, y_test))

pred_df = pl.DataFrame(
    {
        "label":y_test,
        "pred":preprocess_model_pipeline.predict(X_test)
    }
)
pred_df

## Exercises

### Exercise one
We want to predict the tip amount in the NYC taxi database from the other columns

In [None]:
taxi_data = "../data/nyc_trip_data_1k.csv"
df_nyc = pl.read_csv(taxi_data,try_parse_dates=True)
df_nyc

Randomly split `df_nyc` into a train and test `DataFrames`

Create `X_train` with:
- `VendorID` as a categorical variable
- `trip_minutes` as the duration of the trip in minutes
- `passenger_count`,`trip_distance`,`fare_amount` as numerical features

and `y_train` with `tip_amount` as the target vector

Instantiate and fit the gradient boosting model

Make a `DataFrame` with `actual` and `pred` columns on the training data

Make a scatter plot of the `actual` versus `pred`

Get the root-mean-squared error of the prediction of the tip amount

Create the test feature matrix and target vector

Make predictions on the test `DataFrame` and make a scatterplot

Get the root-mean-squared error on the test data

### Exercise two
We now fit models and make predictions with a `Pipeline` approach.

Begin by creating `df_nyc_train` and `df_nyc_test` by:
- creating any feature columns and
- splitting `df_nyc`

Now create the train/test feature matrixes and target vectors

Create a pipeline for the numerical features by:
- imputing missing values with the median
- scaling values by the min/max of that column

Create the categorical `VendorID` feature

Make a `ColumnTransformer` to preprocess all features

Make a `Pipeline` to preprocess the features and do **linear regression**

Make a `DataFrame` of actual and predicted values on the **test** data

Get the RMSE of the model on the test data

## Solutions

### Solution to exercise one
We want to predict the tip amount in the NYC taxi database from the other columns

In [None]:
taxi_data = "../data/nyc_trip_data_1k.csv"
df_nyc = pl.read_csv(taxi_data,try_parse_dates=True)
df_nyc

Randomly split `df_nyc` into a train and test `DataFrames`

In [None]:
df_train,df_test = train_test_split(df_nyc,test_size=0.2,random_state=0)

Create `X_train` with:
- `VendorID` as a categorical variable
- `trip_minutes` as the duration of the trip in minutes
- `passenger_count`,`trip_distance`,`fare_amount` as numerical features

and `y_train` with `tip_amount` as the target vector

In [None]:
X_train = (
    df_train
    .with_columns(
        pl.col("VendorID").cast(pl.Categorical),
        trip_minutes = (pl.col("dropoff") - pl.col("pickup")).dt.total_minutes(),
    )
    .drop("pickup","dropoff","tip_amount")
)
y_train = df_train["tip_amount"]
X_train.head()

Instantiate and fit the gradient boosting model

In [None]:
model = HistGradientBoostingRegressor(categorical_features="from_dtype")

model.fit(X_train,y_train)

Make a `DataFrame` with `actual` and `pred` columns on the training data

In [None]:
train_pred_df = pl.DataFrame(
    {
        "actual":y_train,
        "pred":model.predict(X_train)
    }
)

Make a scatter plot of the `actual` versus `pred`

In [None]:
train_pred_df.plot.scatter(
    x="pred",
    y="actual",
    aspect='equal',
    width=200
    )

Get the root-mean-squared error of the prediction of the tip amount

In [None]:
rmse = root_mean_squared_error(train_pred_df["actual"], train_pred_df["pred"])

print(f"RMSE: {rmse}")

Create the test feature matrix and target vector

In [None]:
X_test = (
    df_test
    .with_columns(
        pl.col("VendorID").cast(pl.Categorical),
        trip_minutes = (pl.col("dropoff") - pl.col("pickup")).dt.total_minutes(),
    )
    .drop("pickup","dropoff","tip_amount")
)
y_test = df_test["tip_amount"]

Make predictions on the test `DataFrame` and make a scatterplot

In [None]:
test_pred_df = pl.DataFrame(
    {
        "label":y_test,
        "pred":model.predict(X_test)
    }
)
test_pred_df.plot.scatter(
    x="pred",
    y="label",
    aspect='equal',
    width=200
    )

Get the root-mean-squared error on the test data

In [None]:
rmse = root_mean_squared_error(test_pred_df["label"], test_pred_df["pred"])

print(f"RMSE: {rmse}")

### Solution to exercise 2
We now fit models and make predictions with a `Pipeline` approach.

Begin by creating `df_nyc_train` and `df_nyc_test` by:
- creating any feature columns and
- splitting `df_nyc`

In [None]:
df_nyc_train, df_nyc_test = train_test_split(
    df_nyc.select(
        "VendorID",
        (pl.col("dropoff") - pl.col("pickup")).dt.total_minutes().alias("trip_minutes"),
        "passenger_count",
        "trip_distance",
        "fare_amount",
        "tip_amount"
    ),
    test_size=0.2, 
    random_state=0
)

Now create the train/test feature matrixes and target vectors

In [None]:
X_train,y_train = df_nyc_train.drop("tip_amount"),df_nyc_train["tip_amount"]
X_test,y_test = df_nyc_test.drop("tip_amount"),df_nyc_test["tip_amount"]

Create a pipeline for the numerical features by:
- imputing missing values with the median
- scaling values by the min/max of that column

In [None]:
numeric_features = ["trip_minutes", "passenger_count","trip_distance","fare_amount",]
from sklearn.preprocessing import MinMaxScaler
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")), 
        ("scaler", MinMaxScaler())
    ]
)

Create the categorical `VendorID` feature

In [None]:
categorical_features = ["VendorID"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore",sparse_output=False)),
    ]
)

Make a `ColumnTransformer` to preprocess all features

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
)

Make a `Pipeline` to preprocess the features and do **linear regression**

In [None]:
preprocess_model_pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", LinearRegression())]
)
preprocess_model_pipeline.set_output(transform="polars")

preprocess_model_pipeline.fit(X_train, y_train)

Make a `DataFrame` of actual and predicted values on the **test** data

In [None]:
pred_df = pl.DataFrame(
    {
        "actual":y_test,
        "pred":preprocess_model_pipeline.predict(X_test)
    }
)
pred_df

Get the RMSE of the model on the test data

In [None]:
rmse = root_mean_squared_error(pred_df["actual"], pred_df["pred"])

print(f"RMSE: {rmse}")