This notebook contains an overview of some quality of life improvements introduced over the last few major releases of scikit-learn. Note that I normally prefer to put all of my imports in the top of the notebook, but for learning purposes I will import as necessary. For now we'll just import pandas as numpy and set a magic seed for random states.

In [None]:
import pandas as pd
import numpy as np

SEED = 101

## Faster parser in fetch_openml [1.2.1]

[OpenML](https://www.openml.org/) is a website that allows users to upload and share machine learning datasets. I haven't used it much, but scikit-learn has functionality to pull datasets directly. Let's pull a dataset so we have something to explore the rest of the quality of life features with. I decided to pull a recently uploaded dataset named [adult](https://www.openml.org/search?type=data&status=active&id=45068) (ID:45068). This is the listed description:

> Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
    
Pulling data into memory is relatively straightforward. Just use the fetch_openml function from sklearn.datasets. However, be sure to specify the data_id rather than the name alone, as there are likely multiple datasets with similar names (particularly for simple names).

In [None]:
from sklearn.datasets import fetch_openml

data = fetch_openml(
    name=None, # Optional, I would definitely specify data_id
    version='active', # It is possible to specify an exact version
    data_id=45068,
    data_home='../data/', # You can specify the data location here!
    target_column='default-target', # Allows you to specify the target column(s)
    cache=True,
    return_X_y=False, # Possible to return (X,y) instead of data dictionary
    as_frame='auto', # Bit misleading, will return a dataframe as an element in the data dictionary
    parser='liac-arff' # Your choices are pandas or liac-arff. The latter is a pure-Python method.
)
data.keys()

In [None]:
df = data['frame']
df.head()

Based on the description, the last column is the binary target. However, we can see that the fetch_openml function has a version of the dataframe seperated into data and target.

In [None]:
data['data'].head()

In [None]:
data['target']

Incidentally, I found that the built-in parser, liac-arff, does not handle missing values properly. Remember to inspect your data, kids!

In [None]:
df['occupation'].isnull().sum()

In [None]:
data = fetch_openml(
    name=None, # Optional, I would definitely specify data_id
    version='active', # It is possible to specify an exact version
    data_id=45068,
    data_home='../data/', # You can specify the data location here!
    target_column='default-target', # Allows you to specify the target column(s)
    cache=True,
    return_X_y=False, # Possible to return (X,y) instead of data dictionary
    as_frame='auto', # Bit misleading, will return a dataframe as an element in the data dictionary
    parser='pandas' # Your choices are pandas or liac-arff. The latter is a pure-Python method.
)
df = data['frame']
df.head()

In [None]:
df['occupation'].isnull().sum()

## Feature Names Support [1.0.2] 
## get_feature_names_out Available in all Transformers [1.1.3]
## Pandas output with set_output API [1.2.1]

Now that we have some data, let's talk about some of our fancy new features. Coming from the pre-1.0 release days, applying transformations and training models directly from dataframes was somewhat fraught. Sometimes the model would fail, and it definitely would not pass through the column names to the transformer/model. There was support in some models for defining and inspecting features (I'm looking at you, [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)), but the poor data science practicioner was forced to set the feature names directly.

Let's start by looking at the data. We already know we have categorical features, as well as some missing values.

In [None]:
df.info()

Normally, you would need to explicitly declare the datatype for categorical variables. With fetch_openml, however, it will set the datatype for you. Also, as we saw above, there are several feature columns with missing values. Let's create a transformer pipeline to both encode the categorical variables and impute missing values on the numerical feature columns.

In [None]:
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# There are multiple ways to do this, but I am explicitly building a list of features (lazily).
categorical_features = list(df.dtypes[df.dtypes == 'category'].index)
numerical_features = list(df.dtypes[df.dtypes != 'category'].index)
# Drop the target, which is included in the numerical features list.
numerical_features.remove('class')

In [None]:
numerical_features

In [None]:
categorical_features

Now we can create our transformation pipelines for each feature type and encapsulate them in a ColumnTransformer object.

In [None]:
numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore'))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Now we'll create a single pipeline for training the model. Technically, we don't need to use a pipeline, but it's a nice mechanism for keeping things neat.

In [None]:
pipe = Pipeline(
    [
        ('prep', prep),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

Finally, we can create a train-test split and train a model. I'll also go ahead and cast the target to an integer.

In [None]:
y = df['class'].map({'<=50K':0, '>50K':1}).values
y

In [None]:
from sklearn.model_selection import train_test_split

y = df['class'].map({'<=50K':0, '>50K':1}).values

X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.7, shuffle=True, random_state=SEED, stratify=y)

X_train.head()

In [None]:
pipe.fit(X_train, y_train)

This rich view is actually somewhat new to me! It was actually released in [0.23.2](https://scikit-learn.org/0.23/auto_examples/release_highlights/plot_release_highlights_0_23_0.html) and offers a nice way to explore complicated pipelines. You can click the html figure to expand elements and view the params.

Now let's use the get_feature_names_out method to retrieve the feature names and pair them with the coefficients to inspect the most important features.

In [None]:
pipe['prep'].get_feature_names_out()

In [None]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['prep'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

In [None]:
df_coef.tail()

Unfortunately, there are some limitations. If we try to introduce PolynomialFeatures transformer _after_ the ColumnTransformer transformer, we lose the traceback to feature names. We also can't incorporate it into the ColumnTransformer, but you can explicitly set the input features for the downstream transformer by assigning the output feature names from the ColumnTransformer to the input feature names for the PolynomialFeatures transformer.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore'))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

pipe = Pipeline(
    [
        ('prep', prep),
        ('poly', PolynomialFeatures()),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

In [None]:
pipe['poly'].feature_names_in_ = pipe['prep'].get_feature_names_out()

In [None]:
pipe['poly'].get_feature_names_out()

In [None]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

In [None]:
df_coef.tail()

Alternatively, you can use the new functionality for setting the output of a transformer introduced in in 1.2.1. However, pandas output does not support sparse matrices, so any one hot encoded outputs will need to be returned as dense values. Keep in mind that this will increase memory usage.

In [None]:
numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
).set_output(transform='pandas')

pipe = Pipeline(
    [
        ('prep', prep),
        ('poly', PolynomialFeatures()),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

In [None]:
df_coef.shape

## Keyword and positional arguments [1.0.2]

This is really more of a stylistic choice, but the maintainers of scikit-learn have decided to enforce the usage of keyword arguments. This isn't necessarily a bad thing, but you should be aware that code created prior to this release may break. Also, rather confusingly, some positional arguments will still work.

For instance, let's instantiate a LogisticRegression model with keyword arguments.

In [None]:
model_keyword = LogisticRegression(
    penalty='l1',
    dual=False,
    tol=0.0001,
    C=0.1
)

Now try the same thing with positional arguments.

In [None]:
model_position = LogisticRegression('l1', False, 0.0001, 0.1)

In [None]:
model_position = LogisticRegression('l1')

You can, however, still instantiate a new object w/ one or two positional keywords, depending on the class. Do your future self a favor, however, and just write out the keywords!

## New and enhanced displays [1.2.1]