This notebook contains an overview of some quality of life improvements introduced over the last few major releases of scikit-learn. Note that I normally prefer to put all of my imports in the top of the notebook, but for learning purposes I will import as necessary. For now we'll just import pandas as numpy and set a magic seed for random states.

In [1]:
import pandas as pd
import numpy as np

SEED = 101

## Faster parser in fetch_openml [1.2.1]

[OpenML](https://www.openml.org/) is a website that allows users to upload and share machine learning datasets. I haven't used it much, but scikit-learn has functionality to pull datasets directly. Let's pull a dataset so we have something to explore the rest of the quality of life features with. I decided to pull a recently uploaded dataset named [adult](https://www.openml.org/search?type=data&status=active&id=45068) (ID:45068). This is the listed description:

> Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
    
Pulling data into memory is relatively straightforward. Just use the fetch_openml function from sklearn.datasets. However, be sure to specify the data_id rather than the name alone, as there are likely multiple datasets with similar names (particularly for simple names).

In [2]:
from sklearn.datasets import fetch_openml

data = fetch_openml(
    name=None, # Optional, I would definitely specify data_id
    version='active', # It is possible to specify an exact version
    data_id=45068,
    target_column='default-target', # Allows you to specify the target column(s)
    return_X_y=False, # Possible to return (X,y) instead of data dictionary
    as_frame='auto', # Bit misleading, will return a dataframe as an element in the data dictionary
    parser='liac-arff' # Your choices are pandas or liac-arff. The latter is a pure-Python method.
)
data.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [3]:
df = data['frame']
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country,class
0,19,134974.0,10,0.0,0.0,20,,Some-college,Never-married,,Own-child,White,Female,United-States,<=50K
1,41,195096.0,13,0.0,0.0,50,Self-emp-inc,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
2,31,152109.0,9,0.0,0.0,50,Private,HS-grad,Never-married,Exec-managerial,Not-in-family,White,Male,United-States,<=50K
3,40,202872.0,12,0.0,0.0,45,Private,Assoc-acdm,Never-married,Adm-clerical,Own-child,White,Female,United-States,<=50K
4,35,98989.0,5,0.0,0.0,38,,9th,Divorced,,Own-child,Amer-Indian-Eskimo,Male,United-States,<=50K


Based on the description, the last column is the binary target. However, we can see that the fetch_openml function has a version of the dataframe seperated into data and target.

In [4]:
data['data'].head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,19,134974.0,10,0.0,0.0,20,,Some-college,Never-married,,Own-child,White,Female,United-States
1,41,195096.0,13,0.0,0.0,50,Self-emp-inc,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
2,31,152109.0,9,0.0,0.0,50,Private,HS-grad,Never-married,Exec-managerial,Not-in-family,White,Male,United-States
3,40,202872.0,12,0.0,0.0,45,Private,Assoc-acdm,Never-married,Adm-clerical,Own-child,White,Female,United-States
4,35,98989.0,5,0.0,0.0,38,,9th,Divorced,,Own-child,Amer-Indian-Eskimo,Male,United-States


In [5]:
data['target']

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
48837     >50K
48838     >50K
48839     >50K
48840     >50K
48841     >50K
Name: class, Length: 48842, dtype: object

Incidentally, I found that the built-in parser, liac-arff, does not handle missing values properly. Remember to inspect your data, kids!

In [6]:
df['occupation'].isnull().sum()

0

In [7]:
data = fetch_openml(
    name=None, # Optional, I would definitely specify data_id
    version='active', # It is possible to specify an exact version
    data_id=45068,
    target_column='default-target', # Allows you to specify the target column(s)
    return_X_y=False, # Possible to return (X,y) instead of data dictionary
    as_frame='auto', # Bit misleading, will return a dataframe as an element in the data dictionary
    parser='pandas' # Your choices are pandas or liac-arff. The latter is a pure-Python method.
)
df = data['frame']
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country,class
0,19,134974.0,10,0.0,0.0,20,,Some-college,Never-married,,Own-child,White,Female,United-States,<=50K
1,41,195096.0,13,0.0,0.0,50,Self-emp-inc,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
2,31,152109.0,9,0.0,0.0,50,Private,HS-grad,Never-married,Exec-managerial,Not-in-family,White,Male,United-States,<=50K
3,40,202872.0,12,0.0,0.0,45,Private,Assoc-acdm,Never-married,Adm-clerical,Own-child,White,Female,United-States,<=50K
4,35,98989.0,5,0.0,0.0,38,,9th,Divorced,,Own-child,Amer-Indian-Eskimo,Male,United-States,<=50K


In [8]:
df['occupation'].isnull().sum()

2809

## Feature Names Support [1.0.2] 
## get_feature_names_out Available in all Transformers [1.1.3]
## Pandas output with set_output API [1.2.1]

Now that we have some data, let's talk about some of our fancy new features. Coming from the pre-1.0 release days, applying transformations and training models directly from dataframes was somewhat fraught. Sometimes the model would fail, and it definitely would not pass through the column names to the transformer/model. There was support in some models for defining and inspecting features (I'm looking at you, [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)), but the poor data science practicioner was forced to set the feature names directly.

Let's start by looking at the data. We already know we have categorical features, as well as some missing values.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   fnlwgt          48842 non-null  float64 
 2   education-num   48842 non-null  int64   
 3   capital-gain    48842 non-null  float64 
 4   capital-loss    48842 non-null  float64 
 5   hours-per-week  48842 non-null  int64   
 6   workclass       46043 non-null  category
 7   education       48842 non-null  category
 8   marital-status  48842 non-null  category
 9   occupation      46033 non-null  category
 10  relationship    48842 non-null  category
 11  race            48842 non-null  category
 12  sex             48842 non-null  category
 13  native-country  47985 non-null  category
 14  class           48842 non-null  object  
dtypes: category(8), float64(3), int64(3), object(1)
memory usage: 3.0+ MB


Normally, you would need to explicitly declare the datatype for categorical variables. With fetch_openml, however, it will set the datatype for you. Also, as we saw above, there are several feature columns with missing values. Let's create a transformer pipeline to both encode the categorical variables and impute missing values on the numerical feature columns.

In [10]:
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# There are multiple ways to do this, but I am explicitly building a list of features (lazily).
categorical_features = list(df.dtypes[df.dtypes == 'category'].index)
numerical_features = list(df.dtypes[df.dtypes != 'category'].index)
# Drop the target, which is included in the numerical features list.
numerical_features.remove('class')

In [11]:
numerical_features

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [12]:
categorical_features

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

Now we can create our transformation pipelines for each feature type and encapsulate them in a ColumnTransformer object.

In [13]:
numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore'))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Now we'll create a single pipeline for training the model. Technically, we don't need to use a pipeline, but it's a nice mechanism for keeping things neat.

In [14]:
pipe = Pipeline(
    [
        ('prep', prep),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

Finally, we can create a train-test split and train a model. I'll also go ahead and cast the target to an integer.

In [15]:
y = df['class'].map({'<=50K':0, '>50K':1}).values
y

array([0, 0, 0, ..., 1, 1, 1])

In [16]:
from sklearn.model_selection import train_test_split

y = df['class'].map({'<=50K':0, '>50K':1}).values

X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.7, shuffle=True, random_state=SEED, stratify=y)

X_train.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country,class
5072,23,219519.0,10,0.0,0.0,30,Private,Some-college,Never-married,Sales,Not-in-family,Black,Female,United-States,<=50K
38633,41,47902.0,10,0.0,0.0,35,Private,Some-college,Married-civ-spouse,Sales,Husband,White,Male,United-States,>50K
29153,33,182714.0,15,0.0,0.0,65,Federal-gov,Prof-school,Never-married,Prof-specialty,Not-in-family,White,Female,United-States,>50K
19183,49,87928.0,9,0.0,0.0,40,Private,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,Male,United-States,>50K
38706,34,100882.0,13,0.0,0.0,47,Private,Bachelors,Married-civ-spouse,Sales,Husband,White,Male,United-States,>50K


In [17]:
pipe.fit(X_train, y_train)

This rich view is actually somewhat new to me! It was actually released in [0.23.2](https://scikit-learn.org/0.23/auto_examples/release_highlights/plot_release_highlights_0_23_0.html) and offers a nice way to explore complicated pipelines. You can click the html figure to expand elements and view the params.

Now let's use the get_feature_names_out method to retrieve the feature names and pair them with the coefficients to inspect the most important features.

In [18]:
pipe['prep'].get_feature_names_out()

array(['num__age', 'num__fnlwgt', 'num__education-num',
       'num__capital-gain', 'num__capital-loss', 'num__hours-per-week',
       'cat__workclass_Federal-gov', 'cat__workclass_Local-gov',
       'cat__workclass_Never-worked', 'cat__workclass_Private',
       'cat__workclass_Self-emp-inc', 'cat__workclass_Self-emp-not-inc',
       'cat__workclass_State-gov', 'cat__workclass_Without-pay',
       'cat__workclass_nan', 'cat__education_10th', 'cat__education_11th',
       'cat__education_12th', 'cat__education_1st-4th',
       'cat__education_5th-6th', 'cat__education_7th-8th',
       'cat__education_9th', 'cat__education_Assoc-acdm',
       'cat__education_Assoc-voc', 'cat__education_Bachelors',
       'cat__education_Doctorate', 'cat__education_HS-grad',
       'cat__education_Masters', 'cat__education_Preschool',
       'cat__education_Prof-school', 'cat__education_Some-college',
       'cat__marital-status_Divorced',
       'cat__marital-status_Married-AF-spouse',
       'cat__mari

In [19]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['prep'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

Unnamed: 0,feature,coef
58,cat__relationship_Wife,1.041103
2,num__education-num,0.87768
33,cat__marital-status_Married-civ-spouse,0.783164
41,cat__occupation_Exec-managerial,0.57464
0,num__age,0.513933


In [20]:
df_coef.tail()

Unnamed: 0,feature,coef
42,cat__occupation_Farming-fishing,-0.955251
45,cat__occupation_Other-service,-0.996107
35,cat__marital-status_Never-married,-1.196122
56,cat__relationship_Own-child,-1.219501
64,cat__sex_Female,-1.494701


Unfortunately, there are some limitations. If we try to introduce PolynomialFeatures transformer _after_ the ColumnTransformer transformer, we lose the traceback to feature names. We also can't incorporate it into the ColumnTransformer, but you can explicitly set the input features for the downstream transformer by assigning the output feature names from the ColumnTransformer to the input feature names for the PolynomialFeatures transformer.

In [21]:
from sklearn.preprocessing import PolynomialFeatures

numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore'))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

pipe = Pipeline(
    [
        ('prep', prep),
        ('poly', PolynomialFeatures()),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

In [22]:
pipe.fit(X_train, y_train)

In [23]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

Unnamed: 0,feature,coef
322,x2 x3,0.000183
531,x4 x5,0.000143
456,x3 x33,0.000109
323,x2 x4,0.000102
476,x3 x53,8.2e-05


In [24]:
pipe['poly'].feature_names_in_ = pipe['prep'].get_feature_names_out()

In [25]:
pipe['poly'].get_feature_names_out()

array(['1', 'num__age', 'num__fnlwgt', ...,
       'cat__native-country_Yugoslavia^2',
       'cat__native-country_Yugoslavia cat__native-country_nan',
       'cat__native-country_nan^2'], dtype=object)

In [26]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

Unnamed: 0,feature,coef
322,num__education-num num__capital-gain,0.000183
531,num__capital-loss num__hours-per-week,0.000143
456,num__capital-gain cat__marital-status_Married-...,0.000109
323,num__education-num num__capital-loss,0.000102
476,num__capital-gain cat__relationship_Husband,8.2e-05


In [27]:
df_coef.tail()

Unnamed: 0,feature,coef
4,num__capital-gain,-6.8e-05
487,num__capital-gain cat__sex_Female,-7.2e-05
449,num__capital-gain cat__education_HS-grad,-7.3e-05
477,num__capital-gain cat__relationship_Not-in-family,-7.5e-05
458,num__capital-gain cat__marital-status_Never-ma...,-0.000105


Alternatively, you can use the new functionality for setting the output of a transformer introduced in in 1.2.1. However, pandas output does not support sparse matrices, so any one hot encoded outputs will need to be returned as dense values. Keep in mind that this will increase memory usage.

In [28]:
numerical_transformer = Pipeline(
    [
        ('impute', SimpleImputer(strategy='median')),
        ('scale', RobustScaler())
    ]
)

categorical_transformer = Pipeline(
    [
        ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]
)

prep = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
).set_output(transform='pandas')

pipe = Pipeline(
    [
        ('prep', prep),
        ('poly', PolynomialFeatures()),
        ('model', LogisticRegression(max_iter=2000, fit_intercept=False))
    ]
)

In [29]:
pipe.fit(X_train, y_train)

In [30]:
df_coef = pd.DataFrame(
    {
        'feature':pipe['poly'].get_feature_names_out(),
        'coef':pipe['model'].coef_[0]
    }
)
df_coef = df_coef.sort_values('coef', ascending=False)
df_coef.head()

Unnamed: 0,feature,coef
322,num__education-num num__capital-gain,0.000183
531,num__capital-loss num__hours-per-week,0.000143
456,num__capital-gain cat__marital-status_Married-...,0.000109
323,num__education-num num__capital-loss,0.000102
476,num__capital-gain cat__relationship_Husband,8.2e-05


In [31]:
df_coef.shape

(5886, 2)

## Keyword and positional arguments [1.0.2]

This is really more of a stylistic choice, but the maintainers of scikit-learn have decided to enforce the usage of keyword arguments. This isn't necessarily a bad thing, but you should be aware that code created prior to this release may break. Also, rather confusingly, some positional arguments will still work.

For instance, let's instantiate a LogisticRegression model with keyword arguments.

In [32]:
model_keyword = LogisticRegression(
    penalty='l1',
    dual=False,
    tol=0.0001,
    C=0.1
)

Now try the same thing with positional arguments.

In [33]:
model_position = LogisticRegression('l1', False, 0.0001, 0.1)

TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given

In [34]:
model_position = LogisticRegression('l1')

You can, however, still instantiate a new object w/ one or two positional keywords, depending on the class. Do your future self a favor, however, and just write out the keywords!