# Introduction to scikit-learn: basic preprocessing for basic model fitting

In this notebook, we will aim at introducing a typical approach to build
predictive models on tabular datasets.

In particular we will highlight:
* the difference between numerical and categorical variables;
* the importance of scaling numerical variables;
* typical ways to deal categorical variables;
* train predictive models on different kinds of data;
* evaluate the performance of a model via cross-validation.

## Introduce the dataset

To this aim, we will use data from the 1994 Census bureau database. The goal
with this data is to regress wages from heterogeneous data such as age,
employment, education, family information, etc.

Let's first load the data located in the `datasets` folder.

In [1]:
import pandas as pd

df = pd.read_csv("https://www.openml.org/data/get_csv/1595261/adult-census.csv")

# Or use the local copy:
# df = pd.read_csv('../datasets/adult-census.csv')

Let's have a look at the first records of this data frame:

In [2]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


The target variable in our study will be the "class" column while we will use
the other columns as input variables for our model. This target column divides
the samples (also known as records) into two groups: high income (>50K) vs low
income (<=50K). The resulting prediction problem is therefore a binary
classification problem.

For simplicity, we will ignore the "fnlwgt" (final weight) column that was
crafted by the creators of the dataset when sampling the dataset to be
representative of the full census database.

In [3]:
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])

We can check the number of samples and the number of features available in
the dataset:

In [4]:
print(
    f"The dataset contains {data.shape[0]} samples and {data.shape[1]} "
    "features"
)

The dataset contains 48842 samples and 13 features


## Working with numerical data

The numerical data is the most natural type of data used in machine learning
and can (almost) directly be fed to predictive models. We can quickly have a
look at such data by selecting the subset of columns from the original data.

We will use this subset of data to fit a linear classification model to
predict the income class.

In [5]:
data.columns

Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [6]:
data.dtypes

age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [7]:
numerical_columns = ['age', 'education-num', 'hours-per-week',
                     'capital-gain', 'capital-loss']
data_numeric = data[numerical_columns]
data_numeric.head()

Unnamed: 0,age,education-num,hours-per-week,capital-gain,capital-loss
0,25,7,40,0,0
1,38,9,50,0,0
2,28,12,40,0,0
3,44,10,40,7688,0
4,18,10,30,0,0


When building a machine learning model, it is important to leave out a
subset of the data which we can use later to evaluate the trained model.
The data used to fit a model a called training data while the one used to
assess a model are called testing data.

Scikit-learn provides an helper function `train_test_split` which will
split the dataset into a training and a testing set. It will ensure that
the data are shuffled randomly before splitting the data.

In [8]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42
)

print(
    f"The training dataset contains {data_train.shape[0]} samples and "
    f"{data_train.shape[1]} features"
)
print(
    f"The testing dataset contains {data_test.shape[0]} samples and "
    f"{data_test.shape[1]} features"
)

The training dataset contains 36631 samples and 5 features
The testing dataset contains 12211 samples and 5 features


We will build a linear classification model called "Logistic Regression". The
`fit` method is called to train the model from the input and target data. Only
the training data should be given for this purpose.

In addition, when checking the time required to train the model and internally
check the number of iterations done by the solver to find a solution.

In [9]:
from sklearn.linear_model import LogisticRegression
import time

model = LogisticRegression()
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

print(
    f"The model {model.__class__.__name__} was trained in "
    f"{elapsed_time:.3f} seconds for {model.n_iter_} iterations"
)

The model LogisticRegression was trained in 0.416 seconds for [100] iterations


  n_iter_i = _check_optimize_result(solver, opt_res, max_iter)


Let's ignore the convergence warning for now and instead let's try
to use our model to make some predictions on the first three records
of the held out test set:

In [10]:
target_predicted = model.predict(data_test)
target_predicted[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [11]:
target_test[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [12]:
predictions = data_test.copy()
predictions['predicted-class'] = target_predicted
predictions['expected-class'] = target_test
predictions['correct'] = target_predicted == target_test
predictions.head()

Unnamed: 0,age,education-num,hours-per-week,capital-gain,capital-loss,predicted-class,expected-class,correct
7762,56,9,40,0,0,<=50K,<=50K,True
23881,25,9,40,0,0,<=50K,<=50K,True
30507,43,13,40,14344,0,>50K,>50K,True
28911,32,9,40,0,0,<=50K,<=50K,True
19484,39,13,30,0,0,<=50K,<=50K,True


To quantitatively evaluate our model, we can use the method `score`. It will
compute the classification accuracy when dealing with a classificiation
problem.

In [13]:
print(
    f"The test accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f}"
)

The test accuracy using a LogisticRegression is 0.818


This is mathematically equivalent as computing the average number of time
the model makes a correct prediction on the test set:

In [14]:
(target_test == target_predicted).mean()

0.8177053476373761

Let's now consider the `ConvergenceWarning` message that was raised previously
when calling the `fit` method to train our model. This warning informs us that
our model stopped learning becaused it reached the maximum number of
iterations allowed by the user. This could potentially be detrimental for the
model accuracy. We can follow the (bad) advice given in the warning message
and increase the maximum number of iterations allowed.

In [15]:
model = LogisticRegression(max_iter=50000)
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start
print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model.n_iter_} iterations"
)

The accuracy using a LogisticRegression is 0.818 with a fitting time of 0.387 seconds in [105] iterations


We can observe now a longer training time but not significant improvement in
the predictive performance. Instead of increasing the number of iterations, we
can try to help fit the model faster by scaling the data first. A range of
preprocessing algorithms in scikit-learn allows to transform the input data
before training a model. We can easily combine these sequential operation with
a scikit-learn `Pipeline` which will chain the operations and can be used as
any other classifier or regressor. The helper function `make_pipeline` will
create a `Pipeline` by giving the successive transformations to perform.

In our case, we will standardize the data and then train a new logistic
regression model on that new version of the dataset set.

In [16]:
data_train.describe()

Unnamed: 0,age,education-num,hours-per-week,capital-gain,capital-loss
count,36631.0,36631.0,36631.0,36631.0,36631.0
mean,38.642352,10.078131,40.431247,1087.077721,89.665311
std,13.725748,2.570143,12.423952,7522.692939,407.110175
min,17.0,1.0,1.0,0.0,0.0
25%,28.0,9.0,40.0,0.0,0.0
50%,37.0,10.0,40.0,0.0,0.0
75%,48.0,12.0,45.0,0.0,0.0
max,90.0,16.0,99.0,99999.0,4356.0


In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_train_scaled = scaler.fit_transform(data_train)
data_train_scaled

array([[ 0.17177061,  0.35868902, -2.28845333, -0.14450843,  5.71188483],
       [ 0.02605707,  1.1368665 , -0.27618374, -0.14450843, -0.22025127],
       [-0.33822677,  1.1368665 ,  0.77019645, -0.14450843, -0.22025127],
       ...,
       [-0.77536738, -0.03039972, -0.03471139, -0.14450843, -0.22025127],
       [ 0.53605445,  0.35868902, -0.03471139, -0.14450843, -0.22025127],
       [ 1.48319243,  1.52595523, -2.69090725, -0.14450843, -0.22025127]])

In [18]:
data_train_scaled = pd.DataFrame(data_train_scaled, columns=data_train.columns)
data_train_scaled.describe()

Unnamed: 0,age,education-num,hours-per-week,capital-gain,capital-loss
count,36631.0,36631.0,36631.0,36631.0,36631.0
mean,-2.273364e-16,1.219606e-16,1.844684e-16,3.5303100000000004e-17,3.8406670000000006e-17
std,1.000014,1.000014,1.000014,1.000014,1.000014
min,-1.576792,-3.532198,-3.173852,-0.1445084,-0.2202513
25%,-0.7753674,-0.4194885,-0.03471139,-0.1445084,-0.2202513
50%,-0.1196565,-0.03039972,-0.03471139,-0.1445084,-0.2202513
75%,0.681768,0.7477778,0.3677425,-0.1445084,-0.2202513
max,3.741752,2.304133,4.714245,13.14865,10.4797


In [19]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start
print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
)

The accuracy using a Pipeline is 0.818 with a fitting time of 0.102 seconds in [13] iterations


We can see that the training time and the number of iterations is much shorter
while the predictive performance is equivalent.

In the previous example, we split the original data into a training set and a
testing set. This strategy has several issues: in the setting where the amount
of data is limited, the subset of data used to train or test will be small;
and the splitting was done in a random manner and we have no information
regarding the confidence of the results obtained.

Instead, we can use what cross-validation. Cross-validation consists in
repeating this random splitting into training and testing sets and aggregate
the model performance. By repeating the experiment, one can get an estimate of
the variabilty of the model performance.

The function `cross_val_score` allows for such experimental protocol by giving
the model, the data and the target. Since there exists several
cross-validation strategies, `cross_val_score` takes a parameter `cv` which
defines the splitting strategy.








In [20]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, data_numeric, target, cv=5)
print(f"The accuracy (mean +/- 1 std. dev.) is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")
print(f"The different scores obtained are: \n{scores}")

The accuracy (mean +/- 1 std. dev.) is: 0.814 +/- 0.004
The different scores obtained are: 
[0.81216092 0.8096018  0.81337019 0.81326781 0.82207207]


Note that by computing the standard-deviation of the cross-validation scores
we can get an idea of the uncertainty of our estimation of the predictive
performance of the model: in the above results, only the first 2 decimals seem
to be trustworthy. Using a single train / test split would not allow us to
know anything about the level of uncertainty of the accuracy of the model.

Setting `cv=5` created 5 distinct splits to get 5 variations for the training
and testing sets. Each training set is used to fit one model which is then
scored on the matching test set. This strategy is called K-fold
cross-validation where `K` corresponds to the number of splits.

In [21]:
# TODO add CV diagram for 5-fold CV from the gallery

## Working with categorical data

In the previous section, we dealt with data for which numerical algorithms are
mathematically designed to work natively. However, real datasets contain type
of data which do not belong to this category and will require some
preprocessing. Such preprocessing will transform these data to be numerical
and thus natively handled by machine learning algorithms.

Categorical data are broadly encountered in data science. Numerical data is a
continuous quantity corresponding to a real numbers while categorical data are
represented as discrete values. For instance, the variable `SEX` in our
previous dataset is a categorical variable because it encodes the data with
the two categories `male` and `female`.

In the remainder of this section, we will present different strategies to
encode categorical data into numerical data which can be used by a
machine-learning algorithm.

In [22]:
data.dtypes

age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [23]:
categorical_columns = [
    'workclass', 'education', 'marital-status', 'occupation', 'relationship',
    'race', 'sex', 'native-country'
]
data_categorical = data[categorical_columns]
data_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


### Encoding categories having an ordering

The most intuitive strategy is to encode each category by a numerical value.
The `OrdinalEncoder` will transform the data in such manner.








In [24]:
print(f"The datasets is composed of {data_categorical.shape[1]} features")
data_categorical.head()

The datasets is composed of 8 features


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [25]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data_categorical)

print(f"The dataset encoded contains {data_encoded.shape[1]} features")
data_encoded[:5]

The dataset encoded contains 8 features


array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

We can see that all categories have been encoded for each feature
independently. We can also notice that the number of features before and after
the encoding is the same.

However, one has to be careful when using this encoding strategy. Using this
integer representation makes the assumption that the categories are ordered: 0
is smaller than 1 which is smaller than 2, etc. Furthermore the
lexicographical order used by `OrdinalEncoder` by default to map from string
labels to integer might be meaningless. For instance if you have a "size"
categorical variable with categories such as "S", "M", "L", "XL", it is better
to map those to 0, 1, 2, 3 rather than 2, 1, 0, 3 as would the lexicographical
strategy would do. The `OrdinalEncoder` class accepts a "catogries"
constructor argument to pass an the correct ordering explicitly.

If a categorical variable does not carry any meaningful order information then
this encoding might be not adequate and you should consider using one-hot
encoding instead (see below).

Note however that the impact a violation of this ordering assumption is really
dependent on the downstream models (for instance linear models are much more
sensitive than models built from a ensemble of decision trees).

### Encode categories without assuming any order

`OneHotEncoder` is an alternative encoder that can prevent the dowstream
models to make a false assumption about the ordering of categories. For a
given feature, it will create as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category will be set to `1` while all the columns of the other categories will
be set to `0`.

In [26]:
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()

The dataset is composed of 8 features


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [27]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
data_encoded = encoder.fit_transform(data_categorical)
print(f"The dataset encoded contains {data_encoded.shape[1]} features")
data_encoded

The dataset encoded contains 102 features


array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [28]:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


The number of features after the encoding is than 10 times larger than in the
original data because some variables have many possible categories.

We can integrate this encoder inside a machine learning pipeline as in the
case with numerical data. In the following, we train a linear classifier on
the encoded data and check the performance of this machine learning pipeline
using cross-validation.

In [29]:
model = make_pipeline(OneHotEncoder(handle_unknown='ignore'),
                      LogisticRegression(max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")
print(f"The different scores obtained are: \n{scores}")

The accuracy is: 0.833 +/- 0.002
The different scores obtained are: 
[0.83222438 0.83560242 0.82872645 0.83312858 0.83466421]


Exercise:
- Try to use the OrdinalEncoder instead. What do you observe?

In case you have issues of with unknown categories, try to precompute the list
of possible categories ahead of time and pass it explicitly to the constructor
of the encoder:

 categories = [data[column].unique() for column in data[categorical_columns]]
 OrdinalEncoder(categories=categories)










## Combining different transformers used for different column types

In the previous sections, we saw that we need to treat data specifically
depending of their nature (i.e. numerical or categorical).

Scikit-learn provides a `ColumnTransformer` class which will dispatch some
specific columns to a specific transformer making it easy to fit a single
predictive model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).

We can first define the columns depending on their data type:
* **binary encoding** will be applied to categorical columns with only too
  possible values (e.g. sex=male or sex=female in this example). Each binary
  categorical columns will be mapped to one numerical columns with 0 or 1
  values.
* **one-hot encoding** will be applied to categorical columns with more that
  two possible categories. This encoding will create one additional column for
  each possible categorical value.
* **numerical scaling** numerical features which will be standardized.








In [30]:
binary_encoding_columns = ['sex']
one_hot_encoding_columns = ['workclass', 'education', 'marital-status',
                            'occupation', 'relationship',
                            'race', 'native-country']
scaling_columns = ['age', 'education-num', 'hours-per-week',
                   'capital-gain', 'capital-loss']

We can now create our `ColumnTransfomer` by specifying a list of triplet
(preprocessor name, transformer, columns). Finally, we can define a pipeline
to stack this "preprocessor" with our classifier (logistic regression).

In [31]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
     one_hot_encoding_columns),
    ('standard-scaler', StandardScaler(), scaling_columns)
])
model = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))

The final model is more complex than the previous models but still follows the
same API:
- the `fit` method is called to preprocess the data then train the classifier;
- the `predict` method can make predictions on new data;
- the `score` method is used to predict on the test data and compare the
  predictions to the expected test labels to compute the accuracy.

In [32]:
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)
model.fit(data_train, target_train)
model.predict(data_test)[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

In [33]:
target_test[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [34]:
data_test.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
7762,56,Private,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States
23881,25,Private,HS-grad,9,Married-civ-spouse,Transport-moving,Own-child,Other,Male,0,0,40,United-States
30507,43,Private,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,14344,0,40,United-States
28911,32,Private,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States
19484,39,Private,Bachelors,13,Married-civ-spouse,Sales,Wife,White,Female,0,0,30,United-States


In [35]:
model.score(data_test, target_test)

0.8577512079272787

This model can also be cross-validated as usual (instead of using a single
train-test split):

In [36]:
scores = cross_val_score(model, data, target, cv=5)
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
print(f"The different scores obtained are: \n{scores}")

The accuracy is: 0.851 +- 0.003
The different scores obtained are: 
[0.85116184 0.8498311  0.84756347 0.85268223 0.85513923]


# Fitting a more powerful model

Linear models are very nice because they are usually very cheap to train and
give a good baseline.

However it is often useful to check whether more complex models such as
ensemble of decision trees can lead to higher predictive performance.

In the following we try a scalable implementation of the Gradient Boosting
Machine algorithm. For this class of models, we know that contrary to linear
models, it is useless to scale the numerical features and furthermore it is
both safe and significantly more computationally efficient use an arbitrary
integer encoding for the categorical variable. Therefore we adapt the
preprocessing pipeline as follows:

In [37]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

# For each categorical column, extract the list of all possible categories
# in some arbritrary order.
categories = [data[column].unique() for column in data[categorical_columns]]

preprocessor = ColumnTransformer([
    ('categorical', OrdinalEncoder(categories=categories), categorical_columns),
], remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
model.fit(data_train, target_train)
print(model.score(data_test, target_test))

0.8801080992547703


We can observe that we get significantly higher accuracies with the Gradient
Boosting model. This is often what we observe whenever the dataset has a large
number of samples and limited number of informative features (e.g. less than
1000) with a mix of numerical and categorical variables.

This explains why Gradient Boosted Machines are very popular among datascience
practitioners who work with tabular data.








Exercises:
- Check that scaling the numerical features does not impact the speed or
  accuracy of HistGradientBoostingClassifier
- Check that one-hot encoding the categorical variable does not improve the
  accuracy of HistGradientBoostingClassifier but slows down the training.