# Building design matrices with `ModelSpec`

Force rebuild

In [1]:
x=4
import numpy as np, pandas as pd
%load_ext rpy2.ipython

from ISLP import load_data
from ISLP.models import ModelSpec

import statsmodels.api as sm

In [2]:
Carseats = load_data('Carseats')
%R -i Carseats
Carseats.columns

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

## Let's break up income into groups

In [3]:
Carseats['OIncome'] = pd.cut(Carseats['Income'], 
                             [0,50,90,200], 
                             labels=['L','M','H'])
Carseats['OIncome']

0      M
1      L
2      L
3      H
4      M
      ..
395    H
396    L
397    L
398    M
399    L
Name: OIncome, Length: 400, dtype: category
Categories (3, object): ['L' < 'M' < 'H']

Let's also create an unordered version

In [4]:
Carseats['UIncome'] = pd.cut(Carseats['Income'], 
                             [0,50,90,200], 
                             labels=['L','M','H'],
                             ordered=False)
Carseats['UIncome']

0      M
1      L
2      L
3      H
4      M
      ..
395    H
396    L
397    L
398    M
399    L
Name: UIncome, Length: 400, dtype: category
Categories (3, object): ['L', 'M', 'H']

## A simple model

In [5]:
design = ModelSpec(['Price', 'Income'])
X = design.fit_transform(Carseats)
X.columns

Index(['intercept', 'Price', 'Income'], dtype='object')

In [6]:
Y = Carseats['Sales']
M = sm.OLS(Y, X).fit()
M.params

intercept    12.661546
Price        -0.052213
Income        0.012829
dtype: float64

## Basic procedure

The design matrix is built by cobbling together a set of columns and possibly transforming them.
A `pd.DataFrame` is essentially a list of columns. One of the first tasks done  in `ModelSpec.fit`
is to inspect a dataframe for column info. The column `ShelveLoc` is categorical:

In [7]:
Carseats['ShelveLoc']

0         Bad
1        Good
2      Medium
3      Medium
4         Bad
        ...  
395      Good
396    Medium
397    Medium
398       Bad
399      Good
Name: ShelveLoc, Length: 400, dtype: category
Categories (3, object): ['Bad', 'Good', 'Medium']

This is recognized by `ModelSpec` in the form of `Column` objects which are just named tuples with two methods
`get_columns` and `fit_encoder`.

In [8]:
design.column_info_['ShelveLoc']

Column(idx='ShelveLoc', name='ShelveLoc', is_categorical=True, is_ordinal=False, columns=('ShelveLoc[Good]', 'ShelveLoc[Medium]'), encoder=Contrast())

It recognized ordinal columns as well.

In [9]:
design.column_info_['OIncome']

Column(idx='OIncome', name='OIncome', is_categorical=True, is_ordinal=True, columns=('OIncome',), encoder=OrdinalEncoder())

In [10]:
income = design.column_info_['Income']
cols, names = income.get_columns(Carseats)
(cols[:4], names)

(array([ 73,  48,  35, 100]), ('Income',))

## Encoding a column

In building a design matrix we must extract columns from our dataframe (or `np.ndarray`). Categorical
variables usually are encoded by several columns, typically one less than the number of categories.
This task is handled by the `encoder` of the `Column`. The encoder must satisfy the `sklearn` transform
model, i.e. `fit` on some array and `transform` on future arrays. The `fit_encoder` method of `Column` fits
its encoder the first time data is passed to it.

In [11]:
shelve = design.column_info_['ShelveLoc']
cols, names = shelve.get_columns(Carseats)
(cols[:4], names)

(array([[0., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.]]),
 ['ShelveLoc[Good]', 'ShelveLoc[Medium]'])

In [12]:
oincome = design.column_info_['OIncome']
oincome.get_columns(Carseats)[0][:4]

array([[2.],
       [1.],
       [1.],
       [0.]])

## The terms

The design matrix consists of several sets of columns. This is managed by the `ModelSpec` through
the `terms` argument which should be a sequence. The elements of `terms` are often
going to be strings (or tuples of strings for interactions, see below) but are converted to a
`Variable` object and stored in the `terms_` of the fitted `ModelSpec`. A `Variable` is just a named tuple.

In [13]:
design.terms

['Price', 'Income']

In [14]:
design.terms_

[Variable(variables=('Price',), name='Price', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
 Variable(variables=('Income',), name='Income', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]

While each `Column` can itself extract data, they are all promoted to `Variable` to be of a uniform type.  A
`Variable` can also create columns through the `build_columns` method of `ModelSpec`

In [15]:
price = design.terms_[0]
design.build_columns(Carseats, price)

(     Price
 0      120
 1       83
 2       80
 3       97
 4      128
 ..     ...
 395    128
 396    120
 397    159
 398     95
 399    120
 
 [400 rows x 1 columns],
 ['Price'])

Note that `Variable` objects have a tuple of `variables` as well as an `encoder` attribute. The
tuple of `variables` first creates a concatenated dataframe from all corresponding variables and then
is run through `encoder.transform`. The `encoder.fit` method of each `Variable` is run once during 
the call to `ModelSpec.fit`.

In [16]:
from ISLP.models.model_spec import Variable

new_var = Variable(('Price', 'Income', 'UIncome'), name='mynewvar', encoder=None)
design.build_columns(Carseats, new_var)

(     Price  Income  UIncome[L]  UIncome[M]
 0    120.0    73.0         0.0         1.0
 1     83.0    48.0         1.0         0.0
 2     80.0    35.0         1.0         0.0
 3     97.0   100.0         0.0         0.0
 4    128.0    64.0         0.0         1.0
 ..     ...     ...         ...         ...
 395  128.0   108.0         0.0         0.0
 396  120.0    23.0         1.0         0.0
 397  159.0    26.0         1.0         0.0
 398   95.0    79.0         0.0         1.0
 399  120.0    37.0         1.0         0.0
 
 [400 rows x 4 columns],
 ['Price', 'Income', 'UIncome[L]', 'UIncome[M]'])

Let's now transform these columns with an encoder. Within `ModelSpec` we will first build the
arrays above and then call `pca.fit` and finally `pca.transform` within `design.build_columns`.

In [17]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(design.build_columns(Carseats, new_var)[0]) # this is done within `ModelSpec.fit`
pca_var = Variable(('Price', 'Income', 'UIncome'), name='mynewvar', encoder=pca)
design.build_columns(Carseats, pca_var)



(     mynewvar[0]  mynewvar[1]
 0      -3.608693    -4.853177
 1      15.081506    35.708630
 2      27.422871    40.774250
 3     -33.973209    13.470489
 4       6.567316   -11.290100
 ..           ...          ...
 395   -36.846346   -18.415783
 396    45.741500     3.245602
 397    49.097533   -35.725355
 398   -13.577772    18.845139
 399    31.927566     0.978436
 
 [400 rows x 2 columns],
 ['mynewvar[0]', 'mynewvar[1]'])

The elements of the `variables` attribute may be column identifiers ( `"Price"`), `Column` instances (`price`)
or `Variable` instances (`pca_var`).

In [18]:
fancy_var = Variable(('Price', price, pca_var), name='fancy', encoder=None)
design.build_columns(Carseats, fancy_var)



(     Price  Price  mynewvar[0]  mynewvar[1]
 0    120.0  120.0    -3.608693    -4.853177
 1     83.0   83.0    15.081506    35.708630
 2     80.0   80.0    27.422871    40.774250
 3     97.0   97.0   -33.973209    13.470489
 4    128.0  128.0     6.567316   -11.290100
 ..     ...    ...          ...          ...
 395  128.0  128.0   -36.846346   -18.415783
 396  120.0  120.0    45.741500     3.245602
 397  159.0  159.0    49.097533   -35.725355
 398   95.0   95.0   -13.577772    18.845139
 399  120.0  120.0    31.927566     0.978436
 
 [400 rows x 4 columns],
 ['Price', 'Price', 'mynewvar[0]', 'mynewvar[1]'])

We can of course run PCA again on these features (if we wanted).

In [19]:
pca2 = PCA(n_components=2)
pca2.fit(design.build_columns(Carseats, fancy_var)[0]) # this is done within `ModelSpec.fit`
pca2_var = Variable(('Price', price, pca_var), name='fancy_pca', encoder=pca2)
design.build_columns(Carseats, pca2_var)



(     fancy_pca[0]  fancy_pca[1]
 0       -6.951792      4.859283
 1       55.170148    -24.694875
 2       59.418556    -38.033572
 3       34.722389     28.922184
 4      -21.419184     -3.120673
 ..            ...           ...
 395    -18.257348     40.760122
 396    -10.546709    -45.021658
 397    -77.706359    -37.174379
 398     36.668694      7.730851
 399     -9.540535    -31.059122
 
 [400 rows x 2 columns],
 ['fancy_pca[0]', 'fancy_pca[1]'])

## Building the design matrix

With these notions in mind, the final design is essentially then

In [20]:
X_hand = np.column_stack([design.build_columns(Carseats, v)[0] for v in design.terms_])[:4]

An intercept column is added if `design.intercept` is `True` and if the original argument to `transform` is
a dataframe the index is adjusted accordingly.

In [21]:
design.intercept

True

In [22]:
design.transform(Carseats)[:4]

Unnamed: 0,intercept,Price,Income
0,1.0,120,73
1,1.0,83,48
2,1.0,80,35
3,1.0,97,100


## Predicting

Constructing the design matrix at any values is carried out by the `transform` method.

In [23]:
new_data = pd.DataFrame({'Price':[10,20], 'Income':[40, 50]})
new_X = design.transform(new_data)
M.get_prediction(new_X).predicted_mean

array([12.65257604, 12.25873428])

In [24]:
%%R -i new_data,Carseats
predict(lm(Sales ~ Price + Income, data=Carseats), new_data)

       0        1 
12.65258 12.25873 


### Difference between using `pd.DataFrame` and `np.ndarray`

If the `terms` only refer to a few columns of the data frame, the `transform` method only needs a dataframe with those columns.

If we had used an `np.ndarray`, the column identifiers would be integers identifying specific columns so,
in order to work correctly, `transform` would need another `np.ndarray` where the columns have the same meaning.

In [25]:
Carseats_np = np.asarray(Carseats[['Price', 'ShelveLoc', 'US', 'Income']])
design_np = ModelSpec([0,3]).fit(Carseats_np)
design_np.transform(Carseats_np)[:4]

array([[1.0, 120, 73],
       [1.0, 83, 48],
       [1.0, 80, 35],
       [1.0, 97, 100]], dtype=object)

The following will fail for hopefully obvious reasons

In [26]:
try:
    new_D = np.zeros((2,2))
    new_D[:,0] = [10,20]
    new_D[:,1] = [40,50]
    M.get_prediction(new_D).predicted_mean
except ValueError as e:
    print(e)

shapes (2,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)


Ultimately, `M` expects 3 columns for new predictions because it was fit
with a matrix having 3 columns (the first representing an intercept).

We might be tempted to try as with the `pd.DataFrame` and produce
an `np.ndarray` with only the necessary variables.

In [27]:
try:
    new_X = np.zeros((2,2))
    new_X[:,0] = [10,20]
    new_X[:,1] = [40,50]
    new_D = design_np.transform(new_X)
    M.get_prediction(new_D).predicted_mean
except IndexError as e:
    print(e)

index 3 is out of bounds for axis 1 with size 2


This fails because `design_np` is looking for column `3` from its `terms`:

In [28]:
design_np.terms_

[Variable(variables=(0,), name='0', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False),
 Variable(variables=(3,), name='3', encoder=None, use_transform=True, pure_columns=True, override_encoder_colnames=False)]

However, if we have an `np.ndarray` in which the first column indeed represents `Price` and the fourth indeed
represents `Income` then we can arrive at the correct answer by supplying such the array to `design_np.transform`:

In [29]:
new_X = np.zeros((2,4))
new_X[:,0] = [10,20]
new_X[:,3] = [40,50]
new_D = design_np.transform(new_X)
M.get_prediction(new_D).predicted_mean

array([12.65257604, 12.25873428])

Given this subtlety about needing to supply arrays with identical column structure to `transform` when
using `np.ndarray` we presume that using a `pd.DataFrame` will be the more popular use case.

## A model with some categorical variables

Categorical variables become `Column` instances with encoders.

In [30]:
design = ModelSpec(['Population', 'Price', 'UIncome', 'ShelveLoc']).fit(Carseats)
design.column_info_['UIncome']

Column(idx='UIncome', name='UIncome', is_categorical=True, is_ordinal=False, columns=('UIncome[L]', 'UIncome[M]'), encoder=Contrast())

In [31]:
X = design.fit_transform(Carseats)
X.columns

Index(['intercept', 'Population', 'Price', 'UIncome[L]', 'UIncome[M]',
       'ShelveLoc[Good]', 'ShelveLoc[Medium]'],
      dtype='object')

In [32]:
sm.OLS(Y, X).fit().params

intercept            11.876012
Population            0.001163
Price                -0.055725
UIncome[L]           -1.042297
UIncome[M]           -0.119123
ShelveLoc[Good]       4.999623
ShelveLoc[Medium]     1.964278
dtype: float64

In [33]:
%%R
lm(Sales ~ Population + Price + UIncome + ShelveLoc, data=Carseats)$coef

    (Intercept)      Population           Price        UIncomeM        UIncomeH 
    10.83371503      0.00116301     -0.05572469      0.92317388      1.04229679 
  ShelveLocGood ShelveLocMedium 
     4.99962319      1.96427771 


## Getting the encoding you want

By default the level dropped by `ModelSpec` will be the first of the `categories_` values from 
`sklearn.preprocessing.OneHotEncoder()`. We might wish to change this. It seems
as if the correct way to do this would be something like `Variable(('UIncome',), 'mynewencoding', new_encoder)`
where `new_encoder` would somehow drop the column we want dropped. 

However, when using the convenient identifier `UIncome` in the `variables` argument, this maps to the `Column` associated to `UIncome` within `design.column_info_`:

In [34]:
design.column_info_['UIncome']

Column(idx='UIncome', name='UIncome', is_categorical=True, is_ordinal=False, columns=('UIncome[L]', 'UIncome[M]'), encoder=Contrast())

This column already has an encoder and `Column` instances are immutable as named tuples. Further, there are times when 
we may want to encode `UIncome` differently within the same model. In the model below the main effect of `UIncome` is encoded with two columns while in the interaction `UIncome` (see below) has three columns. This is a design of interest
and we need a way to allow different encodings of the same column of `Carseats`

In [35]:
%%R
lm(Sales ~ UIncome:ShelveLoc + UIncome, data=Carseats)


Call:
lm(formula = Sales ~ UIncome:ShelveLoc + UIncome, data = Carseats)

Coefficients:
             (Intercept)                  UIncomeM                  UIncomeH  
                  5.1317                    0.1151                    1.1561  
  UIncomeL:ShelveLocGood    UIncomeM:ShelveLocGood    UIncomeH:ShelveLocGood  
                  4.5121                    5.5752                    3.7381  
UIncomeL:ShelveLocMedium  UIncomeM:ShelveLocMedium  UIncomeH:ShelveLocMedium  
                  1.2473                    2.4782                    1.5141  



 We can create a new 
`Column` with the encoder we want. For categorical variables, there is a convenience function to do so.

In [36]:
from ISLP.models.model_spec import contrast
pref_encoding = contrast('UIncome', 'drop', 'L')

In [37]:
design.build_columns(Carseats, pref_encoding)

(     UIncome[M]  UIncome[H]
 0           1.0         0.0
 1           0.0         0.0
 2           0.0         0.0
 3           0.0         1.0
 4           1.0         0.0
 ..          ...         ...
 395         0.0         1.0
 396         0.0         0.0
 397         0.0         0.0
 398         1.0         0.0
 399         0.0         0.0
 
 [400 rows x 2 columns],
 ['UIncome[M]', 'UIncome[H]'])

In [38]:
design = ModelSpec(['Population', 'Price', pref_encoding, 'ShelveLoc']).fit(Carseats)
X = design.fit_transform(Carseats)
X.columns

Index(['intercept', 'Population', 'Price', 'UIncome[M]', 'UIncome[H]',
       'ShelveLoc[Good]', 'ShelveLoc[Medium]'],
      dtype='object')

In [39]:
sm.OLS(Y, X).fit().params

intercept            10.833715
Population            0.001163
Price                -0.055725
UIncome[M]            0.923174
UIncome[H]            1.042297
ShelveLoc[Good]       4.999623
ShelveLoc[Medium]     1.964278
dtype: float64

In [40]:
%%R
lm(Sales ~ Population + Price + UIncome + ShelveLoc, data=Carseats)$coef

    (Intercept)      Population           Price        UIncomeM        UIncomeH 
    10.83371503      0.00116301     -0.05572469      0.92317388      1.04229679 
  ShelveLocGood ShelveLocMedium 
     4.99962319      1.96427771 


## Interactions

We've referred to interactions above. These are specified (by convenience) as tuples in the `terms` argument
to `ModelSpec`.

In [41]:
design = ModelSpec([('UIncome', 'ShelveLoc'), 'UIncome'])
X = design.fit_transform(Carseats)
sm.OLS(Y, X).fit().params

intercept                       7.866634
UIncome[L]:ShelveLoc[Good]      4.512054
UIncome[L]:ShelveLoc[Medium]    1.247275
UIncome[M]:ShelveLoc[Good]      5.575170
UIncome[M]:ShelveLoc[Medium]    2.478163
UIncome[L]                     -2.734895
UIncome[M]                     -2.619745
dtype: float64

The tuples in `terms` are converted to `Variable` in the formalized `terms_` attribute by creating a `Variable` with
`variables` set to the tuple and the encoder an `Interaction` encoder which (unsurprisingly) creates the interaction columns from the concatenated data frames of `UIncome` and `ShelveLoc`.

In [42]:
design.terms_[0]

Variable(variables=('UIncome', 'ShelveLoc'), name='UIncome:ShelveLoc', encoder=Interaction(column_names={'ShelveLoc': ['ShelveLoc[Good]', 'ShelveLoc[Medium]'],
                          'UIncome': ['UIncome[L]', 'UIncome[M]']},
            columns={'ShelveLoc': range(2, 4), 'UIncome': range(0, 2)},
            variables=['UIncome', 'ShelveLoc']), use_transform=True, pure_columns=False, override_encoder_colnames=False)

Comparing this to the previous `R` model.

In [43]:
%%R
lm(Sales ~ UIncome:ShelveLoc + UIncome, data=Carseats)


Call:
lm(formula = Sales ~ UIncome:ShelveLoc + UIncome, data = Carseats)

Coefficients:
             (Intercept)                  UIncomeM                  UIncomeH  
                  5.1317                    0.1151                    1.1561  
  UIncomeL:ShelveLocGood    UIncomeM:ShelveLocGood    UIncomeH:ShelveLocGood  
                  4.5121                    5.5752                    3.7381  
UIncomeL:ShelveLocMedium  UIncomeM:ShelveLocMedium  UIncomeH:ShelveLocMedium  
                  1.2473                    2.4782                    1.5141  



We note a few important things:

1. `R` has reorganized the columns of the design from the formula: although we wrote `UIncome:ShelveLoc` first these
columns have been built later. **`ModelSpec` builds columns in the order determined by `terms`!**

2. As noted above, `R` has encoded `UIncome` differently in the main effect and in the interaction. For `ModelSpec`, the reference to `UIncome` always refers to the column in `design.column_info_` and will always build only the columns for `L` and `M`. **`ModelSpec` does no inspection of terms to decide how to encode categorical variables.**

A few notes:

- **Why not try to inspect the terms?** For any nontrivial formula in `R` with several categorical variables and interactions, predicting what columns will be produced from a given formula is not simple. **`ModelSpec` errs on the side of being explicit.**

- **Is it impossible to build the design as `R` has?** No. An advanced user who *knows* they want the columns built as `R` has can do so (fairly) easily.

In [44]:
full_encoding = contrast('UIncome', None)
design.build_columns(Carseats, full_encoding)

(     UIncome[H]  UIncome[L]  UIncome[M]
 0           0.0         0.0         1.0
 1           0.0         1.0         0.0
 2           0.0         1.0         0.0
 3           1.0         0.0         0.0
 4           0.0         0.0         1.0
 ..          ...         ...         ...
 395         1.0         0.0         0.0
 396         0.0         1.0         0.0
 397         0.0         1.0         0.0
 398         0.0         0.0         1.0
 399         0.0         1.0         0.0
 
 [400 rows x 3 columns],
 ['UIncome[H]', 'UIncome[L]', 'UIncome[M]'])

In [45]:
design = ModelSpec([pref_encoding, (full_encoding, 'ShelveLoc')])
X = design.fit_transform(Carseats)
sm.OLS(Y, X).fit().params

intercept                       5.131739
UIncome[M]                      0.115150
UIncome[H]                      1.156118
UIncome[H]:ShelveLoc[Good]      3.738052
UIncome[H]:ShelveLoc[Medium]    1.514104
UIncome[L]:ShelveLoc[Good]      4.512054
UIncome[L]:ShelveLoc[Medium]    1.247275
UIncome[M]:ShelveLoc[Good]      5.575170
UIncome[M]:ShelveLoc[Medium]    2.478163
dtype: float64

## Special encodings

For flexible models, we may want to consider transformations of features, i.e. polynomial
or spline transformations. Given transforms that follow the `fit/transform` paradigm
we can of course achieve this with a `Column` and an `encoder`. The `ISLP.transforms`
package includes a `Poly` transform

In [46]:
from ISLP.models.model_spec import poly
poly('Income', 3)

Variable(variables=('Income',), name='poly(Income, 3)', encoder=Poly(degree=3), use_transform=True, pure_columns=False, override_encoder_colnames=True)

In [47]:
design = ModelSpec([poly('Income', 3), 'ShelveLoc'])
X = design.fit_transform(Carseats)
sm.OLS(Y, X).fit().params

intercept              5.440077
poly(Income, 3)[0]    10.036373
poly(Income, 3)[1]    -2.799156
poly(Income, 3)[2]     2.399601
ShelveLoc[Good]        4.808133
ShelveLoc[Medium]      1.889533
dtype: float64

Compare:

In [48]:
%%R
lm(Sales ~ poly(Income, 3) + ShelveLoc, data=Carseats)$coef

     (Intercept) poly(Income, 3)1 poly(Income, 3)2 poly(Income, 3)3 
        5.440077        10.036373        -2.799156         2.399601 
   ShelveLocGood  ShelveLocMedium 
        4.808133         1.889533 


## Splines

Support for natural and B-splines is also included

In [49]:
from ISLP.models.model_spec import ns, bs, pca

## Custom encoding

Instead of PCA we might run some clustering on some features and then uses the clusters to
create new features. This can be done with `derived_variable`. Indeed, `pca`, `ns` and `bs` are all examples
of this.

In [50]:
from ISLP.models.model_spec import derived_variable, Contrast

In [51]:
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
cluster = make_pipeline(StandardScaler(), KMeans(n_clusters=3, random_state=0))
group = Variable(('Income', 'Price', 'Advertising', 'Population'), 'group', None)
X = design.build_submodel(Carseats, [group]).drop('intercept', axis=1)
cluster.fit(X.values)
cluster.predict(X.values)

array([1, 1, 2, 1, 2, 1, 0, 1, 0, 0, 0, 1, 2, 2, 0, 1, 2, 1, 0, 0, 0, 2,
       2, 2, 1, 2, 1, 0, 0, 1, 0, 1, 2, 1, 2, 0, 0, 2, 2, 2, 0, 2, 0, 2,
       0, 2, 0, 0, 2, 0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 2, 2, 0, 1, 2,
       0, 1, 1, 2, 1, 1, 2, 0, 0, 1, 1, 0, 2, 0, 1, 0, 0, 2, 2, 0, 1, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 0, 1, 2, 0, 0, 2, 0, 0, 1, 2, 0, 1, 0,
       0, 1, 0, 2, 0, 2, 0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0,
       0, 0, 2, 1, 0, 2, 1, 1, 1, 2, 0, 0, 2, 0, 2, 1, 0, 0, 0, 1, 2, 2,
       1, 0, 2, 2, 0, 2, 2, 2, 2, 0, 0, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 0,
       1, 0, 0, 2, 1, 0, 2, 1, 2, 1, 0, 2, 2, 1, 1, 2, 2, 0, 1, 1, 2, 2,
       1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 2, 1, 1, 0, 0, 1, 2, 2, 1, 1,
       1, 2, 0, 2, 2, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 2, 1, 2, 2, 0, 0, 0,
       2, 2, 2, 2, 1, 0, 0, 0, 1, 0, 0, 2, 1, 0, 2, 1, 2, 1, 1, 2, 1, 2,
       2, 2, 1, 1, 0, 2, 2, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 2,
       1, 2, 2, 1, 1, 0, 1, 0, 0, 1, 2, 1, 2, 1, 0,

For clustering, we often want to use the `predict` method rather than the `transform` method. If the ultimate
features all use `transform` then the do not even need to use these two calls to `make_pipeline`.

In [52]:
cluster2 = make_pipeline(StandardScaler(), KMeans(n_clusters=3, random_state=0))
cluster_var = derived_variable(['Income', 'Price', 'Advertising', 'Population'], 
                               name='myclus', 
                               encoder=cluster2,
                               use_transform=False)
design = ModelSpec([cluster_var]).fit(Carseats)
design.transform(Carseats)



Unnamed: 0,intercept,myclus
0,1.0,1
1,1.0,1
2,1.0,2
3,1.0,1
4,1.0,2
...,...,...
395,1.0,1
396,1.0,2
397,1.0,2
398,1.0,0


Somewhat clunkily, we can make this a categorical variable by creating a `Variable` with a
categorical encoder.

In [53]:
cluster2 = make_pipeline(StandardScaler(), KMeans(n_clusters=3, random_state=0))
cluster_var = derived_variable(['Income', 'Price', 'Advertising', 'Population'], 
                               name='myclus', 
                               encoder=cluster2,
                               use_transform=False)
cat_cluster = Variable((cluster_var,), name='mynewcat', encoder=Contrast(method='drop'))
cat_cluster

Variable(variables=(Variable(variables=('Income', 'Price', 'Advertising', 'Population'), name='myclus', encoder=Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=3, random_state=0))]), use_transform=False, pure_columns=False, override_encoder_colnames=True),), name='mynewcat', encoder=Contrast(), use_transform=True, pure_columns=False, override_encoder_colnames=False)

In [54]:
design = ModelSpec([cat_cluster]).fit(Carseats)

design.transform(Carseats)



Unnamed: 0,intercept,1,2
0,1.0,1.0,0.0
1,1.0,1.0,0.0
2,1.0,0.0,1.0
3,1.0,1.0,0.0
4,1.0,0.0,1.0
...,...,...,...
395,1.0,1.0,0.0
396,1.0,0.0,1.0
397,1.0,0.0,1.0
398,1.0,0.0,0.0
