# Scikit-Learn's new integration with Pandas

Scikit-Learn will make one of its biggest upgrades in recent years with its mammoth version 0.20 release. For those that like to work with Pandas to do exploratory data analysis and then move to Scikit-Learn then this will likely mean a huge improvement to your workflow.

All of Scikit-Learn machine learning models require the input to be a two dimensional numpy array of numeric values. No string values are allowed. Although there are many methods to encode string variables as numeric, scikit-learn never provided a canonical way to handle this very common occurrence.

This lead to numerous tutorials handling string columns in a variety of different ways. People were turning to Pandas `get_dummies` function, creating their own custom estimators, or even developing entire packages, such as [sklearn-pandas][1], to support this trouble spot. This lack of standardization made for a painful experience for those wanting to build machine learning models with string columns.

Furthermore, there was poor support for making transformations to specific columns and not to the entire array. For instance, if you wanted to standardize continuous features but not categorical features.

# Introducing `ColumnTransformer` and the upgraded `OneHotEncoder`
With the upgrade to version 0.20, many workflows from Pandas to Scikit-Learn should start looking more similar. The `ColumnTransformer` estimator will apply a transformation to a specific subset of columns of your Pandas DataFrame (or array).

The `OneHotEncoder` estimator is not new but has been upgraded to encode string columns. Before, it only encoded columns containing numeric categorical data.

Let's see how these new additions work to handle string columns in a Pandas DataFrame.

# Kaggle Housing Dataset
The [Housing Prices: Advanced Regression Techniques][2] is a beginners machine learning competition permanently ongoing from Kaggle. The goal is to predict housing prices given about 80 features. There are a mix of continuous and categorical columns. Download the data - I suggest using the [command line tool API][3].

### Inspect the data
Let's read in our DataFrame and output the first few rows.

[1]: https://github.com/scikit-learn-contrib/sklearn-pandas
[2]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
[3]: https://github.com/Kaggle/kaggle-api

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('data/housing/train.csv')
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
train.shape

(1460, 81)

### Remove the target variable from the training set
The target variable is `SalePrice` which we remove and assign as an array to its own variable.

In [4]:
y = train.pop('SalePrice').values

# Encoding a single string column
To start off, let's encode a single string column, `HouseStyle`, which has values for the exterior of the house. Let's output the unique counts of each string value.

In [5]:
vc = train['HouseStyle'].value_counts()
vc

1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64

We have 8 unique values in this column.

In [6]:
len(vc)

8

## Scikit-Learn Gotcha - Must have 2D data
Most Scikit-Learn estimator require that data be strictly 2-dimensional. If we select the column above as `train['HouseStyle']`, this technically creates a Pandas Series which is a single dimension of data. We can force Pandas to create a one-column DataFrame, by passing a single-item list to the brackets.

In [7]:
hs_train = train[['HouseStyle']].copy()
hs_train.ndim

2

# Import, Instantiate, Fit - The three-step process for each estimator
The scikit-learn API is consistent for all estimators and uses a three-step process to train or fit the data. 

1. Import the estimator we want from the module its located in
1. Instantiate the estimator possibly changing its defaults
1. Fit the estimator to the data. Possibly transform the data to its new space if need be.

Below, we import `OneHotEncoder`, instantiate it and ensure that we get a dense (and not sparse) array returned and then encode our single column with the `fit_transform` method.

In [8]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
hs_train_transformed = ohe.fit_transform(hs_train)
hs_train_transformed

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

As expected, it has encoded each unique value as its own binary column.

In [9]:
hs_train_transformed.shape

(1460, 8)

# We have a NumPy array. Where are the column names?
Notice that our output is a NumPy array and not a Pandas DataFrame. Scikit-Learn was not originally built to be directly integrated with Pandas. All Pandas objects are converted to NumPy arrays internally and NumPy arrays are always returned after a transformation.

We can still get our column name from the `OneHotEncoder` object through its `get_feature_names` method.

In [10]:
feature_names = ohe.get_feature_names()
feature_names

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

## Verifying our first row of data is correct
It's good to verify that our estimator is properly working. Let's look at the first row of encoded data.

In [11]:
row0 = hs_train_transformed[0]
row0

array([0., 0., 0., 0., 0., 1., 0., 0.])

This encodes the 6th value in the array as 1. Let's use boolean indexing to reveal the feature name.

In [12]:
feature_names[row0 == 1]

array(['x0_2Story'], dtype=object)

Now, let's verify that the first value in our original DataFrame column is the same.

In [13]:
hs_train.values[0]

array(['2Story'], dtype=object)

### Use `inverse_transform` to automate this
Just like most transformer objects, there is an `inverse_transform` method that will get you back your original data. Here we must wrap `row0` in a list to make it a 2D array.

In [14]:
ohe.inverse_transform([row0])

array([['2Story']], dtype=object)

We can verify all values by inverting the entire transformed array.

In [15]:
hs_inv = ohe.inverse_transform(hs_train_transformed)
hs_inv

array([['2Story'],
       ['1Story'],
       ['2Story'],
       ...,
       ['2Story'],
       ['1Story'],
       ['1Story']], dtype=object)

In [16]:
np.array_equal(hs_inv, hs_train.values)

True

## Applying transformation to the test set
Whatever transformation we do to our training set, we must apply to our test set. Let's read in the test set and get the same column and apply our transformation.

In [17]:
test = pd.read_csv('data/housing/test.csv')
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [18]:
hs_test = test[['HouseStyle']].copy()

In [19]:
hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

We should again get 8 columns and we do.

In [20]:
hs_test_transformed.shape

(1459, 8)

# Trouble area #1 - Categories unique to the test set
What happens if we have a home with a house style that is unique to just the test set? Say something like `3Story`. Let's change the first value of the house styles and see what the default is from Scikit-Learn.

In [21]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = '3Story'
hs_test.head(3)

Unnamed: 0,HouseStyle
0,3Story
1,1Story
2,2Story


In [22]:
ohe.transform(hs_test)

ValueError: Found unknown categories ['3Story'] in column 0 during transform

## Error: Unknown Category
By default, our encoder will produce an error. This is likely what we want as we need to know if there are unique strings in the test set. If you do have this problem then there could be something much deeper that needs investigating. For now, we will ignore this problem and encode this row as all 0's by setting the `handle_unknown` parameter to 'ignore' upon instantiation.

In [23]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit(hs_train)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=False)

In [24]:
hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

Let's verify that the first row is all 0's.

In [25]:
hs_test_transformed[0]

array([0., 0., 0., 0., 0., 0., 0., 0.])

# Trouble area #2 - Missing Values in test set
If you have missing values in your test set (NaN or None), then these will be ignored as long as `handle_unknown` is set to 'ignore'.

In [26]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = np.nan
hs_test.iloc[1, 0] = None
hs_test.head(4)

Unnamed: 0,HouseStyle
0,
1,
2,2Story
3,2Story


In [27]:
hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed[:4]

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.]])

# Trouble area #3 - Missing Values in training set
Missing values in the training set is more of an issue. As of now, the `OneHotEncoder` estimator cannot fit with missing values.

In [28]:
hs_train = hs_train.copy()
hs_train.iloc[0, 0] = np.nan
hs_train.head(3)

Unnamed: 0,HouseStyle
0,
1,1Story
2,2Story


In [29]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit_transform(hs_train)

TypeError: '<' not supported between instances of 'str' and 'float'

It would be nice if there was an option to ignore them like what happens when transforming the test set above. But this doesn't exist.

# Must impute missing values
For now, we must impute the missing values. The old `Imputer` from the preprocessing module got deprecated. A new module, `impute`, was formed in its place, with a new estimator `SimpleImputer` and a new strategy, 'constant'.

In [30]:
hs_train = train[['HouseStyle']].copy()
hs_train.iloc[0, 0] = np.nan

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', fill_value='MISSING')
hs_train_imputed = si.fit_transform(hs_train)
hs_train_imputed

array([['MISSING'],
       ['1Story'],
       ['2Story'],
       ...,
       ['2Story'],
       ['1Story'],
       ['1Story']], dtype=object)

From here, we can then

In [31]:
hs_train_transformed = ohe.fit_transform(hs_train_imputed)
hs_train_transformed

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

Notice, that we now have an extra column and and an extra feature name.

In [32]:
hs_train_transformed.shape

(1460, 9)

In [33]:
ohe.get_feature_names()

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_MISSING', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

### Apply both transformations to test set
We can manually apply each of the two steps above in order like this:

In [34]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = 'reasdf'

In [35]:
hs_test_imputed = si.transform(hs_test)
hs_test_transformed = ohe.transform(hs_test_imputed)
hs_test_transformed.shape

(1459, 9)

In [36]:
ohe.get_feature_names()

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_MISSING', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

## Use a `Pipeline` instead
Scikit-Learn provides a Pipeline transformer and estimator that takes a list of transformations and applies them in succession. You can also run a machine learning model as the final estimator. Here we simply impute and encode.

In [37]:
from sklearn.pipeline import Pipeline

Each step is a two-item tuple consisting of a string that labels the step and the instantiated estimator.

In [38]:
si_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
ohe_step = ('ohee', OneHotEncoder(sparse=False, handle_unknown='ignore'))
steps = [si_step, ohe_step]

pipe = Pipeline(steps)

In [39]:
hs_train = train[['HouseStyle']].copy()
hs_train.iloc[0, 0] = np.nan

In [40]:
pipe.fit_transform(hs_train)

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [41]:
hs_test = test[['HouseStyle']].copy()
pipe.transform(hs_test)

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Multiple String Columns
Encoding multiple string columns is not a problem. Select the columns you want and then pass the new DataFrame through the pipeline again.

In [42]:
string_cols = ['RoofMatl', 'HouseStyle']
string_train = train[string_cols]
string_train.head(3)

Unnamed: 0,RoofMatl,HouseStyle
0,CompShg,2Story
1,CompShg,1Story
2,CompShg,2Story


In [43]:
si_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
steps = [si_step, ohe_step]

pipe = Pipeline(steps)
pipe.fit_transform(string_train)

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

### Get individual pieces of the pipeline
It is possible to get each individual transformer through its name. In this instance, we get the one-hot encoder so that we can output the feature names.

In [44]:
ohe = pipe.named_steps['ohe']
ohe.get_feature_names()

array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal', 'x0_Roll',
       'x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl', 'x1_1.5Fin', 'x1_1.5Unf',
       'x1_1Story', 'x1_2.5Fin', 'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer',
       'x1_SLvl'], dtype=object)

# Use the new `ColumnTransformer` to choose columns
The brand new transformer `ColumnTransformer` (part of the new `compose` module) allows you to choose which columns get which transformations. Categorical columns will almost always need separate transformations than continuous columns.

The `ColumnTransformer` is currently experimental, meaning that its functionality can change in the future. There also appears to be a few bugs with it as well. I even [found one][1] while writing this tutorial.

The `ColumnTransformer` works similarly to `Pipeline` in that it takes a list of tuples, but instead of two items, each tuple is three items in length and will look like this:
```
('name1', SomeTransformer(parameters), columns)
```

Where `columns` is a list of the DataFrame columns that you want to transform during that step. You can also choose the columns with integer indexes, a boolean array, or even a function.

### Pass a `Pipeline` to the `ColumnTransformer`
We can even pass a pipeline of many transformations to the column transformer, and in fact there is a bug that forces us do just that.

Below, we reproduce the above imputing and encoding using the ColumnTransformer.

[1]: https://github.com/scikit-learn/scikit-learn/issues/11969

In [45]:
from sklearn.compose import ColumnTransformer

In [157]:
si_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
steps = [si_step, ohe_step]

cat_pipe = Pipeline(steps)
cat_cols = ['RoofMatl', 'HouseStyle']
transformers = [('cat', cat_pipe, cat_cols)]

ct = ColumnTransformer(transformers=transformers)

In [47]:
X_train = ct.fit_transform(train)
X_train

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [48]:
X_train.shape

(1460, 16)

We can now transform our test set in the same manner.

In [49]:
X_test = ct.transform(test)

In [50]:
X_test.shape

(1459, 16)

### Retrieving the feature names
We have to do a little searching to get the feature names. We use the names attribute of the transformers. First, we select our transformer (there is only one here - a pipeline named cat).

In [51]:
pl = ct.named_transformers_['cat']

Then from this pipeline we select the one-hot encoder object and finally get the feature names.

In [52]:
ohe = pl.named_steps['ohe']
ohe.get_feature_names()

array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal', 'x0_Roll',
       'x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl', 'x1_1.5Fin', 'x1_1.5Unf',
       'x1_1Story', 'x1_2.5Fin', 'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer',
       'x1_SLvl'], dtype=object)

# Transforming the numeric columns
The numeric columns will need a different set of transformations. Instead of imputing missing values with a constant, the median or mean is often chosen. And instead of encoding the values, we usually standardize them by subtracting the mean of each column and dividing by the standard deviation. This helps many models like ridge regression produce a better fit.

## Usually all the numeric columns
We can select all of the numeric columns by first finding the dtypes of each column and then testing whether the `kind` attribute is 'O'. See the [NumPy docs][1] for more on the `kind` attribute.

[1]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.kind.html

In [144]:
train.dtypes.head()

Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object

In [145]:
kinds = np.array([dt.kind for dt in train.dtypes])
kinds

array(['i', 'i', 'O', 'f', 'i', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
       'O', 'O', 'O', 'O', 'i', 'i', 'i', 'i', 'O', 'O', 'O', 'O', 'O',
       'f', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'i', 'O', 'i', 'i', 'i',
       'O', 'O', 'O', 'O', 'i', 'i', 'i', 'i', 'i', 'i', 'i', 'i', 'i',
       'i', 'O', 'i', 'O', 'i', 'O', 'O', 'f', 'O', 'i', 'i', 'O', 'O',
       'O', 'i', 'i', 'i', 'i', 'i', 'i', 'O', 'O', 'O', 'i', 'i', 'i',
       'O', 'O'], dtype='<U1')

In [148]:
numeric_cols = train.columns[kinds != 'O'].values
numeric_cols

array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'], dtype=object)

Once we have our numeric column names, we can use the same steps as above with help from the `ColumnTransformer`.

In [149]:
from sklearn.preprocessing import StandardScaler

In [165]:
si_step = ('si', SimpleImputer(strategy='median'))
ss_step = ('ss', StandardScaler())
steps = [si_step, ss_step]

num_pipe = Pipeline(steps)
transformers = [('num', num_pipe, numeric_cols)]

ct = ColumnTransformer(transformers=transformers)

In [166]:
X = ct.fit_transform(train)
X

array([[-1.73086488,  0.07337496, -0.22087509, ..., -0.08768781,
        -1.5991111 ,  0.13877749],
       [-1.7284922 , -0.87256276,  0.46031974, ..., -0.08768781,
        -0.48911005, -0.61443862],
       [-1.72611953,  0.07337496, -0.08463612, ..., -0.08768781,
         0.99089135,  0.13877749],
       ...,
       [ 1.72611953,  0.30985939, -0.1754621 , ...,  4.95311151,
        -0.48911005,  1.64520971],
       [ 1.7284922 , -0.87256276, -0.08463612, ..., -0.08768781,
        -0.8591104 ,  1.64520971],
       [ 1.73086488, -0.87256276,  0.23325479, ..., -0.08768781,
        -0.1191097 ,  0.13877749]])

In [167]:
X.shape

(1460, 37)

# Combining both categorical and numerical column transformations
We can apply separate transformations to each secton of our DataFrame and `ColumnTransforer` will automatically concatenate our results together. Below we create a pipeline for both categorical and numerical columns and then use the `ColumnTransformer` to independently transform them.

In [187]:
all_columns = train.columns.values
kinds = np.array([dt.kind for dt in train.dtypes])
is_numeric = kinds != 'O'
numeric_cols = all_columns[is_numeric]
cat_cols = all_columns[~is_numeric]

si_cat_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
cat_steps = [si_cat_step, ohe_step]
cat_pipe = Pipeline(cat_steps)

si_num_step = ('si', SimpleImputer(strategy='median'))
ss_step = ('ss', StandardScaler())
num_steps = [si_num_step, ss_step]
num_pipe = Pipeline(num_steps)

transformers = [('cat', cat_pipe, cat_cols),
                ('num', num_pipe, numeric_cols)]

ct = ColumnTransformer(transformers=transformers)

X = ct.fit_transform(train)
X.shape

(1460, 305)

# Machine Learning
We can create one final pipeline and add a machine learning model as the final estimator.

In [210]:
from sklearn.linear_model import Ridge

In [211]:
ml_pipe = Pipeline([('transform', ct), ('ridge', Ridge())])

In [212]:
ml_pipe.fit(train, y)

Pipeline(memory=None,
     steps=[('transform', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('cat', Pipeline(memory=None,
     steps=[('si', SimpleImputer(copy=True, fill_value='MISSING', missing_values=nan,
       strategy='constant', verbos...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [213]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True)

In [214]:
cross_val_score(ml_pipe, train, y, cv=kf).mean()

0.8119226710668714

# Selecting parameters when Grid Searching

In [215]:
from sklearn.model_selection import GridSearchCV

In [205]:
param_grid = {
    'transform__num__si__strategy': ['mean', 'median'],
    'ridge__alpha': [.001, 0.1, 1.0, 10, 100],
}
gs = GridSearchCV(ml_pipe, param_grid, cv=kf)

In [206]:
gs.fit(train, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('transform', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('cat', Pipeline(memory=None,
     steps=[('si', SimpleImputer(copy=True, fill_value='MISSING', missing_values=nan,
       strategy='constant', verbos...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'transform__num__si__strategy': ['mean', 'median'], 'ridge__alpha': [0.001, 0.1, 1.0, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [207]:
gs.best_params_

{'ridge__alpha': 10, 'transform__num__si__strategy': 'mean'}

In [209]:
pd.DataFrame(gs.cv_results_)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,param_transform__num__si__strategy,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.034049,0.002207,0.009171,0.001338,0.001,mean,"{'ridge__alpha': 0.001, 'transform__num__si__s...",0.901581,0.856938,0.893413,...,0.564308,0.544383,10,0.935158,0.939936,0.929458,0.945877,0.937442,0.937574,0.005408
1,0.034064,0.000914,0.007842,0.000284,0.001,median,"{'ridge__alpha': 0.001, 'transform__num__si__s...",0.901578,0.856964,0.893429,...,0.564316,0.544383,9,0.935158,0.939935,0.929453,0.945877,0.937442,0.937573,0.00541
2,0.030157,0.000369,0.007588,8.9e-05,0.1,mean,"{'ridge__alpha': 0.1, 'transform__num__si__str...",0.899387,0.854853,0.895557,...,0.800789,0.103319,8,0.934084,0.938858,0.928206,0.944976,0.935014,0.936228,0.005548
3,0.034269,0.002525,0.007771,0.000425,0.1,median,"{'ridge__alpha': 0.1, 'transform__num__si__str...",0.899382,0.854888,0.895568,...,0.800795,0.103326,7,0.934083,0.938857,0.928201,0.944977,0.935014,0.936227,0.005549
4,0.032,0.001501,0.008202,0.000403,1.0,mean,"{'ridge__alpha': 1.0, 'transform__num__si__str...",0.894367,0.844483,0.884532,...,0.806715,0.091847,4,0.922084,0.928361,0.915705,0.933772,0.931898,0.926364,0.006655
5,0.032464,0.000531,0.007599,9.6e-05,1.0,median,"{'ridge__alpha': 1.0, 'transform__num__si__str...",0.894384,0.844506,0.88453,...,0.80672,0.091857,3,0.922086,0.92836,0.915707,0.933771,0.931898,0.926364,0.006654
6,0.030082,0.000297,0.007481,8.7e-05,10.0,mean,"{'ridge__alpha': 10, 'transform__num__si__stra...",0.89376,0.840727,0.863183,...,0.811317,0.083677,1,0.897985,0.906401,0.896868,0.909821,0.919559,0.906127,0.008321
7,0.032869,0.000938,0.007679,0.000114,10.0,median,"{'ridge__alpha': 10, 'transform__num__si__stra...",0.893885,0.840658,0.863156,...,0.811305,0.083714,2,0.897979,0.906406,0.896882,0.909821,0.919561,0.90613,0.008319
8,0.030034,0.000132,0.007594,0.000181,100.0,mean,"{'ridge__alpha': 100, 'transform__num__si__str...",0.893595,0.833907,0.841184,...,0.80642,0.08161,5,0.862478,0.873634,0.867825,0.878383,0.892835,0.875031,0.010389
9,0.033327,0.001151,0.007782,0.000405,100.0,median,"{'ridge__alpha': 100, 'transform__num__si__str...",0.893816,0.833685,0.841175,...,0.806398,0.081688,6,0.862424,0.873642,0.867819,0.878376,0.892844,0.875021,0.010406


In [53]:
from sklearn.base import BaseEstimator, TransformerMixin

In [63]:
class ObviousTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, cat_threshold=None, num_strategy='median', return_df=False):
        self.cat_threshold = cat_threshold
        self.num_strategy = num_strategy
        self.return_df = return_df
        self._columns = None
        
    def fit(self, X):
        self._columns = X.columns.values
        self._dtypes = X.dtypes.values
        self._kinds = np.array([dt.kind for dt in X.dtypes])
        self._column_dtypes = {}
        is_cat = self._kinds == 'O'
        self._column_dtypes['cat'] = self._columns[is_cat]
        self._column_dtypes['num'] = self._columns[~is_cat]
        self._feature_names = self._column_dtypes['num']
        self._cat_cols = {}
        for col in self._column_dtypes['cat']:
            vc = X[col].value_counts()
            if self.cat_threshold is not None:
                vc = vc[vc > self.cat_threshold]
            vals = vc.index.values
            self._cat_cols[col] = vals
            self._feature_names = np.append(self._feature_names, col + '_' + vals)
            
        self._total_cat_cols = sum([len(v) for col, v in self._cat_cols.items()])
        self._num_fill = X[self._column_dtypes['num']].agg(self.num_strategy)
        return self
        
    def transform(self, X):
        if set(self._columns) != set(X.columns):
            raise ValueError('Passed DataFrame has different columns than fit DataFrame')
        elif len(self._columns) != len(X.columns):
            raise ValueError('Passed DataFrame has different number of columns than fit DataFrame')
            
        X_num = X[self._column_dtypes['num']].fillna(self._num_fill)
        X_num = (X_num - X_num.mean()) / X_num.std()
        X_num = X_num.values
        X_cat = np.empty((len(X), self._total_cat_cols), dtype='int')
        i = 0
        for col in self._column_dtypes['cat']:
            vals = self._cat_cols[col]
            for val in vals:
                X_cat[:, i] = X[col] == val
                i += 1
                
        data = np.column_stack((X_num, X_cat))
        if self.return_df:
            return pd.DataFrame(data=data, columns=self._feature_names)
        else:
            return data
    
    def fit_transform(self, X):
        return self.fit(X).transform(X)

In [173]:
ot = ObviousTransformer(cat_threshold=0, return_df=True)

In [174]:
X = ot.fit_transform(train)

In [176]:
X.shape

(1460, 289)

In [75]:
from sklearn.linear_model import LinearRegression, Ridge

In [83]:
lr = LinearRegression()
ridge = Ridge(alpha=100)

In [84]:
lr.fit(X_train, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [85]:
ridge.fit(X_train, y)

Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [86]:
ridge.score(X_train, y)

0.8702432916493581

In [101]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split

In [133]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

lr.fit(X_train, y_train)
lr.score(X_train, y_train)

0.9079027436742987

In [134]:
lr.score(X_test, y_test)

-7.480134477170103e+19

In [113]:
X_test.shape

(365, 202)

In [96]:
kf = KFold(n_splits=5, shuffle=True)

In [136]:
cross_val_score(ridge, X, y, cv=kf).mean()

0.8293505911080119

In [100]:
cross_val_score(lr, X_train, y, cv=kf).mean()

-1.9144976430754385e+20

In [74]:
lr.score(X_train, y)

0.8961668139404931

In [57]:
a.shape

(1460, 202)

In [62]:
ot._feature_names

array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'MSZoning_RL', 'MSZoning_RM', 'MSZoning_FV',
       'MSZoning_RH', 'Street_Pave', 'Alley_Grvl', 'Alley_Pave',
       'LotShape_Reg', 'LotShape_IR1', 'LotShape_IR2', 'LandContour_Lvl',
       'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low',
       'Utilities_AllPub', 'LotConfig_Inside', 'LotConfig_Corner',
       'LotConfig_CulDSac', 'LotConfig_FR2', 'LandSlope_Gtl',
       'LandSlope_Mod', 'Neighborhood_NAmes', 'Neighbo