# <font color='#eb3483'> Feature Engineering </font>

One of the most important steps in the machine learning pipeline is engineering features - it's often the determining factor in whether you'll get a successful model! Feature engineering is the process of making new features in your dataset that better represent the problem you're trying to model. In this module we won't be exploring any new packages or skills, but will try to highlight the importance of taking the time to craft useful features when you're approaching a machine learning model.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

### <font color='#eb3483'> Why bother feature engineering? </font>
In most problems, the data we're given is messy (hence why we need to do data processing), and might not be in the format most conducive for learning. Let's consider a simple example - predicting the area of a circle. The original data we're given is the radius of each circle, and we want to build a linear regression to predict the area.

In [2]:
area = pd.DataFrame({ 'radius': np.arange(10), 'area': 3.14* np.arange(10)**2 })
area.head()

Unnamed: 0,radius,area
0,0,0.0
1,1,3.14
2,2,12.56
3,3,28.26
4,4,50.24


Let's try building the model with the data as is:

In [3]:
predictor = LinearRegression()
mse = cross_val_score(predictor, area.drop('area',axis=1),
                area['area'], scoring="neg_mean_squared_error", 
                cv=3).mean()

print("MSE :",-mse)

MSE : 4337.58693164567


Yikes that's a terrible mean squared error for a simple problem. What if we engineered a new feature to be the radius^2 instead of the radius?

In [4]:
area['radius_sq'] = area['radius']**2
predictor = LinearRegression()
mse = cross_val_score(predictor, area.drop('area',axis=1),
                area['area'], scoring="neg_mean_squared_error", 
                cv=3).mean()
print("MSE :",-round(mse,2))

MSE : 0.0


Now we have a perfect prediction! In this example, all we had to do was engineer a feature that was more in-line with the problem we were trying to model. 

For more complicated models (i.e Neural Networks) some of the feature engineering can be done directly in the model (i.e. in a multi-layer neural network, the internal layers can act as learned representation of the data you're feeding in), but they require more training and are more prone to over-fitting. In general, feature engineering can lead to big boosts in predictive power with relatively little work on your end - so it's always a great place to start!

### <font color='#eb3483'> Feature Engineering Walkthrough </font>

For this section we are going to use the [Ames Housing dataset](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) which is an updated and expanded version of the Boston Housing Dataset.This [link](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) has the data dictionary.

In [5]:
ames = pd.read_csv("data/ames.csv").drop(columns="PID").sample(500, random_state=42)
ames.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemod/Add', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'W

In [6]:
ames.shape

(500, 80)

In [7]:
ames.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1357,70,RM,,5100,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,6,2008,WD,Normal,161000
2367,160,RM,21.0,1890,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,7,2006,WD,Normal,116000
2822,60,RL,62.0,7162,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2006,WD,Normal,196500
2126,20,RL,60.0,8070,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,123600
1544,30,RM,50.0,7000,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,7,2008,WD,Normal,126000


Let's take a peak at all our datatypes.

In [8]:
ames.dtypes.head()

MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
Street          object
dtype: object

## <font color='#eb3483'> Data Processing </font>

Looks like we have a lot of ordinal (i.e. data that has ordered categories) and categorical data (i.e. data that has categories) We are going to replace the ordinal and categorical variables, using `mlxtend`.

In [9]:
#Remember target is what we're trying to predict
target = "SalePrice"
#Independent variables are things we're using to try to predict it
independent_variables = ames.drop(columns=target).columns

In [10]:
numerical_cols = ames[independent_variables].select_dtypes(np.number).columns
categorical_cols = ames.select_dtypes(exclude=np.number).columns

#Let's make an ordered mapping of all our ordinal data (i.e. values on the right are better)
ordinal_var_dict = {'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],
 'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],
 'LandSlope': ['Sev', 'Mod', 'Gtl'],
 'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'BsmtQual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'BsmtCond': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'BsmtExposure': ['NA', 'No', 'Mn', 'Av', 'Gd'],
 'BsmtFinType1': ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
 'BsmtFinType2': ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
 'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Min2', 'Min1', 'Typ'],
 'FireplaceQu': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'GarageFinish': ['NA', 'Unf', 'RFn', 'Fin'],
 'GarageQual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'GarageCond': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
 'PavedDrive': ['N', 'P', 'Y'],
 'PoolQC': ['NA', 'Fa', 'TA', 'Gd', 'Ex'],
 'Fence': ['NA', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']}


#Let's keep track of all our ordinal and categorical data
ordinal_cols = list(ordinal_var_dict.keys())
categorical_cols = list(set(categorical_cols) - set(ordinal_cols))

### <font color='#eb3483'> Numerical Data </font>
For numerical data, we're going to do a two-step process *impute* our missing values using the median, and then *normalize* our columns (i.e. subtract the mean and divide by the standard deviation). We're going to use built-in functions from sklearn. Check them out using `?`.

In [11]:
from sklearn.preprocessing import normalize
from sklearn.impute import SimpleImputer

SimpleImputer?

[0;31mInit signature:[0m
[0mSimpleImputer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmissing_values[0m[0;34m=[0m[0mnan[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrategy[0m[0;34m=[0m[0;34m'mean'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfill_value[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m=[0m[0;34m'deprecated'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcopy[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0madd_indicator[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or
most frequent) along each column, or using a constant value.

Read more in the :ref:`User Guide <impute>`.

.. versionadded:: 0.20
   `SimpleIm

Let's check out the imputer first. We'll look at the LotFrontage column (which has 83 missing values) and see what happens when we impute the data

In [12]:
ames[numerical_cols[1]]

1357     NaN
2367    21.0
2822    62.0
2126    60.0
1544    50.0
        ... 
1047    24.0
332     21.0
1920    75.0
2396    86.0
1832     NaN
Name: LotFrontage, Length: 500, dtype: float64

In [13]:
#Simple imputer takes a strategy (i.e. replace missing values with the median value)
#And has a fit_transform function which takes a dataframe and returns a numpy array with the data and no missing vals
imputed = SimpleImputer(strategy="median").fit_transform(ames[numerical_cols])

#Notice our LotFrontage data now has no missing values!
imputed[:10,1] #we'll just look at the first 10 rows

array([ 68.,  21.,  62.,  60.,  50., 102.,  35.,  24.,  50.,  59.])

Now let's see how to normalize. For that we'll use sklearns normalize function. Same idea we feed in a dataframe or numpy matrix and it'll normalize all of our columns and return a numpy matrix. Note that we can't feed in our raw data (it'll throw an error if there's missing values) which is we'll use the imputed data

In [14]:
normalize(pd.DataFrame(imputed))

array([[1.02769404e-02, 9.98331349e-03, 7.48748511e-01, ...,
        0.00000000e+00, 8.80880602e-04, 2.94801375e-01],
       [3.45326064e-02, 4.53240459e-03, 4.07916413e-01, ...,
        0.00000000e+00, 1.51080153e-03, 4.32952552e-01],
       [6.99641316e-03, 7.22962693e-03, 8.35138517e-01, ...,
        0.00000000e+00, 5.83034430e-04, 2.33913413e-01],
       ...,
       [1.83821813e-03, 6.89331801e-03, 8.96131341e-01, ...,
        4.59554534e-02, 3.67643627e-04, 1.84465190e-01],
       [4.93232538e-03, 7.06966637e-03, 9.09603005e-01, ...,
        0.00000000e+00, 8.22054229e-04, 1.64904078e-01],
       [5.39685941e-03, 6.11644067e-03, 8.60978972e-01, ...,
        0.00000000e+00, 5.39685941e-04, 1.80524947e-01]])

Now let's package it together into one beautiful extended line of code

In [15]:
numerical_data_imputed_normalized = pd.DataFrame(
    #We're created a new dataframe where our columns have been imputed and normalized
    normalize(SimpleImputer(strategy="median").fit_transform(ames[numerical_cols])),
    columns=numerical_cols
)

### <font color='#eb3483'> Categorical Variables </font>

For categorical data we're going to use [1-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). Which means that each category will have a binary column (i.e. if the column was gender we'd have one column for male and female and having a 1 for male means the person is male). This is super common in machine learning, and pandas even has a function for it called `get_dummies` (check out the help docs)

In [16]:
categorical_data_dummy = pd.get_dummies(ames[categorical_cols], drop_first=True)

In [17]:
categorical_data_dummy.head()

Unnamed: 0,Alley_Pave,Heating_GasW,Heating_Grav,Heating_Wall,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,Condition2_Feedr,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
1357,False,False,False,False,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,False
2367,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2822,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1544,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False


### <font color='#eb3483'>Ordinal variables </font>

Checking the [data dictionary](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) there are many ordinal variables (measuring quality levels of different aspects in the houses from worst to best). As a reminder ordinal data means that there are categories (like categorical) but there's an ordering to them (i.e. one category is better than the other). To represent ordinal data we want to convert it numeric values that preserve that ordering. To do that we'll use pandas built-in functionality for categorical data. The high level steps are
- For each column we'll convert it to categorical data (which means each string value will have an associated number i.e. 1 = 'Male', 2 = 'Female')
- We'll set the ordering of the categories to be what we have in our dictionary (i.e. so the 'worst' category is first, best is last)
- Then we'll set our column to just use the underlying category numbers which now preserve the order we want

In [18]:
ordinal_data = ames[ordinal_cols]

In [19]:
#We're going to iterate through the ordinal columns and fix them
for col_ordinal, values in ordinal_var_dict.items():
    ordinal_data[col_ordinal] = (
    ordinal_data[col_ordinal] #first let's grab all our column's data
    .astype("category") #Convert it to category type
    #for the category we're going to set the ordering of the possible values to be what we have in our ordinal_dict
    .cat.set_categories(values) 
    #This will make sure we're using the category numbers (which will be in the order we want)
    .cat.codes
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ordinal_data[col_ordinal] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ordinal_data[col_ordinal] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ordinal_data[col_ordinal] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value 

In [20]:
ordinal_data.head()

Unnamed: 0,LotShape,Utilities,LandSlope,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,KitchenQual,Functional,FireplaceQu,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence
1357,3,3,2,2,3,3,3,1,1,1,1,3,6,4,1,3,3,2,-1,3
2367,3,3,2,2,2,3,3,1,5,1,4,2,6,-1,1,3,3,2,-1,-1
2822,3,3,2,3,2,4,3,1,1,1,4,3,6,4,3,3,3,2,-1,-1
2126,3,3,2,2,2,4,3,1,6,1,4,2,6,-1,-1,-1,-1,2,-1,-1
1544,3,3,2,2,3,3,3,1,3,6,2,2,6,-1,1,3,3,1,-1,3


We join the 3 datasets

In [21]:
ames_processed = pd.concat([
    numerical_data_imputed_normalized.reset_index(drop=True),
    categorical_data_dummy.reset_index(drop=True),
    ordinal_data.reset_index(drop=True)
], axis=1)

In [22]:
ames_processed.shape

(500, 171)

In [23]:
ames_processed.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemod/Add,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,HeatingQC,KitchenQual,Functional,FireplaceQu,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence
0,0.010277,0.009983,0.748749,0.001175,0.001028,0.282616,0.29304,0.0,0.0,0.0,...,1,3,6,4,1,3,3,2,-1,3
1,0.034533,0.004532,0.407916,0.001295,0.001511,0.425614,0.425614,0.082015,0.060864,0.0,...,4,2,6,-1,1,3,3,2,-1,-1
2,0.006996,0.00723,0.835139,0.000816,0.000583,0.233564,0.23368,0.022155,0.0,0.0,...,4,3,6,4,3,3,3,2,-1,-1
3,0.002176,0.006527,0.877833,0.000435,0.000544,0.216902,0.217011,0.0,0.063961,0.0,...,4,2,6,-1,-1,-1,-1,2,-1,-1
4,0.003655,0.006091,0.852737,0.000731,0.000975,0.234624,0.243395,0.0,0.036424,0.004873,...,2,2,6,-1,1,3,3,1,-1,3


And just like that we have a beautiful feature engineered dataset! We've covered some standard tools but remember it's important to think about what new features are suited to the problem (i.e. is yearbuilt important or do we want to just bin it into categories for old or new house?). That's where the creative part of data science comes in, and why having domain expertise or understanding where your data is coming from is so important!