**This notebook is an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/creating-features).**

---


# Introduction #

In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

Run this cell to set everything up!

In [1]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

In [2]:
pd.set_option("display.max_columns", None)
df.head(10)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,FirstFlrSF,SecondFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YearSold,SaleType,SaleCondition,SalePrice
0,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,141.0,31770.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,Gtl,North_Ames,Norm,Norm,OneFam,One_Story,Above_Average,Average,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,Typical,Typical,CBlock,Typical,Good,Gd,BLQ,2.0,Unf,0.0,441.0,1080.0,GasA,Fair,Y,SBrkr,1656.0,0.0,0.0,1656.0,1,0,1,0,3,1,Typical,7,Typ,2,Good,Attchd,Fin,2,528.0,Typical,Typical,Partial_Pavement,210.0,62.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,5,2010,WD,Normal,215000
1,One_Story_1946_and_Newer_All_Styles,Residential_High_Density,80.0,11622.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,Gtl,North_Ames,Feedr,Norm,OneFam,One_Story,Average,Above_Average,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,Typical,Typical,CBlock,Typical,Typical,No,Rec,6.0,LwQ,144.0,270.0,882.0,GasA,Typical,Y,SBrkr,896.0,0.0,0.0,896.0,0,0,1,0,2,1,Typical,5,Typ,0,No_Fireplace,Attchd,Unf,1,730.0,Typical,Typical,Paved,140.0,0.0,0.0,0.0,120.0,0.0,No_Pool,Minimum_Privacy,,0.0,6,2010,WD,Normal,105000
2,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,81.0,14267.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,Gtl,North_Ames,Norm,Norm,OneFam,One_Story,Above_Average,Above_Average,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,Typical,Typical,CBlock,Typical,Typical,No,ALQ,1.0,Unf,0.0,406.0,1329.0,GasA,Typical,Y,SBrkr,1329.0,0.0,0.0,1329.0,0,0,1,1,3,1,Good,6,Typ,0,No_Fireplace,Attchd,Unf,1,312.0,Typical,Typical,Paved,393.0,36.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,Gar2,12500.0,6,2010,WD,Normal,172000
3,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,93.0,11160.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Corner,Gtl,North_Ames,Norm,Norm,OneFam,One_Story,Good,Average,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Good,Typical,CBlock,Typical,Typical,No,ALQ,1.0,Unf,0.0,1045.0,2110.0,GasA,Excellent,Y,SBrkr,2110.0,0.0,0.0,2110.0,1,0,2,1,3,1,Excellent,8,Typ,2,Typical,Attchd,Fin,2,522.0,Typical,Typical,Paved,0.0,0.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,4,2010,WD,Normal,244000
4,Two_Story_1946_and_Newer,Residential_Low_Density,74.0,13830.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,OneFam,Two_Story,Average,Average,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,Typical,Typical,PConc,Good,Typical,No,GLQ,3.0,Unf,0.0,137.0,928.0,GasA,Good,Y,SBrkr,928.0,701.0,0.0,1629.0,0,0,2,1,3,1,Typical,6,Typ,1,Typical,Attchd,Fin,2,482.0,Typical,Typical,Paved,212.0,34.0,0.0,0.0,0.0,0.0,No_Pool,Minimum_Privacy,,0.0,3,2010,WD,Normal,189900
5,Two_Story_1946_and_Newer,Residential_Low_Density,78.0,9978.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,OneFam,Two_Story,Above_Average,Above_Average,1998,1998,Gable,CompShg,VinylSd,VinylSd,BrkFace,20.0,Typical,Typical,PConc,Typical,Typical,No,GLQ,3.0,Unf,0.0,324.0,926.0,GasA,Excellent,Y,SBrkr,926.0,678.0,0.0,1604.0,0,0,2,1,3,1,Good,7,Typ,1,Good,Attchd,Fin,2,470.0,Typical,Typical,Paved,360.0,36.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,6,2010,WD,Normal,195500
6,One_Story_PUD_1946_and_Newer,Residential_Low_Density,41.0,4920.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,Gtl,Stone_Brook,Norm,Norm,TwnhsE,One_Story,Very_Good,Average,2001,2001,Gable,CompShg,CemntBd,CmentBd,,0.0,Good,Typical,PConc,Good,Typical,Mn,GLQ,3.0,Unf,0.0,722.0,1338.0,GasA,Excellent,Y,SBrkr,1338.0,0.0,0.0,1338.0,1,0,2,0,2,1,Good,6,Typ,0,No_Fireplace,Attchd,Fin,2,582.0,Typical,Typical,Paved,0.0,0.0,170.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,4,2010,WD,Normal,213500
7,One_Story_PUD_1946_and_Newer,Residential_Low_Density,43.0,5005.0,Pave,No_Alley_Access,Slightly_Irregular,HLS,AllPub,Inside,Gtl,Stone_Brook,Norm,Norm,TwnhsE,One_Story,Very_Good,Average,1992,1992,Gable,CompShg,HdBoard,HdBoard,,0.0,Good,Typical,PConc,Good,Typical,No,ALQ,1.0,Unf,0.0,1017.0,1280.0,GasA,Excellent,Y,SBrkr,1280.0,0.0,0.0,1280.0,0,0,2,0,2,1,Good,5,Typ,0,No_Fireplace,Attchd,RFn,2,506.0,Typical,Typical,Paved,0.0,82.0,0.0,0.0,144.0,0.0,No_Pool,No_Fence,,0.0,1,2010,WD,Normal,191500
8,One_Story_PUD_1946_and_Newer,Residential_Low_Density,39.0,5389.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,Gtl,Stone_Brook,Norm,Norm,TwnhsE,One_Story,Very_Good,Average,1995,1996,Gable,CompShg,CemntBd,CmentBd,,0.0,Good,Typical,PConc,Good,Typical,No,GLQ,3.0,Unf,0.0,415.0,1595.0,GasA,Excellent,Y,SBrkr,1616.0,0.0,0.0,1616.0,1,0,2,0,2,1,Good,5,Typ,1,Typical,Attchd,RFn,2,608.0,Typical,Typical,Paved,237.0,152.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,3,2010,WD,Normal,236500
9,Two_Story_1946_and_Newer,Residential_Low_Density,60.0,7500.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,OneFam,Two_Story,Good,Average,1999,1999,Gable,CompShg,VinylSd,VinylSd,,0.0,Typical,Typical,PConc,Typical,Typical,No,Unf,7.0,Unf,0.0,994.0,994.0,GasA,Good,Y,SBrkr,1028.0,776.0,0.0,1804.0,0,0,2,1,3,1,Good,7,Typ,1,Typical,Attchd,Fin,2,442.0,Typical,Typical,Paved,140.0,60.0,0.0,0.0,0.0,0.0,No_Pool,No_Fence,,0.0,6,2010,WD,Normal,189000


-------------------------------------------------------------------------------

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

# 1) Create Mathematical Transforms

Create the following features:

- `LivLotRatio`: the ratio of `GrLivArea` to `LotArea`
- `Spaciousness`: the sum of `FirstFlrSF` and `SecondFlrSF` divided by `TotRmsAbvGrd`
- `TotalOutsideSF`: the sum of `WoodDeckSF`, `OpenPorchSF`, `EnclosedPorch`, `Threeseasonporch`, and `ScreenPorch`

In [3]:
# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

X_1["LivLotRatio"] = df['GrLivArea']/ df['LotArea']
X_1["Spaciousness"] = (df['FirstFlrSF'] + df['SecondFlrSF'])/ df['TotRmsAbvGrd']
X_1["TotalOutsideSF"] = df[['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'Threeseasonporch', 'ScreenPorch']].sum(axis=1)


# Check your answer
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

-------------------------------------------------------------------------------

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

```
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)
```

# 2) Interaction with a Categorical

We discovered an interaction between `BldgType` and `GrLivArea` in Exercise 2. Now create their interaction features.

In [4]:
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
print('__'*10, 'one hot encoding of Bldg type, using appropriate prefix for readability','__'*10)
X_2 = pd.get_dummies(df['BldgType'], prefix='Bldg')
print(X_2.head(5))
# Multiply
print('__'*10, 'multiply building type boolean by GrLivArea to manifest categorical interaction','__'*10)
X_2 = X_2.mul(df['GrLivArea'], axis=0)
print(X_2.head(5))


# Check your answer
q_2.check()

____________________ one hot encoding of Bldg type, using appropriate prefix for readability ____________________
   Bldg_Duplex  Bldg_OneFam  Bldg_Twnhs  Bldg_TwnhsE  Bldg_TwoFmCon
0        False         True       False        False          False
1        False         True       False        False          False
2        False         True       False        False          False
3        False         True       False        False          False
4        False         True       False        False          False
____________________ multiply building type boolean by GrLivArea to manifest categorical interaction ____________________
   Bldg_Duplex  Bldg_OneFam  Bldg_Twnhs  Bldg_TwnhsE  Bldg_TwoFmCon
0          0.0       1656.0         0.0          0.0            0.0
1          0.0        896.0         0.0          0.0            0.0
2          0.0       1329.0         0.0          0.0            0.0
3          0.0       2110.0         0.0          0.0            0.0
4          0.0  

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# 3) Count Feature

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature `PorchTypes` that counts how many of the following are greater than 0.0:

```
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
```

In [5]:
print('__'*10,'previewing columns to be counted','__'*10)
df[['WoodDeckSF',
                'OpenPorchSF',
                'EnclosedPorch',
                'Threeseasonporch',
                'ScreenPorch']].head(5)

____________________ previewing columns to be counted ____________________


Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch
0,210.0,62.0,0.0,0.0,0.0
1,140.0,0.0,0.0,0.0,120.0
2,393.0,36.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,212.0,34.0,0.0,0.0,0.0


In [6]:
X_3 = pd.DataFrame()

# YOUR CODE HERE
PorchTypes = ['WoodDeckSF',
                'OpenPorchSF',
                'EnclosedPorch',
                'Threeseasonporch',
                'ScreenPorch']
X_3["PorchTypes"] = df[PorchTypes].gt(0).sum(axis=1)

print('__'*10,'X_3 dataframe shows counts for different types of porches','__'*10)
print(X_3.head(5))
# Check your answer
q_3.check()

____________________ X_3 dataframe shows counts for different types of porches ____________________
   PorchTypes
0           2
1           2
2           2
3           0
4           2


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# 4) Break Down a Categorical Feature

`MSSubClass` describes the type of a dwelling:

In [7]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting `MSSubClass` at the first underscore `_`. (Hint: In the `split` method use an argument `n=1`.)

In [8]:
X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4['MSClass']= df['MSSubClass'].str.split('_', n=1).str[0] #.str[0] for element-wise indexing on strings or lists stored in a Serie
print(X_4.head(5))

# Check your answer
q_4.check()

  MSClass
0     One
1     One
2     One
3     One
4     Two


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# 5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature `MedNhbdArea` that describes the *median* of `GrLivArea` grouped on `Neighborhood`.

In [9]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = df.groupby('Neighborhood')['GrLivArea'].transform('median') # contrast with .agg() function to summarise

# Check your answer
q_5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

In [10]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

0.13954039591355258

# Keep Going #

[**Untangle spatial relationships**](https://www.kaggle.com/ryanholbrook/clustering-with-k-means) by adding cluster labels to your dataset.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/feature-engineering/discussion) to chat with other learners.*