# Kaggle Feature Engineering microcurse
- Better features make better models. 
- Discover how to get the most out of your data
- https://www.kaggle.com/learn/feature-engineering

## 3.- Exercise: Creating Features
- In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

In [2]:
### Import necessary libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

import requests
import zipfile as zfm
import io

In [3]:
### Write url w/zipfile path

# Data that define repo and filename w/path
ro = 'jmonti-gh'                  # repo_owner
rn = 'Datasets'                   # repo_name
zipfln = 'FE_CourseData_1.zip'
dataset = 'ames.csv'

# Data necesary If a proxy is used
proxies = {
  'http': 'http://jorge.monti:jorgemonti2009@172.16.1.49:3128',
  'https': 'http://jorge.monti:jorgemonti2009@172.16.1.49:3128'    # https://jorge.monti:jorgemonti2009@172.16.1.49:3128
}

# url where to obtain the response
url = f'https://raw.githubusercontent.com/{ro}/{rn}/main/{zipfln}'

In [4]:
### try-except block to get the zipfile containing the dataset
try:
    r = requests.get(url)
    print('No Proxy needed')
except OSError as oe:
    if 'ProxyError' in str(oe):
        r = requests.get(url, proxies=proxies)
        print('Proxy used!')
    else:
        ln = '-' * 5 + '\n'
        for er in [oe, oe.args]:
            print(ln, er, '\nType: ', type(er), sep='')

No Proxy needed


In [5]:
### Read the zipfile and load the dataset
with zfm.ZipFile(io.BytesIO(r.content)) as zf:
    print(zf.namelist())
    df = pd.read_csv(zf.open(dataset))

print('\nDataset loaded:', dataset, '->', df.shape)
df.iloc[[0, 9, -9, -1]]

['abalone.csv', 'accidents.csv', 'airbnb.csv', 'ames.csv', 'autos.csv', 'bike-sharing.csv', 'caravan.csv', 'concrete.csv', 'customer.csv', 'DataDocumentation.txt']

Dataset loaded: ames.csv -> (2930, 79)


Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YearSold,SaleType,SaleCondition,SalePrice
0,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,141.0,31770.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,0.0,No_Pool,No_Fence,,0.0,5,2010,WD,Normal,215000
9,Two_Story_1946_and_Newer,Residential_Low_Density,60.0,7500.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,...,0.0,No_Pool,No_Fence,,0.0,6,2010,WD,Normal,189000
2921,Duplex_All_Styles_and_Ages,Residential_Low_Density,55.0,12640.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Inside,...,0.0,No_Pool,No_Fence,,0.0,7,2006,WD,Normal,150900
2929,Two_Story_1946_and_Newer,Residential_Low_Density,74.0,9627.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,...,0.0,No_Pool,No_Fence,,0.0,11,2006,WD,Normal,188000


In [6]:
# 79 Cols, let's see them
# df.info()
## from .info() we can see that we have a lot of NaNs
## And that we have a lot of object cols ¿ int64 and float64 cols

In [7]:
### Funct. that calc score
def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score

In [8]:
### Separate features (predictors) from target
X = df.copy()
#display(X['SalePrice'])
y = X.pop('SalePrice')
#display(X['SalePrice'])  # KeyError: 'SalePrice'

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

### 1) Create Mathematical Transforms

Create the following features:

- `LivLotRatio`: the ratio of `GrLivArea` to `LotArea`
- `Spaciousness`: the sum of `FirstFlrSF` and `SecondFlrSF` divided by `TotRmsAbvGrd`
- `TotalOutsideSF`: the sum of `WoodDeckSF`, `OpenPorchSF`, `EnclosedPorch`, `Threeseasonporch`, and `ScreenPorch`

In [9]:
### NEW df (X_1) to hold new features
X_1 = pd.DataFrame()

X_1['LivLotRatio'] = X.GrLivArea / X.LotArea
X_1["Spaciousness"] = (X.FirstFlrSF + X.SecondFlrSF) / X.TotRmsAbvGrd

cols_to_sum = ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
               'Threeseasonporch', 'ScreenPorch']
X_1["TotalOutsideSF"] = X[cols_to_sum].sum(axis=1)

#X_1.LivLotRatio.name
X_1.iloc[[0, 9, -9, -1]]

Unnamed: 0,LivLotRatio,Spaciousness,TotalOutsideSF
0,0.052125,236.571429,272.0
9,0.240533,257.714286,200.0
2921,0.136709,216.0,40.0
2929,0.207749,222.222222,238.0


__Solution:__

X_1["LivLotRatio"] = df.GrLivArea / df.LotArea    
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd    
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch

-------------------------------------------------------------------------------

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

```
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)
```

### 2) Interaction with a Categorical

We discovered an interaction between `BldgType` and `GrLivArea` in Exercise 2. Now create their interaction features.

In [13]:
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(df.BldgType, prefix='Bldg')
# Multiply
X_2 = X_2.mul(df.GrLivArea, axis=0)
X_2

Unnamed: 0,Bldg_Duplex,Bldg_OneFam,Bldg_Twnhs,Bldg_TwnhsE,Bldg_TwoFmCon
0,0.0,1656.0,0.0,0.0,0.0
1,0.0,896.0,0.0,0.0,0.0
2,0.0,1329.0,0.0,0.0,0.0
3,0.0,2110.0,0.0,0.0,0.0
4,0.0,1629.0,0.0,0.0,0.0
...,...,...,...,...,...
2925,0.0,1003.0,0.0,0.0,0.0
2926,0.0,902.0,0.0,0.0,0.0
2927,0.0,970.0,0.0,0.0,0.0
2928,0.0,1389.0,0.0,0.0,0.0


### 3) Count Feature

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature `PorchTypes` that counts how many of the following are greater than 0.0:

```
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
```

In [18]:
print(df.select_dtypes('bool').columns)


Index([], dtype='object')


In [21]:
X_3 = pd.DataFrame()
outdoors = ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
            'Threeseasonporch', 'ScreenPorch']
X_3['PorchTypes'] = df[outdoors].gt(0.0).sum(axis=1)

In [28]:
df['PorchTypes'] = X_3['PorchTypes']
#df[outdoors + ['PorchTypes']]
# to see twoo rows of e/value of new col 'Components'
auxdf = pd.DataFrame()
for i in df.PorchTypes.value_counts().index:
    #print(i)
    minidf = df[outdoors + ['PorchTypes']].\
            loc[df.PorchTypes.eq(i)].iloc[[0,-1]]
    auxdf = pd.concat([auxdf, minidf])
auxdf

Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch,PorchTypes
6,0.0,0.0,170.0,0.0,0.0,1
2926,164.0,0.0,0.0,0.0,0.0,1
0,210.0,62.0,0.0,0.0,0.0,2
2929,190.0,48.0,0.0,0.0,0.0,2
3,0.0,0.0,0.0,0.0,0.0,0
2922,0.0,0.0,0.0,0.0,0.0,0
15,503.0,36.0,0.0,0.0,210.0,3
2884,128.0,53.0,0.0,0.0,155.0,3
2212,174.0,24.0,120.0,0.0,228.0,4
2212,174.0,24.0,120.0,0.0,228.0,4


In [34]:
### I can't see a Threeseasonporch > 0 !
df['Threeseasonporch'].describe()
df[outdoors + ['PorchTypes']][df.Threeseasonporch.gt(0)].iloc[[0, 9, -9, -1]]

Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch,PorchTypes
79,370.0,70.0,0.0,238.0,0.0,3
655,0.0,0.0,0.0,140.0,0.0,1
2094,0.0,84.0,0.0,196.0,0.0,2
2765,0.0,130.0,0.0,130.0,0.0,2


### 4) Break Down a Categorical Feature
- MSsubClass describes the type of a dwelling

In [35]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass at the first underscore _. (Hint: In the split method use an argument n=1.)

In [51]:
X_4 = pd.DataFrame()
X_4['MSClass'] = (
    df['MSSubClass']
    .str
    .split('_', n=1, expand=True))[0]

### 5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature `MedNhbdArea` that describes the *median* of `GrLivArea` grouped on `Neighborhood`.

In [52]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = (
    df.groupby('Neighborhood')['GrLivArea']
    .transform('median'))

#X_5.head(10)

Unnamed: 0,MedNhbdArea
0,1200.0
1,1200.0
2,1200.0
3,1200.0
4,1560.0
5,1560.0
6,1767.0
7,1767.0
8,1767.0
9,1560.0


In [53]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

0.13865658128932104