<H1>Feature Engineering</H1>

<p>Exracted from https://pycaret.gitbook.io/docs/get-started/preprocessing/feature-engineering</p>

<H2>Polynomial Features</H2>
<p>In machine learning experiments, the relationship between the dependent and independent variables is often assumed to be linear; however, this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship, which otherwise may go unnoticed. </p>

In [20]:
import pandas as pd
pd.options.display.max_rows = 4000


In [4]:
# load dataset
from pycaret.datasets import get_data
juice = get_data('juice')

# init setup
from pycaret.classification import *
clf1 = setup(data = juice, target = 'Purchase', polynomial_features = True)

Unnamed: 0,Id,Purchase,WeekofPurchase,StoreID,PriceCH,PriceMM,DiscCH,DiscMM,SpecialCH,SpecialMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,Store7,PctDiscMM,PctDiscCH,ListPriceDiff,STORE
0,1,CH,237,1,1.75,1.99,0.0,0.0,0,0,0.5,1.99,1.75,0.24,No,0.0,0.0,0.24,1
1,2,CH,239,1,1.75,1.99,0.0,0.3,0,1,0.6,1.69,1.75,-0.06,No,0.150754,0.0,0.24,1
2,3,CH,245,1,1.86,2.09,0.17,0.0,0,0,0.68,2.09,1.69,0.4,No,0.0,0.091398,0.23,1
3,4,MM,227,1,1.69,1.69,0.0,0.0,0,0,0.4,1.69,1.69,0.0,No,0.0,0.0,0.0,1
4,5,CH,228,7,1.69,1.69,0.0,0.0,0,0,0.956535,1.69,1.69,0.0,Yes,0.0,0.0,0.0,0


Unnamed: 0,Description,Value
0,Session id,4026
1,Target,Purchase
2,Target type,Binary
3,Target mapping,"CH: 0, MM: 1"
4,Original data shape,"(1070, 19)"
5,Transformed data shape,"(1070, 190)"
6,Transformed train set shape,"(749, 190)"
7,Transformed test set shape,"(321, 190)"
8,Ordinal features,1
9,Numeric features,17


In [17]:
clf1.get_config('dataset_transformed')

Unnamed: 0,age,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Self-emp-inc,workclass_Federal-gov,workclass_Never-worked,workclass_Without-pay,education_HS-grad,...,race_Black,race_Asian-Pac-Islander,race_Amer-Indian-Eskimo,race_Other,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
11518,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,30.0,0.246101,0
14847,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,55.0,0.246101,1
29022,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,50.0,0.246101,0
22295,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2339.0,60.0,0.246101,0
29925,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0,0.246101,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27318,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.246101,0
28452,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,4386.0,0.0,40.0,0.418963,1
20781,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,20.0,0.246101,0
11019,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0,0.246101,0


<H2>Group Features</H2>
<p>When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as mean, median, variance and standard deviation for a group of such features can be created from existing features using group_features parameter. </p>

In [28]:
# load dataset
from pycaret.datasets import get_data
credit = get_data('credit')

# init setup
from pycaret.classification import *
clf2 = setup(data = credit, target = 'default', group_features = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'])


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0


Unnamed: 0,Description,Value
0,Session id,939
1,Target,default
2,Target type,Binary
3,Original data shape,"(24000, 24)"
4,Transformed data shape,"(24000, 30)"
5,Transformed train set shape,"(16800, 30)"
6,Transformed test set shape,"(7200, 30)"
7,Numeric features,23
8,Preprocess,True
9,Imputation type,simple


In [32]:

df2=clf2.get_config('dataset_transformed')
df2.to_csv('groupfeatures.csv', index = False)


<H2>Bin Numeric Features</H2>
<p>Feature binning is a method of turning continuous variables into categorical values using pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. In PyCaret, continuous numeric features can be binned into intervals using bin_numeric_features parameter. PyCaret uses the ‘sturges’ rule to determine the number of bins and uses K-Means clustering to convert continuous numeric features into categorical features. </p>

In [33]:
# load dataset
from pycaret.datasets import get_data
income = get_data('income')

# init setup
from pycaret.classification import *
clf3 = setup(data = income, target = 'income >50K', bin_numeric_features = ['age'])

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Unnamed: 0,Description,Value
0,Session id,7694
1,Target,income >50K
2,Target type,Binary
3,Original data shape,"(32561, 14)"
4,Transformed data shape,"(32561, 64)"
5,Transformed train set shape,"(22792, 64)"
6,Transformed test set shape,"(9769, 64)"
7,Ordinal features,1
8,Numeric features,5
9,Categorical features,8


In [36]:
df3=clf3.get_config('dataset_transformed')
df3.to_csv('agebins.csv', index = False)


Unnamed: 0,age,workclass_Private,workclass_Local-gov,workclass_Self-emp-not-inc,workclass_Federal-gov,workclass_Self-emp-inc,workclass_State-gov,workclass_Never-worked,workclass_Without-pay,education_Some-college,...,race_Other,race_Black,race_Asian-Pac-Islander,race_Amer-Indian-Eskimo,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
29521,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,35.0,0.246292,0
21239,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0,0.246292,0
15339,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,40.0,0.246292,0
24895,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40.0,0.246292,1
27532,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,15024.0,0.0,45.0,0.246292,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
843,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48.0,0.246292,0
28263,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,30.0,0.246292,0
11121,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,48.0,0.246292,1
4379,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,20.0,0.246292,0


<H2>Combine Rare Levels</H2>
<p>Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in PyCaret using rare_to_value parameter. </p>


In [38]:
# load dataset
from pycaret.datasets import get_data
income = get_data('income')

# init setup
from pycaret.classification import *
clf4 = setup(data = income, target = 'income >50K', rare_to_value = 0.1)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Unnamed: 0,Description,Value
0,Session id,706
1,Target,income >50K
2,Target type,Binary
3,Original data shape,"(32561, 14)"
4,Transformed data shape,"(32561, 30)"
5,Transformed train set shape,"(22792, 30)"
6,Transformed test set shape,"(9769, 30)"
7,Ordinal features,4
8,Numeric features,5
9,Categorical features,8


In [40]:
df4=clf4.get_config('dataset_transformed')
df4.to_csv('combineRareLevel.csv', index = False)
