#Feature Engineering

What are features?
- variables we use to help predict our target.
- not our target variable
- not all of the independent variables we start with.
- the independent variables we END with, the ones we use in modeling.

Why would we choose some variables and not others?
- doesn't influence your target.
- may overfit the model.
- too many null values
- dependency between attributes
- category with too many values and we can't encode.
- information that could lead to discrimination or unethical decisions

Why would we want to create new variables?
- dependency between 2 variables, so blend them into one.
- binning categorical with too many values into fewer categories.
- continuous variables with a lot of noise
- calculation of 2 variables, like length x width


Why do we try to limit the number of variables?
- curse of dimensionality


What is it?
- creating new features
- removing features
- selecting top features
- transforming features

Goal in feature engineering:
I want to make it easy for the computer to see the patterns

Algorithmic feature selection methods:
- Filter Feature Selection methods: look at the features with highest
    correlation to the target and select those features. Wouldn't have the ability to check for things like confidential info. Wouldn't pick out if the impact of 3 features together is strong but individually weak. Could end up giving you 3 features that all give the same information.
- Wrapper methods: create n different models, evaluate performance, and the features that are in the model that performed the best, are the ones to keep. Computationally expensive.

Importance of scaling:
    
if you have a variable with significantly larger units than another, it's going to have inflated importance. So, scale before doing this.

Must scale X's, do not scale y.

**Features are the difference**

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

Wrangle

- Acquire

In [2]:
df = pd.read_csv('student/student-mat.csv', sep=';')

In [3]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


Summarize

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

Nulls
- No missing values

Numeric Columns

In [5]:
df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


Object Columns

-How many unique values in each column?

In [6]:
mask = np.array(df.dtypes == 'object')

filter df columns by using the mask

In [8]:
#df.dtypes == 'object' returns a series.
# convert this to an array.
obj_df = df.iloc[:, mask]

loop through all the object columns and generate value counts of each unique value.

In [11]:
# loop through each column name in the list of columns
# print the value_counts

for col in obj_df.columns:
    print(obj_df[col].value_counts())

GP    349
MS     46
Name: school, dtype: int64
F    208
M    187
Name: sex, dtype: int64
U    307
R     88
Name: address, dtype: int64
GT3    281
LE3    114
Name: famsize, dtype: int64
T    354
A     41
Name: Pstatus, dtype: int64
other       141
services    103
at_home      59
teacher      58
health       34
Name: Mjob, dtype: int64
other       217
services    111
teacher      29
at_home      20
health       18
Name: Fjob, dtype: int64
course        145
home          109
reputation    105
other          36
Name: reason, dtype: int64
mother    273
father     90
other      32
Name: guardian, dtype: int64
no     344
yes     51
Name: schoolsup, dtype: int64
yes    242
no     153
Name: famsup, dtype: int64
no     214
yes    181
Name: paid, dtype: int64
yes    201
no     194
Name: activities, dtype: int64
yes    314
no      81
Name: nursery, dtype: int64
yes    375
no      20
Name: higher, dtype: int64
yes    329
no      66
Name: internet, dtype: int64
no     263
yes    132
Name: romantic, 

In [12]:
# create df with new dummy vars
dummy_df = pd.get_dummies(obj_df, dummy_na=False, drop_first=True)

In [13]:
# concatenate the dataframe with dummies to our original dataframe
#via column (axis=1)
df = pd.concat([df, dummy_df], axis=1)

In [14]:
# drop object columns from df
df.drop(columns=obj_df.columns, inplace=True)

Split

Split data into train, validate, test

In [16]:
from sklearn.model_selection import train_test_split
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)

In [17]:
train.shape, validate.shape, test.shape

((221, 42), (95, 42), (79, 42))

Split into X and y dataframes

In [18]:
# x df's are all cols except G3
X_train = train.drop(columns=['G3'])
X_validate = validate.drop(columns=['G3'])
X_test = test.drop(columns=['G3'])

# y df's are just G3
y_train = train[['G3']]
y_validate = validate[['G3']]
y_test = test[['G3']]

Explore

Scale

In [24]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(copy=True).fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

Feature Selection
    1. SelectKBest
    2. RFE: Recursive Feature Elimination

SelectKBest

- filter method
- find and keep the attributes with the highest correlation to the target variable

In [26]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])

In [27]:
from sklearn.feature_selection import SelectKBest, f_regression

Initialize the f_selector object, defining the scoring method.

In [28]:
f_selector = SelectKBest(f_regression, k=13)

Fit the object to our X and y data (train!)
This will score, rank and ID the top k features

In [31]:
f_selector.fit(X_train_scaled, y_train.G3)

SelectKBest(k=13, score_func=<function f_regression at 0x7fdbae222680>)

Transform out dataset to reduce to the K best features.

In [34]:
X_train_reduced = f_selector.transform(X_train_scaled)

print(X_train_reduced.shape)
print(X_train.shape)

(221, 13)
(221, 41)


In [35]:
f_support = f_selector.get_support()
print(f_support)

[ True  True  True  True  True  True False False False False False False
 False  True  True False  True False False False False  True False False
 False False False False False False  True False  True False False False
 False False  True False False]


In [38]:
#using iloc, the df will filter out all the index locations where mask is false
#the : before the comma is for rows (so if we wanted to filter rows
# we could say like 10:20), and after the comma is for columns.

In [40]:
X_reduced_scaled = X_train_scaled.iloc[:,f_support]

In [36]:
f_feature = X_train_scaled.iloc[:,f_support].columns.tolist()

In [42]:
#This new dataframe is ready for modeling!

In [41]:
X_reduced_scaled.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,G1,G2,sex_M,Mjob_other,reason_reputation,guardian_other,higher_yes
142,0.0,1.0,1.0,0.0,0.666667,0.0,0.357143,0.578947,0.0,0.0,0.0,0.0,1.0
326,0.333333,0.75,0.75,0.0,0.0,0.0,0.714286,0.789474,1.0,1.0,1.0,0.0,1.0
88,0.166667,0.5,0.5,0.333333,0.333333,0.333333,0.5,0.526316,1.0,0.0,1.0,0.0,1.0
118,0.333333,0.25,0.75,0.666667,0.333333,0.333333,0.357143,0.368421,1.0,1.0,0.0,0.0,1.0
312,0.666667,0.25,0.5,0.0,0.333333,0.333333,0.642857,0.578947,1.0,1.0,0.0,1.0,1.0


In [37]:

f_feature

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'G1',
 'G2',
 'sex_M',
 'Mjob_other',
 'reason_reputation',
 'guardian_other',
 'higher_yes']

In [43]:
#We could run through it again with a different k value, and select those best features.
#We can then run the different dataframes through models, and select the best model 

**Recursive Feature Elimination, RFE**

Wrapper method

- recursively build model after model with fewer and fewer features. It will then identify which model performs the best. Then, return which features were used in that model. Those are the features we will keep.

In [44]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

Initialize the linear regression object

In [45]:
lm = LinearRegression()

Initialize the RFE object, setting the hyperparameters to be our linear model above (lm), and the number of features we want returned.

In [46]:
rfe = RFE(lm, 13)

In [49]:
X_rfe = rfe.fit(X_train_scaled, y_train)
X_rfe.transform(X_train_scaled)

  y = column_or_1d(y, warn=True)


array([[0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        1.        ],
       [0.33333333, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [0.16666667, 0.33333333, 0.33333333, ..., 0.        , 0.        ,
        1.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.16666667, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [0.16666667, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [51]:
X_rfe = rfe.fit_transform(X_train_scaled, y_train)

  y = column_or_1d(y, warn=True)


In [52]:
X_rfe

array([[0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        1.        ],
       [0.33333333, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [0.16666667, 0.33333333, 0.33333333, ..., 0.        , 0.        ,
        1.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.16666667, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [0.16666667, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [54]:
mask = rfe.support_

In [56]:
X_reduced_scaled_rfe = X_train_scaled.iloc[:,mask]

In [57]:
# features selected from selectkbest
X_reduced_scaled.columns.tolist()

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'G1',
 'G2',
 'sex_M',
 'Mjob_other',
 'reason_reputation',
 'guardian_other',
 'higher_yes']

In [59]:
# features selected from rfe
X_reduced_scaled_rfe.columns.tolist()

['age',
 'traveltime',
 'failures',
 'famrel',
 'absences',
 'G1',
 'G2',
 'Mjob_health',
 'Mjob_other',
 'Mjob_services',
 'schoolsup_yes',
 'famsup_yes',
 'internet_yes']

In [65]:
from pydataset import data

In [66]:
df = data('tips')

In [67]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [68]:
df['tip_percentage'] = df.tip / df.total_bill

In [69]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


Create a column named price_per_person. This should be the total bill divided by the party size.

In [72]:
df['price_per_person'] = df.total_bill / df['size']

In [73]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

- total_bill, day, time, size

Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?

In [75]:
f_selector = SelectKBest(f_regression, k=2)


In [77]:
tips = df.drop(columns=['sex', 'smoker', 'day', 'time'])

In [78]:
tips.head()

Unnamed: 0,total_bill,tip,size,tip_percentage,price_per_person
1,16.99,1.01,2,0.059447,8.495
2,10.34,1.66,3,0.160542,3.446667
3,21.01,3.5,3,0.166587,7.003333
4,23.68,3.31,2,0.13978,11.84
5,24.59,3.61,4,0.146808,6.1475


In [79]:
train_and_validate, test = train_test_split(tips, train_size=0.8, random_state=123)
train, validate = train_test_split(train_and_validate, train_size=0.8, random_state=123)

In [80]:
train.shape, validate.shape, test.shape

((156, 5), (39, 5), (49, 5))

In [82]:
## Scale the data

In [81]:
scaler = MinMaxScaler(copy=True).fit(train)


In [83]:
train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values])


In [84]:
## Set aside validate and test for 'out of sample data'

In [85]:
validate_scaled = pd.DataFrame(scaler.transform(validate), columns=validate.columns.values).set_index([validate.index.values])
test_scaled = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])

In [86]:
X_train = train.drop(columns='tip')
y_train = train[['tip']]

X_validate = validate.drop(columns='tip')
y_validate = validate[['tip']]

In [87]:
## Select K best to find the most relevent features

In [88]:
f_selector.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


SelectKBest(k=2, score_func=<function f_regression at 0x7fdbae222680>)

In [89]:
##Transforming dataset to reduce to the best features

In [90]:
X_reduced = f_selector.transform(X_train)

print(X_train.shape)
print(X_reduced.shape)

(156, 4)
(156, 2)


In [91]:
f_support = f_selector.get_support()

print(f_support) 

[ True  True False False]


In [92]:
f_feature = X_train.loc[:,f_support].columns.tolist()

print(str(len(f_feature)), 'selected features')
print(f_feature)

2 selected features
['total_bill', 'size']


In [93]:
#Recursive Feature Elimination

- Initialize the linear regression object

In [94]:
lm = LinearRegression()


- Initialize the RFE object

In [95]:
rfe = RFE(lm, 2)


In [96]:
#Fit and transform our data to include only two features

In [97]:
X_rfe = rfe.fit_transform(X_train,y_train)  


  y = column_or_1d(y, warn=True)


In [98]:
#Get a list of the features that remain

In [99]:
mask = rfe.support_


In [100]:
rfe_features = X_train.loc[:,mask].columns.tolist()


In [101]:
print(str(len(rfe_features)), 'selected features')
print(rfe_features)

2 selected features
['size', 'tip_percentage']


In [102]:
var_ranks = rfe.ranking_
var_names = X_train.columns.tolist()

pd.DataFrame({'Var': var_names, 'Rank': var_ranks})

Unnamed: 0,Var,Rank
0,total_bill,2
1,size,1
2,tip_percentage,1
3,price_per_person,3


In [103]:
### With K Best, total_bill and party size were the most relevant features.
### With RFE, tip_percentage and size were the most relevant.

Run the process again using tip percentage as the target variable.

In [104]:
train.head()

Unnamed: 0,total_bill,tip,size,tip_percentage,price_per_person
126,29.8,4.2,6,0.14094,4.966667
189,18.15,3.5,3,0.192837,6.05
84,32.68,5.0,2,0.152999,16.34
241,27.18,2.0,2,0.073584,13.59
40,31.27,5.0,3,0.159898,10.423333


In [106]:
X_train = train.drop(columns='tip_percentage')
y_train = train[['tip_percentage']]

X_validate = validate.drop(columns='tip_percentage')
y_validate = validate[['tip_percentage']]

Create the f_selector object

In [107]:
f_selector = SelectKBest(f_regression, k=2)


Fit the data to the model

In [108]:
f_selector.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


SelectKBest(k=2, score_func=<function f_regression at 0x7fdbae222680>)

In [109]:
## Transform our dataset to reduce to the K Best features

In [110]:
X_reduced = f_selector.transform(X_train)

print(X_train.shape)
print(X_reduced.shape)

(156, 4)
(156, 2)


In [111]:
## Find our list of features

In [112]:
f_support = f_selector.get_support()

print(f_support) 

[ True  True False False]


In [113]:
f_feature = X_train.loc[:,f_support].columns.tolist()

print(str(len(f_feature)), 'selected features')
print(f_feature)

2 selected features
['total_bill', 'tip']


In [114]:
## Recursive Feature Elimination

Initialize the linear regression object

In [115]:
lm = LinearRegression()


In [116]:
## Initialize the RFE object

In [117]:
rfe = RFE(lm, 2)


In [118]:
## Fit and transform the data

In [119]:
X_rfe = rfe.fit_transform(X_train,y_train)  


  y = column_or_1d(y, warn=True)


In [120]:
mask = rfe.support_


In [121]:
rfe_features = X_train.loc[:,mask].columns.tolist()


In [122]:
print(str(len(rfe_features)), 'selected features')
print(rfe_features)

2 selected features
['tip', 'size']


In [123]:
var_ranks = rfe.ranking_
var_names = X_train.columns.tolist()

pd.DataFrame({'Var': var_names, 'Rank': var_ranks})

Unnamed: 0,Var,Rank
0,total_bill,3
1,tip,1
2,size,1
3,price_per_person,2


In [125]:
## Using K best the top two features are total bill and tip. 
## Using RFE the top two features are tip and size.


In [126]:
def select_kbest(X,y,num):
    f_selector = SelectKBest(f_regression, k=num)
    f_selector.fit(X, y)
    X_reduced2 = SelectKBest(f_regression, k=num).fit_transform(X, y)
    f_support = f_selector.get_support()
    f_feature = X.loc[:,f_support].columns.tolist()
    return f_feature

In [127]:
select_kbest(X_train, y_train, 2)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


['total_bill', 'tip']