# Feature Engineering Exercises

1. Create tips dataset and add columns
- Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")


In [2]:
from pydataset import data
tips = data('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
tips['tip_percentage'] = tips.tip / tips.total_bill
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


- Create a column named price_per_person. This should be the total bill divided by the party size.

In [4]:
tips['price_per_person'] = (tips.total_bill / tips['size'])
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


- Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

In [5]:
# convert object columns to encoded columns
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
enc.fit(tips[['sex','smoker','day','time']])
tips[['sex','smoker','day','time']] = enc.transform(tips[['sex','smoker','day','time']])

In [6]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,0.0,0.0,2.0,0.0,2,0.059447,8.495
2,10.34,1.66,1.0,0.0,2.0,0.0,3,0.160542,3.446667
3,21.01,3.5,1.0,0.0,2.0,0.0,3,0.166587,7.003333
4,23.68,3.31,1.0,0.0,2.0,0.0,2,0.13978,11.84
5,24.59,3.61,0.0,0.0,2.0,0.0,4,0.146808,6.1475


In [7]:
# Most important feature for predicting tips amount: total_bill, time, day, 

- Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?

In [8]:
# split the data in train, validate and test
train, test = train_test_split(tips, test_size = 0.2, random_state = 123)
train, validate = train_test_split(train, test_size = 0.25, random_state = 123)
train.shape, validate.shape, test.shape

((146, 9), (49, 9), (49, 9))

In [9]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['tip', 'tip_percentage'])
y_train = train.tip

X_validate = validate.drop(columns=['tip', 'tip_percentage'])
y_validate = validate.tip

X_test = test.drop(columns=['tip','tip_percentage'])
y_test = test.tip

In [10]:
# Define the thing
scaler = sklearn.preprocessing.MinMaxScaler()

# Fit the thing
scaler.fit(X_train)

MinMaxScaler()

In [11]:
# create X versions scaled
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

In [12]:
X_train_scaled = pd.DataFrame(X_train_scaled, index=train.index, columns=['total_bill',
 'sex',
 'smoker',
 'day',
 'time',
 'size',
 'price_per_person'])

In [13]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 2 features
f_selector = SelectKBest(f_regression, k=2)

# find the top 8 X's correlated with y
f_selector.fit(X_train_scaled, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

f_feature

['total_bill', 'size']

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, 2)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

rfe_feature

['total_bill', 'price_per_person']

- Use select k best and recursive feature elimination to select the top 2 features for predicting tip percentage. What are they?

In [15]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 


y_train = train.tip_percentage

y_validate = validate.tip_percentage

y_test = test.tip_percentage

In [16]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 2 features
f_selector = SelectKBest(f_regression, k=2)

# find the top 8 X's correlated with y
f_selector.fit(X_train_scaled, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

f_feature

['total_bill', 'price_per_person']

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, 2)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

rfe_feature

['size', 'price_per_person']

- Why do you think select k best and recursive feature elimination might give different answers for the top features? 
    - They use differnt methods to select best correlations to target.
- Does this change as you change the number of features your are selecting?
    - Yes, the number of features will change the selection

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [19]:
def select_kbest(X, y, k):
    # parameters: f_regression stats test, give me 2 features
    f_selector = SelectKBest(f_regression, k=k)

    # find the top 2 X's correlated with y
    f_selector.fit(X, y)

    # boolean mask of whether the column was selected or not. 
    feature_mask = f_selector.get_support()

    # get list of top K features. 
    f_feature = X.iloc[:,feature_mask].columns.tolist()

    return f_feature

In [20]:
select_kbest(X_train_scaled, y_train, 2)

['total_bill', 'price_per_person']

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [23]:
def rfe(X, y, k):
    # initialize the ML algorithm
    lm = LinearRegression()

    # create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
    rfe = RFE(lm, k)

    # fit the data using RFE
    rfe.fit(X,y)  

    # get the mask of the columns selected
    feature_mask = rfe.support_

    # get list of the column names. 
    rfe_feature = X.iloc[:,feature_mask].columns.tolist()

    return rfe_feature

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).