# Feature Engineering Exercises

__Step 1__

Load the tips dataset:
* Create a column named tip_percentage. This should be the tip amount divided by the total bill.
* Create a column named price_per_person. This should be the total bill divided by the party size.
* Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?
* Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?
* Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?
* Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

In [1]:
import numpy as np
import pandas as pd
from pydataset import data

In [2]:
tips = data('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
#Create tip percentage
tips['tip_percentage'] = tips.tip / tips.total_bill

#Create price per person
tips['price_per_person'] = tips.total_bill / tips.size

I think that the most important features for predicting the tip amount would be total_bill and size.

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.preprocessing import LabelEncoder, StandardScaler
import wrangle

In [5]:
#Convert categorical columns to numeric values
for column in tips.columns:
    if tips[column].dtype == np.number:
        continue
    tips[column] = LabelEncoder().fit_transform(tips[column])

  if tips[column].dtype == np.number:


__Finding best features for predicting 'tip'__

In [6]:
#Now split data into train, validate, test sets
train, validate, test = wrangle.train_validate_test_split(tips)

#Split the data into X and y groups
X_train, y_train = train.drop(columns = ['tip', 'tip_percentage']), train.tip
X_validate, y_validate = validate.drop(columns = ['tip', 'tip_percentage']), validate.tip
X_test, y_test = test.drop(columns = ['tip', 'tip_percentage']), test.tip

#Scale the data
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
validate_scaled = scaler.transform(X_validate)
test_scaled = scaler.transform(X_test)

In [11]:
#Select K Best
kbest = SelectKBest(f_regression, k=2)
kbest.fit(train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f8b2b26ce50>)

In [12]:
kbest.get_support()

array([ True, False, False, False, False, False,  True])

In [14]:
#What columns are they?
X_train.columns[kbest.get_support()]

Index(['total_bill', 'price_per_person'], dtype='object')

In [16]:
#RFE
rfe = RFE(LinearRegression(), n_features_to_select=2)
rfe.fit(train_scaled, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [17]:
rfe.get_support()

array([ True, False, False, False, False, False,  True])

In [18]:
X_train.columns[rfe.get_support()]

Index(['total_bill', 'price_per_person'], dtype='object')

The top two features for predicting tip amount are 'total_bill' and 'price_per_person'. Both SelectKBest and RFE agree.

__Finding best features for predicting 'tip_percentage'__

In [19]:
#Change the y variables
y_train, y_validate, y_test = train.tip_percentage, validate.tip_percentage, test.tip_percentage

In [20]:
#Select K Best
kbest = SelectKBest(f_regression, k=2)
kbest.fit(train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f8b2b26ce50>)

In [21]:
kbest.get_support()

array([ True, False, False, False, False, False,  True])

In [22]:
X_train.columns[kbest.get_support()]

Index(['total_bill', 'price_per_person'], dtype='object')

In [26]:
#RFE
rfe = RFE(LinearRegression(), n_features_to_select=2)
rfe.fit(train_scaled, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [27]:
rfe.get_support()

array([ True, False, False, False, False, False,  True])

In [37]:
X_train.columns[rfe.get_support()].tolist()

['total_bill', 'price_per_person']

The best two features for predicting 'tip_percentage' are 'total_bill' and 'price_per_person'. Both SelectKBest and RFE agree.

I think SelectKBest and RFE could give different results is because SelectKBest evaluates each feature individually, but RFE evaluates the features together and removes the worst performing features one at a time. RFE could find a relationship between multiple variables that helps predict better than just using one individually.

__Step 2__

Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [78]:
def select_kbest(X_train, y_train, k):
    #Will need to scale the X_train data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    
    kbest = SelectKBest(f_regression, k=k)
    kbest.fit(X_train_scaled, y_train)
    return X_train.columns[kbest.get_support()].tolist()

In [79]:
#Reset the splits
#Now split data into train, validate, test sets
train, validate, test = wrangle.train_validate_test_split(tips)

#Split the data into X and y groups
X_train, y_train = train.drop(columns = ['tip', 'tip_percentage']), train.tip
X_validate, y_validate = validate.drop(columns = ['tip', 'tip_percentage']), validate.tip
X_test, y_test = test.drop(columns = ['tip', 'tip_percentage']), test.tip

In [80]:
#Testing
select_kbest(X_train, y_train, 2)

['total_bill', 'price_per_person']

__Step 3__

Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [76]:
def rfe(X_train, y_train, n):
    #Scale the X_train data
    scaler = StandardScaler()
    scaler.fit(X_train, y_train)
    X_train_scaled = scaler.transform(X_train)
    
    rfe = RFE(LinearRegression(), n_features_to_select=n)
    rfe.fit(X_train_scaled, y_train)
    return X_train.columns[rfe.get_support()].tolist()

In [77]:
#Testing
rfe(X_train, y_train, 2)

['total_bill', 'price_per_person']

__Step 4__

Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [82]:
swiss = data('swiss')
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [83]:
#Now split data into train, validate, test sets
train, validate, test = wrangle.train_validate_test_split(swiss)

#Split the data into X and y groups
X_train, y_train = train.drop(columns = ['Fertility']), train.Fertility
X_validate, y_validate = validate.drop(columns = ['Fertility']), validate.Fertility
X_test, y_test = test.drop(columns = ['Fertility']), test.Fertility

In [84]:
#Find best features using SelectKBest
select_kbest(X_train, y_train, 3)

['Examination', 'Catholic', 'Infant.Mortality']

In [85]:
#Find best features using RFE
rfe(X_train, y_train, 3)

['Examination', 'Catholic', 'Infant.Mortality']