## Feature Engineering

In [1]:
import pandas as pd
import numpy as np
import pydataset
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### 1. Load the tips dataset.



In [3]:
tips = pydataset.data('tips')
tips.head(1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2


#### a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [4]:
tips['tip_percentage'] = tips.tip / tips.total_bill
tips.head(1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447


#### b. Create a column named price_per_person. This should be the total bill divided by the party size.


In [19]:
tips = tips.rename(columns={'size': 'party_size'}) 
tips['price_per_person'] = tips.total_bill / tips.party_size
tips.head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,party_size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


#### c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?


* total_bill, day, time

#### d. Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?


In [23]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   party_size        244 non-null    int64  
 7   tip_percentage    244 non-null    float64
 8   price_per_person  244 non-null    float64
dtypes: float64(4), int64(1), object(4)
memory usage: 27.2+ KB


In [26]:
tips.day.value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

In [28]:
X = tips.drop(columns=['tip', 'sex', 'smoker', 'day', 'time'])
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [31]:
kbest = SelectKBest(f_regression, k=2)
kbest.fit(X_train_scaled, y_train)
kbest.fit(X_train_scaled, y_train)
X_train.columns[kbest.get_support()]

Index(['total_bill', 'party_size'], dtype='object')

#### e. Use select k best and recursive feature elimination to select the top 2 features for predicting tip percentage. What are they?


#### f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

### 2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.


### 3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.


### 4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).