# Feature Engineering Exercises

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression

from wrangle import scale_data

1. Load the tips dataset.

In [2]:
tips = sns.load_dataset('tips')


# Skipping the splitting and scaling

tips_scaled, tips_scaled, tips_scaled = scale_data(tips, tips, tips)

tips_scaled.head()

Unnamed: 0,total_bill,tip,size
0,0.291579,0.001111,0.2
1,0.152283,0.073333,0.4
2,0.375786,0.277778,0.4
3,0.431713,0.256667,0.2
4,0.450775,0.29,0.6


a. Create a column named price_per_person. This should be the total bill divided by the party size.   

In [9]:
tips['price_per_person'] = tips.total_bill / tips['size']
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
0,16.99,1.01,Female,No,Sun,Dinner,2,8.495
1,10.34,1.66,Male,No,Sun,Dinner,3,3.446667
2,21.01,3.5,Male,No,Sun,Dinner,3,7.003333
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84
4,24.59,3.61,Female,No,Sun,Dinner,4,6.1475


b. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?  

- total_bill, size, time, day. In order of potential importance.

c. Use select k best to select the top 2 features for predicting tip amount. What are they?
    d. Use recursive feature elimination to select the top 2 features for tip amount. What are they?   

In [14]:
X_tips = tips[['total_bill','size','price_per_person']]
y_tips = tips.tip

f_select = SelectKBest(f_regression, k=2)

f_select.fit(X_tips, y_tips)

feature_mask = f_select.get_support()

f_feature = X_tips.iloc[:,feature_mask].columns.tolist()
f_feature

['total_bill', 'size']

e. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

# Appendix Exercises

Our scenario continues:
As a customer analyst, I want to know who has spent the most money with us over their lifetime. I have monthly charges and tenure, so I think I will be able to use those two attributes as features to estimate total_charges. I need to do this within an average of $5.00 per customer.

1. Write a function, select_kbest_freg() that takes X_train, y_train and k as input (X_train and y_train should not be scaled!) and returns a list of the top k features.

2. Write a function, select_kbest_freg() that takes X_train, y_train (scaled) and k as input and returns a list of the top k features.

3. Write a function, ols_backware_elimination() that takes X_train and y_train (scaled) as input and returns selected features based on the ols backwards elimination method.

4. Write a function, lasso_cv_coef() that takes X_train and y_train as input and returns the coefficients for each feature, along with a plot of the features and their weights.

5. Write 3 functions, the first computes the number of optimum features (n) using rfe, the second takes n as input and returns the top n features, and the third takes the list of the top n features as input and returns a new X_train and X_test dataframe with those top features , recursive_feature_elimination() that computes the optimum number of features (n) and returns the top n features.