 
# Feature Engineering

In [29]:
#General libraries
import pandas as pd
import numpy as np

#Disable Warnings
import warnings
warnings.filterwarnings("ignore")

#Import Dataset
from pydataset import data

#sklearn imports
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Tips Dataset

<tr>
    <td> <img src="Photos/pexels-karolina-grabowska-4386321.jpg"/> </td>
</tr>

[Source Photo 1](https://www.pexels.com/photo/crop-anonymous-person-calculating-profit-on-smartphone-calculator-near-banknotes-4386321/)

### Load Data

We will utilize the <code>tips</code> dataset.

In [30]:
#Load the dataset
tips=data("tips")
#Show head of the DataFrame
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


### Tip Percentage

**Task**
    
Create a column named <code>tip_percentage</code>. This should be the tip amount divided by the total bill.

In [31]:
tips["tip_percentage"]= tips.tip / (tips.total_bill)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


### Price Per Person

**Task**
    
Create a column named <code>price_per_person</code>. This should be the tip amount divided by the total bill.


In [32]:
tips['price_per_person'] = tips.total_bill / tips['size']
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


### Predictions

**Question**
    
Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?


**Answer**
    
I believe `total_bill` will be most important in calculating the `tip` value because people usually tip based on a percentage of their total bill. Speaking personally, I always tip at the same rate, even if I don't like the service.

I also believe `total_bill` will be predictive of `tip_percentage`. People who have a smaller total bill may feel obligated to leave a larger percentage of tip, so that their tip is not tiny. Likewise, people who have a high total bill may feel more comfortable giving a lesser percentage, since a smaller tip percentage still means a large tip. 

### Select K Best and Recursive Feature Elimination

#### K Best for tip amount

**Task**
    
Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?


First, we will convert the `smoker` and `time` columns to integer values so that they are easier for us to process. 
 - For the `smoker` column, `0` is non-smoking and `1` is smoking.
 - For the `time` column, `0` is lunch and `1` is dinner.

In [33]:
tips['smoker'] = (tips.smoker == 'Yes').astype(int)
tips['time'] = (tips.time == 'Dinner').astype(int)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,0,Sun,1,2,0.059447,8.495
2,10.34,1.66,Male,0,Sun,1,3,0.160542,3.446667
3,21.01,3.5,Male,0,Sun,1,3,0.166587,7.003333
4,23.68,3.31,Male,0,Sun,1,2,0.13978,11.84
5,24.59,3.61,Female,0,Sun,1,4,0.146808,6.1475


Now we split the data into `train` and `test`, stratifying on our target feature of `tip`.

In [34]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip_percentage', 'price_per_person']]
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)

Now we can scale our data using the `StandardScaler`

In [35]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We are ready to use Select K Best.

In [36]:
# Create model
kbest = SelectKBest(f_regression, k=2)
#Fit the model
kbest.fit(X_train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7fe1db220310>)

We can use this model to find the top three features most predictive of `tip` amount.

In [37]:
X_train.columns[kbest.get_support()]

Index(['total_bill', 'size'], dtype='object')

**Conclusion**

We conclude that `total_bill` and `size` are most predicitve of `tip`.


#### Recursive Feature Elimination for tip amount

Let's see if we get a different result for the features most preditive of `tip` using the recursive feature elimination method.

In [38]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)
X_train.columns[rfe.get_support()]

Index(['total_bill', 'tip_percentage'], dtype='object')

**Conclusion**

According recursive feature elimination, `total_bill` and `tip_percentage` are most predictive of tip amount.

#### K Best for `tip_percentage`

**Task**
    
Use select k best and recursive feature elimination to select the top 2 features for predicting tip percentage. What are they?

Again, we split the data into `train` and `test`, but this time we stratify the split on the feature `tip_percentage`.

In [39]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip', 'price_per_person']]
y = tips.tip_percentage

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)

We scale our data using `StandardScaler`.

In [40]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now we use Select K Best.

In [41]:
# Create model
kbest = SelectKBest(f_regression, k=2)
#Fit the model
kbest.fit(X_train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7fe1db220310>)

We use the model to find the features most predictive of `tip_percentage`.

In [42]:
X_train.columns[kbest.get_support()]

Index(['tip', 'price_per_person'], dtype='object')

**Conclusion**
    
The features `tip` and `price_per_person` are most predictive of `tip_percentage`.


#### Recursive Feature Elimination for `tip_percentage`

Let's see if we get a different result for the most predictive features of `tip_percentage` using recursive feature elimination.

In [43]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)
X_train.columns[rfe.get_support()]

Index(['total_bill', 'tip'], dtype='object')

**Conclusion**

According the recursive feature elimination model, `total_bill` and `tip` are the most predictive features of `tip_percentage`.


### Comparing methods

**Task**
    
Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?


These methods likely give different answers for top features because the relative importance of the features is very close. Depending on how you measure the relative importance, the order of importance may appear different. 

We can test as to whether or not the results start to converge as we select more features. We will use the Select K Best and Recursive Feature Elimination for `tip_percentage` as an example.

In [44]:
# Create model
kbest = SelectKBest(f_regression, k=4)
#Fit the model
kbest.fit(X_train_scaled, y_train)
#Get most important columns
X_train.columns[kbest.get_support()]

Index(['total_bill', 'size', 'tip', 'price_per_person'], dtype='object')

In [45]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=4)
rfe.fit(X_train_scaled, y_train)
X_train.columns[rfe.get_support()]

Index(['total_bill', 'size', 'tip', 'price_per_person'], dtype='object')

**Conclusion**

As you can see, if we simply increase the features selected to 4, the results of both of these methods is identical. The relative importance of all of these features is likely close.

## Functions

### Select K Best function

**Task**
    
Write a function named <code>select_kbest</code> that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the <code>SelectKBest</code> class. Test your function with the <code>tips</code> dataset. You should see the same results as when you did the process manually.

#### Create function

In [46]:
def select_kbest(X, y, k):
    #split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
    
    #scale the data
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Create model
    kbest = SelectKBest(f_regression, k=k)
    #Fit the model
    kbest.fit(X_train_scaled, y_train)
    
    ##Get most important columns
    return X_train.columns[kbest.get_support()]

#### Test Function

First we will test the function to find which features are most predictice of `tip`.

In [47]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip_percentage', 'price_per_person']]
y = tips.tip
select_kbest(X, y, 2)

Index(['total_bill', 'size'], dtype='object')

**Conclusion**
    
Our Select K Best function tells us that the features <code>total_bill</code> and <code>size</code> are most predictive of <code>tip</code>, which is the same result we got above.

In [48]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip', 'price_per_person']]
y = tips.tip_percentage
select_kbest(X, y, 2)

Index(['tip', 'price_per_person'], dtype='object')

**Conclusion**

Our Select K Best function tells us that the features <code>tip</code> and <code>price_per_person</code> are most predictive of <code>tip_percentage</code>. This is the same result we got above. 

### Recursive Feature Elimination Function

**Task** 
    
Write a function named <code>rfe</code> that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the <code>RFE</code> class. Test your function with the <code>tips</code> dataset. You should see the same results as when you did the process manually.

#### Create function

In [49]:
def rfe(X, y, k):
    #split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
    
    #scale the data
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    #Create model
    rfe = RFE(estimator=LinearRegression(), n_features_to_select=k)
    
    #fit model 
    rfe.fit(X_train_scaled, y_train)
    
    return X_train.columns[rfe.get_support()]

#### Test function

We will use our function to determine which features are most predictive of `tip`.

In [50]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip_percentage', 'price_per_person']]
y = tips.tip
rfe(X, y, 2)

Index(['total_bill', 'tip_percentage'], dtype='object')

   
**Conclusion**
    
Our Recursive Feature Elimination function tells us that the features <code>total_bill</code> and <code>tip_percentage</code> are most predictive of <code>tip</code>, which is the same result we got by using Recursive Feature Elimination above.

We will now use our function to determine which featurses are most predictive of `tip_percentage`.

In [51]:
X = tips[['total_bill', 'size', 'smoker', 'time', 'tip', 'price_per_person']]
y = tips.tip_percentage
select_kbest(X, y, 2)

Index(['tip', 'price_per_person'], dtype='object')

**Conclusion**
    
Our Recursive Feature Elimination function tells us that the features <code>tip</code> and <code>price_per_person</code> are most predictive of <code>tip_percentage</code>, which is the same result we got by using Recursive Feature Elimination above.
    

## Swiss Dataset

 <tr>
    <td> <img src="Photos/pexels-louis-2399391.jpg"/> </td>
</tr>

[Source Photo 2](https://www.pexels.com/photo/photo-of-people-near-clock-tower-during-daytime-2399391/) 

**Task**
    
Load the <code>swiss</code> dataset and use all the other features to predict <code>Fertility</code>. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).
  

### Load Dataset

In [52]:
swiss = data("swiss")
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


### Select K Best

We will use our Select K Best function that we defined above.

In [53]:
#Define x
X = swiss[['Agriculture', 'Examination', 'Education', 'Catholic', 'Infant.Mortality']]
#Define y, using the target variable
y = swiss.Fertility
select_kbest(X, y, 3)

Index(['Examination', 'Education', 'Catholic'], dtype='object')

**Conclusion**

Using the Select K Best function , we conclude that `Examination`, `Education`, and `Catholic` are most predictive of `Fertility`.
   

### Recursive Feature Elimination

In [54]:
X = swiss[['Agriculture', 'Examination', 'Education', 'Catholic', 'Infant.Mortality']]
y = swiss.Fertility
rfe(X, y, 3)

Index(['Agriculture', 'Education', 'Catholic'], dtype='object')


**Conclusion**

Using Recursive Feature Elimination, we conclude `Agriculture`, `Education`, and `Catholic` are most predictive of `Fertility`.
    
