### Feature Engineering

In order to get comfortable with common feature engineering techniques and how to implement them in python, we will use a toy dataset.  First, let's create the dataset, and we can also read in the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'response': [2.4, 3.3, -4.2, 5.6, 1.5, 8.7], 
                         'x1': ['yes','no','yes','maybe','no','yes'],
                         'x2': [-1,-3,np.nan, 0, np.nan, 1],
                         'x3': [2.4, 15, 3.3, 2.4, 1.8, 0.4],
                         'x4': [np.nan, np.nan, 1, 1, 1, 1],
                         'x5': ['A', 'B', np.nan, 'A', 'A', 'A']})
df

Unnamed: 0,response,x1,x2,x3,x4,x5
0,2.4,yes,-1.0,2.4,,A
1,3.3,no,-3.0,15.0,,B
2,-4.2,yes,,3.3,1.0,
3,5.6,maybe,0.0,2.4,1.0,A
4,1.5,no,,1.8,1.0,A
5,8.7,yes,1.0,0.4,1.0,A


`1.` Fit a linear model between the response and the three x-variables in the dataset.  Also add an intercept.  Use the results to answer the first quiz question below.

In [2]:
df['intercept'] = 1
lm = sm.OLS(df['response'], df[['intercept','x2','x3','x4']])
results = lm.fit()
results.summary()

MissingDataError: exog contains inf or nans

`2.` Use the sklearn documetation [here](http://scikit-learn.org/stable/modules/preprocessing.html) and the previous video to assist in filling in the missing values for each of the quantitative columns with the column mean.  Now, use the new columns to re-fit the linear model from question `1.`, and use the results to answer quiz 2 below.

In [3]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(df[['x2', 'x3', 'x4']])
df[['x2', 'x3', 'x4']] = imp.transform(df[['x2', 'x3', 'x4']])

In [4]:
lm = sm.OLS(df['response'], df[['intercept','x2','x3','x4']])
results = lm.fit()
results.summary()



0,1,2,3
Dep. Variable:,response,R-squared:,0.489
Model:,OLS,Adj. R-squared:,0.149
Method:,Least Squares,F-statistic:,1.438
Date:,"Fri, 25 Oct 2019",Prob (F-statistic):,0.365
Time:,23:10:02,Log-Likelihood:,-14.742
No. Observations:,6,AIC:,35.48
Df Residuals:,3,BIC:,34.86
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.1507,1.111,1.036,0.376,-2.384,4.685
x2,5.1202,3.053,1.677,0.192,-4.594,14.835
x3,1.0487,0.752,1.394,0.258,-1.345,3.442
x4,1.1507,1.111,1.036,0.376,-2.384,4.685

0,1,2,3
Omnibus:,,Durbin-Watson:,2.043
Prob(Omnibus):,,Jarque-Bera (JB):,2.515
Skew:,-1.531,Prob(JB):,0.284
Kurtosis:,3.828,Cond. No.,2.83e+16


`3.` Another common way to scale features is by subtracting the mean and dividing by the standard deviation.  There are certain machine learning algorithms where you should always consider this type of scaling (or other ways of normalizing), as discussed [here](https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning).  Use the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) and the previous video to assist with performing this scaling on the three new, quantitative columns in your dataset.  

To assure you performed these transformations correctly, answer quiz 3 below.

In [5]:
norm = StandardScaler()
norm.fit(df[['x2','x3','x4']])
norm.transform(df[['x2','x3','x4']])

array([[-0.20701967, -0.3706604 ,  0.        ],
       [-1.86317701,  2.20015852,  0.        ],
       [ 0.        , -0.18703048,  0.        ],
       [ 0.621059  , -0.3706604 ,  0.        ],
       [ 0.        , -0.49308035,  0.        ],
       [ 1.44913767, -0.7787269 ,  0.        ]])