# <font color="blue">Lesson 6 - Feature Engineering and Selection</font>
## Feature Scaling and Selection 


In [None]:
import pandas as pd
from sklearn import preprocessing
# python magic to load numpy and matplotlib
%pylab inline

Read in the wine dataset. 

In [None]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=";")

In [None]:
data.head()

Drop the quality column and split the data into training and test sets: 

In [None]:
from sklearn.model_selection import train_test_split
y = data.quality
X = data.drop('quality', axis=1)

# x is for features, y is for targets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

## Feature Scaling
Feature scaling is a method used to standardize the range of features, usually between the range from 0 to 1. It is also known as data normalization (or standardization) and is a crucial step in data preprocessing.

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

We must apply the scaling to both of our training sets: 

In [None]:
X_train_norm = mms.fit_transform(X_train)
y_train_norm = mms.fit_transform(y_train.values.reshape(-1, 1))

Note that since our target values only contain a single array of data, we had to use y_train.values.reshape(-1,1) to convert this into the 2d array structure fit_transform() expects. 

We can look at the first row to see the scaled values: 

In [None]:
X_train_norm[0]

In [None]:
y_train_norm[0]

## Feature Selection
Now that we've normalized our data to a standard scale, let's take a look at which features make the most sense for our dataset. 

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression

In [None]:
# extract four best features
test = SelectKBest(score_func=f_regression, k=4)

# use those features to fit your model
fit = test.fit(X_train_norm, y_train_norm)

In [None]:
# View indices of the top 4 features
[1+zero_based_index for zero_based_index in list(test.get_support(indices=True))]



In [None]:
[(x,y) for x,y in enumerate(X_train.columns)]

## Answer the following:
1. What pandas function can be used to provide summary statistics on this dataframe? 
2. When using sklearn.model_selection train_test_split function, what do the parameters test_size, random_state and stratify mean? 
3. What does fit_transform do and when should you use it? 
4. Which four features did you determine to be the most influential?