### Centering and Scaling
Model performance can improve if the features are scaled. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

White Wine Quality! The 'quality' feature of the wine is the binary target variable: If 'quality' is less than 5, the target variable is 1, and otherwise, it is 0.

Notice how some features seem to have different units of measurement. 'density', for instance, only takes values between 0 and 1, while 'total sulfur dioxide' has a maximum value of 289. As a result, it may be worth scaling the features here. 

In [5]:
import pandas as pd
import numpy as np

In [60]:
df = pd.read_csv('data\white-wine.csv')

In [61]:
# The 'quality' feature of the wine is the binary target variable: 
# If 'quality' is less than 5, the target variable is 1, and otherwise, it is 0.

df['quality'] = np.where(df['quality'] < 5, 'TRUE', 'FALSE')

In [63]:
# build predictor and target df
X, y = df.drop('quality', axis=1).values, df['quality'].values

In [65]:
# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}\n".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled)))  
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Unscaled Features: 18.432687072460002
Standard Deviation of Unscaled Features: 41.54494764094571

Mean of Scaled Features: 2.7314972981668206e-15
Standard Deviation of Scaled Features: 0.9999999999999999


### Pipelined Centering and Scaling
With regard to whether or not scaling is effective, the proof is in the pudding! Examine whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. Test a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.

The feature array and target variable array have been pre-loaded as X and y. Additionally, KNeighborsClassifier and train_test_split have been imported from sklearn.neighbors and sklearn.model_selection, respectively.

In [66]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 

In [67]:
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps[transfomer, estimator]
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))


Accuracy with Scaling: 0.964625850340136
Accuracy without Scaling: 0.9666666666666667


### SVM (classification) with Pipelined k-fold cross validation to tune hyperparameters 

The following builds a pipeline that includes __scaling__ and __hyperparameter tuning__ to __classify__ wine quality.

Fit an [__SVM classifier__](C:\Users\seanf\Documents\Machine_Learning\DataCamp_Supervised_ML\lecture_slides\SVM.pdf). The hyperparameters to tune are __C__ and __gamma__. C controls the __regularization strength__. It is analogous to the C  tuned for logistic regression in Chapter 3, while gamma controls the kernel coefficient.  Kernel is essentially a similarity function; given two samples the kernel determines how similar they are.  

The following modules have been pre-loaded: Pipeline, svm, train_test_split, GridSearchCV, classification_report, accuracy_score. The feature and target variable arrays X and y have also been pre-loaded.

In [68]:
# Import necessary modules
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

In [69]:
# Setup the pipeline (transformer, estimator)
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space using the following notation:
# 'step_name__parameter_name'. Here, the step_name is SVM, and the parameter_names are C and gamma.
# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid = parameters)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.9693877551020408
             precision    recall  f1-score   support

      FALSE       0.97      1.00      0.98       951
       TRUE       0.43      0.10      0.17        29

avg / total       0.96      0.97      0.96       980

Tuned Model Parameters: {'SVM__C': 100, 'SVM__gamma': 0.01}


### LogReg (classification) with Pipelined Imputer (missing data), Scaler and ElasticNet (Regularization Ratio) 

Build a pipeline that imputes the missing data, scales the features, and fits an [ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) to the Gapminder data. Tune the __L1_ratio__ of __ElasticNet__ using __GridSearchCV__.

In [77]:
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

In [78]:
df = pd.read_csv('data\gm_2008_region.csv')

df = df.drop(['Region'], axis = 1)

In [79]:
# Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the 
# .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series 
# respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.

# build predictor and target df
X, y = df.drop('life', axis=1).values, df['life'].values

In [85]:
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scalar', StandardScaler()),
         ('elasticnet', ElasticNet())]

# Create the pipeline: pipeline (transformer, estimator)
pipeline = Pipeline(steps)

# Specify the hyperparameter space for the l1 ratio using the following notation: 'step_name__parameter_name'. 
# Here, the step_name is elasticnet, and the parameter_name is l1_ratio.
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid = parameters)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))




Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
Tuned ElasticNet R squared: 0.8862016570888217
