# Pipeline Assignment

Pipeline() is used for evaluating different options and different models.

## Basic Understanding
After covering this material, we need to answer the following questions:
 1. Why is it wrong to say that y can be predicted from X?
 2. What is wrong with applying the steps in order?
 3. Why do the training sets overlap the testing sets?
 4. Why does the pipeline avoid the issue?
 
## Modify the code to understand the trend
Small number of features:
 1. Reduce the number of components of X by 100.
 2. In the Kernel submenu, click "Restart & Run All" to rerun.
 3. What are the new values for $R^2$?
 4. Are the new values lower or higher? Explain.

Modify the code to: 
 1. Increase the number of components of X by 10 to 1000
 2. Rerun.
 3. What is the relationship between the number of components and $R^2$?

## Model pipeline optimization
 1. How did we pass parameters to different models?
 2. What is the advantage of using a single call to GridSearchCV()?
 
 
The example is taken from Chapter 6 of an Introduction to Machine Learning with Python
[Github for book code](https://github.com/amueller/introduction_to_ml_with_python)

In [7]:
# NumPy library:
import numpy as np

# Create uncorrelated Random Variables:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 100))
y = rnd.normal(size=(100,))

print('X has {} observations of {}-dimensional vectors'.format(X.shape[0], X.shape[1]))
print('y has {} responses.'.format(y.shape[0]))

X has 100 observations of 100-dimensional vectors
y has 100 responses.


## Feature Selection
Linear regression is performed using:

       y = m*X[i] + c
       
where the goal is to predict y from the ith-feature in X.

For the F-value, we are measuring the p-values, the probability that "m" is zero by random chance. Thus, a low-value indicates that "m" is not zero. 

In selecting percentiles, each column of X is sorted from low to high. The top-5 percentile means that 95\% of the values fall below it.

ScikitLearn references for feature selection:
[percentiles option](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html),
[fitting using f_regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html).
[pre-processing](https://scikit-learn.org/stable/modules/preprocessing.html).

In [8]:
# Feature Selection Example.
from sklearn.feature_selection import SelectPercentile, f_regression

# f_regression: 
#   1. Predicts y from X using the .fit() function.
#   2. Computes the F-value for each one of the regression variables.
#   3. Return the features in X that best predict the output.
select = SelectPercentile(score_func=f_regression, percentile=5).fit(X, y)

# Standardize the variable to zero-mean standard deviation=1.
X_selected = select.transform(X)

print("X_selected.shape: {}".format(X_selected.shape))
print("Transformed X has mean={:0.3f} and stdev={:0.3f}".format(np.mean(X_selected), np.std(X_selected)))

X_selected.shape: (100, 5)
Transformed X has mean=-0.010 and stdev=0.988


Ridge Regression estimate y using a reduced number of variables:
$$ || y - X w || + \alpha || w ||_2^2 $$

SciKit Learn:
[Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
[Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
[Ridge regression example](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression).

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

print("Cross-validation accuracy (cv only on ridge): {:.2f}".format(
      np.mean(cross_val_score(Ridge(), X_selected, y, cv=5))))
print("This mean R-squared result is wrong. Why?")

Cross-validation accuracy (cv only on ridge): 0.13
This mean R-squared result is wrong. Why?


In [10]:
# Use pipeline to estimate correlations correctly
from sklearn.pipeline import Pipeline

pipe = Pipeline([("select", SelectPercentile(score_func=f_regression,
                                             percentile=5)),
                 ("ridge", Ridge())])

corrected_result = cross_val_score(pipe, X, y, cv=5)
print("R-squared values:", corrected_result)

print("Cross-validation accuracy (pipeline): {:.2f}".format(
      np.mean(corrected_result)))
print("Why is this the correct result?")


R-squared values: [-0.72450081 -0.25121839 -0.75010265 -0.33567149 -0.74246843]
Cross-validation accuracy (pipeline): -0.56
Why is this the correct result?


In [11]:
# Examine the ridge parameters withing the pipeline:
print(pipe.named_steps["ridge"].get_params())

{'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001}


# Model optimization
Allows us to select among different classifiers and different pipelines.

In [12]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

pipe = Pipeline([('preprocessing', StandardScaler()), 
                 ('classifier', SVC())])

param_grid = [
    {'classifier': [SVC()], 
     'preprocessing': [StandardScaler(), None],
     'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
     'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
    {'classifier': [RandomForestClassifier(n_estimators=100)],
     'preprocessing': [None], 
     'classifier__max_features': [1, 2, 3]}]

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

Best params:
{'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False), 'classifier__C': 10, 'classifier__gamma': 0.01, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True)}

Best cross-validation score: 0.99
Test-set score: 0.98
