# Pipelines

Pipelines allow for easily chaining together multiple transformations as well as models. For example:
```
from sklearn.preprocessing import StandardScalar
from sklearn.decomposition import PCA
from sklean.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline_lr = Pipeline([
    ('scl', StandardScalar()),
    ('pca', PCA(n_components=2)),
    ('clf', LogisticRegression(random_state=0))
])
pipeline_lr.fit(X_train, y_train)
```

# Assessing Model Performance

There are two good techniques for cross-validation, which allow for estimating generalization error. 

## Holdout

Data is seperated into training, validation and test sets. The validation set is used during model selection for training and tuning, while the test set helps emulate generalization error. 

The problem with this method is that it is sensitive to how the samples are split up, which makes the next method more robust in comparision. 

## K-fold

Data is split into $k$ folds without replacement, where $k-1$ folds are used for training, and one for testing. This procedure is completed $k$ (usually 10) times to gain $k$ models, and $k$ estimates.

Because the procedure is repeated, the splits will change, and individual samples have less strength (since they may be in both training and test sets). 

For smaller datasets, one sample can be left out as the test set - this is called leave-one-out (LOO) cross-validation.
```
from sklearn.cross_validation import cross_val_score
from sklean.linear_model import LogisticRegression

lr = LogisticRegression()
# Pipelines can be used as fell. 
scores = cross_val_score(estimator=lr,
                            X=X_train,
                            y=y_train,
                            cv=10,
                            n_jobs=1)
                            
```

## Grid Search 

While parameters learned from the dataset can be optimized by the algorithm, external parameters (say depth of a decision tree) need to be optimized seperately.

Grid search works by trying sets of external parameters, finding the optimal combination. 
```
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

svc = SVC()

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = {
    'C': param_range,
    'kernel': ['linear', 'rbf'],
    'gamma': param_range
}

gs = GridSearchCV(estimator=svc,
                    param_grid=param_grid,
                    scoring='accuracy',
                    cv=10,
                    n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_, gs.best_params)
```

The one downside to this method is that it can be very computationally intensive. `RandomizedSearchCV` can randomly draw parameters from distributions which can often be a faster approach. 

Grid search can also be combined with K-fold from above once the optimal parameters are found, to tune the dataset parameters. 

# Finding the Best Classifier

When the decsion to use a specific classifier isn't known, it can be smart to test multiple to find the most ideal one to start parameter tuning.

In [1]:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    cv_results = model_selection.cross_val_score(model, X, Y, cv=10, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

LR: 0.766969 (0.035426)
LDA: 0.773496 (0.034665)
KNN: 0.721377 (0.044168)
CART: 0.704375 (0.064984)
NB: 0.756494 (0.033037)
SVM: 0.651059 (0.003418)
