## Problem Statement
- Build a machine learning pipeline

#### ML Pipeline
- There are some standard workflows in machine learning projects that can be automated. 
- In scikit-learn **Pipeline** utility can help to cleary define and automate these workflows. 

#### What is Pipeline utility?
- It allows linear sequence of of data transforms to be chained together in a modeling process that can be evaluated. 

###  [input] - [wf-1] - [wf-2] - [wf-3] - [wf-n] - [predictions]

- Pipelines help prevent data leakage
- For example, data preparation like standardization is constrained to each fold of cross validation procedure. 


#### Load Python libraries and dataset

In [None]:
import pandas as pd
from numpy import set_printoptions

In [None]:
data = pd.read_csv("../data/pima-indians-diabetes.csv")

#### Check Your Data

In [None]:
# check first 20 rows of the dataset
print(data.head(5))

### Separate input and target variables

In [None]:
# split data into train and test 
data_array = data.values
X = data_array[:,0:8]
y = data_array[:,8]

## <span style="color:red"> Data Preparation and Modeling Pipeline</span>

- Create a pipeline to prepare the dataset using Standard Scaler on the entire training dataset before traing the model.
### [input] - [standardize] - [classifier] - [predictions]

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

from sklearn.pipeline import Pipeline

In [None]:
# define pipeline worksflows
estimator1 = [] # list instantiation
estimator1.append(('standardize', StandardScaler()))
estimator1.append(('clf', DecisionTreeClassifier()))

# instantiate Pipeline class with pipeline workflows
pipe1 = Pipeline(estimator1)

## Evaluate Pipeline

In [None]:
# Instantiate KFold class with number of splits
kfold = KFold(n_splits=5)

# cross validation on Kfolds 
results = cross_val_score(pipe1, X, y, cv=kfold, n_jobs=1)

## Analyze Results

In [None]:
print("Accuracy per fold\n======================================")
for i in range(len(results)):
    print("Accuracy - Fold-{}  -> {}".format(i, results[i]))

print("\nAverage accuracy\n=====================================")
print("Accuracy - {}".format(results.mean()*100.0))

## Feature Selection
- Statistical tests can be used to select those features have strongest relationshio with the output variable.
- scikit-learn provides the **SelectKBest** class to do feature selection
- It can be used with a suite of different statistical tests to select a specific number of features. 

### Problem Statement
- Use Pima Indians Diabetes dataset and **select best 4 features**

In [None]:
# we already have X and y from Pima indian dataset

# Load Python library for feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import operator
set_printoptions(precision=3)

# Feature selection (k)
select_feat = SelectKBest(score_func=chi2, k=4)
select_feat_fit = select_feat.fit(X, y)

# Summarize Scores
feat_scores = select_feat_fit.scores_


# Summarize selected features
feat_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
feature_score_map = dict(zip(feat_names, feat_scores))
sorted_feature_score = dict(sorted(feature_score_map.items(), key=operator.itemgetter(1), reverse=True))
for k, v in sorted_feature_score.items():
    print(k, ":", v)
    


In [None]:
# get feature set
features = select_feat_fit.transform(X)
print(f'\n{features[0:5, :]}')

## <span style="color:red"> Data Preparation, Feature Extraction and Modeling Pipeline</span>

- Create a pipeline to extract features and classfication model.
### [input] - [normalizer] - [feature-selection] - [Classifier] - [Predictions]

In [None]:
# define pipeline workflows
estimator2 = []
estimator2.append(('minmaxscaler', MinMaxScaler(feature_range=(0, 1))))
estimator2.append(('select_best', SelectKBest(score_func=chi2, k=4)))
estimator2.append(('clf', DecisionTreeClassifier()))

# instantiate Pipeline class with pipeline workflows
pipe2 = Pipeline(estimator2)

## Evaluate Pipeline

In [None]:
# Instantiate KFold class with number of splits
kfold = KFold(n_splits=10)

# cross validation on Kfolds 
results = cross_val_score(pipe2, X, y, cv=kfold)

In [None]:
print("Accuracy per fold\n======================================")
for i in range(len(results)):
    print("Accuracy - Fold-{}  -> {}".format(i, results[i]))

print("\nAverage accuracy\n=====================================")
print("Accuracy - {}".format(results.mean()*100.0))