# Introduction
<hr style="border:2px solid black"> </hr>


**What?** How to use pipeline in sklearn and AVOID data leakage



In [None]:
"""
An easy trap to fall into in applied machine learning is leaking data from 
your training dataset to your test dataset. To avoid this trap you need a 
robust test harness with strong separation of training and testing. 
Pipelines help you prevent data leakage in your test harness by ensuring 
that data preparation like standardization is constrained to each fold
of your cross validation procedure.
"""

In [17]:
# Import python modules
from pandas import read_csv
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import KFold
from IPython.display import Markdown, display
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [2]:
# Additional cosmetic function
def myPrint(string, c = "blue"):    
    """My version of the python-native print command.
    
    Print in bold and red tect
    """
    colorstr = "<span style='color:{}'>{}</span>".format(c, '**'+ string + '**' )    
    display(Markdown(colorstr))

In [9]:
myPrint("Importing dataset")
filename = './datasetCollections/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
print("Input size: ", X.shape)
print("Labels size: ", Y.shape)

<span style='color:blue'>**Importing dataset**</span>

Input size:  (768, 8)
Labels size:  (768,)


In [16]:
myPrint("Create a pipeline that standardizes the data then creates a model")

"""
The pipeline is defined with two steps:
[1] Standardize the data
[2] Learn a Linear Discriminant Analysis model

The pipeline is then evaluated using 10-fold cross validation
"""

# Create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
pipeLineModel = Pipeline(estimators)
# evaluate pipeline
kfold = KFold(n_splits = 10, shuffle = True, random_state = 7)
results = cross_val_score(pipeLineModel, X, Y, cv = kfold)
print("Mean: ", results.mean())

print("")
print(dir(model))
print(model.get_params)

<span style='color:blue'>**Create a pipeline that standardizes the data then creates a model**</span>

Mean:  0.7669685577580315

['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_fit_params', '_check_n_features', '_estimator_type', '_final_estimator', '_fit', '_get_param_names', '_get_params', '_get_tags', '_inverse_transform', '_iter', '_log_message', '_more_tags', '_pairwise', '_replace_estimator', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_params', '_sk_visual_block_', '_transform', '_validate_data', '_validate_names', '_validate_steps', 'classes_', 'decision_function', 'fit', 'fit_predict', 'fit_transform', 'get_params', 'inverse_t

In [29]:
myPrint("Features extraction within the pipeline")

"""
Feature extraction is another procedure that is susceptible to data leakage.
feature extraction procedures must be restricted to the data in your training dataset.
FeatureUnion which allows the results of multiple feature selection and extraction 
procedures to be combined into a larger dataset on which a model can be trained.

The code below does 4 steps:
[1] Feature Extraction with Principal Component Analysis (3 features)
[2] Feature Extraction with Statistical Selection (6 features)
[3] Feature Union
[4] Learn a Logistic Regression Model
"""

# Create feature union
features = []
features.append(('pca', PCA(n_components = 3)))
features.append(('select_best', SelectKBest(k = 6)))
feature_union = FeatureUnion(features)

# Create pipeline
estimators = []
estimators.append(('feature_union', feature_union)) 
estimators.append(('logistic', LogisticRegression(max_iter = 250)))
model = Pipeline(estimators)

# Evaluate pipeline
kfold = KFold(n_splits=10, shuffle = True, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Mean", results.mean())

<span style='color:blue'>**Features extraction within the pipeline**</span>

Mean 0.7721633629528366


# References
<hr style="border:2px solid black"> </hr>


- https://machinelearningmastery.com/data-preparation-without-data-leakage/

