# How to create machine earning pipeline using SKlearn

* First of all lets us know what pipeline is !
- It is a very important concept for a data scientist, in sofware ingineering people
A pipeline is created to allow data flow from its raw format to some useful information
It provides a mechanism to construct a multi_ML parallel pipeline system in order to compare the results of several ML methodes

 
 * In this example we will use
 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html?highlight=breast%20cancer%20wisconsin%20diagnostic%20dataset">breast_cancer</a> dataset

In [13]:
# necessary imports 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB 

In [14]:
# load the dataset 
df_breast_concer = load_breast_cancer()
df_breast_concer.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [15]:
# Train ans test split data

X_train,X_test,y_train,y_test=train_test_split(df_breast_concer.data,df_breast_concer.target,test_size=0.3,random_state=0)

In [16]:
# using standardscaler for data preprocessing 
# Logistic Regression 
pipeline_lr= Pipeline([('sc1',StandardScaler()),
                     ('pca1',PCA(n_components=2)),
                     ('lr_classifier',LogisticRegression(random_state=0))])

# Principal components analysis
pipeline_dt= Pipeline([('sc2',StandardScaler()),
                     ('pca2',PCA(n_components=2)),
                     ('dt_classifier',DecisionTreeClassifier())])

# Randomforest classifier 
pipeline_randomforest= Pipeline([('sc3',StandardScaler()),
                     ('pca3',PCA(n_components=2)),
                     ('rf_classifier',RandomForestClassifier())])
# naive bayes 
pipeline_naivebayes = Pipeline([('sc4',StandardScaler()),
                                ("pca2",PCA(n_components = 2)),
                                ("nb_classifier",GaussianNB())
                               ])

In [20]:
# Make the pipelines in a list 
my_pipelines = [pipeline_lr,pipeline_dt,pipeline_randomforest,pipeline_naivebayes]

# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'RandomForest',3: "Naive Bayes"}

# fit the piplines 
for pip in my_pipelines:
    pip.fit(X_train, y_train)

In [30]:
# Accuracy for each model 
def accuracy():
    for v,model in enumerate(my_pipelines):
        print("{} Test Accuracy is: {}".format(pipe_dict[v],model.score(X_test,y_test)))
accuracy()

Logistic Regression Test Accuracy is: 0.935672514619883
Decision Tree Test Accuracy is: 0.8888888888888888
RandomForest Test Accuracy is: 0.9064327485380117
Naive Bayes Test Accuracy is: 0.8947368421052632


In [31]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

# best model ocurracy 
for v,model in enumerate(my_pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier= v
print('Classifier with best accuracy :{}'.format(pipe_dict[best_classifier]))

Classifier with best accuracy :Logistic Regression
