## Pipelines In SkLearn

Data collection -->  Data cleaning --> Feature Engineering/Selection --> Model Train --> Model fit

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Design plan 

We create pipeline for 3 classifiers : Logistic regression, DecisionTree, Random Forest 

We apply standard scaling and reduce dimesnsion from 4 to 2 using PCA.
Then apply the classifier and choose the best

In [None]:
iris_df=load_iris()

In [None]:
# any  dataset has  data, feature_names, target, target_names, DESCR of the data
#select column names


iris_df.feature_names

In [None]:

iris_df.data

In [None]:
X_train,X_test,y_train,y_test=train_test_split(iris_df.data,iris_df.target,test_size=0.25,random_state=42)

## Pipelines Creation

### Data Preprocessing using Standard Scaler -->  Reduce Dimension using PCA-->  Apply  Classifier

Creating Three Pipelines for three different classifiers

# Standard scaler
Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Ref: 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#:~:text=where%20u%20is%20the%20mean,or%20one%20if%20with_std%3DFalse%20.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#:~:text=where%20u%20is%20the%20mean,or%20one%20if%20with_std%3DFalse%20.

# Principal component analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.



In [None]:
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)),
                     ('lr_classifier',LogisticRegression(random_state=0))])

In [None]:
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
                     ('pca2',PCA(n_components=2)),
                     ('dt_classifier',DecisionTreeClassifier())])

In [36]:
pipeline_randomforest=Pipeline([('scalar3',StandardScaler()),
                     ('pca3',PCA(n_components=2)),
                     ('rf_classifier',RandomForestClassifier())])

In [37]:
## LEts make the list of pipelines
pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest]

In [17]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

In [40]:
# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'RandomForest'}

# Fit the pipelines
for pipe in pipelines:
	pipe.fit(X_train, y_train)

In [None]:
#Print accuracy score of each classifier

for i,model in enumerate(pipelines):
    print(f"{pipe_dict[i]} Test Accuracy: {model.score(X_test,y_test)}" ,)
     

In [None]:
for i,model in enumerate(pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier=i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))

## MakePipelines In SKLearn

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
# Create a pipeline
pipe = make_pipeline((RandomForestClassifier()))
# Create dictionary with candidate learning algorithms and their hyperparameters
grid_param = [
                {"randomforestclassifier": [RandomForestClassifier()],
                 "randomforestclassifier__n_estimators": [10, 100, 1000],
                 "randomforestclassifier__max_depth":[5,8,15,25,30,None],
                 "randomforestclassifier__min_samples_leaf":[1,2,5,10,15,100],
                 "randomforestclassifier__max_leaf_nodes": [2, 5,10]}]
# create a gridsearch of the pipeline, the fit the best model
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) # Fit grid search
best_model = gridsearch.fit(X_train,y_train)

In [None]:
best_model.score(X_test,y_test)