# Machine Learning with Scikit-Learn

#### Introduction to Machine Learning

__Why Machine Learning?__

we generate 2.5 quintillion bytes of data(none of this data is in a single or same format) every day, thats equal to data stored on 100 million Blu-ray disks. When we stacked them up, it would be equal to the height of four Eiffel towers. Manually processing these data would be not possible. Machine learning helps analyze this data easily and quickly.

__Purpose of Machine Learning__

Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and extract information to enable information-driven decisions and provide insights.

Machine Learning analyzes the data and helps us to Identify patterns and relationships, Gain insights into unknown data and Take information-driven decisions. To do this ML uses statistical & mathematical models and applies them to the dataset. 

This process can be either be __Semi-automated__ or __Fully-automated__.

##### Machine Learning Terminology:

These are some machine learning terminologies that you will come across in this lesson:
1. __Observations__: 
    * Records,
    * Examples,
    * Samples,
    
    which present in the given dataset.
    
2. __Features__:
    * Inputs or attributes(columns) that defines a given dataset.
    
3. __Response__:
    * Response is the label, outcome, target or some defined answers attached to the given dataset.
    

##### Machine Learning Approach
The machine learning approach starts with either a problem that you need to solve or a given dataset that you need to analyze.
1. Understand the problem/dataset
2. Extract the features from the dataset
3. Identify the problem type
4. Choose the right model
5. Train and test the model
6. Strive for accuracy

###### Steps 1 & 2
![Steps%201%20and%202.PNG](attachment:Steps%201%20and%202.PNG)

###### Steps 3 and 4: Identify the Problem Type and Learning Model

1.__concept:__

Machine learning can either be supervised or unsupervised. The problem type should be selected based on the type of learning model.

__Supervised Learning__:

In supervised learning, the dataset used to train a model should have observations, features, and responses. The model is trained to predict the “right” response for a given set of data points. Supervised learning models are used to predict an outcome. The goal of this model is to “generalize” a dataset so that the “general rule” can be applied to new data as well.

__Unsupervised Learning__:

In unsupervised learning, the response or the outcome of the data is not known. Unsupervised learning models are used to identify and visualize patterns in data by grouping similar types of data. The goal of this model is to “represent” data in a way that meaningful information can be extracted.

2.__Problem Types__:

Data can either be continuous or categorical. Based on whether it is supervised or unsupervised learning, the problem type will differ.

![problemtype.PNG](attachment:problemtype.PNG)

3.__Examples__:

__Supervised Learning:__ Categories of news based on the topics.

__Unsupervised Learning:__ Grouping of similar stories on different news networks

###### Working of Supervised Learning Model
In supervised learning, a known dataset with observations, features, and response is used to create and train a machine learning algorithm. A predictive model, built on top of this algorithm, is then used to predict the response for a new dataset that has the same features.

###### Working of Unsupervised Learning Model
In unsupervised learning, a known dataset has a set of observations with features. But the response is not known. The predictive model uses these features to identify how to classify and represent the data points of new or unseen data.

###### Steps 5 and 6: Train, Test, and Optimize the Model
To train supervised learning models, data analysts usually divide a known dataset into training and testing sets

![step_5and6.PNG](attachment:step_5and6.PNG)

![step_5and6_2.PNG](attachment:step_5and6_2.PNG)

![image.png](attachment:image.png)

###### Supervised Learning Model Considerations
Some considerations of supervised and unsupervised learning models are shown here.

![sup_mod_consideration.PNG](attachment:sup_mod_consideration.PNG)

#### Scikit-Learn
Scikit is a powerful and modern machine learning Python library for fully and semi-automated data analysis and information extraction.

* Efficient tools to identify and organize problems(Supervised/ Unsupervised)
* Free and open datasets
* Rich set of libraries for learning and predicting
* Model support for every problem type
* Model persistence
* Open source community and vendor support

##### Scikit-Learn: Problem-Solution Approach
Scikit-learn helps Data Scientists organize their work through its problem-solution approach.

![sckit-learn.PNG](attachment:sckit-learn.PNG)

##### Scikit-Learn: Problem-Solution Considerations
While working with a Scikit-Learn dataset or loading your own data to Scikit -Learn, always consider these points:
* Create separate objects for feature and response
* Ensure that features and response have only numeric values
* Features and response should be in the form of a NumPy ndarray
* Since features and response would be in the form of arrays, they would have shapes and sizes
* Features are always mapped as x,and response is mapped as y

#### Supervised Learnig Models
##### Supervised Learning Models: Linear Regression
Linear regression is a supervised learning model used to analyze continuous data.
* It is easy to use as the model does not require a lot of tuning
* It is the most basic and widely used technique to predict a value of an attribute
* It runs very fast, which makes it more time-efficient

The linear regression equation is based on the formula for a simple linear equation. 
![Lin_reg_formu.PNG](attachment:Lin_reg_formu.PNG)

Linear regression is the most basic technique to predict a value of an attribute.
![Lin_reg_formu2.PNG](attachment:Lin_reg_formu2.PNG)
__The attributes are usually fitted using the “least square” approach.__

Smaller the value of SSR or SSE, the more accurate the prediction will be, which would make the model the best fit. 

![Lin_reg_formu3.PNG](attachment:Lin_reg_formu3.PNG)
__The attributes are usually fitted using the “least square” approach.__

Let us see how linear regression works in Scikit-Learn
![Lin_reg_formu4.PNG](attachment:Lin_reg_formu4.PNG)

###### Demo - 01: Loading a Dataset
__Problem Statement: Demonstrate how to load a built-in scikit-learn dataset.__

In [1]:
#import necessary Libraries
import numpy as np
import pandas as pd

#import Sci-kit learn datasets
from sklearn.datasets import load_boston
boston_dataset = load_boston()

#use built-in methods to explore & understand the data
print(boston_dataset['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [2]:
#print the features of the dataset
print(boston_dataset['feature_names'])

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [3]:
#store data into a dataframe
df_boston = pd.DataFrame(boston_dataset.data)

In [4]:
#set features as columns in the dataset
df_boston.columns = boston_dataset.feature_names
#view first 5 columns
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
#print dataset matrix (observations & Features matrix)
print(boston_dataset.data.shape)

(506, 13)


In [6]:
#print dataset target or response shape
print(boston_dataset.target.shape)

(506,)


In [7]:
#view target or response
print(boston_dataset['target'])

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 3

###### Demo - 02: Linear Regression model
__Problem Statement: Demonstrate how to create and train a linear regression model and to calculate the Mean-Square-Error and Variance of the given dataset__

In [8]:
#import necessary Libraries
import numpy as np
import pandas as pd

#import Sci-kit learn datasets
from sklearn.datasets import load_boston
boston_dataset = load_boston()

#store data into a dataframe
df_boston = pd.DataFrame(boston_dataset.data)

#set features as columns in the dataset
df_boston.columns = boston_dataset.feature_names

#append price, target as new columns in the dataset (added as last columns)
df_boston['Price'] = boston_dataset.target

#print top 5 observations
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [9]:
#assign Features on X-axis
X_features = boston_dataset.data

#assign Target on Y-axis
Y_target = boston_dataset.target

#import the Linear Model (the Estimator)
from sklearn.linear_model import LinearRegression
lineReg = LinearRegression()

#fit the data into the estimator
lineReg.fit(X_features,Y_target)

LinearRegression()

In [10]:
#print the intercept
print ('The estimator intercept %.2f'%lineReg.intercept_)

The estimator intercept 36.46


In [11]:
#print the Coefficient
print('The Coefficient is %d'%len(lineReg.coef_))

The Coefficient is 13


In [12]:
#train the model, split the whole dataset into train and test dataset (model selection)
from sklearn import model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X_features,Y_target)

#print the dataset shape
print(boston_dataset.data.shape)

(506, 13)


In [13]:
#Print shape of training and testing dataset
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(379, 13) (127, 13) (379,) (127,)


In [14]:
#fit training sets to model
lineReg.fit(X_train,Y_train)

LinearRegression()

In [15]:
# The Mean-Squared-Error (MSE) or Residual sum of Squares
print('MSE value is %.2f'%np.mean((lineReg.predict(X_test)-Y_test)**2))

MSE value is 19.95


In [16]:
#calculate the Variance using lineReg.score(X_test,Y_test)
print('Variance Score is %.2f'%lineReg.score(X_test,Y_test))

Variance Score is 0.70


__The closer the variance score is to 1, the higher the moodel accuracy.__

##### Supervised Learning Models: Logistic Regression
Logistic regression is a generalization of the linear regression model used for classification problems.

![logi_reg1.PNG](attachment:logi_reg1.PNG)

__The purpose of K-NN is to predict the class for each observation.__

Logistic regression is a generalization of the linear regression model used for classification problems.
![logi_reg2.PNG](attachment:logi_reg2.PNG)

![logi_reg3.PNG](attachment:logi_reg3.PNG)

##### Supervised Learning Models: K-Nearest Neighbors
K-Nearest Neighbors, or K-NN, is one of the simplest machine learning algorithms used for both classification and regression problem types.
![kmeans1.PNG](attachment:kmeans1.PNG)

Supervised Learning Models: K-Nearest NeighborsIf you are using this method for binary classification, choose an odd number for k to avoid the case of a tieddistance between two classes.

![kmeans2.PNG](attachment:kmeans2.PNG)

K-means finds the best centroids by alternatively assigning random centroids to a dataset and selecting mean data points from the resulting clusters to form new centroids. It continues this process iteratively until the model is optimized.
![kmeans3.PNG](attachment:kmeans3.PNG)

###### Demo - 03: K-NN and Logistic Regrssion models
__Problem Statement: Demonstrate the use of K-NN and logistic regression models__

__K-NN Model__

In [17]:
#import necessary Libraries
import numpy as np
import pandas as pd

#import Sci-kit learn datasets
from sklearn.datasets import load_iris
Iris_dataset = load_iris()

#display the dataset type
type(Iris_dataset)

sklearn.utils.Bunch

In [18]:
#View information using built-in DESCR(Describe) method.
print(Iris_dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [19]:
#view features
print(Iris_dataset.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [20]:
#view target(response)
print (Iris_dataset.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [21]:
#find the number of observation
print(Iris_dataset.data.shape)

(150, 4)


In [22]:
#Assign Features data to X_axis
X_features = Iris_dataset.data
#Assign Target data to Y_axis
Y_target = Iris_dataset.target

#View the shape for both axis
print(X_features.shape, Y_target.shape)

(150, 4) (150,)


In [23]:
#USe KNN classifier method to find nerest datapont for given value of K
#import KNN fro SKlearn
from sklearn.neighbors import KNeighborsClassifier

#Instantaite the KNN Classifier (the Estimator)
KNN = KNeighborsClassifier(n_neighbors=1)
#print the KNN
print(KNN)

KNeighborsClassifier(n_neighbors=1)


In [24]:
#fit the data into KNN model(estimator)
KNN.fit(X_features,Y_target)

KNeighborsClassifier(n_neighbors=1)

In [25]:
#create object with new values for prediction
X_new = [[3,5,4,1],[5,3,4,2]]

#predict the outcome of the new observation using KNN Classifier
KNN.predict(X_new)

array([1, 1])

__Logistic Regression Model__

In [26]:
#use the Logistic Regression as Estimator
from sklearn.linear_model import LogisticRegression
#LogReg = LogisticRegression(random_state = 0,solver = 'liblinear',multi_class = 'auto')
LogReg = LogisticRegression(solver = 'liblinear')
print(LogReg)

LogisticRegression(solver='liblinear')


In [27]:
#fit the data into the logistic regrsiion estimator
LogReg.fit(X_features, Y_target)

LogisticRegression(solver='liblinear')

In [28]:
#predict the outcome using logistic regression estimtor
LogReg.predict(X_new)

array([0, 2])

#### Unsupervised Learning Models: Clustering
A cluster is a group of similar data points.

Clustering is used to:
* Extract the structure of the data
* Identify groups in the data

__Greater similarity between data points results in better clustering.__

![unsup1.PNG](attachment:unsup1.PNG)

##### Unsupervised Learning Models: K-Means Clustering
K-means finds the best centroids by alternatively assigning random centroids to a dataset and selecting mean data points from the resulting clusters to form new centroids. It continues this process iteratively until the model is optimized.
![unsup2.PNG](attachment:unsup2.PNG)

K-means finds the best centroids by alternatively assigning random centroids to a dataset and selecting mean data points from the resulting clusters to form new centroids. It continues this process iteratively until the model is optimized.

![unsup3.PNG](attachment:unsup3.PNG)

Choose a mean from each cluster as a centroid
![unsup4.PNG](attachment:unsup4.PNG)

Reassign data points to new centroids
![unsup5.PNG](attachment:unsup5.PNG)

Iterate the process till the model is optimized
![unsup6.PNG](attachment:unsup6.PNG)

Let us see how the k-means algorithm works in Scikit-Learn.
![unsup7.PNG](attachment:unsup7.PNG)

###### Demo - 04 - K-Means Clustering
__Problem Statement: Demonstrate how to use k-means clustering to classify data points.__

In [29]:
#import required Libraries
import numpy as np
import pandas as pd
#import K-means Class from skleran.cluster
from sklearn.cluster import KMeans

#import make-blbs dataset
from sklearn.datasets import make_blobs

#define number of samples
n_samples = 300
#define random state value to initialize the center
random_state = 20

#define number of features as '5'
X,y = make_blobs(n_samples=n_samples, n_features=5, random_state=None)

#define number of cluster to be formed as '3' and in random state and fit features into the model
predict_y = KMeans(n_clusters=3, random_state=random_state).fit_predict(X)

#print the estimatoe prediction
predict_y

array([1, 1, 1, 2, 2, 0, 1, 0, 1, 1, 0, 2, 1, 1, 0, 2, 1, 0, 2, 0, 2, 2,
       2, 0, 2, 1, 1, 2, 1, 0, 0, 2, 1, 2, 0, 1, 1, 1, 2, 1, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 2, 0, 1, 2, 1, 0, 1, 0, 2, 2, 0, 2, 0, 0, 2, 1, 1,
       1, 2, 1, 2, 1, 2, 2, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 2, 1, 0, 2,
       0, 0, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1,
       0, 1, 1, 2, 2, 0, 0, 2, 2, 2, 0, 0, 1, 2, 2, 1, 2, 0, 2, 2, 2, 0,
       1, 1, 0, 2, 2, 0, 0, 2, 2, 1, 0, 2, 2, 1, 0, 0, 1, 0, 2, 0, 0, 1,
       1, 1, 0, 0, 1, 2, 0, 1, 1, 2, 0, 1, 0, 0, 2, 1, 1, 0, 2, 1, 2, 0,
       1, 0, 1, 0, 2, 1, 0, 2, 0, 1, 0, 0, 2, 2, 0, 2, 0, 0, 1, 1, 1, 0,
       0, 0, 2, 1, 0, 0, 2, 1, 1, 1, 2, 0, 2, 1, 0, 1, 0, 0, 1, 1, 1, 2,
       2, 2, 0, 2, 0, 1, 1, 1, 0, 2, 0, 1, 0, 0, 1, 2, 2, 0, 1, 2, 2, 1,
       2, 2, 2, 0, 0, 2, 0, 2, 2, 1, 1, 2, 1, 2, 2, 1, 0, 1, 1, 1, 1, 1,
       2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 2, 0, 1, 2, 2, 2,
       2, 2, 1, 0, 1, 2, 1, 1, 0, 2, 1, 0, 0, 2])

Output returns the label value of each data point. The Labels show in which cluster a data point belongs to.

##### Unsupervised Learning Models: Dimensionality Reduction
It reduces a high-dimensional dataset into a dataset with fewer dimensions. This makes it easier and faster for the algorithm to analyze the data.

![unsup8.PNG](attachment:unsup8.PNG)
These are some techniques used for dimensionality reduction:

![image.png](attachment:image.png)

##### Unsupervised Learning Models: Principal Component Analysis
It is a linear dimensionality reduction method which uses singular value decomposition of the data and keeps only the most significant singular vectors to project the data to a lower dimensional space.
* It is primarily used to compress or reduce the data. 
* PCA tries to capture the variance, which helps it pick upinterestingfeatures.
* PCA is used to reduce dimensionality in the dataset and to build our feature vector.
* Here, the principal axes in the feature space represents the direction of maximum variance in the data.
![unsup10.PNG](attachment:unsup10.PNG)
__This method is used to capture variance.__

![unsup11.PNG](attachment:unsup11.PNG)

###### Demo - 05: Principal Component Analysis(PCA)
__Problem Statement: Demonstrate how to use the PCA model to reduce the dimensions of a dataset.__

In [30]:
#import required Libray 
from sklearn.decomposition import PCA

#import dataset
from sklearn.datasets import make_blobs

#define sample and random state
n_sample = 20
reandom_State = 20

#generate the dataset with 10 features(dimension)
X,Y = make_blobs(n_samples = n_sample, n_features = 10, random_state = None)

#View the shape of the dataset
X.shape

(20, 10)

In [31]:
#define the PCA estimator with No.of reduced components
pca = PCA(n_components=3)

#fit data into the PCA estimators
pca.fit(X)

print(pca.explained_variance_ratio_)

[0.65355764 0.32635588 0.00577919]


In [32]:
#print the first PCA component
first_pca= pca.components_[0]
print(first_pca)

[ 0.09363603 -0.15707354 -0.52456026  0.15954887  0.36759938  0.46877667
 -0.04549001 -0.43837883  0.28934121  0.18192404]


In [33]:
#transform the titled data using transform method
pca_reduced = pca.transform(X)

#view the reduced shape (lower dimension)
pca_reduced.shape

(20, 3)

__From the output, we can see the number of features are successfully reduced from 10 to 3.__

##### Pipeline
* Pipeline simplifies the process where more than one model is required or used.
* All models in the pipeline must be transformers. The last model can either be a transformer or a classifier, regressor, or other such objects.
* Once all the data is fit into the models or estimators, the predict method can be called.

__Estimators are known as ‘model instance’.__

###### Demo-06: Pipeline
__Problem Statement: Demonstrate how to build a pipeline.__

In [34]:
#import pipeline class
from sklearn.pipeline import Pipeline

#import linear estimator
from sklearn.linear_model import LinearRegression

#import PCA estimator for dimentionality reduction
from sklearn.decomposition import PCA

#chain the estimator together
#Note:  All the estimators in the piprline except the last one must be transformers

estimator = [('dim_reduction',PCA()),('linear_model',LinearRegression())]

#put the chain of estimators in a pipeline object
pipeline_estimator = Pipeline(estimator)
pipeline_estimator

Pipeline(steps=[('dim_reduction', PCA()), ('linear_model', LinearRegression())])

In [35]:
#view first step
#Steps method helps tp review stepwise processing of the pipeline.
pipeline_estimator.steps[0]

('dim_reduction', PCA())

In [36]:
#view second step
pipeline_estimator.steps[1]

('linear_model', LinearRegression())

In [37]:
#view all the steps in pipeline
pipeline_estimator.steps

[('dim_reduction', PCA()), ('linear_model', LinearRegression())]

__In this demo, we saw how to create pipeline, also how we can change from different estimators together. And to view pipeline process togethr and in a stepwose manner.__

#### Model Persistence
we can save our model for future use. This avoids the need to retrain the model.
* This can be saved using the Pickle method.
* It can also be replaced with the joblib of Sci-kit team.
* Both joblib.dump and joblib.load  can be used.
* These would be efficient for Big Data.

##### Demo - 06: Model Persistence
__Problem Statement: Demonstrate how to persist a model for future use.__

In [38]:
#import required libraries and dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()

#view feature name of the dataset
iris_dataset.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [39]:
#view target of the dataset
iris_dataset.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [40]:
#define features and target objects
X_feature = iris_dataset.data
Y_target = iris_dataset.target

#create object with new values for prediction
X_new = [[3,5,4,1],[5,3,4,2]]

#use logistic regression estimators
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver = 'liblinear')

#fir data into the logistic regression estimator 
logreg.fit(X_feature,Y_target)

LogisticRegression(solver='liblinear')

In [41]:
#predict the outcome using logistic regression estimator
logreg.predict(X_new)

array([0, 2])

__Outcome of the observation, as we see it matches the target lables 0 & 2 from 0, 1 and 2. This means that model is working as expected.__

In [42]:
#import library for model persistence
import pickle as pkl

#use dumps method to persist(save) the mmodel
persist_model = pkl.dumps(logreg)
persist_model

b"\x80\x04\x95*\x03\x00\x00\x00\x00\x00\x00\x8c\x1esklearn.linear_model._logistic\x94\x8c\x12LogisticRegression\x94\x93\x94)\x81\x94}\x94(\x8c\x07penalty\x94\x8c\x02l2\x94\x8c\x04dual\x94\x89\x8c\x03tol\x94G?\x1a6\xe2\xeb\x1cC-\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\rfit_intercept\x94\x88\x8c\x11intercept_scaling\x94K\x01\x8c\x0cclass_weight\x94N\x8c\x0crandom_state\x94N\x8c\x06solver\x94\x8c\tliblinear\x94\x8c\x08max_iter\x94Kd\x8c\x0bmulti_class\x94\x8c\x04auto\x94\x8c\x07verbose\x94K\x00\x8c\nwarm_start\x94\x89\x8c\x06n_jobs\x94N\x8c\x08l1_ratio\x94N\x8c\x0en_features_in_\x94K\x04\x8c\x08classes_\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h\x1c\x8c\x05dtype\x94\x93\x94\x8c\x02i4\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x0c\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x94t\x94b\x8c\x05coef_\x94h\

In [43]:
#use joblib.dump tp persist the model into a file
#we can save the model to a file using the joblib class of the external library. 
#Instatiate the class and pass the model type and file name
import joblib
joblib.dump(logreg, 'regresfilename.pkl')

['regresfilename.pkl']

In [44]:
#use joblib.load to persist the model to a file.
#create new estimator from the saved model
new_logreg_estimator = joblib.load('regresfilename.pkl')

#view the new estimator
new_logreg_estimator

LogisticRegression(solver='liblinear')

In [45]:
#validate & use new estimator to predict
new_logreg_estimator.predict(X_new)

array([0, 2])

__This proves that the model can be created, trained and persisted for future use.__

#### Model Evaluation: Metric Functions
You can use the “Metrics” function to evaluate the accuracy of your model’s predictions.

![modelper1.PNG](attachment:modelper1.PNG)