# Exercises - Week 2 - Transformers & Estimators - Blackjack

## Contents
1. Data Lab notebooks
2. Exercises

## 1. Data Lab notebooks
1. [Objects](https://bentley.cloud.databricks.com/#notebook/90404) 
2. [Classes](https://bentley.cloud.databricks.com/#notebook/191802)
3. [Train & test datasets](https://bentley.cloud.databricks.com/#notebook/958305)
3. [Transformer classes](https://bentley.cloud.databricks.com/#notebook/430288) 
4. [Estimator classes](https://bentley.cloud.databricks.com/#notebook/958181) 
5. [Pipelines](https://bentley.cloud.databricks.com/#notebook/409455)

## 2. Exercises

__Exercise:__ Verify your understanding of the `fit` and `predict` methods of the pipeline `est`
by using only the `fit`, `transform` and `predict` methods of the individual transformers and the estimator used to create the pipeline.
Then compare your output and show that it is identical. 

All of the objects are created below.

The following line of code imports the needed tools from the sklearn package readily available.  It creates a function called est, which contains a pipeline with an imputer object, a scaler, and a log regression object.  The Imputer implies values for missing columns, and per the code specifications it will do so using the computed average of all other values in the given column.  The scalar object will adjust each column using a Max-Min scale, thereby placing all functions between 0-1 for each column, and allowing python easier readability in the pipeline.  The logreg portion of the est pipeline is a logistic regression function, which will provide a logistic regression model for the given inputs.  Beneath the est pipeline, the imputer, scalar and logreg are listed out separately - allowing us to test the understandability of each function, and better understand what happens "inside" a pipeline.

In [7]:
from sklearn.pipeline      import Pipeline
from sklearn.linear_model  import LogisticRegression
from sklearn.preprocessing import Imputer, MinMaxScaler

est = Pipeline([
  ('imputer', Imputer(strategy="mean")),
  ('scaler',  MinMaxScaler()),
  ('logreg',  LogisticRegression())
])
imp = Imputer(strategy="mean")
sca = MinMaxScaler()
log = LogisticRegression()

In [8]:
import pandas as pd
import numpy as np
import sklearn as sk

The following simply loads the iris dataset to be used and specifies the features and target to be found.

In [10]:
from sklearn.datasets import load_iris
iris_features = load_iris().data
iris_target   = load_iris().target
(iris_features.shape, 
 iris_target.shape
)

The following separates out the data into training and test splits - providing us something to use the estimator pipeline on.
test_size tells the size of the test dataset, typically 20% of the dataset.
random.seed parameter allows us to generate the same shuffled indices.

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target,test_size=0.2,random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

The following code reshapes the response (y) variable to be used so it will work within our estimator pipieline

In [14]:
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

In [15]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

The following lines of code take the concept of the pipeline and further expands upon it by breaking out each step.  Originally the imputer and scalar objects are used to find missing values, and to scale all values to allow the pipeline to accurately predict and use the data presented.  Once the data has been ran through the imputer and scalar objects, it is ready to be put into the logistic regression portion of the code.  Note that all scalar, imputer, objects have fit and transform functions, which serve to first apply the impute or scale, and then transform the data. We can use fit_transform() function that combines fit and transform together. Note that scaling the target values is generally not required.

In [17]:
x_train_imp = imp.fit_transform(x_train)
x_train_sca = sca.fit_transform(x_train_imp)

In [18]:
x_test_imp = imp.fit_transform(x_test)
x_test_sca = sca.fit_transform(x_test_imp)

The astype int ensures that the data being viewed in the y train function is not listed as double or string - avoiding potential errors.

In [20]:
x_train_reg = log.fit(x_train_sca, y_train.astype(int))

output below is the predicted value

In [22]:
x_test_hold = x_train_reg.predict(x_test_sca)
x_test_hold

Function T can transpose an arary. The code here is to align the format between the predicted results and origianl results.

In [24]:
y=y_test.T

Following computes the error overall from the output generated above

In [26]:
error = x_test_hold - y[0]
error

In [27]:
x_train_reg.intercept_, x_train_reg.coef_

The following is the utilization of the est pipeline - which performs all steps listed above in a simple one line object of code

In [29]:
pipe_1_train = est.fit(x_train, y_train.astype(int))
y_predict = pipe_1_train.predict(x_test)
error_pipe=y_predict-y_test.T

the error from pipeline method, is equal to the one we calcute by applying fuctions(impute,scaler,logistic regression) seperately.

In [31]:
error_pipe==error

__The End__