## Ensemble Technique

Ensemble Technique uses multiple models (multiple predictors) to construct a most strong model.

Multiple models is a combination of models, for instance :
- logistic regression + decision trees

The ouputs of each predictor are combined by different averaging methods, such as:
- weighted averages,
- normal averages,
- or votes, 
and a final prediction value is derived. 

The use of Ensemble methods is because the combination between models are more effective than individual methods and, therefore, are heavily used to build machine learning models. 

Ensemble methods can be implemented by either:
- bagging
- or boosting.

### Bagging

The independent models/predictors are build using a random subsample/bootstrap of data for each of the models/
predictors. 

Then an average (weighted, normal, or by voting) of the scores
from the different predictors is taken to get the final score/prediction.

The most famous bagging method is **Random Forest**.


### Boosting
In boosting the predictors are not independently trained but done so in a sequential manner. 

For example, we
build a logistic regression model on a subsample/bootstrap of the original
training data set. Then we take the output of this model and feed it to a
decision tree, to get the prediction, and so on. 

- input -> logistic regression -> output -> precious ouput is the input for -> decision tree -> output -> continues

The aim of this sequential training is for the subsequent models to learn from the mistakes of the previous model. 

**Gradient Boosting** is an example of a boosting method.


## Gradient Boosting

The Gradient Boosting doesn't increment the weights of
misclassified outcomes from one previous learner to the next. It optimize
the loss function of the previous learner.

Here demonstrate building a boosted trees classifier, using the gradient
boosting method under the hood.


### Dataset

The dataset chosen was Iris dataset that is compounded by 
- features: 	SepalLength	SepalWidth	PentalLength	PentalWidth	
- targets: Species

The dataset has three types fo species. But only for demonstration (don't do multiclass classification) we will work only with two spcecies classes.

### Data Transformation

Through Exploratory Datat Analysis we saw the bimodal distribution in target due to we selected only two species. We can see the correlation between each feature with target.

Data normalization was implemented, because it is a very recommend procedure when we have features with different value range.

### TensorFlow Pipeline
A TensorFlow pipeline was built to slice, shuffle and create batch of data for generate train and test dataset for Logistic model.

### Model Verification
To verify the model performance to make predictions of two different species was implemented: accuracy, precision, recall. The results were:
- Training
    - Training Data Accuracy (%) =  98.75
    - Training Data Precision (%) =  97.73
    - Training Data Recall (%) =  100.0
- Test
    - Test Data Accuracy (%) =  80.0
    - Test Data Precision (%) =  63.64

For this case we can evaluate the same model using F1 Score metric.

### Load Libraries

In [6]:
from __future__ import absolute_import, division, print_function, unicode_literals
import pandas as pd
import seaborn as sb
import tensorflow as tf
from tensorflow import keras as ks

from tensorflow.estimator import BoostedTreesClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Load dataset

The dataset is the well known Iris, that cathegorize four species of plants. The dataset is compound by features:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Witdth

and the categorical targets (species) are:
- Setosa
- Versicolor
- Virginica

In [7]:
# Define columns names
col_names = ['SepalLength', 'SepalWidth', 'PentalLength', 'PentalWidth', 'Species']
target_dimensions = ['Setosa', 'Versicolor', 'Virginica']

# Load dataset
# Define the path to read the dataset
training_data_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_data_path = tf.keras.utils.get_file("iris_test.csv","https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

# Train dataset
# We are not interest in the original dataset header (header = 0)
training = pd.read_csv(training_data_path, names=col_names, header=0)

# Select the Species that not contains zero
training = training[training['Species'] >= 1]

# replace the Species values from 1,2 to 0,1
training['Species'] = training['Species'].replace([1,2], [0,1])

# Test dataset
# We are not interest in the original dataset header (header = 0)
test = pd.read_csv(test_data_path, names=col_names,header=0)

# Select the Species that not contains zero
test = test[test['Species'] >= 1]

# Replace the Species values from 1,2 to 0,1
test['Species'] = test['Species'].replace([1,2], [0,1])

In [8]:
# Reset the index of dataframes
training.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

# Concatenate two dataframes through rows
iris_dataset = pd.concat([training, test], axis=0)

# Output stats 
iris_stats = iris_dataset.describe()

# Put the dataframe in a better configuration to read
iris_stats = iris_stats.transpose()

# Show the statics results
iris_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SepalLength,100.0,6.262,0.662834,4.9,5.8,6.3,6.7,7.9
SepalWidth,100.0,2.872,0.332751,2.0,2.7,2.9,3.025,3.8
PentalLength,100.0,4.906,0.825578,3.0,4.375,4.9,5.525,6.9
PentalWidth,100.0,1.676,0.424769,1.0,1.3,1.6,2.0,2.5
Species,100.0,0.5,0.502519,0.0,0.0,0.5,1.0,1.0


In [9]:
## Select the require dataset
X_data = iris_dataset[[i for i in iris_dataset.columns if i not in ['Species']]]
Y_data = iris_dataset[['Species']]

# Split dataset in training and test
training_features, test_features, training_labels, test_labels = train_test_split(X_data , Y_data , test_size=0.2)

print('No. of rows in Training Features: ', training_features.shape[0])
print('No. of rows in Test Features: ', test_features.shape[0])
print('No. of columns in Training Features: ', training_features.shape[1])
print('No. of columns in Test Features: ', test_features.shape[1])
print('No. of rows in Training Label: ', training_labels.shape[0])
print('No. of rows in Test Label: ', test_labels.shape[0])
print('No. of columns in Training Label: ', training_labels.shape[1])
print('No. of columns in Test Label: ', test_labels.shape[1])

No. of rows in Training Features:  80
No. of rows in Test Features:  20
No. of columns in Training Features:  4
No. of columns in Test Features:  4
No. of rows in Training Label:  80
No. of rows in Test Label:  20
No. of columns in Training Label:  1
No. of columns in Test Label:  1


In [10]:
## Normalize dataset
def norm(x):
    stats = x.describe()
    stats = stats.transpose()
    
    return (x - stats['mean']) / stats['std']

normed_train_features = norm(training_features)
normed_test_features = norm(test_features)

## Build the input pipeline for the TensorFlow model

In [11]:
def feed_input(features_dataframe, target_dataframe, shuffle=True, num_of_epochs=10, batch_size=32):
    # This function allows to shuffle dataset and create batchs of dataset for each epoch
    def input_feed_function():
        #  get the slices of an array in the form of objects
        dataset = tf.data.Dataset.from_tensor_slices((dict(features_dataframe), target_dataframe))
        
        # Shuffle dataset in which X is the number of samples randomized 
        if shuffle:
            dataset = dataset.shuffle(2000)
            
        dataset = dataset.batch(batch_size).repeat(num_of_epochs)
        
        return dataset
    
    return input_feed_function

In [12]:
# Obtain the train dataset to feed into Linear Model
train_feed_input = feed_input(normed_train_features, training_labels)

# In test We don't need shuffle dataset and the epoch is only one 
train_feed_input_testing = feed_input(normed_train_features, training_labels, num_of_epochs=1, shuffle=False)

# Obtain the test dataset to feed into Linear Model
test_feed_input = feed_input(normed_test_features, test_labels, num_of_epochs=1, shuffle=False)

# Select numerical column for model
feature_columns_numeric = [tf.feature_column.numeric_column(m) for m in training_features.columns]

# Build Boost Decision Trees with categorical columns
btree_model = BoostedTreesClassifier(feature_columns=feature_columns_numeric, n_batches_per_layer=1)

# Train the model with dataset
btree_model.train(train_feed_input)

## Predictions

In [15]:
# Make predictions with train dataset test
train_predictions = btree_model.predict(train_feed_input_testing)

# Make preditions with test dataset
test_predictions = btree_model.predict(test_feed_input)

# Transform train predictions into Series
train_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in train_predictions])

# Transform test Predictions in to series
test_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in test_predictions])

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpybj1_ru1/model.ckpt-29
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpybj1_ru1/model.ckpt-29
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [18]:
# Convert Series into Dataframes
train_predictions_df = pd.DataFrame(train_predictions_series, columns=['predictions'])
test_predictions_df = pd.DataFrame(test_predictions_series, columns=['predictions'])

# Reset indices before to join by row the two Dataframes
training_labels.reset_index(drop=True, inplace=True)
train_predictions_df.reset_index(drop=True, inplace=True)

test_labels.reset_index(drop=True, inplace=True)
test_predictions_df.reset_index(drop=True, inplace=True)

# Join Dataframes
train_labels_with_predictions_df = pd.concat([training_labels, train_predictions_df], axis=1)
test_labels_with_predictions_df = pd.concat([test_labels,test_predictions_df], axis=1)

## Calculate Metrics

This part allow us to check the quality of our model. In the classification problems the metric that is usually used is: 
- accuracy
- precision
- recall

In [19]:
def calculate_binary_class_scores(y_true, y_pred):
    # y_true : target original values
    # y_pred : predict values
    accuracy = accuracy_score(y_true,y_pred.astype('int64'))
    precision = precision_score(y_true,y_pred.astype('int64'))
    recall = recall_score(y_true, y_pred.astype('int64'))
    
    return accuracy, precision, recall



train_accuracy_score, train_precision_score, train_recall_score = calculate_binary_class_scores(training_labels, train_predictions_series)
test_accuracy_score, test_precision_score, test_recall_score = calculate_binary_class_scores(test_labels, test_predictions_series)
print('Training Data Accuracy (%) = ', round(train_accuracy_score*100,2))
print('Training Data Precision (%) = ', round(train_precision_score*100,2))
print('Training Data Recall (%) = ', round(train_recall_score*100,2))
print('-'*50)
print('Test Data Accuracy (%) = ', round(test_accuracy_score*100,2))
print('Test Data Precision (%) = ', round(test_precision_score*100,2))

Training Data Accuracy (%) =  98.75
Training Data Precision (%) =  97.73
Training Data Recall (%) =  100.0
--------------------------------------------------
Test Data Accuracy (%) =  80.0
Test Data Precision (%) =  63.64
