# Optimizing an ML Pipeline in Azure


## Table of contents
   * [Overview](#Overview)
   * [Summary](#Summary)
   * [Scikit-learn Pipeline](#Scikit-learn-Pipeline)
   * [AutoML](#AutoML)
   * [Pipeline comparison](#Pipeline-comparison)
   * [Future work](#Future-work)
   * [Proof of cluster clean up](#Proof-of-cluster-clean-up)
   * [Citation](#Citation)
   * [References](#References)
   

## Overview
This project is part of the Udacity Azure ML Nanodegree.  
In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is optimized using hyperparameter tunning artifacts from HyperDrive. The best model is then compared to another model generated by Azure AutoML run.


## Summary

### The data set
The "Bank Marketing" data set contains information about direct marketing campaigns of a Portuguese banking institution.  

In this project we seek to predict if the client would subscribe a bank term deposit ('yes') or not ('no'). Finding this answer we can help the financial institution choose better strategies for a greater effectiveness for future marketing campaigns. 

### The best model  

The best performing model was a Voting Ensemble model generated by an AutoML job. It was used nine models using   XGBoostClassifier and LightGBMalgorithms algorithm. 

The primary metric used was the accuracy and it achieved 0.9167.  
However, since it is an unbalanced data set, it is interesting to analyse other metrics results:  
- f1_score_micro: 0.9167
- precision_score_weighted: 0.9107
- AUC_weighted: 0.9477


## Scikit-learn Pipeline

### Data Analysis

The data set has information about the clients 10,000 clients. There are some demographic attributes like age, type of job, marital status and education. And attributes about their financial profile like and the related campaign: if they have housing loan, credit in default, personal loan and other.  

The target variable **y** answer if the client has subscribed a term deposit.

### Preprocessing data

The Scikit-learn pipeline obtains the provided data from the provided URL. Following data download, a number of data cleaning steps are carried out including:

1. Removing NAs from the dataset.
2. One-hot encoding job titles, contact, and education variables.
3. Encoding a number of other categorical variables.
4. Encoding months of the year.
5. Encoding the target variable.


### Model

Before training the model, the data was split into a training and testing set. It was set a test set size of 30% of total entries. The classification algorithm used was **Logistic Regression**, which has parameters like the Regularization Strength **C** and **maximum number of iterations** to be set.


### Hyperparameter Tuning

Azure's Hyperdrive service was used for hyperparameter tuning with the following key elements:

#### Parameter sampling

It was used a Random Parameter Sampling. 
Grid sampling does a simple grid search over all possible values which can be very expensive.
Random Parameter Sampling also can be less expensive to fit a model and it can make full use of all available nodes compared to Bayesian Parameter Sampling.
When using Bayesian sampling, the number of concurrent runs has an impact on the effectiveness of the tuning process  because some runs can start without fully benefiting from runs that are still running.

#### Early stopping policy  

Automatically end poorly performing runs with an early termination policy. 
Choosing an Early termination improves computational efficiency. In this project was selected the **Bandit Policy** stopping policy because it allows a higher saving. Choosing an Early stopping policy is a good option because we don't want the hyperparameter tuning service let all training runs execute to completion. 

The best model parameters here were a C value of 5.00 and a maximum number of iterations value of 147. The model's accuracy was 91.54% and the SKLearn pipeline experiment was executed in a compute cluster.



## AutoML  

The autoML pipeline used the same steps before starting the training process: The data were cleaned, preprocessed and splited into train and test sets.  

The autoML experiment was processed locally and the pipeline tested RandomForest, XGBoostClassifier, GradientBoosting, LogisticRegression, ExtremeRandomTrees, LightGBM.  

The best model selected by autoML was a voting ensemble with the accuracy as primary metric, obtaining of 91.67%. It was selected nine models using XGBoostClassifier and LightGBMalgorithms algorithms.  


## Pipeline comparison
**Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?**

The two models performed very similarly in terms of accuracy, with the hyperdive model achieving 91.54% accuracy and the autoML model achieving 91.67% accuracy. However, this study showed me that autoML process can bring excellent results with less effort and had a great performance in other metrics as well such as precision, recall and AUC which is more suitable for unbalanced data like in this case. 


## Future work
**What are some areas of improvement for future experiments? Why might these improvements help the model?**

In the future we can improve the model by: 

- Exploring feature engineering process,
- Testing scaling methods for the continuous variables,
- Testing oversampling methods such as SMOTE and ADASYN,
- Running AutoML for more time (increase the "experiment_timeout_minutes").

