# End to End Machine Learning Steps
---

## Step 1: Define the problem

* Can the problem be solved with Supervised, Unsupervised, Reinforncement or mixed learning?
    * For supervised learning: What is/are the target variable(s)?
    * For unsupervised learniing: Clustering? Outlier/Anomaly detection?
    * For reinforcement learning: What will be the agent's reward?

* Establish your performance metrics
* Define baseline
    * Human level performance
    * Other systems/tools
* How would success look like?
    * Establish Specific, Relevant and Attainable targets



**Important note:** *it is very important to have a clear understanding on what the real problem is before starting with data acquisition and exploration. Equally important is to define how will model performance will be measured and its goals*
    
---

# Step 2: Acquire Data

* Find data and document sources
* Check for space
* Check for terms and conditions
* If applicable: get access
* Prepare your environment
    - [pipenv](https://docs.python.org/3/tutorial/venv.html)
    - [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
* Load libraries
* Import data
* Deal with sensitive information (delete, protect, anonymize)


In [None]:
# Import libraries

import pandas as pd
import numpy

In [None]:
# Import data from file

dataframe = pd.read_csv('filename.cvs')

In [None]:
# import from database

import pyodbc

#define connection to db and query

connector = pyodbc.connect('DRIVER={SQL Server}; SERVER=Server_Name;DATABASE=DB_Name;UID=User;PWD=Password')
query = "SELECT * FROM TableName"

# import data into a dataframe

dataframe = pd.read_sql(query, connector)

## Step 3: Exploratory data analysis

**Understandig your data**

* Size of dataset
* Variable datatypes
* Relationships between variables
* Distributions
* Correlations
* Outliers
* Missing values

---

#### Other libraries suitable for visualization:
* [Bokeh](https://bokeh.org/)
* [Plotly](https://plot.ly/)

In [None]:
# import 

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Plot types:

- **Comparison**
    * Line Chart: best suited for time series
    * Bar Chart: best suited to compare numerical values across categories
    * Radar Chart: best suited to plot mulitple variables with each on axis
- **Relation**
    * Scatter Plot: for 2 or 3 numerical variables
    * Bubble Plot: extends for one additional variable (size of the dot)
    * Pair Plot: for combination of scatter plot and histograms
    * Heatmap: for multivariate datasets
- **Composition**
    * Pie Chart: to depict numerical portions in a circle
    * Stacked Bar Chart: to show how a category is divided into sub-categories and the proportion of the sub-category, in comparison to the overall category
    * Stacked Area Chart: to show trends for part-of-a-whole relations where the values of several groups are illustrated on top of one another
- **Distribution**
    * Histogram: visualizes the distribution of a single numerical variable
    * KDE or Density Plot: variation of a histogram that uses kernel smoothing
    * Box Plot: used to in single numerical variables to show ranges, quartiles, outliers
    * Violin Plot: combination between box and density plots
- **Geo plots**: to visualize geospatial data

## Step 4: Feature Engineering

* Impute Missing Values
    - Numerical - mean, median, mode
    - Categorical - “missing”, mode
* Categorical Values
    - Encode Rare values ---> “rare”
    - Categorical to numerical
* Remove Outliers (can be skipped for Tree-based algorithms)
* Scaling (can be skipped for Tree-based algorithms)
* Normalization (can be skipped for Tree-based algorithms)
* Dimensionality reduction
    - LDA
    - PCA
    - tSNE
    - IsoMap
* Quantitative Feature selection (if applicable)

**Important note:** order of steps mentioned above is critical to have good transformers


### Recommended libraries for easier feature engineering
* [Feature Engine](http://feature-engine.readthedocs.io/)
* [Feature Tools](https://www.featuretools.com/)

---

**Important _feature enginering_ notes:** 

- Split your data on train/test before proceeding with feature engineering
- Fit only your training data
- Transform both train and test
- Don't remove observations in your test data (i.e. outliers)
    - Remove outliers only in training data
    
---

## Step 5: Model Training

### Linear Models

- Linear Regression
- Polynomial Regression
- LogisticRegression
- Support Vector
    - SVC
    - Linear SVC
    - NuSVC
- Quadratic Discriminant Analysis
- Linear Discriminant Analysis
- Stochastic Gradient Descent
- Passive Aggressive


### Tree based

- DecisionTree
- RandomForest
- AdaBoost
- XGBoost
- CatBoost
- LightGBM

### Distance Base
- Nearest Neighbors
- K Nearest
- Nearest Centroid
- Radius Neighbors


### Bayesian

- Gaussian
- Complement
- Multinomial
- Bernoulli

### Neural Network
- MultiLayer Perceptron
- ANN
- CNN
- RNN
- DBN
- GAN



In [None]:
# Compare your models and select the best ones foe hyper parameter tuning
# Comparison table example for regression

algorithms = []
algorithms.append(('Regressor_1', regresor()))
algorithms.append(('Regressor_2', regresor()))
algorithms.append(('Regressor_3', regresor()))


names = []
train_rmse = []
test_rmse = []
train_r2 = []
test_r2 = []

for name, clf in algorithms:
    clf.fit(X_train, y_train)
    train_rmse.append(sqrt(mean_squared_error(y_train, clf.predict(X_train))))
    test_rmse.append(sqrt(mean_squared_error(y_test, clf.predict(X_test))))
    train_r2.append(r2_score(y_train, clf.predict(X_train)))
    test_r2.append(r2_score(y_test, clf.predict(X_test)))
    names.append(name)

clf_comparison = pd.DataFrame({'Algorithm': names, 'Train RMSE': train_rmse, 'Test RMSE': test_rmse,
                             'Train r square': train_r2, 'Test r square': test_r2})
clf_comparison.sort_values(by=['Test RMSE'])

---
## Step 6: Model Tuning
[https://en.wikipedia.org/wiki/Hyperparameter_optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)


- [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)
- [RandomSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)
- [Bayes Search](http://hyperopt.github.io/)
- [Genetic](https://epistasislab.github.io/tpot/)

---

## Step 7: Present results

- Feature importances
