# Algorithms and Tools

# Algorithms
As a ML practitioner, it really important that you keep learning. Machine Learning, as a field, is still growing. Everyday, there are new research papers published and new algorithms introduced. One of the most important skill to have as a ML practitioner, is the ability to read research papers and learn new algorithm.

RandomForest has been the go-to algorithm for quite a while. After the success of Bagging algorithms (like RandomForest), new class of algorithms called Gradient Boosted Trees (GBTs) took away the title. As of now, Gradient Boosted Tree’s are the state-of-the-art for structured data problems. __GBT__ is used by the winners, in almost every kaggle competition (with tabular data).

__XGBoost__ became famous in 2016, it was the first (major release, for the use of public) of the kind. __LightGBM__ (released by Microsoft) followed soon after. And then Yandex released their version of GBTs called __CatBoost__. These are all very famous packages which implement their own version of GBTs. T skills (reading research paper & new algorithms).

# Gradient Boosted Trees (GBTs)
Here are some links to get you started with GBT’s. You are suppose to go through these links and learn to use GBT’s on your own. 

* [CatBoost vs Light GBM vs XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db) it discusses Structural Differences, treatment of categorical variables and different Hyper-Parameters for each algorithm.

* [XGBoost vs LightGBM](https://medium.com/kaggle-nyc/gradient-boosting-decision-trees-xgboost-vs-lightgbm-and-catboost-72df6979e0bb) it will discuss implementation details of XGBoost and LightGBM.

* [Which boosting algorithm should I use?](https://medium.com/riskified-technology/xgboost-lightgbm-or-catboost-which-boosting-algorithm-should-i-use-e7fda7bb36bc) This blog compares each algorithms on various aspects like speed, accuracy and size of the dataset they can handle.

# Implementation
* [Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost](https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/) this blog is quite different from ones that you see above. Here you will find code implementation for each algorithm. Luckily, all these libraries use the same APIs as sklearn, so you will feel like home.

We highly recommend you to take a dataset and try using all these algorithms atleast once. Next thing is to visit their official website and read the docs. Here are the links: [XGBoost](https://xgboost.ai/), [LightGBM](https://github.com/microsoft/LightGBM), [CatBoost](https://catboost.ai/).


Tools
We have just started to scratch the surface with machine learning. While the fundamentals are covered, theres much more to learn.

# Class Imbalance
__Class imbalance__ can be a really problem when working with real-world datasets. Class imbalance, if let un-noticed, can trick you into believing you poor model. This happens because evaluation metrics (like accuracy) can give your very high scores even-if your model is blindly predicting the majority class. Hence, it not recommend to use accuracy_score as a metrics if your dataset is not balanced.

Heres a blog discussing class imbalance problem and how to dealing with it.

[imbalance](https://github.com/scikit-learn-contrib/imbalanced-learn) is an amazing library to help deal with imbalance classes. Here's a [kaggle kernel](https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets) showing it in-action.

# Categorical Encoding
There are many different ways of encoding the categorical features in your data. But, you can not use any random encoding with any algorithm. Different algorithms have different assumptions and thus, different requirements. In no time, this could get really messy.

Heres are some blogs that discusses, combination of encoding and algorithm to use together.

* [One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why?](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769)

* [Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)

* [Are categorical variables getting lost in your random forests?](https://web.archive.org/web/20200924113639/https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)

* Intense [dicussion](https://www.kaggle.com/c/zillow-prize-1/discussion/38793) on kaggle about categorical encoding, you can read it at leisure.

While sklearn provides you with a lot of different transformers, for encoding your categorical features, by no means its complete. There are many new, sophisticated and advance ways of encoding (like mean & target encoding) that tends to give better results. [Category_encoders](https://github.com/scikit-learn-contrib/category_encoders) library has implementations of all the latest encoding techniques and it also has same interface as sklearn. This makes it extremely easy to use. Heres a [blog](https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159) to get you started.

# Ensembling & model stacking
Ensembling has taken the machine learning world by storm. Every (quite literally) winning solutions on kaggle uses ensembling of some form. Ensembling is not about taking average of all the predictions (form different models), its much more than that. Data Scientist use a lot of different techniques and novel ideas to combine different models.

[Kaggle ensembling guide](https://mlwave.com/kaggle-ensembling-guide/) is a de-facto read, discussing ensembling in depth.

Ensembling is an effective technique, it can help you climb the leader-board (if you do it right). Just like everything else, Data Scientist and developers have implemented all these novel idea for you, all you have to do is USE IT! Here are two libraries to help you with ensembling:

* [ML-Ensemble](http://ml-ensemble.com/) is the most wide used library. Refer [this](http://ml-ensemble.com/info/tutorials/start.html) to get started.

* [mlextend](http://rasbt.github.io/mlxtend/) is a Python library of useful tools for the day-to-day data science tasks. Refer [this](http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/) for classification and [this](http://rasbt.github.io/mlxtend/user_guide/regressor/StackingRegressor/) for regression.

# Model Interpretation
Using the right encoding technique, choosing the right model, tuning your hyper-parameters and ensembling your models, can easily get you to the top of the leader-board. But when you move from kaggle to real-world, things are different. Building a model with very high score is not enough, you also have to justify. You should be able to explain it to your business management “Why your model is making a certain decision?”. The process of understanding your model is know as __model interpretation.__

For a really long time, practitioner had treated machine learning model as black-box. But at present, a lot of effort are invested to understand these seemingly black-box models.

Heres a [great resource](https://github.com/cog-data/ML_Interpretability_tutorial/blob/master/Machine_Learning_Interpretability_tutorial.ipynb) on model Interpretability in Machine Learning.

Here are some libraries to help your with model interpretation:

* Local Interpretable Model-agnostic Explanations [LIME](https://github.com/marcotcr/lime) is a popular python library which can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.

* [ELI5](https://github.com/TeamHG-Memex/eli5) is a Python library which allows to visualize and debug various Machine Learning models using unified API

* SHapley Additive exPlanations [SHAP](https://github.com/slundberg/shap) is a unified python library to explain the output of any machine learning model.

* [YellowBrick](https://www.scikit-yb.org/en/latest/) it offer a lot of plot for Machine Learning Visualization. Ranging from feature selection to target visualization. HIGHLY RECOMMENDED!

You have just started, KEEP LEARNING!