## Missing Data

** Case Deletion (CD) ** - Also is known as complete case analysis. It is available in all statistical packages and is the default method in many programs. This method consists of discarding all instances (cases) with missing values for at least one feature.A variation of this method consists of determining the extent of missing data on each instance and attribute and delete the instances and/or attributes with high
levels of missing data. Before deleting any attribute, it is necessary to evaluate its relevance to the analysis.

** Mean Imputation (MI) ** - This is one of the most frequently used methods.It consists of replacing the missing data for a given feature (attribute) by the mean of all known values of that attribute in the class where the instance with missing attribute belongs.

** Median Imputation (MDI) ** - Since the mean is affected by the presence of outliers it seems natural to use the median instead just to assure robustness. In this case, the missing data for a given feature is replaced by the median of all known values of that attribute in the class where the instance with the missing feature belongs. This method is also a recommended choice when the distribution of the values of a given feature is skewed.

** KNN Imputation (KNNI) ** - This method the missing values of an instance are imputed considering a given number of instances that are most similar to the instance of interest. The similarity of two instances is determined using a distance function.

Also, 
1. https://pypi.python.org/pypi/fancyimpute
2. https://machinelearningmastery.com/handle-missing-data-python/

## Stochastic Gradient Descent v/s Gradient Descent


Batch Gradient Descent has to scan through the entire training set before taking a single 
step — a costly operation if m is large — Stochastic Gradient Descent can start making progress 
right away, and continues to make progress with each example it looks at. 

Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch 
gradient descent. (Note however that it may never “converge” to the minimum, and the parameters
θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the 
minimum will be reasonably good approximations to the true minimum.2) 

For these reasons, particularly when the training set is large, stochastic gradient descent 
is often preferred over batch gradient descent.


Note - 
While it is more common to run stochastic gradient descent as we have described it and with a 
fixed learning rate α, by slowly letting the learning rate α decrease to zero as the algorithm 
runs, it is also possible to ensure that the parameters will converge to the global minimum rather
then merely oscillate around the minimum.

Creds: Andrew Ng




## Bias-Variance Tradeoff

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an
algorithm to miss the relevant relations between features and target outputs (underfitting).
= Mean(y - y^)

The variance is error from sensitivity to small fluctuations in the training set. High variance 
can cause overfitting: modeling the random noise in the training data, rather than the intended 
outputs.

**  The more complex the model f(x) is, the more data points it will capture, and the lower the bias
will be. However, complexity will make the model "move" more to capture the data points, and hence
its variance will be larger.  **

Creds: Wikipedia
    
    
    

## Bias-Variance Tradeoff DT, RF, GBM

Shallow decision trees have high bias and low variance.
Deep decision trees have low bias and high variance. 



** Bagging **

Before we talk about RFs, let’s first talk about Bagging. 
Bagging is a simpler version of RFs and can be described as follows:

1. Create a new training set by sampling the original training set with replacement. 
The new training set has the same number of data points as the original, so there will 
generally be duplicates. This process is known as bootstrapping. 

2. Bootstrap a bunch of times, say 500. Train a DT on each bootstrap.

3. Use all 500 DTs to make predictions. For regression, this can be simply the average. 
For classification, majority voting.

Bagging stands for Bootstrap Aggregation, and can be applied to any model class, although 
historically it was most often used with DTs. 

** The key intuition of Bagging is that it reduces the variance of your model class. ** 

If you think about the simple 1-D regression example from above, 
Bagging is trying to get the prediction as close to the black line as possible. So when you use Bagging, 
you’re incentivized to use deep decision trees because they have high variance and low bias.




** Random Forests **

A Random Forest is a generalization of Bagging that is specific to DTs. At each branch in the 
decision tree, Random Forest training also subsamples the features in addition to the training examples.
Intuitively, this process further de-correlates the individual trees, which is good for Bagging, since 
the main limitation of Bagging is that bootstrapping is not the same as drawing fresh samples from the 
true data distribution.



** Boosting **

There are a few different boosting methods around, including AdaBoost and Gradient Boosting. All of them take
the same high-level approach:

1. Keep an overall predictor that is the (weighted) average of a bunch of models.

2. Train first model on original training data, and initialize overall predictor as just this single model.

3. Assess the error of the the overall predictor and modify the training data the focus on areas of high error.

   a) For AdaBoost, this means re-weighting the data points so that poorly modeled data points get higher weight.
   
   b) For Gradient Boosting, this means redefining the supervised prediction target to be some kind of residual 
   between the ground truth and the overall predictor.
   
4. Train a new model on the modified training data, and add to the overall predictor.
Repeat Steps 3 & 4.

One can interpret boosting as trying to minimize the bias of the overall predictor. So when you use boosting, 
you’re incentivized to use shallow decision trees because they have low variance and high bias. Using high variance
base models in boosting runs a much higher risk of overfitting than approaches like Bagging.



** Summary **

Bagging and RFs try to create multiple representative data sets, and then predict the average of the models 
trained on these data sets. This process is exactly what you want to do if you want to minimize the variance 
& overfitting of your model class, but does nothing about minimizing bias. On the other hand, Boosting surgically
manipulates the training set to focus on areas of high error. This is exactly what you’d do if you worry about 
your model class having high bias and being unable to globally model the data distribution well. However, Boosting
itself ignores the overfitting issue. So when you use Bagging & RFs, try to use high variance & low bias models.
Conversely, when you use Boosting, try to use low variance & high bias models.

Creds: https://www.quora.com/What-are-the-differences-between-Random-Forest-and-Gradient-Tree-Boosting-algorithms



## Tuning

** GBM : ** 
1. https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/


** XGBoost **
1. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
2. https://www.kaggle.com/babatee/intro-xgboost-classification


** LightGBM : ** 
1. https://github.com/Microsoft/LightGBM/blob/master/docs/Features.rst#leaf-wise-best-first-tree-growth
2. https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters-Tuning.rst
3. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
4. https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide



** Why LightGBM is faster than GBM (XGBoost) and when to use which **

XGBoost is pretty much a faster and regularized version of GBM and both of these use a *level-order* 
approach to grow the trees. In this strategy, each node splits the data prioritizing the nodes 
closer to the tree root.

However, LightGBM uses a *leaf-level* tree growth strategy, where the tree grows by splitting the data at the nodes with the highest loss change. This could result in imbalance and deep trees, resulting in overfitting, hence controlling the *max_depth* parameter is important.

Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth.

Creds: https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/

** Linear Models **
1. http://scikit-learn.org/stable/modules/linear_model.html

** Feature Engineering **
1. https://elitedatascience.com/feature-engineering-best-practices