# Missing Data Analysis


Deletion methods:
1. List wise deletion: Will end up with a smaller set, but will not be artificially inducing bias (in case of imputation)
2. Column deletion
Imputation methods:
(Based on the assumption that missing data is MCAR or at least MAR)
1. Numeric variables:
> * Mean imputation: Rarely used, because it does not preserve the relationship between variables, reduces standard errors (std dev) leading to type 1 errors, also is affected in case of outliers
> * Linear regression imputation: better than mean imputation but still reduces standard errors and introduces bias
> * MLE imputation: estimates distribution parameters from the complete dataset and predict the missing values. Difficult to configure
> * Multiple imputation: Mean/mode/regressor/classifier imputation done multiple times in parallel and the final results are pooled. 
> * MICE (Multiple imputation with chained equations): Linear/logistic regression/other classifiers with complete data as predictors and missing values as target. Mean/median/random is used as initial substitution. Regression is repeated multiple times in a sequence.









#  **Categorical variables encoding techniques**
# 1. Label encoding: 
* not suitable for nominal variables, only ordinal variables

# 2. One-hot encoding:
* suitable for lesser number of categorical variables (with less categories), because increases dimensionality
* not suitable for tree based algorithms: deteriorates model performance and will reduce the chances of the tree picking the categorical variable for the split lesser (less closer to the root)
* better suited for linear models
* binary encoding is a (log (n+1)/log 2) variation of this

# 3. Target encoding:
* encoding categories using the mean of the target variable.
* works well with both linear and tree-based models
* prone to data leakage. Variations to avoid the same:

> * cross fold target encoding
> * leave one out target encoding

# 4. Problem specific feature engineering: TBD examples










# **Linear Regression**

* Ways to solve linear regression: least squares estimation can either be done by differentiation or using matrices (normal equations(analytical)) gradient descent (numerical solutions). Numerical solutions are used when 
http://math.mit.edu/~gs/linearalgebra/ila0403.pdf
https://machinelearningmastery.com/linear-regression-with-maximum-likelihood-estimation/
* univariate regression https://are.berkeley.edu/courses/EEP118/current/derive_ols.pdf


(https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/)
Assumptions that the data has to satisfy for the model to be valid:
* 1) Linearity of predictors: linearity of predicted vs actual plot/residual plot (standardised residual vs predicted) should show no pattern


> * use a non-linear model/add non-linear terms


* 2) No multicollinearity: checked using VIF (variance inflation factor) https://blog.minitab.com/blog/starting-out-with-statistical-software/what-in-the-world-is-a-vif
> * remove correlated variables
> * perform dimensionality reduction
> * combine correlated variables
* 3) Normality of residuals: Anderson-Darling test (p value < 0.05 = non-normal)
* 4) No autocorrelation of residuals: Durbin-Watson test
> use lag terms
* 5) Homoscedasticity: equal variance among error terms.
> * use weighted least squares regression
> * include missing variables







**Goodness of fit measures for linear regression:**
* R2: how much variation of the total variation does the model explain?
* adjusted R2: how much variation explained by of the total variation 

**Confidence intervals vs Prediction intervals**
* CI: represents the probability of finding the *mean* of y (95% probability that confidence intervals contain the mean)
* PI: used while prediction of an unknown y (the exact value and not mean)

# **Regularised Regression**
> * Ridge regression: l2 regularisation
> * Lasso regression: l1 regularisation (used for feature selection)
when to use what: experimental, but l1 if focus is on feature selection
why does l1 norm result in feature selection:
https://towardsdatascience.com/regularization-in-machine-learning-connecting-the-dots-c6e030bfaddd


> * regularisation requires variables to be standardised because regularisation term is proportional to the coefficient
https://www.kaggle.com/questions-and-answers/59305



# **Logistic Regression**

> * loss function is logloss/log likelihood (binary)/cross entropy (mutliclass) (penalises larger deviations more compared to smaller deviations)
> * 
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#introduction
http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function
>* the output probabilty distribution is a bernoulli distribution (for binary), multinomial

##### ((loss function is differentiated by predicted y (the function that learns the predictions))


# **Evaluation Metrics**

> * Confusion matrix derived metrics: Accuracy, precision, recall, f1 score (harmonic mean of pr)
precision = the number of correctly identified positives among the total identified as positive
recall = number of correctly identified positives among the total actual positives
> * probability based metrics: logloss 
> * auroc and aupr curves both have a disadvantage of just being ranking metrics. But auroc has the additional advantage of being bad with imbalanced data (positive class is smaller) which p-r curves are resistant against

## **TBD: Imbalanced data handling, K-fold cross validation**

# **Random Forest**
> * weak decision trees (low bias, high variance) are built in parallel using bagging (bootstrap aggregation: sampling with replacement)
> * each tree is built:
> > * 2/3rd of the samples are used for building the tree, 1/3rd is used to calculate oob error
> > * each tree uses all features, but chooses a random sample of the total set to split a node
>>> * gini impurity/information gain: gini impurity is computationally less expensive because it doesn't contain the log term. Regression trees use MSE as the splitting criteria
>>* each tree is built without pruning

## Favourable characteristics:
> * can handle missing data (should experiment with other missing data imputation methods regardless)
> * robust to outliers in predictors (not target)
>* non-linear classifier

## Handling data imbalance for random forest
>* weighted random forest: assign higher weights to minority class, so that the classification error for minority class are more heavily penalised







# **XGBoost**

* Ensemble model which essentially uses a form of gradient descent to iteratively find an update (a weak learn in this case) in the right direction 
* Each weak learner is a tree which predicts residuals based on the last predicted target values. 
* The model building starts with an initial target value which is the same for all instances. The residuals are calculated using this value. The residuals are then scaled using a learning rate. New predictions are calculated using this tree and from them the new pseudo residuals and so on until the residuals are very small or max number of iterations. 
* The split at each node is decided based on gain and impurity (called similarity) scores 
* The tree is pruned based on the impurity scores
* The output of the leaf node is regularized. The reg parameter shrinks similarity scores which results in more pruning which leads to less overfitting.

Advantages:
* is fast because it takes the computer hardware into account (parallelisation, cache awareness and hard disk usage for very large datasets)
* can handle large feature sets and data
* can handle missing values well

refs: https://medium.com/syncedreview/tree-boosting-with-xgboost-why-does-xgboost-win-every-machine-learning-competition-ca8034c0b283
https://www.youtube.com/watch?v=oRrKeUCEbq8