# Meta Machine Learning

Predictive modeling is a term that encompasses machine learning, artificial intelligence, etc., that focuses on **predictions**. While the primary interest of predictive modeling is to generate accurate predictions, a second concern is **interpretability**.

Beware of the reasons why predictive modeling fails: 
   - inadequate pre-processing of data; 
   - inadequate model validation; 
   - training and testing data are not from the same distribution, or lack of stability; 
   - overfitting. 

Note the importance of **expert judgments**:
   - "In the end, predictive modeling is not a substitute for intuition, but a compliment."
   - "Traditional experts make better decisions when they are provided the results of statistical predictions."

It is important to **understand the predictors**
 - Predictor sets may contain numerically redundant information.
 - Predictors may be sparse
 - Predictors may not be relevant to responses - **feature selection** is the process of determining the minimum set of relevant predictors needed.
 - One needs to be mindful of the relation between the number of samples vs. the number of predictors.
    
   

It is generally a good practice to **visualize the data** before doing anything. 
  - If the predictors are of low-dimension, a scatterplot probably suffices. 
  - If there are multiple predictors, plots that help understand the cross-relationship between predictors are needed

For any given problem, if possible, **evaluate several models**.
 - Model assessment for supervised machine learning problems are usually addressed using [**cross-validation**](cross_validation.ipynb), which is an estimate of prediction error.
 - Traditional statistical learning, which emphasizes interpretability, often use [**information criterions and all kinds of metrics**](metrics_and_information_criterions.ipynb), more than often calculated by resampling.
 - Visualization can be useful in gauging model performance.

**Randomly** (try to put this somewhere else), when encoding one-hot vector for multi-class classification, if using KL divergence as the loss function, it is sometimes useful to set the target output to $[\epsilon, \epsilon, (1-(K-1)\epsilon, \epsilon, \epsilon)]$. This is called **label smoothing** and aids gradient descent.

## Reference

- Applied Predictive Modeling, Chapters 1-2