# Advice and Good Practice

Predictive modeling is a term that encompasses machine learning, artificial intelligence, etc., that focuses on **predictions**. While the primary interest of predictive modeling is to generate accurate predictions, a second concern is **interpretability**.

Note the importance of **expert judgments**:
   - "In the end, predictive modeling is not a substitute for intuition, but a compliment."
   - "Traditional experts make better decisions when they are provided the results of statistical predictions."

Things to watch out in preparation of **data**
- **Leakage**: Information about labels sneaks into features
- **Sample bias**: Test inputs and deployment inputs have different distributions
- **Nonstationary**: When the thing you are modeling changes over time
    - **Covaraiate Shift**: input distribution changes over time
    - **Concept Shift**: correct output for given input changes over time

Look at [data preprocessing](data_preprocessing.ipynb) for more detailed suggestions.

It is important to **understand the predictors** (also see the point next)
 - Predictor sets may contain **numerically redundant information**.
 - Predictor sets may contain **missing values**.
 - Predictors may be **sparse**
 - Predictors may not be relevant to responses - **feature selection** is the process of determining the minimum set of relevant predictors needed.
 - One needs to be mindful of the relation between the number of samples vs. the number of predictors. Regularization may come in handy in these situations.

It is generally a good practice to **visualize the data**: see `scatterplot` from `pandas.plotting`, and there is also a `scatter` option to choose from in `DataFrame.plot`.

**Before ML**
  - If the predictors are of low-dimension, a scatterplot probably suffices. 
  - If there are multiple predictors, plots that help understand the cross-relationship between predictors are needed
  
**After ML**
  - Check both in-sample and out-of-sample model fit vs. ground-truth, again you can use the scatter plot. Better yet to identify cases where model does a lousy job.

For any given problem, if possible, **evaluate several models**.
 - Model assessment for supervised machine learning problems are usually addressed using [**cross-validation**](cross_validation.ipynb), which is an estimate of prediction error.
 - Traditional statistical learning, which emphasizes interpretability, often use [**information criterions and all kinds of metrics**](evaluation_metrics_and_information_criterions.ipynb), more than often calculated by resampling.
 - Visualization can be useful in gauging model performance (see the point above about scatterplot between ground-truth and fitted).

## Reference

- Applied Predictive Modeling, Chapters 1-2
- MLEDU: Lecture 1
- < Hands-on Machine Learning >, Chapter 2