# Data Preprocessing

**Feature Engineering** is how the predictors are encoded.

   - Some encoding may be optimal for some models but poor for others.
   - In some models, multiple encodings of the same data may cause problems, except for some feature selection algos.

**Data Transformation on Single Predictors**

 - Centering and Scaling: may lost the interpretability of data
 - Resolve skewness/heavy-tailness: make the data symmetric by log square root inverse, or the more general Box-Cox (1964) Transform
 \begin{align}
 x^{*}=\left\{
                \begin{array}{ll}
                  \frac{x^{\lambda}-1}{\lambda},\;\;\text{if }\lambda\neq0\\
                  \log(x)\;\;\text{if }\lambda=0,\\
                \end{array}
              \right.
 \end{align}

**Outliers Removel**

 - Outliers can be spotted via visualization.
 - Should take caution to remove outliers, especially when sample size is small, for fear of losing important distributional info - do investigate these outliers.
 - Some models are robust to outliers: Trees, SVM.

**Data Reduction and Feature Extraction: PCA** - see the notebook of [PCA](PCA.ipynb).

**Missing Values**

 - It is important to know why the data is missing - it can inform the analysis: missing at random vs. missing not at random.
 - Censored data: exact data is missing but we know something of its value - in predictive modeling, it is common to treat them as missing values or just us the censored values as observed.
 - Missing values are usually concentrated more according to predictors (cols of the data table) rather than data points (the rows of the data table). So removing predictors may be preferable.
 - On the implementation side, check out `fillna`, `dropna` functions in `pandas`, depending on what the simple way you want to pursue concerning missing data. `sklearn` also has a class called `SimpleImputer` that can systematically fill in missing values by mean, median, most frequent values, etc.
 - Some models are robust to missing values, such as Trees, MARS and kNN **(from 10.7 in ESL, need to think about this...)**
 - The most "official" way to tackle missing values is perhaps by imputation: explore the inter-relationship between predictors, using e.g. kNN. There is a small literature; see Further Reading.  

**Removing Predictors**

 - **Zero-variance predictors**: low fraction of unique values with respect to entire population, or large ratio of the frequency of the most prevalent values to the second.
 - **Colinearity**: (1) Use Variance Inflation Factor in [linear models](linear_regression.ipynb) to diagnoze. (2) If a few of the top PCA components represent high precentage of variance, it indicates high colinearity; loadings of the factors also helps indicate such.

**Adding Predictors**

 - Dummy variables
 - For classifications, class centroids can be calculated and each data point's distance can be added as an extra predictor.

**Binning Predictors**

**Manual** categorization is discouraged since it generally harm performance, although in some cases it might improve interpretability.

## Reference

- Applied Predictive Modeling, Chapters 3
- < Hands on Machine Learning >, Chapter 2

### Further Reading

- Applied Predictive Modeling, Chapters 12-15
- Saar-Tsechansky M, Provost F (2007b). "Handling Missing Values When Applying Classification Models" Journal of Machine Learning Research, 8, 1625-1657.
- Jerez J, Molina I, Garcia-Laencina P, Alba R, Ribelles N, Martin M, Franco L (2010). "Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem." Artificial Intelligence in Medicine, 50, 105-115.