# Decision Trees (DT)

## Model Specification

### Variants and Generalizations

## Theoretical Properties

### Advantages
- DT are simple to understand and easy to interpret. Intuitive - close to the logical thinking of doctors.
- DT requires little data preparation, such as data normalization, dummy variables creation. DT precedure is also immune to magnitude of features, i.e. a monotonic transform on input does not alter the results in DT.
- It is relatively cheap to do inference, in that the cost of using the tree in inference is roughly [logarithmic](http://scikit-learn.org/stable/modules/tree.html) in the number of data points used to train the tree. (Probably in reference of each split partition data set in half; more reference needed) 
- Able to handle both numerical and categorical data, i.e. mixed data types.
- Able to handle multiple outputs.

### Disadvantages
- DT typically has lower prediction accuracy, resulting from high variance or instability. One incarnation of the instability is that a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The main reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it.
- It is a hard optimization problem. Global optimum is almost impossible to achieve - so we will have to make do with Greedy algorithm and local opmitimum. Computation efficiency may also be an issue.
- There are concepts that are hard to learn because DT does not express them easily, such as XOR, parity or multiplexer problems. [(reference needed)](http://scikit-learn.org/stable/modules/tree.html)

### Relation to Other Models
- To account for the high variance and lower prediction accuracy in DT, multiple extensions are proposed; see the notebooks on [bagging](Bagging.ipynb), [boosting](Boosting.ipynb) and [emsemble methods](Emsemble.ipynb). A general theme is, in order to achieve greater prediction accurary, interpretability is to be sacraficed.
- DT, or CART (Classification and Regression Trees) is also related to [Multivariate Adaptive Regression Spline (MARS)](MARS.ipynb). MARS is considered to be better in fitting a smooth surface and capturing additive structure (see Disavantages under Empirical Performance)

## Empirical Performance

### Advantages
- DT is computationally scalable, compared to SVM(svm.ipynb) or neural nets.
- Possible to validate a model using [statistical tests](http://scikit-learn.org/stable/modules/tree.html). That makes it possible to account for the reliability of the model. (Reference needed)
- Performs well even if its [assumptions are somewhat violated by the true model](http://scikit-learn.org/stable/modules/tree.html) from which the data were generated (references needed)
- To some extent, DT is robust to irrelavant inputs.
- DT is robust to outliers in the input space.

### Disavantages
- DT can create biased trees in classificatoin if some classes dominates. It is therefore recommended to balance the dataset prior to fitting with the decision tree - probably by using the loss matrix extension detailed below and/or Gini/entropy loss function.
- Lack of Smoothness in the Prediction Surface, especially in the regression setting. Something that [MARS](Multivariate Adaptive Regression Splines.ipynb) can be viewed as alleviating.
- Although it can do it with sufficient data, DT makes no special encouragement to find structure such as $Y=c_1I(X_1<t_1)+c_2I(X_2>t_2)$. This is something that [MARS](Multivariate Adaptive Regression Splines.ipynb) can be viewed as alleviating.
- This may not be a disadvantage of DT per se. But it is non trivial to handle missing data, except to treat them as another category.

## Implementation Details and Practical Tricks

**Splitting Categorical Predictors** 

When splitting a predictor having $q$ possible unordered values, there are $2^{q-1}-1$ possible partitions into two groups, which may be computational daunting. One trick is that we can order the predictor classes according to the proportion falling in outcome class 1, and then this predictor is splitted as if it were an ordered predictor. One can show that this gives the optimal split, in terms of cross-entropy or Gini index. For regression problem it can follow the same token and order by increasing mean of the outcome. Although intuitive, the proofs of these assertions are not trivial.

One problem is the partitioning algorithm tends to favor categorical predictors with many levels $q$. This can lead to severe overfitting if $q$ is large, and such variables should be avoided.

**The Loss Matrix**

In classification problems, there are cases where misclassifications of certain classes are more catastrophic, or when there are imbalance of class instances. **One approach** to mitigate this is to scale the loss function by a disproportionate loss. For instance, rewrite Gini index as $\sum_{k\neq k'}L_{kk'}\hat{p}_{mk}\hat{p}_{mk'}$. But it does not work in two-class if $L_{kk'}=L_{k'k}$. **Another approach** is to weigh on the observations. The effect of observation weighting is to alter the prior probability on the classes. In the terminal node, the empirical Bayes rule implies that we classify to class $k(m)=arg\min_k\sum_l L_{lk}\hat{p}_{mk}$.

**Missing Predictor Values**

Two tricks. (1) **If missing data is on a categorical variable, just make a new category of 'missing'** - we may find pattern or even reasons why these values are missing. (2) Look for **surrogate variable**. First train the trees using only non-missing-valued observations. Next find surrogate variables, which is predictor and corresponding split point that best mimics the split of the training data achieved by the primary split (not much details on ESL; more reference needed). At last, when doing inference, use surrogate splits in order (do not imagine that we can do this during training, despite what ESL says). Note that by nature of DT, surrogate variables can only be used one by one, i.e. we cannot use, say, a linear combination of variables.

**Why only Binary Splits**

Because multiway splits fragment the data too quickly, leaving insufficient data at the next level down.

**Linear Combination Splits**

Using a linear combination split $\sum a_jX_j \leq s$, while perhaps improving on the predictor power, hurts the interpretability.

## Use Cases

## Results Interpretation, Metrics and Visualization

## References

- ESL, Section 9.2
- [scikit-learn Document 1.10](http://scikit-learn.org/stable/modules/tree.html)

### Further Reading

## Misc.