In [5]:
# jupyter nbconvert "TCXP - MAPI.ipynb" --to slides --post serve
# PENDING: clear up pendings
# PENDING: quote on truth vs .clarity 


# TCXP - A Scalable Algorithm for Explaining Individual Tree-based Classifier Predictions

<div> 
<img width="400" height="300" src="ml_scratching.png" />

</div>

# Outline 


* Explanations??? Please explain yourself!
* What is TCXP? 
* TCXP vs. LIME 
* Demo on real data


## Explanations (or lack thereof) in the context of Machine Learning 



* In industries (such as banking, insurance):
    * Sales staff sometimes ask about individual predictions...
    * Predictive analytics promises **actionable insights**:
        * Individual prediction $\rightarrow$ individual action 
        
* Explanation for prediction: 
    * Answer the question: **Why** did the model predict $\hat{y}^{(i)}$ on input $x^{(i)}$?
    * What *features*/*variables* contributed (the most) to a prediction? 
    * What was each *feature*'s contribution to the prediction?
    
* PENDING: mention new European legislation

## A dichotomy

*  Predictions by *simple algoritms*, are "easy" to explain:
 $$P(x^{(i)} \in C^+) = \sigma( z^{(i)} ) \quad \text{ with }\quad z^{(i)} = \beta_0 + \sum_{j=1}^{f} \beta_j x^{(i)}_j$$
    Why is $P(x^{(i)})$ close to 1? Typically a few of the terms $\beta_j x^{(i)}_j$ are positive and big enough.
    $$\Delta z_j^{(i)} := \beta_j x_j^{(i)}  $$
    is the $j$-th feature's contribution to the $(i)$-th example's prediction.
    
* Predictions by *advanced algorithms*, e.g. random forests, neural networks, XGBoost are *hard* to explain
    * black-box nature and high internal complexity of these models

## Explaining  advanced ML algorithms to sales staff

<div>
    <p>        
    <img height="250" width="250" src="confused-girl.jpg" style="display: inline-block"/>
    <img height="250" width="250" src="confused-boy.jpg" style="display: inline-block" />
    <img height="250" width="250" src="confused-black-girl.jpg" style="display: inline-block" />
    </p>
</div>

# What is TCXP?


* An algo to generate **interpretable explanations** for *individual* tree-based classifier predictions.
  -  **Simple** and **scalable**
  
* Definition: If model $M$ yields a probability $p^+ (i) = p^+_M( \mathbf{x}(i))$ that the $i$-th data point belongs to the positive class, an **explanation** for this prediction is:
    $$( p_0(i), \Delta p_{1}(i), \dots, \Delta p_{f}(i)  )  \quad \text{ such that }\quad  
     p_{0}(i) + \sum_{j=1}^f \Delta p_{j}(i) = p^+  (i ) $$
$\Delta p_j(i)$ is interpreted as the _contribution_ to the prediction coming from the $j$-th feature, $x_j(i)$.     

  
* **How**?
  * **Basic idea:**  carry out *careful accounting* of probability contributions of each variable.

   

## Binary classification through a tree: leaf counts
<div> 
    <img  src="tree_00.png" />    
</div>

For each leaf node $k$ we record count of positive class over total count: $( n^+_k ,\, n^+_k + n^-_k )  $

## Binary classification through a tree: probability estimates

A classification trees has **internal decision** nodes, each using a single variable, and final (non-decision) **leaves**

<div> 
    <img  src="tree_01.png" />    
</div>

For each leaf $k$, $p^+_k = n^+_k / (n^+_k + n^-_k) $

## Explanation generation: all node counts 
    
<div>
    <img  src="tree_02.png" />    
</div>


For each internal node compute  $( n^+_k, \, (n^+_k + n^-_k)) $

## Explanation generation: all node probs
    
<div>
    <img  src="tree_03.png" />    
</div>


For each internal node node $k$, $p^+_k = n^+_k / (n^+_k + n^-_k) $

## Explanation generation: deltas on all edges
    
<div>
    <img  src="tree_04.png" />    
</div>


If $k$ is the parent of $l$,  compute $\Delta p^+ _{(k,l)} := p^+_l - p^+_k$

## Explanation generation: assigning deltas to variables -  Case 1
    
<div>
    <img  src="tree_c1.png" />    
</div>

First delta is attributable to $Y$, second to $Z$, third to $X$

## Explanation generation: assigning deltas to variables -  Case 1
    
<div>
    <img  src="tree_c1.png" />    
</div>

First delta is attributable to $Y$, second to $Z$, third to $X$

## Explanation generation: assigning deltas to variables -  Case 2
    
<div>
    <img  src="tree_c2.png" />    
</div>

First delta is attributable to $Y$, second to $Z$, third to $Y$ again.

## Explanation generation: assigning deltas to variables -  Case 3
    
<div>
    <img  src="tree_c3.png" />    
</div>

First delta is attributable to $Y$, second to $X$, none to $Z$

## Extension to forest/GBT and complexity

Consider a clasifier $\mathcal{T}$, consisting of $T$ trees, with weights $w_1, w_2, \dots, w_T$ with ($\sum^T_{i=1} w_i = 1$)  
and let $\mathbf{e}^{(t)}(i) := ( p_0^{(t)}(i), \Delta p^{(t)}_{1}(i), \dots, \Delta p^{(t)}_{f}(i)  )$ denote the explanation computed on tree $t$, $t=1, \dots, T$. 
Then we can define the explanation fo the whole classifier as the weighted sum:
$$\mathbf{e}^{(\mathcal T)}(i) := \sum_{i=1}^T w_i \mathbf{e}^{(t)}(i) $$

### Complexity

For if all trees in $\mathcal T$ consist of a total of $V$ nodes and the maximum depth of any tree is $d$, then:

  * Precomputing cond. probabilitities at all nodes takes $O(V)$ time and $O(V)$ space.
  * Computing an individual explanation takes $O(d)$ time and $O(f)$ space
  
Incidentally, both bounds are dominated by the corresponding bounds of training the classifier and generating a prediction

## Another approach for explanation generation: LIME

**LIME**: **L**ocal **I**nterpreatable **M**odel-agnostic **E**xplanations

**Basic Idea**: For each $\mathbf x (i)$:
  * Generate (100s of) random samples of a neighborhood around $\mathbf x(i)$.
  * Compute prediction using model $M$ for each sample.
  * Fit linear ML model to predictions.
  * Cast coefficients of linear model as variable importances.
  
**Main drawback**:
   * It is **very slow** to compute an explanation for a single datapoint! 
     * $\rightarrow$ This doesn't scale!
 

## Comparison between LIME and TCXP

<div>
    <img  src="lime_vs_tcxp.png" />    
</div>