## Rulefit
## Tutorials

Rulefit is a [Python implementation](https://github.com/christophM/rulefit) of **rule ensemble models** developed by [Friedman and Popescu (2008)](https://arxiv.org/pdf/0811.1679.pdf), which are a variant of ensemble learning models. These are based on the generation of an ensemble of base learners $f(x; \gamma)$, $\{f(x; \gamma_m)\}_{m=1}^M$, which then compose a (regularized) linear model:
\begin{equation}
F(x) = \alpha_0 + \sum_{m=1}^M \alpha_mf(x;\gamma_m)
\end{equation}
Where $\{\alpha_m\}$ is then estimated through L1 regularization:
\begin{equation}
\{\alpha_m\}_{m=0}^M = \underset{\{\alpha_m\}_{m=0}^M}{\mathrm{argmin}}\sum_{i=1}^N L\Big(y_i; \alpha_0 + \sum_{m=1}^M \alpha_mf(x;\gamma_m)\Big) + \lambda\sum_{m=1}^M|\alpha_m|
\end{equation}

The set of base learners $\{f(x; \gamma_m)\}_{m=1}^M$ may be generated through **ISLE algorithm**:
1. Initialize $f_0(x) = \underset{c}{\mathrm{argmin}} \sum_{i=1}^N L(y_i, c)$.
2. For each $m \in \{1, 2, ..., M\}$:
    * Estimate:
    \begin{equation}
        \displaystyle \gamma_m = \underset{\gamma}{\mathrm{argmin}} \sum_{i \in S_m(\eta)} L(y_i, F_{m-1}(x_i) + f(x_i; \gamma))
    \end{equation}
    <br> Where $S_m(\eta)$ is a random sample of size $N.\eta$ produced in iteration $m$, typically generatd without replacement.
    <br>
    <br>
    * Update $F_m(x) = F_{m-1}(x) + v.f(x; \gamma_m)$.
<br>
<br>
3. Then, the final ISLE ensemble of learners is given by $\mathcal{T}_{ISLE} = \{f(x; \gamma_1), f(x; \gamma_2), ..., f(x; \gamma_M)\}$.

As long as each base learner $f(x; \gamma_m)$ consists on a decision tree, it is straightforward to identify a set of rules $\{r_k(x)\}$ from all of its non-root nodes. $r_k(x)$ is a numerical variable identified by:
<br>
<br>
\begin{equation}
r_k(x) = \underset{s_{jk} \neq S_j}{\prod}I(x \in s_{jk})
\end{equation}
Where $s_{jk}$ is a subset of the domain $S_j$ of input variable $x_j$.
<br>
Therefore, a **rule ensemble model** uses $\{r_k(x)\}$ instead of $f(x; \gamma_m)$ in the (regularized) linear model:
\begin{equation}
F(x) = \alpha_0 + \sum_{k=1}^K \alpha_kr_k(x)
\end{equation}
Again, $\{\alpha_k\}$ is then estimated through L1 regularization:
\begin{equation}
\{\hat{\alpha}_k\}_{k=0}^K = \underset{\{\alpha_k\}_{m=0}^K}{\mathrm{argmin}}\sum_{i=1}^N L\Big(y_i; \alpha_0 + \sum_{k=1}^K \alpha_kr_k(x)\Big) + \lambda\sum_{k=1}^K|\alpha_k|
\end{equation}
<br>
Where $K$ is the total number of rules given that $M$ base learners are generated:
\begin{equation}
K = \sum_{m=1}^M 2(t_m - 1)
\end{equation}

From the brief presentation above, it follows that a rule ensemble model has as main hyper-parameters:
* $\eta$: subsample parameter. It is denoted by *sample_fract* in the Rulefit library.
* $t_m$: tree size of base learner $m$. It is regulated by *tree_size* argument, which is the mean of $t_m$, $\overline{L}$ - either when $t_m$ is a random variable following an exponential distribution or when $t_m$ is fixed across $m$.
* $M$: number of base learners. In Rulefit library, it is regulated by the maximum number of rules, *max_rules*:
\begin{equation}
max\_rules \geq K
\end{equation}
* $v$: learning rate, defined as *memory_par* in Rulefit library.

Additionally, the following arguments may be calibrated when developing a rule ensemble model through Rulefit library:
* *rfmode*: distinguishes between regression or classification.
* *lin_standardise* and *lin_trim_quantile*: controls for the standardization of rules before estimating the linear model.
* *exp_rand_tree_size*: defines whether to use $t_m$ as a random or a fixed variable.
* *model_type*: used to declare if original inputs may also be added to the linear model together with the set of rules.
* *tol*, *max_iter* and *n_jobs*: parameters for Lasso CV estimation of the linear model.
    * Therefore, it follows that $\lambda$ is automatically defined through CV when Rulefit library is used.

This notebook provides a first implementation of Rulefit library, as suggested by tutorial scripts found in [Github](https://github.com/christophM/rulefit).

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Importing data](#imports)<a href='#imports'></a>.
3. [Basic usage](#basic_usage)<a href='#basic_usage'></a>.
4. [Specifying the base learner generator](#spec_base_gen)<a href='#spec_base_gen'></a>.
    * [Regression problem](#regression)<a href='#regression'></a>.
    * [Classification problem](#classification)<a href='#classification'></a>.
<br>
<br>
5. [Passing a base learner generator](#pass_base_gen)<a href='#pass_base_gen'></a>.

<a id='libraries'></a>

## Libraries

In [44]:
# pip install git+git://github.com/christophM/rulefit.git
import rulefit

import pandas as pd
import numpy as np
import os
import json

from datetime import datetime
import time

import progressbar
from time import sleep

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score, auc, precision_recall_curve, brier_score_loss

import pickle

<a id='imports'></a>

## Importing data

In [45]:
df = pd.read_csv('../Datasets/boston.csv', index_col=0)
print('\033[1mShape of df:\033[0m ' + str(df.shape) + '.')
df.head()

[1mShape of df:[0m (506, 14).


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


<a id='basic_usage'></a>

## Basic usage

### Data pre-processing

In [46]:
print('\033[1mAssessing missing values:\033[0m')
df.isnull().sum()

[1mAssessing missing values:[0m


crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64

In [47]:
print('\033[1mData types:\033[0m')
df.dtypes

[1mData types:[0m


crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [48]:
features = df.drop('medv', axis=1).columns
X_train = df.drop('medv', axis=1).values
y_train = df.medv.values

### Training the rule ensemble model

In [7]:
# Creating the estimation object:
rf = rulefit.RuleFit()

# Training the model:
rf.fit(X_train, y_train, feature_names=features)

  positive)
  positive)
  positive)
  positive)
  positive)


RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_iter=1000, max_rules=2000, memory_par=0.01,
        model_type='rl', n_jobs=None, random_state=None, rfmode='regress',
        sample_fract='default', tol=0.0001,
        tree_generator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.01,
                                                 loss='ls', max_depth=100,
                                                 max_features=None,
                                                 max_leaf_nodes=4,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                

In [8]:
# Predictions:
y_pred = rf.predict(X_train)

### Creating rules

In [9]:
rules = rf.get_rules()
print('\033[1mShape of rules:\033[0m ' + str(rules.shape) + '.')
rules.head()

[1mShape of rules:[0m (1718, 5).


Unnamed: 0,rule,type,coef,support,importance
0,crim,linear,-0.0,1.0,0.0
1,zn,linear,0.001507,1.0,0.034022
2,indus,linear,-0.003772,1.0,0.025
3,chas,linear,-0.0,1.0,0.0
4,nox,linear,-0.0,1.0,0.0


In [10]:
rules = rules[rules.coef != 0].sort_values("support", ascending=False)
rules.head(10)

Unnamed: 0,rule,type,coef,support,importance
1,zn,linear,0.001507,1.0,0.034022
6,age,linear,-0.041714,1.0,1.166747
2,indus,linear,-0.003772,1.0,0.025
1301,rm <= 8.386499881744385 & rm > 5.0989999771118...,rule,0.070733,0.940171,0.016776
416,rm <= 8.742000102996826 & black > 105.02000045...,rule,0.157133,0.918803,0.042919
1145,rm <= 8.157999992370605 & dis > 1.173600018024...,rule,-1.324359,0.91453,0.370264
1486,rad > 2.5 & rm <= 8.577499866485596 & crim <= ...,rule,0.224524,0.888889,0.070561
770,crim <= 16.34310007095337 & rm <= 8.3169999122...,rule,0.548662,0.888889,0.172428
577,dis > 1.1736000180244446 & rm <= 7.07449984550...,rule,-1.26512,0.880342,0.410609
956,lstat > 5.144999980926514,rule,-0.01016,0.876068,0.003348


<a id='spec_base_gen'></a>

## Specifying the base learner generator

In [27]:
help(rulefit.RuleFit)

Help on class RuleFit in module rulefit.rulefit:

class RuleFit(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  RuleFit(tree_size=4, sample_fract='default', max_rules=2000, memory_par=0.01, tree_generator=None, rfmode='regress', lin_trim_quantile=0.025, lin_standardise=True, exp_rand_tree_size=True, model_type='rl', Cs=None, cv=3, tol=0.0001, max_iter=None, n_jobs=None, random_state=None)
 |  
 |  Rulefit class
 |  
 |  
 |  Parameters
 |  ----------
 |      tree_size:      Number of terminal nodes in generated trees. If exp_rand_tree_size=True,
 |                      this will be the mean number of terminal nodes.
 |      sample_fract:   fraction of randomly chosen training observations used to produce each tree.
 |                      FP 2004 (Sec. 2)
 |      max_rules:      approximate total number of rules generated for fitting. Note that actual
 |                      number of rules will usually be lower than this due to duplicates.
 |      memory_par:     scale 

<a id='regression'></a>

### Regression problem

#### Training the rule ensemble model

In [39]:
# Creating the estimation object:
rf = rulefit.RuleFit(rfmode='regress', tree_generator=None,
                     exp_rand_tree_size=True,
                     sample_fract='default', tree_size=4, max_rules=2000, memory_par=0.01,
                     lin_standardise=True, lin_trim_quantile=0.025,
                     random_state=1)

# Training the model:
rf.fit(X_train, y_train, feature_names=features)

RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_iter=1000, max_rules=2000, memory_par=0.01,
        model_type='rl', n_jobs=None, random_state=1, rfmode='regress',
        sample_fract='default', tol=0.0001,
        tree_generator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.01,
                                                 loss='ls', max_depth=100,
                                                 max_features=None,
                                                 max_leaf_nodes=2,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                   

In [40]:
# Predictions:
y_pred = rf.predict(X_train)
print('insample_rmse = ' + str(np.sqrt(np.sum((y_pred - y_train)**2)/len(y_train))))

insample_rmse = 1.5222615689844674


#### Creating rules

In [41]:
rules = rf.get_rules()
print('\033[1mShape of rules:\033[0m ' + str(rules.shape) + '.')
rules = rules[(rules.coef != 0)].sort_values("support", ascending=False)
rules.head(10)

[1mShape of rules:[0m (1722, 5).


Unnamed: 0,rule,type,coef,support,importance
1,zn,linear,0.002677,1.0,0.06042
6,age,linear,-0.032031,1.0,0.895907
612,dis > 1.133300006389618 & rm <= 7.819999933242798,rule,-1.866426,0.957265,0.377502
1583,dis > 1.1716500520706177 & tax > 219.0,rule,-2.129946,0.952991,0.450818
224,rm <= 8.752500057220459 & dis > 1.338699996471...,rule,-0.070616,0.940171,0.016748
564,dis > 1.1736000180244446 & lstat > 4.650000095...,rule,-2.465786,0.888889,0.774922
1282,rm <= 8.157999992370605 & dis > 1.564000010490...,rule,-0.967384,0.888889,0.304019
1601,lstat <= 23.880000114440918,rule,0.028642,0.884615,0.009151
657,lstat <= 28.785000801086426 & black > 105.2399...,rule,0.390027,0.871795,0.130393
67,rm <= 8.742000102996826 & ptratio > 14.5499997...,rule,0.307528,0.867521,0.104255


In [14]:
# Rules derived from base learners:
rules[(rules.coef != 0) & (rules.type == 'rule')].sort_values("support", ascending=False).head(10)

Unnamed: 0,rule,type,coef,support,importance
612,dis > 1.133300006389618 & rm <= 7.819999933242798,rule,-1.866426,0.957265,0.377502
1583,dis > 1.1716500520706177 & tax > 219.0,rule,-2.129946,0.952991,0.450818
224,rm <= 8.752500057220459 & dis > 1.338699996471...,rule,-0.070616,0.940171,0.016748
564,dis > 1.1736000180244446 & lstat > 4.650000095...,rule,-2.465786,0.888889,0.774922
1282,rm <= 8.157999992370605 & dis > 1.564000010490...,rule,-0.967384,0.888889,0.304019
1601,lstat <= 23.880000114440918,rule,0.028642,0.884615,0.009151
657,lstat <= 28.785000801086426 & black > 105.2399...,rule,0.390027,0.871795,0.130393
67,rm <= 8.742000102996826 & ptratio > 14.5499997...,rule,0.307528,0.867521,0.104255
1014,lstat <= 28.684999465942383 & rm <= 8.74200010...,rule,0.718026,0.846154,0.259065
735,rm <= 8.742000102996826 & ptratio > 14.5499997...,rule,0.083887,0.846154,0.030267


In [15]:
# Original inputs added into the (regularized) linear model:
rules[(rules.coef != 0) & (rules.type == 'linear')].sort_values("support", ascending=False).head(10)

Unnamed: 0,rule,type,coef,support,importance
1,zn,linear,0.002677,1.0,0.06042
6,age,linear,-0.032031,1.0,0.895907


<a id='classification'></a>

### Classification problem

In [49]:
# Creating a categorical variable from the original continuous response:
y_class = y_train.copy()
y_class[y_class<21] = -1
y_class[y_class>=21] = +1
N = X_train.shape[0]

#### Training the rule ensemble model

In [50]:
# Creating the estimation object:
rf = rulefit.RuleFit(rfmode='classify', tree_generator=None,
                     exp_rand_tree_size=True,
                     sample_fract='default', tree_size=4, max_rules=2000, memory_par=0.01,
                     lin_standardise=True, lin_trim_quantile=0.025,
                     random_state=1)

# Training the model:
rf.fit(X_train, y_class, feature_names=features)

RuleFit(Cs=None, cv=3, exp_rand_tree_size=True, lin_standardise=True,
        lin_trim_quantile=0.025, max_iter=1000, max_rules=2000, memory_par=0.01,
        model_type='rl', n_jobs=None, random_state=1, rfmode='classify',
        sample_fract='default', tol=0.0001,
        tree_generator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.01,
                                                  loss='deviance',
                                                  max_depth=100,
                                                  max_features=None,
                                                  max_leaf_nodes=2,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
             

In [51]:
# Predictions:
y_pred = rf.predict(X_train)
y_proba = rf.predict_proba(X_train)

print('insample_acc = ' + str(len(y_pred==y_class)/len(y_class)))

insample_acc = 1.0


#### Creating rules

In [52]:
rules = rf.get_rules()
print('\033[1mShape of rules:\033[0m ' + str(rules.shape) + '.')
rules.head()

[1mShape of rules:[0m (1734, 5).


Unnamed: 0,rule,type,coef,support,importance
0,crim,linear,0.000609,1.0,0.003422
1,zn,linear,-0.002788,1.0,0.06292
2,indus,linear,0.003397,1.0,0.022515
3,chas,linear,0.0,1.0,0.0
4,nox,linear,-0.004137,1.0,0.000478


In [53]:
# Rules derived from base learners:
rules[(rules.coef != 0) & (rules.type == 'rule')].sort_values("support", ascending=False).head(10)

Unnamed: 0,rule,type,coef,support,importance
1586,rm > 4.190999984741211,rule,-0.032777,0.995726,0.002138
1588,rm > 4.000499963760376,rule,-0.075028,0.991453,0.006907
657,dis > 1.3150500059127808,rule,-0.039182,0.991453,0.003607
1587,rm > 4.115499973297119,rule,0.012214,0.991453,0.001124
108,dis <= 9.886650085449219,rule,-0.001366,0.987179,0.000154
1000,dis > 1.344599962234497,rule,0.03615,0.987179,0.004067
661,dis > 1.3392000198364258,rule,-0.021292,0.982906,0.00276
660,dis > 1.3452500104904175,rule,-0.030263,0.978632,0.004376
454,tax > 195.5,rule,-0.042275,0.978632,0.006113
656,dis > 1.338699996471405,rule,0.082866,0.974359,0.013098


In [54]:
# Original inputs added into the (regularized) linear model:
rules[(rules.coef != 0) & (rules.type == 'linear')].sort_values("support", ascending=False).head(10)

Unnamed: 0,rule,type,coef,support,importance
0,crim,linear,0.000609,1.0,0.003422
1,zn,linear,-0.002788,1.0,0.06292
2,indus,linear,0.003397,1.0,0.022515
4,nox,linear,-0.004137,1.0,0.000478
5,rm,linear,-0.002611,1.0,0.001676
6,age,linear,-0.001619,1.0,0.045293
7,dis,linear,0.011904,1.0,0.024117
8,rad,linear,0.003998,1.0,0.034781
10,ptratio,linear,0.002213,1.0,0.004762
11,black,linear,-2.3e-05,1.0,0.00203


<a id='pass_base_gen'></a>

## Passing a base learner generator

In [29]:
# Creating a tree generator from GBM:
base_gen = GradientBoostingRegressor(subsample=0.75,
                                     max_depth=10,
                                     learning_rate=0.01,
                                     n_estimators=500)

# Creating the estimation object:
rf = rulefit.RuleFit(tree_generator=base_gen, max_iter=5000)

# Training the model:
rf.fit(X_train, y_train, feature_names=features)

# Creating rules:
rules = rf.get_rules()
rules.sort_values('importance', ascending=False).head(10)

Unnamed: 0,rule,type,coef,support,importance
464,rm > 6.940999984741211 & nox <= 0.669499993324...,rule,2.31238,0.147757,0.820569
6,age,linear,-0.028063,1.0,0.784916
479,dis > 1.1716500520706177 & rm <= 7.47949981689...,rule,-2.621037,0.931398,0.662534
144,lstat > 15.065000057220459 & rm <= 6.977999925...,rule,-1.406672,0.313984,0.652851
338,black > 105.23999786376953 & lstat <= 19.82999...,rule,1.298384,0.633245,0.625716
792,dis > 1.227150022983551 & ptratio > 13.8499999...,rule,-1.144772,0.461741,0.570708
110,dis > 1.338699996471405 & ptratio > 15.25 & cr...,rule,-1.2818,0.738786,0.56309
577,lstat > 19.605000495910645 & rm <= 6.837500095...,rule,-1.416412,0.171504,0.533915
500,lstat <= 5.1549999713897705 & dis <= 3.2074499...,rule,2.506396,0.036939,0.472739
630,rm <= 7.442999839782715 & crim > 0.60821500420...,rule,-1.031817,0.253298,0.448737


In [36]:
rules[rules.type=='linear']

Unnamed: 0,rule,type,coef,support,importance
0,crim,linear,-0.0,1.0,0.0
1,zn,linear,0.009153,1.0,0.206586
2,indus,linear,-0.0,1.0,0.0
3,chas,linear,-0.0,1.0,0.0
4,nox,linear,-0.0,1.0,0.0
5,rm,linear,0.240687,1.0,0.154502
6,age,linear,-0.028063,1.0,0.784916
7,dis,linear,0.0,1.0,0.0
8,rad,linear,0.0,1.0,0.0
9,tax,linear,-0.0,1.0,0.0
