#### Usage example on congress dataset

These examples use the RIPPER algorithm. IREP usage is similar, with only slight hyperparameter differences.

In [1]:
import wittgenstein as lw
import pandas as pd

Load our dataset:

In [2]:
df = pd.read_csv('../datasets/house-votes-84.csv')

Split our data into train-test sets:

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state=42)

#### Training

Create a ruleset classifier:

In [4]:
ripper_clf = lw.RIPPER(random_state=42)
ripper_clf

<RIPPER(verbosity=0, random_state=42, dl_allowance=64, prune_size=0.33, max_total_conds=None, k=2, n_discretize_bins=10, max_rules=None, max_rule_conds=None)>

Train the ruleset classifier on the trainset:

In [5]:
ripper_clf.fit(train, class_feat='Party', pos_class='democrat')
ripper_clf.ruleset_ # Access underlying model

<Ruleset [physician-fee-freeze=n] V [adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n]>

The fit method is flexible and can be called in various ways, including with train_x and train_y, or with numpy arrays.  

Unlike dataframes, arrays don't have feature names...

In [6]:
X_train, y_train = train.drop('Party', axis=1), train['Party']
X_array, y_array = X_train.values, y_train.values
ripper_clf.fit(X_array, y_array, pos_class='democrat')
ripper_clf.ruleset_

<Ruleset [3=n] V [10=y^2=y^6=n] V [2=?^0=y]>

But we can pass them in:

In [7]:
X_array, y_arry = train.drop('Party', axis=1).values, train['Party'].values
ripper_clf.fit(X_array, y_arry, 
               pos_class='democrat', class_feat='Party', 
               feature_names=df.columns[1:])
ripper_clf.ruleset_

<Ruleset [physician-fee-freeze=n] V [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n] V [adoption-of-the-budget-resolution=?^Handicapped-infants=y]>

We can force a simpler ruleset using max_rules, max_total_conds, or max_rule_conds.

In [8]:
ripper_clf = lw.RIPPER(max_rules=2, random_state=1)
ripper_clf.fit(train, class_feat='Party', pos_class='democrat')
ripper_clf.ruleset_ 

<Ruleset [physician-fee-freeze=n] V [synfuels-corporation-cutback=y^physician-fee-freeze=?]>

Verbosity allows us to view training steps...

In [9]:
ripper_clf = lw.RIPPER(random_state=42, verbosity=1) # Scale of 0-5
ripper_clf.fit(train, class_feat='Party', pos_class='democrat')
ripper_clf.ruleset_


GREW INITIAL RULESET:
[[physician-fee-freeze=n] V
[synfuels-corporation-cutback=y] V
[superfund-right-to-sue=?^export-administration-act-south-africa=y] V
[adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n]]

optimization run 1 of 2

OPTIMIZED RULESET:
[[physician-fee-freeze=n] V
[synfuels-corporation-cutback=y] V
[superfund-right-to-sue=?^export-administration-act-south-africa=y] V
[adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n]]

No changes were made. Halting optimization.
GREW FINAL RULES
[[physician-fee-freeze=n] V
[synfuels-corporation-cutback=y] V
[superfund-right-to-sue=?^export-administration-act-south-africa=y] V
[adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n] V
[physician-fee-freeze=n] V
[synfuels-corporation-cutback=y] V
[superfund-right-to-sue=?^export-administration-act-south-africa=y] V
[adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n]]

FINAL RULESET:
[[phys

<Ruleset [physician-fee-freeze=n] V [adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n^immigration=n]>

#### Model selection

Some sklearn methods are supported. Cross-validation:

In [10]:
from sklearn.model_selection import cross_val_score

# Dummify our data to make sklearn happy
X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes('object').columns)
y_train = y_train.map(lambda x: 1 if x=='democrat' else 0)

ripper_clf = lw.RIPPER(random_state=42)
cross_val_score(ripper_clf, X_train, y_train)


array([0.95454545, 0.90769231, 0.89230769, 0.92307692, 0.90769231])

Grid-search:

In [11]:
from sklearn.model_selection import GridSearchCV
param_grid = {"prune_size": [0.1, 0.25, 0.33, 0.5], "k": [1, 2]}
grid = GridSearchCV(estimator=ripper_clf, param_grid=param_grid)
grid.fit(X_train, y_train)
grid.best_params_

{'k': 1, 'prune_size': 0.33}

Ensemble:

In [12]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

nb = GaussianNB()
tree = DecisionTreeClassifier(random_state=42)
estimators = [("rip", ripper_clf), ("tree", tree), ("nb", nb)]
ensemble_clf = StackingClassifier(
  estimators=estimators, final_estimator=LogisticRegression()
)


#### Testing

How good is our model?

In [13]:
X_test = test.drop('Party', axis=1)
y_test = test['Party']
ripper_clf = lw.RIPPER(random_state=42)
ripper_clf.fit(train, class_feat='Party', pos_class='democrat')
ripper_clf.score(X_test, y_test) # Default metric is accuracy

0.908256880733945

We can also score it on custom metrics, including sklearn's:

In [14]:
from sklearn.metrics import precision_score, recall_score
precision = ripper_clf.score(X_test, y_test, precision_score)
recall = ripper_clf.score(X_test, y_test, recall_score)
print(f'precision: {precision}')
print(f'recall: {recall}')

precision: 1.0
recall: 0.855072463768116


#### Prediction

To make predictions, use the predict method.

In [15]:
ripper_clf.predict(X_test.tail(10))

[False, True, True, True, False, False, False, True, False, True]

For predicted probabilities, use predict_proba.

In [16]:
ripper_clf.predict_proba(X_test.tail(10))

array([[0.75903614, 0.24096386],
       [0.01086957, 0.98913043],
       [0.01086957, 0.98913043],
       [0.01086957, 0.98913043],
       [0.75903614, 0.24096386],
       [0.75903614, 0.24096386],
       [0.75903614, 0.24096386],
       [0.01086957, 0.98913043],
       [0.75903614, 0.24096386],
       [0.01086957, 0.98913043]])

We can also ask our model to give us the reasons for its predictions.

In [17]:
ripper_clf.predict(X_test.tail(), give_reasons=True)

([False, False, True, False, True],
 [[],
  [],
  [<Rule [physician-fee-freeze=n]>],
  [],
  [<Rule [physician-fee-freeze=n]>]])