# 1. Learn Boolean decision rules
### Import libraries

In [28]:
from pyrulelearn.imli import imli
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

### Model Configuration

We now create an instance of `imli` object. First we learn a classification rule in CNF, that is, the decision rule is ANDs of ORs of input features. For that, we specify `rule_type=CNF` inside the model. In this example, we learn a 2 clause rule with parameter `data_fidelity=10`. `data_fidelity` parameter sets the priority between accuracy and rule-sparsity such that a higher value of `data_fidelity` results in a more accurate rule. We require a MaxSAT solver to learn the Boolean rule. In this example, we use `open-wbo` as the MaxSAT solver. 

In [2]:
model = imli(rule_type="CNF", num_clause=2,  data_fidelity=10, solver="open-wbo", work_dir=".", verbose=True)

### Load dataset
In this example, we learn a decision rule on `Iris` flower dataset. While the original dataset is used for multiclass classification, we modify it for binary classification. Our objective is to learn a decision rule that separates `Iris Versicolour` from other two classes of Iris: `Iris Setosa` and `Iris Virginica`. 

Our framework requires the training set to be discretized. In the following, we apply entropy-based discretization on the dataset. Alternatively, one can discreize the dataset and directly use them.

In [3]:
X, y, features = model.discretize_orange("../benchmarks/iris_orange.csv")

Applying entropy based discretization using Orange library
- file name:  ../benchmarks/iris_orange.csv
- the number of discretized features: 22


### Split dataset into train and test set

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Train the model

In [5]:
model.fit(X_train,y_train)


Training started for batch:  1
- number of soft clauses:  93
- number of Boolean variables: 157
- number of hard and soft clauses: 863


Batch tarining complete
- number of literals in the rule: 2
- number of training errors:    3 out of 49

Training started for batch:  2
- number of soft clauses:  95
- number of Boolean variables: 161
- number of hard and soft clauses: 890


Batch tarining complete
- number of literals in the rule: 4
- number of training errors:    1 out of 51


### The following function is used to access the performance of the trained model

In [6]:
def measurement(cnf_matrix):
    # print(cnf_matrix)
    TN, FP, FN, TP = cnf_matrix.ravel()

    # Sensitivity, hit rate, recall, or true positive rate
    TPR = TP/(TP+FN)
    # Specificity or true negative rate
    TNR = TN/(TN+FP)
    # Precision or positive predictive value
    PPV = TP/(TP+FP)
    # Negative predictive value
    NPV = TN/(TN+FN)
    # Fall out or false positive rate
    FPR = FP/(FP+TN)
    # False negative rate
    FNR = FN/(TP+FN)
    # False discovery rate
    FDR = FP/(TP+FP)

    # Overall accuracy
    ACC = (TP+TN)/(TP+FP+FN+TN)
    return TPR, TNR, PPV, NPV, FPR, FNR, FDR, ACC*100

### Report performance of the learned rule

In [9]:
yhat_train = model.predict(X_train)
_, _, _, _, _, _, _, train_acc = measurement(confusion_matrix(y_train, yhat_train))
yhat_test = model.predict(X_test)
_, _, _, _, _, _, _, test_acc = measurement(confusion_matrix(y_test, yhat_test))
print("\ntraining    accuracy: ", train_acc)
print("test        accuracy: ", test_acc)



Prediction through MaxSAT formulation
- number of soft clauses:  188
- number of Boolean variables: 344
- number of hard and soft clauses: 2488

Prediction through MaxSAT formulation
- number of soft clauses:  138
- number of Boolean variables: 194
- number of hard and soft clauses: 1288

training    accuracy:  96.0
test        accuracy:  98.0


### Show the learned rule

In [9]:
rule = model.get_rule(features)
print("Learned rule is: \n")
print("An Iris flower is predicted as Iris Versicolor if")
print(rule)

Learned rule is: 

An Iris flower is predicted as Iris Versicolor if
( sepal length = (5.45 - 7.05) OR petal length = (2.45 - 4.75) ) AND 
( petal length = (2.45 - 4.75) OR petal width = (0.8 - 1.75))


# 2. Learn decision rules as DNF

To learn a decision rule as a DNF (ORs of ANDs of input features), we specify `rule_type=DNF` in the parameters of the model. In the following, we learn a 2 clause DNF decision rule. 

In [10]:
model = imli(rule_type="DNF", num_clause=2,  data_fidelity=10, solver="open-wbo", work_dir=".", verbose=False)

In [12]:
X, y, features = model.discretize_orange("../benchmarks/iris_orange.csv")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model.fit(X_train,y_train)
yhat_train = model.predict(X_train)
_, _, _, _, _, _, _, train_acc = measurement(confusion_matrix(y_train, yhat_train))
yhat_test = model.predict(X_test)
_, _, _, _, _, _, _, test_acc = measurement(confusion_matrix(y_test, yhat_test))
print("\ntraining    accuracy: ", train_acc)
print("test        accuracy: ", test_acc)
rule = model.get_rule(features)
print("Learned rule is: \n")
print("An Iris flower is predicted as Iris Versicolor if")

print(rule)


training    accuracy:  96.0
test        accuracy:  98.0
Learned rule is: 

An Iris flower is predicted as Iris Versicolor if
( sepal length >=  7.05 AND not_petal width = (0.8 - 1.75) ) OR 
( not_petal length = (2.45 - 4.75))


# 3. Learn more expressible decision rules

Our framework allows one to learn more expressible decision rules, which we call relaxed_CNF rules. This rule allows thresholds on satisfaction of clauses and literals and can learn more complex decision boundaries. See the [ECAI-2020](https://bishwamittra.github.io/publication/ecai_2020/paper.pdf) paper for more details. 


In our framework, set the parameter `rule_type=relaxed_CNF` to learn the rule.

In [13]:
model = imli(rule_type="relaxed_CNF", num_clause=2,  data_fidelity=10, solver="cplex", work_dir=".", verbose=False)

In [15]:
X, y, features = model.discretize_orange("../benchmarks/iris_orange.csv")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model.fit(X_train,y_train)
yhat_train = model.predict(X_train)
_, _, _, _, _, _, _, train_acc = measurement(confusion_matrix(y_train, yhat_train))
yhat_test = model.predict(X_test)
_, _, _, _, _, _, _, test_acc = measurement(confusion_matrix(y_test, yhat_test))
print("\ntraining    accuracy: ", train_acc)
print("test        accuracy: ", test_acc)


training    accuracy:  95.0
test        accuracy:  98.0


### Understanding the decision rule

In this example, we ask the framework to learn a 2 clause rule. During training, we learn the thresholds on clauses and literals while fitting the dataset. The learned rule operates in two levels. In the first level, a clause is satisfied if the literals in the clause satisfy the learned threshold on literals. In the second level, the formula is satisfied when the threshold on clauses is satisfied.

In [16]:
rule = model.get_rule(features)
print("Learned rule is: \n")
print("An Iris flower is predicted as Iris Versicolor if")
print(rule)
print("\nThrehosld on clause:", model.get_threshold_clause())
print("Threshold on literals: (this is a list where the entrie denotes threholds on literals on all clauses)")
print(model.get_threshold_literal())

Learned rule is: 

An Iris flower is predicted as Iris Versicolor if
[ (  sepal width >=  2.95  + petal length = (2.45 - 4.75)  + petal width = (0.8 - 1.75)   )>= 2  ] +
[ ( )>= 0  ]  >= 2

Threhosld on clause: 2
Threshold on literals: (this is a list where the entrie denotes threholds on literals on all clauses)
[2, 0]


# extra

In [1]:
import numpy as np 
train = np.load("../data/train.npa")
test = np.load("../data/test.npa")
labels = np.load("../data/labels.npa")
assert len(labels) == len(train)

In [2]:
import sys
from pyrulelearn.imli import imli
clf = imli(rule_type="CNF", num_clause=10, data_fidelity=1, solver="open-wbo", work_dir="../data/", verbose=True)
clf.fit(train, labels)

Training started for batch:  433
- number of soft clauses:  4042
- number of Boolean variables: 4142
- number of hard and soft clauses: 24372


Batch tarining complete
- number of literals in the rule: 94
- number of training errors:    2 out of 42

Training started for batch:  434
- number of soft clauses:  4044
- number of Boolean variables: 4154
- number of hard and soft clauses: 26385


Batch tarining complete
- number of literals in the rule: 94
- number of training errors:    1 out of 44

Training started for batch:  435
- number of soft clauses:  4042
- number of Boolean variables: 4142
- number of hard and soft clauses: 24372


Batch tarining complete
- number of literals in the rule: 95
- number of training errors:    3 out of 42

Training started for batch:  436
- number of soft clauses:  4044
- number of Boolean variables: 4154
- number of hard and soft clauses: 26385


Batch tarining complete
- number of literals in the rule: 93
- number of training errors:    4 out of 44



In [3]:
print(clf.get_rule([str(i) for i in range(len(train[0]))]))
print()
print(clf.get_selected_column_index())

( not 28 AND not 37 AND not 105 AND not 109 AND not 179 AND 52 AND 115 AND 188 ) OR 
( not 2 AND not 24 AND not 85 AND not 130 AND not 187 AND 45 AND 112 AND 175 AND 182 AND 195 ) OR 
( not 2 AND not 7 AND not 12 AND not 27 AND not 35 AND not 88 AND not 90 AND not 135 AND not 174 AND 164 AND 167 ) OR 
( not 93 AND not 123 AND not 132 AND 109 AND 126 AND 148 AND 172 AND 174 AND 180 AND 183 AND 186 AND 187 AND 188 AND 190 AND 194 ) OR 
( not 31 AND not 88 AND not 101 AND 2 AND 32 AND 129 AND 161 AND 166 AND 172 AND 181 AND 189 ) OR 
( not 0 AND not 1 AND not 32 AND not 34 AND 40 AND 137 ) OR 
( not 0 AND not 48 AND not 77 AND 13 AND 25 AND 160 AND 189 AND 197 ) OR 
( not 12 AND not 26 AND not 50 AND not 62 AND not 72 AND not 117 AND 38 AND 115 AND 128 AND 143 AND 186 ) OR 
( not 6 AND not 105 AND 31 AND 52 AND 100 AND 112 AND 139 AND 178 AND 182 ) OR 
( not 22 AND not 69 AND not 99 AND 47 AND 96 AND 185)

[[-28, -37, -105, -109, -179, 52, 115, 188], [-2, -24, -85, -130, -187, 45, 112, 17

In [4]:
yhat = clf.predict(test)
len(yhat)

113077

In [5]:
from sklearn.metrics import classification_report
print(classification_report(labels, clf.predict(train), target_names=['0','1']))

              precision    recall  f1-score   support

           0       1.00      0.02      0.05     16619
           1       0.25      1.00      0.40      5417

    accuracy                           0.26     22036
   macro avg       0.63      0.51      0.22     22036
weighted avg       0.82      0.26      0.14     22036



In [6]:
print(clf.get_rule([str(i) for i in range(len(train[0]))]))
print()
print(clf.get_selected_column_index())

( not 28 AND not 37 AND not 105 AND not 109 AND not 179 AND 52 AND 115 AND 188 ) OR 
( not 2 AND not 24 AND not 85 AND not 130 AND not 187 AND 45 AND 112 AND 175 AND 182 AND 195 ) OR 
( not 2 AND not 7 AND not 12 AND not 27 AND not 35 AND not 88 AND not 90 AND not 135 AND not 174 AND 164 AND 167 ) OR 
( not 93 AND not 123 AND not 132 AND 109 AND 126 AND 148 AND 172 AND 174 AND 180 AND 183 AND 186 AND 187 AND 188 AND 190 AND 194 ) OR 
( not 31 AND not 88 AND not 101 AND 2 AND 32 AND 129 AND 161 AND 166 AND 172 AND 181 AND 189 ) OR 
( not 0 AND not 1 AND not 32 AND not 34 AND 40 AND 137 ) OR 
( not 0 AND not 48 AND not 77 AND 13 AND 25 AND 160 AND 189 AND 197 ) OR 
( not 12 AND not 26 AND not 50 AND not 62 AND not 72 AND not 117 AND 38 AND 115 AND 128 AND 143 AND 186 ) OR 
( not 6 AND not 105 AND 31 AND 52 AND 100 AND 112 AND 139 AND 178 AND 182 ) OR 
( not 22 AND not 69 AND not 99 AND 47 AND 96 AND 185)

[[-28, -37, -105, -109, -179, 52, 115, 188], [-2, -24, -85, -130, -187, 45, 112, 17