# ABCN2 Algorithm

The following notebook is used to import a dataset, and apply the ABCN2 algorithm to it. This uses the CN2 rule based learning algorithm, as well as expert rules which we will also derive from the dataset in this notebook

Data: Heart Attack Prediction, https://www.kaggle.com/imnikhilanand/heart-attack-prediction/downloads/heart-attack-prediction.zip/1

## Imports and reading data

In [1]:
import numpy as np
import pandas as pd
import Orange

In [2]:
FILE_INPUT = 'data.csv'

In [3]:
df = pd.read_csv(FILE_INPUT)
df = df.replace('?', np.nan)

## Data exploration

In [4]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130,132.0,0,2,185,0,0.0,,,,0
1,29,1,2,120,243.0,0,0,160,0,0.0,,,,0
2,29,1,2,140,,0,0,170,0,0.0,,,,0
3,30,0,1,170,237.0,0,1,170,0,0.0,,,6.0,0
4,31,0,2,100,219.0,0,1,150,0,0.0,,,,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 14 columns):
age           294 non-null int64
sex           294 non-null int64
cp            294 non-null int64
trestbps      293 non-null object
chol          271 non-null object
fbs           286 non-null object
restecg       293 non-null object
thalach       293 non-null object
exang         293 non-null object
oldpeak       294 non-null float64
slope         104 non-null object
ca            3 non-null object
thal          28 non-null object
num           294 non-null int64
dtypes: float64(1), int64(4), object(9)
memory usage: 32.3+ KB


In [6]:
df.describe()

Unnamed: 0,age,sex,cp,oldpeak,num
count,294.0,294.0,294.0,294.0,294.0
mean,47.826531,0.72449,2.982993,0.586054,0.360544
std,7.811812,0.447533,0.965117,0.908648,0.480977
min,28.0,0.0,1.0,0.0,0.0
25%,42.0,0.0,2.0,0.0,0.0
50%,49.0,1.0,3.0,0.0,0.0
75%,54.0,1.0,4.0,1.0,1.0
max,66.0,1.0,4.0,5.0,1.0


In [7]:
# Prints unique values
df.nunique()

age            38
sex             2
cp              4
trestbps       31
chol          153
fbs             2
restecg         3
thalach        71
exang           2
oldpeak        10
slope           3
ca              1
thal            3
num             2
dtype: int64

In [8]:
# Number of NaNs in the dataframe
df.isnull().sum()

age             0
sex             0
cp              0
trestbps        1
chol           23
fbs             8
restecg         1
thalach         1
exang           1
oldpeak         0
slope         190
ca            291
thal          266
num             0
dtype: int64

## CN2 Algorithm

Documentation: https://github.com/biolab/orange3, https://docs.biolab.si//3/data-mining-library/

In [155]:
data = Orange.data.Table(FILE_INPUT)

# Exclude cols: slope, cal and thal which contain a lot of missing values
new_domain = Orange.data.Domain(
    list(heart_attack.domain.attributes[:10]),
    list(heart_attack.domain.attributes[13:])
)


heart_attack = Orange.data.Table(new_domain, data)

In [163]:
print(heart_attack[0:10])

[[28, 1, 2, 130, 132, 0, 2, 185, 0, 0.0 | 0],
 [29, 1, 2, 120, 243, 0, 0, 160, 0, 0.0 | 0],
 [29, 1, 2, 140, ?, 0, 0, 170, 0, 0.0 | 0],
 [30, 0, 1, 170, 237, 0, 1, 170, 0, 0.0 | 0],
 [31, 0, 2, 100, 219, 0, 1, 150, 0, 0.0 | 0],
 [32, 0, 2, 105, 198, 0, 0, 165, 0, 0.0 | 0],
 [32, 1, 2, 110, 225, 0, 0, 184, 0, 0.0 | 0],
 [32, 1, 2, 125, 254, 0, 0, 155, 0, 0.0 | 0],
 [33, 1, 3, 120, 298, 0, 0, 185, 0, 0.0 | 0],
 [34, 0, 2, 130, 161, 0, 0, 190, 0, 0.0 | 0]]


In [157]:
heart_attack.domain.attributes

(ContinuousVariable(name='age', number_of_decimals=0),
 DiscreteVariable(name='sex', values=['0', '1']),
 ContinuousVariable(name='cp', number_of_decimals=0),
 ContinuousVariable(name='trestbps', number_of_decimals=0),
 ContinuousVariable(name='chol', number_of_decimals=0),
 DiscreteVariable(name='fbs', values=['0', '1']),
 ContinuousVariable(name='restecg', number_of_decimals=0),
 ContinuousVariable(name='thalach', number_of_decimals=0),
 DiscreteVariable(name='exang', values=['0', '1']),
 ContinuousVariable(name='oldpeak', number_of_decimals=1))

In [158]:
for x in heart_attack.domain.attributes:
    n_miss = sum(1 for d in heart_attack if np.isnan(d[x]))
    print("%4.1f%% %s" % (100.0 * n_miss / len(heart_attack), x.name))

 0.0% age
 0.0% sex
 0.0% cp
 0.3% trestbps
 7.8% chol
 2.7% fbs
 0.3% restecg
 0.3% thalach
 0.3% exang
 0.0% oldpeak


In [159]:
heart_attack.domain.class_var

DiscreteVariable(name='num', values=['0', '1'])

In [162]:
# Construct a learning algorithm and classifier
cn2_learner = Orange.classification.rules.CN2Learner()
cn2_classifier = cn2_learner(heart_attack)

# Print out the found rules, with the quality of the rule, and curr_class_dist
for rule in cn2_classifier.rule_list:
    print(rule.curr_class_dist.tolist(), rule, rule.quality)

[0, 5] IF age>=63.0 THEN num=1  -0.0
[1, 0] IF restecg>=2.0 AND sex==0 THEN num=0  -0.0
[6, 0] IF exang==0 AND trestbps>=170.0 THEN num=0  -0.0
[0, 3] IF exang==0 AND chol>=491.0 THEN num=1  -0.0
[4, 0] IF exang==0 AND restecg>=2.0 THEN num=0  -0.0
[0, 1] IF restecg>=2.0 THEN num=1  -0.0
[3, 0] IF thalach>=132.0 AND age>=59.0 THEN num=0  -0.0
[0, 6] IF oldpeak>=1.0 AND chol>=388.0 THEN num=1  -0.0
[21, 0] IF exang==0 AND chol>=297.0 THEN num=0  -0.0
[0, 9] IF oldpeak>=1.0 AND chol>=329.0 THEN num=1  -0.0
[3, 0] IF exang!=0 AND age>=59.0 THEN num=0  -0.0
[2, 0] IF exang!=0 AND chol>=328.0 THEN num=0  -0.0
[0, 3] IF trestbps>=170.0 THEN num=1  -0.0
[0, 3] IF sex==0 AND chol>=288.0 THEN num=1  -0.0
[14, 0] IF sex==0 AND restecg>=1.0 THEN num=0  -0.0
[0, 5] IF exang!=0 AND trestbps>=160.0 THEN num=1  -0.0
[1, 0] IF chol<=201.0 AND fbs!=0 THEN num=0  -0.0
[3, 0] IF exang!=0 AND thalach>=160.0 THEN num=0  -0.0
[3, 0] IF thalach>=135.0 AND thalach>=184.0 THEN num=0  -0.0
[0, 6] IF oldpeak>=1.