The article "Association Rule Mining to Detect Factors Which Contribute to Heart Disease in Males and Females"  applies association rule mining to extract rules/patterns in heart disease risk. Data used in research paper is UCI Cleveland Heart Disease dataset. The goal was to extract rules that could differentiate between healthy and sick patients, and to identify whether these patterns are different between males and females.

Three rule mining algorithms were applied: Apriori, Predictive Apriori, and Tertius. Each algorithm used a different criterion for selecting most important rules. Apriori set treshold for confidence greater than 90%, Predictive Apriori selected rules with accuracy over 99%, and Tertius chose rules based on confirmation levels above 79%. The analysis was conducted in two phases — first on the entire dataset to identify general sick and healthy rules, and then separately on male and female to identify rules specific fro each gender.

The results showed that females were more  associated with healthy rules, indicating that women have a lower risk of heart disease. For both genders, certain factors were strong indicators of sickness, such as having asymptomatic chest pain and the presence of exercise-induced angina. However, some risk factors were more specific for each gender. For females, a normal or hypertrophic resting ECG and a flat ST segment slope were indicators of heart disease. In men, only a hypertrophic resting ECG appeared prominently as a risk factor. On the other hand, indicators of good health common to both genders included an upsloping ST segment slope, zero colored major vessels, and an oldpeak value of 0.56 or less.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv("processed.cleveland.data", delimiter=",", header=None)

feature_names = pd.read_csv("costs/heart-disease.cost", delimiter=":", header=None).drop(1, axis=1).values.reshape(1,13)
df.columns = np.append(feature_names,'class') 

display(df.head(3))

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1


In [3]:
qm_counts_per_column = (df == '?').sum()
print(qm_counts_per_column)

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
class       0
dtype: int64


In thal and ca columns there is missing data marked as ?

In [4]:
# exclude rows with ?
df = df[~(df == '?').any(axis=1)]

In [5]:
qm_counts_per_column = (df == '?').sum()
print(qm_counts_per_column)

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
class       0
dtype: int64


In [6]:
#sns.pairplot(df)

In [7]:
# Replace categorical numerical to categorical nominal

df['sex'] = df['sex'].map({0: 'female', 1: 'male'})
df['cp'] = df['cp'].map({1: 'cp: angina', 2: 'cp: abnang', 3:'cp: notang', 4:'cp: asympt'})
df['fbs'] = df['fbs'].map({0: 'fbs: False', 1: 'fbs: True'})
df['restecg'] = df['restecg'].map({0: 'restecg: norm', 1: 'restecg: abn', 2:'restecg: hyp'})
df['exang'] = df['exang'].map({0: 'exang: no', 1: 'exang: yes'})
df['slope'] = df['slope'].map({1: 'slope: upsloping', 2: 'slope: flat', 3:'slope: downsloping'})
df['ca'] = pd.to_numeric(df['ca'], errors='coerce')
df = df.dropna(subset=['ca'])
df['ca'] = df['ca'].astype(int)
df['ca'] = df['ca'].map({0: 'col. vessle: 0', 1: 'col. vessle: 1', 2: 'col vessle: 2', 3:'col vessle: 3'})
df['thal'] = df['thal'].map({'3.0': 'thal: normal','6.0': 'thal: fixed defect', '7.0':'thal: reversable defect'})
df['class'] = np.where(df['class'] == 0, 'healthy', 'sick')

display(df.head(3))

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,63.0,male,cp: angina,145.0,233.0,fbs: True,restecg: hyp,150.0,exang: no,2.3,slope: downsloping,col. vessle: 0,thal: fixed defect,healthy
1,67.0,male,cp: asympt,160.0,286.0,fbs: False,restecg: hyp,108.0,exang: yes,1.5,slope: flat,col vessle: 3,thal: normal,sick
2,67.0,male,cp: asympt,120.0,229.0,fbs: False,restecg: hyp,129.0,exang: yes,2.6,slope: flat,col vessle: 2,thal: reversable defect,sick


In [8]:
def bin_column(df, column, n_bins=5):
    
    min_val = df[column].min()
    max_val = df[column].max()
    
    bins = np.linspace(min_val, max_val, n_bins + 1).round(2)
    bins[0] = -np.inf
    bins[-1] = np.inf
    
    labels = [f"{column} ({bins[i]}, {bins[i+1]}]" for i in range(len(bins) - 1)]
    
    df[column] = pd.cut(df[column], bins=bins, labels=labels, include_lowest=True)
    
    return df

In [9]:
numeric_columns = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
numeric_columns

['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

In [10]:
for i in numeric_columns:
    bin_column(df, i)
    
display(df.head(3))

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,"age (57.8, 67.4]",male,cp: angina,"trestbps (136.4, 157.6]","chol (213.6, 301.2]",fbs: True,restecg: hyp,"thalach (149.6, 175.8]",exang: no,"oldpeak (1.24, 2.48]",slope: downsloping,col. vessle: 0,thal: fixed defect,healthy
1,"age (57.8, 67.4]",male,cp: asympt,"trestbps (157.6, 178.8]","chol (213.6, 301.2]",fbs: False,restecg: hyp,"thalach (97.2, 123.4]",exang: yes,"oldpeak (1.24, 2.48]",slope: flat,col vessle: 3,thal: normal,sick
2,"age (57.8, 67.4]",male,cp: asympt,"trestbps (115.2, 136.4]","chol (213.6, 301.2]",fbs: False,restecg: hyp,"thalach (123.4, 149.6]",exang: yes,"oldpeak (2.48, 3.72]",slope: flat,col vessle: 2,thal: reversable defect,sick


In [11]:
df = df.reset_index(drop=True)

### Apriori

In [12]:
transactions = df.astype(str).values.tolist()

te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df_transformed = pd.DataFrame(te_array, columns=te.columns_)
df_transformed.head(3)

Unnamed: 0,"age (-inf, 38.6]","age (38.6, 48.2]","age (48.2, 57.8]","age (57.8, 67.4]","age (67.4, inf]","chol (-inf, 213.6]","chol (213.6, 301.2]","chol (301.2, 388.8]","chol (388.8, 476.4]","chol (476.4, inf]",...,"thalach (-inf, 97.2]","thalach (123.4, 149.6]","thalach (149.6, 175.8]","thalach (175.8, inf]","thalach (97.2, 123.4]","trestbps (-inf, 115.2]","trestbps (115.2, 136.4]","trestbps (136.4, 157.6]","trestbps (157.6, 178.8]","trestbps (178.8, inf]"
0,False,False,False,True,False,False,True,False,False,False,...,False,False,True,False,False,False,False,True,False,False
1,False,False,False,True,False,False,True,False,False,False,...,False,False,False,False,True,False,False,False,True,False
2,False,False,False,True,False,False,True,False,False,False,...,False,True,False,False,False,False,True,False,False,False


In [13]:
frequent_itemsets = apriori(df_transformed, min_support=0.15, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.93)

rules = rules.sort_values(by='confidence', ascending=False)

def rule_filter(df, target_class):
    return df[
        (df['antecedents'].apply(lambda x: len(x) >= 3)) &  
        (df['consequents'] == frozenset([target_class])) &  
        ((df['support'] > 0.16) | (df['confidence'] > 0.94))
    ][['antecedents', 'consequents', 'support', 'confidence']].reset_index(drop=True)


rules_sick = rule_filter(rules, 'sick')
rules_healthy = rule_filter(rules, 'healthy')


print("Rules - sick:")
display(rules_sick)

print("Rules - 'healthy':")
display(rules_healthy)


Rules - sick:


Unnamed: 0,antecedents,consequents,support,confidence
0,"(slope: flat, cp: asympt, thal: reversable defect)",(sick),0.151515,0.957447
1,"(cp: asympt, exang: yes, thal: reversable defect)",(sick),0.161616,0.941176


Rules - 'healthy':


Unnamed: 0,antecedents,consequents,support,confidence
0,"(col. vessle: 0, female, exang: no)",(healthy),0.161616,0.96
1,"(col. vessle: 0, thal: normal, female, exang: no)",(healthy),0.154882,0.958333
2,"(fbs: False, col. vessle: 0, female, exang: no)",(healthy),0.151515,0.957447
3,"(col. vessle: 0, trestbps (115.2, 136.4], thal: normal, exang: no)",(healthy),0.171717,0.944444
4,"(fbs: False, thal: normal, female, exang: no)",(healthy),0.188552,0.933333


The algorithm was run with a minimum support threshold of 0.13 and a confidence threshold of 0.93. To obtain the same association rules reported in the reference paper, an additional filter was applied to extract rules with support greater than 0.16 and confidence greater than 0.94, even though these specific thresholds were not explicitly mentioned in the paper.

<b>Rule extraction for males and females

In [14]:
male_df = df.loc[df['sex']=='male'].drop('sex', axis=1)
male_df = male_df.reset_index(drop=True)
male_df.head(3)

Unnamed: 0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,"age (57.8, 67.4]",cp: angina,"trestbps (136.4, 157.6]","chol (213.6, 301.2]",fbs: True,restecg: hyp,"thalach (149.6, 175.8]",exang: no,"oldpeak (1.24, 2.48]",slope: downsloping,col. vessle: 0,thal: fixed defect,healthy
1,"age (57.8, 67.4]",cp: asympt,"trestbps (157.6, 178.8]","chol (213.6, 301.2]",fbs: False,restecg: hyp,"thalach (97.2, 123.4]",exang: yes,"oldpeak (1.24, 2.48]",slope: flat,col vessle: 3,thal: normal,sick
2,"age (57.8, 67.4]",cp: asympt,"trestbps (115.2, 136.4]","chol (213.6, 301.2]",fbs: False,restecg: hyp,"thalach (123.4, 149.6]",exang: yes,"oldpeak (2.48, 3.72]",slope: flat,col vessle: 2,thal: reversable defect,sick


In [15]:
female_df = df.loc[df['sex']=='female'].drop('sex', axis=1)
female_df = female_df.reset_index(drop=True)
female_df.head(3)

Unnamed: 0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,"age (38.6, 48.2]",cp: abnang,"trestbps (115.2, 136.4]","chol (-inf, 213.6]",fbs: False,restecg: hyp,"thalach (149.6, 175.8]",exang: no,"oldpeak (1.24, 2.48]",slope: upsloping,col. vessle: 0,thal: normal,healthy
1,"age (57.8, 67.4]",cp: asympt,"trestbps (136.4, 157.6]","chol (213.6, 301.2]",fbs: False,restecg: hyp,"thalach (149.6, 175.8]",exang: no,"oldpeak (2.48, 3.72]",slope: downsloping,col vessle: 2,thal: normal,sick
2,"age (48.2, 57.8]",cp: asympt,"trestbps (115.2, 136.4]","chol (301.2, 388.8]",fbs: False,restecg: norm,"thalach (149.6, 175.8]",exang: yes,"oldpeak (-inf, 1.24]",slope: upsloping,col. vessle: 0,thal: normal,healthy


In [16]:
transactions_male = male_df.astype(str).values.tolist()

te_array = te.fit(transactions_male).transform(transactions_male)
male_df_transformed = pd.DataFrame(te_array, columns=te.columns_)
male_df_transformed.head(3)

Unnamed: 0,"age (-inf, 38.6]","age (38.6, 48.2]","age (48.2, 57.8]","age (57.8, 67.4]","age (67.4, inf]","chol (-inf, 213.6]","chol (213.6, 301.2]","chol (301.2, 388.8]",col vessle: 2,col vessle: 3,...,"thalach (-inf, 97.2]","thalach (123.4, 149.6]","thalach (149.6, 175.8]","thalach (175.8, inf]","thalach (97.2, 123.4]","trestbps (-inf, 115.2]","trestbps (115.2, 136.4]","trestbps (136.4, 157.6]","trestbps (157.6, 178.8]","trestbps (178.8, inf]"
0,False,False,False,True,False,False,True,False,False,False,...,False,False,True,False,False,False,False,True,False,False
1,False,False,False,True,False,False,True,False,False,True,...,False,False,False,False,True,False,False,False,True,False
2,False,False,False,True,False,False,True,False,True,False,...,False,True,False,False,False,False,True,False,False,False


In [17]:
transactions_fe = female_df.astype(str).values.tolist()

te_array = te.fit(transactions_fe).transform(transactions_fe)
female_df_transformed = pd.DataFrame(te_array, columns=te.columns_)
female_df_transformed.head(3)

Unnamed: 0,"age (-inf, 38.6]","age (38.6, 48.2]","age (48.2, 57.8]","age (57.8, 67.4]","age (67.4, inf]","chol (-inf, 213.6]","chol (213.6, 301.2]","chol (301.2, 388.8]","chol (388.8, 476.4]","chol (476.4, inf]",...,"thalach (-inf, 97.2]","thalach (123.4, 149.6]","thalach (149.6, 175.8]","thalach (175.8, inf]","thalach (97.2, 123.4]","trestbps (-inf, 115.2]","trestbps (115.2, 136.4]","trestbps (136.4, 157.6]","trestbps (157.6, 178.8]","trestbps (178.8, inf]"
0,False,True,False,False,False,True,False,False,False,False,...,False,False,True,False,False,False,True,False,False,False
1,False,False,False,True,False,False,True,False,False,False,...,False,False,True,False,False,False,False,True,False,False
2,False,False,True,False,False,False,False,True,False,False,...,False,False,True,False,False,False,True,False,False,False


In [18]:
male_df['class'].value_counts() / male_df['class'].value_counts().sum()
female_df['class'].value_counts() / female_df['class'].value_counts().sum()

class
healthy    0.739583
sick       0.260417
Name: count, dtype: float64

<b>Females rule extraction

In [19]:
frequent_itemsets = apriori(female_df_transformed, min_support=0.08, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.91)

rules = rules.sort_values(by='confidence', ascending=False)

def rule_filter_sick(df):
    return df[
        (df['antecedents'].apply(lambda x: len(x) >= 3)) &
        (df['consequents'] == frozenset(['sick'])) &
        ((df['support'] > 0.08) | (df['confidence'] > 0.90))
    ][['antecedents', 'consequents', 'support', 'confidence']].reset_index(drop=True)

def rule_filter_healthy(df):
    return df[
        (df['antecedents'].apply(lambda x: len(x) >= 4)) &
        (df['consequents'] == frozenset(['healthy'])) &
        ((df['support'] > 0.2) & (df['confidence'] > 0.9))
    ][['antecedents', 'consequents', 'support', 'confidence']].reset_index(drop=True)

rules_sick = rule_filter_sick(rules)
rules_healthy = rule_filter_healthy(rules)

print("Rules - sick:")
display(rules_sick)

print("Rules - healthy:")
display(rules_healthy.head(10))


Rules - sick:


Unnamed: 0,antecedents,consequents,support,confidence
0,"(cp: asympt, exang: yes, thal: reversable defect)",(sick),0.083333,1.0
1,"(cp: asympt, exang: yes, thalach (123.4, 149.6])",(sick),0.083333,1.0
2,"(slope: flat, cp: asympt, thal: reversable defect)",(sick),0.104167,1.0
3,"(cp: asympt, thal: reversable defect, restecg: hyp)",(sick),0.083333,1.0
4,"(fbs: False, cp: asympt, thal: reversable defect)",(sick),0.09375,1.0
5,"(fbs: False, cp: asympt, thal: reversable defect, slope: flat)",(sick),0.083333,1.0


Rules - healthy:


Unnamed: 0,antecedents,consequents,support,confidence
0,"(fbs: False, oldpeak (-inf, 1.24], restecg: norm, col. vessle: 0, exang: no)",(healthy),0.21875,1.0
1,"(fbs: False, restecg: norm, col. vessle: 0, exang: no, thal: normal)",(healthy),0.260417,1.0
2,"(fbs: False, slope: flat, col. vessle: 0, exang: no, thal: normal)",(healthy),0.208333,1.0
3,"(oldpeak (-inf, 1.24], restecg: norm, col. vessle: 0, exang: no, thal: normal)",(healthy),0.229167,1.0
4,"(oldpeak (-inf, 1.24], restecg: norm, exang: no, slope: upsloping, thal: normal)",(healthy),0.229167,1.0
5,"(oldpeak (-inf, 1.24], restecg: norm, thalach (149.6, 175.8], exang: no, thal: normal)",(healthy),0.208333,1.0
6,"(fbs: False, oldpeak (-inf, 1.24], restecg: norm, slope: upsloping, thal: normal)",(healthy),0.229167,1.0
7,"(fbs: False, oldpeak (-inf, 1.24], restecg: norm, exang: no, slope: upsloping)",(healthy),0.21875,1.0
8,"(fbs: False, oldpeak (-inf, 1.24], restecg: norm, exang: no, thal: normal)",(healthy),0.291667,1.0
9,"(fbs: False, restecg: norm, exang: no, slope: upsloping, thal: normal)",(healthy),0.229167,1.0


The results obtained are different in comparison to one in the paper, the reason might be the different values for treshold of confidence and support used to filter the rules. Since almost 75% of the data for females are labeled as healthy more healthy rules are extracted and for sick rules the support has much smaller values, so the different methods might have been used while filtering the rules for each class. I tried to implement this in the code, however could not got the same results as in paper.

From the extracted rules it can be concluded that for females asymptomatic chest pain, reversible defect in the heart muscle, presence of exercise-induce angina and fasting blood sugare less than 120mg/dl are string indicators of sickness. While no exercise-induce angina,  an upsloping ST segment slope, ST depression induced by exercise relative to rest with value up to 1.24, normal heart status and also fasting blood sugare less than 120mg/dl are string indicators of sickness, are strong indicators of healthy person.

<b> Males

In [20]:
frequent_itemsets = apriori(male_df_transformed, min_support=0.16, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.91)

rules = rules.sort_values(by='confidence', ascending=False)

def rule_filter(df, target_class):
    return df[
        (df['antecedents'].apply(lambda x: len(x) >= 3)) & 
        (df['consequents'] == frozenset([target_class])) &  
        ((df['support'] > 0.16) | (df['confidence'] > 0.95))
    ][['antecedents', 'consequents', 'support', 'confidence']].reset_index(drop=True)

rules_sick = rule_filter(rules, 'sick')
rules_healthy = rule_filter(rules, 'healthy')

print("Rules - sick:")
display(rules_sick)

print("Rules - healthy:")
display(rules_healthy)

Rules - sick:


Unnamed: 0,antecedents,consequents,support,confidence
0,"(slope: flat, cp: asympt, thal: reversable defect)",(sick),0.174129,0.945946
1,"(exang: yes, thal: reversable defect, cp: asympt)",(sick),0.199005,0.930233
2,"(fbs: False, exang: yes, thal: reversable defect, cp: asympt)",(sick),0.174129,0.921053
3,"(slope: flat, cp: asympt, exang: yes)",(sick),0.169154,0.918919
4,"(exang: yes, restecg: hyp, cp: asympt)",(sick),0.169154,0.918919
5,"(fbs: False, cp: asympt, exang: yes, slope: flat)",(sick),0.164179,0.916667


Rules - healthy:


Unnamed: 0,antecedents,consequents,support,confidence
0,"(col. vessle: 0, thal: normal, slope: upsloping)",(healthy),0.189055,0.926829
1,"(oldpeak (-inf, 1.24], col. vessle: 0, thal: normal, slope: upsloping)",(healthy),0.174129,0.921053
2,"(slope: upsloping, col. vessle: 0, thal: normal, exang: no)",(healthy),0.174129,0.921053
3,"(fbs: False, col. vessle: 0, thal: normal, slope: upsloping)",(healthy),0.169154,0.918919
4,"(oldpeak (-inf, 1.24], col. vessle: 0, thal: normal)",(healthy),0.208955,0.913043


Almost the same rules are extraced as ones in the paper, but it can be noticed that in paper they used also the 'Sex' column in the rules, which could affect the associtation rules extractions. 

Same as for the females, strong indicators that person has a heart disease are asymptomatic chest pain, reversible defect in the heart muscle, presence of exercise-induce angina, and a flat ST segment slope. While indicators that person does not has a heart disease are numbered of colored vessel 0, an upsloping ST segment slope, no exercise-induce angina and a normal heart status.