In [1]:
from rule_learner import *

In [2]:
data_file = "../data_sets/covid_categorical_good.csv"

data = pd.read_csv(data_file)
data = data.dropna(how="any")
print("Columns:",data.columns)

Columns: Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')


First, let's try without age (Numeric attributes are expensive).

In [3]:
data_categorical = data.drop(columns=['age'])
column_list = data_categorical.columns.to_numpy().tolist()
class_labels = data_categorical[column_list[-1]].unique()

print(class_labels)

['alive' 'dead']


I will do each class separately, because the accuracy (and also coverage) for "alive" and "dead" is very different.

In [4]:
import time
from operator import attrgetter

# first dead
start_time = time.time()
rules = learn_rules(column_list, data_categorical, ["dead"], 30, 0.6)
print("--- deadly rules in %d minutes ---" % ((time.time() - start_time) // 60))

# sort rules by accuracy descending
rules.sort(key=attrgetter('accuracy', 'coverage'), reverse=True)
for rule in rules[:20]:
    print(rule)

--- deadly rules in 0 minutes ---
If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, imm_supr=no, hypertension=yes, asthma=no] then dead. Coverage:70, accuracy: 0.6571428571428571
If [renal_chronic=yes, diabetes=yes, obesity=no, copd=yes, tobacco=no, hypertension=yes, imm_supr=no, asthma=no, sex=female] then dead. Coverage:31, accuracy: 0.6129032258064516


In [5]:
# now alive
start_time = time.time()
rules = learn_rules(column_list, data_categorical, ["alive"], 60, 0.9)
print("---alive  rules in %d minutes ---" % ((time.time() - start_time)//60))

# sort rules by accuracy descending
rules.sort(key=attrgetter('accuracy', 'coverage'), reverse=True)
for rule in rules[:20]:
    print(rule)

---alive  rules in 0 minutes ---
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, asthma=yes, copd=no, imm_supr=no, renal_chronic=no] then alive. Coverage:88, accuracy: 0.9886363636363636
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, copd=no, imm_supr=no, renal_chronic=no, asthma=no] then alive. Coverage:2351, accuracy: 0.9766056997022544
If [hypertension=no, sex=female, diabetes=no, asthma=yes, obesity=no, imm_supr=no, copd=no, cardiovascular=no, tobacco=no] then alive. Coverage:1686, accuracy: 0.9673784104389087
If [hypertension=no, sex=female, diabetes=no, obesity=no, copd=no, imm_supr=no, renal_chronic=no, cardiovascular=no, asthma=no, tobacco=no] then alive. Coverage:54563, accuracy: 0.9620255484485823
If [hypertension=no, asthma=yes, diabetes=no, copd=no, imm_supr=no, sex=female, tobacco=no, obesity=yes] then alive. Coverage:531, accuracy: 0.9566854990583804
If [asthma=yes, hypertension=no, obesity=yes, sex=male, tobacco=yes, renal_c

But age is probably a very important factor in determining COVID outcome.
So we will repeat the experiment with age included (on the original data).

In [6]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")

column_list = data.columns.to_numpy().tolist()

In [7]:
# first dead
start_time = time.time()
rules = learn_rules(column_list, data, ["dead"], 30, 0.5)
print("--- deadly rules with age in %d minutes ---" % ((time.time() - start_time) // 60))

# sort rules by accuracy descending
rules.sort(key=attrgetter('accuracy', 'coverage'), reverse=True)
for rule in rules[:20]:
    print(rule)

--- deadly rules with age in 1 minutes ---
If [age>=78, renal_chronic=yes, diabetes=yes, tobacco=no, hypertension=yes, obesity=no, imm_supr=no, asthma=no] then dead. Coverage:40, accuracy: 0.7
If [age>=80, renal_chronic=yes, diabetes=yes, hypertension=yes, sex=female, imm_supr=no, tobacco=no, asthma=no, cardiovascular=no] then dead. Coverage:36, accuracy: 0.6388888888888888
If [age>=80, sex=male, obesity=yes, diabetes=yes, tobacco=no, imm_supr=no, cardiovascular=no, renal_chronic=no] then dead. Coverage:64, accuracy: 0.625
If [age>=80, renal_chronic=yes, hypertension=yes, cardiovascular=no, sex=male, diabetes=no, obesity=no] then dead. Coverage:44, accuracy: 0.5909090909090909
If [age>=80, sex=male, hypertension=yes, tobacco=no, copd=yes, diabetes=yes] then dead. Coverage:36, accuracy: 0.5833333333333334
If [age>=80, sex=male, hypertension=yes, copd=no, diabetes=no, cardiovascular=yes, renal_chronic=no, obesity=no, tobacco=no, imm_supr=no] then dead. Coverage:81, accuracy: 0.5802469135

In [8]:
# now alive
start_time = time.time()
rules = learn_rules(column_list, data, ["alive"], 60, 0.9)
print("---alive  rules with age in %d minutes ---" % ((time.time() - start_time)//60))

# sort rules by accuracy descending
rules.sort(key=attrgetter('accuracy', 'coverage'), reverse=True)
for rule in rules[:20]:
    print(rule)

---alive  rules with age in 14 minutes ---
If [age<29, hypertension=no, sex=female, tobacco=yes, imm_supr=no] then alive. Coverage:331, accuracy: 1.0
If [age<26, asthma=yes, obesity=no, sex=female] then alive. Coverage:247, accuracy: 1.0
If [age<36, hypertension=no, sex=female, obesity=no, imm_supr=no, diabetes=no, asthma=yes] then alive. Coverage:102, accuracy: 1.0
If [age<30, hypertension=no, obesity=no, sex=female, imm_supr=no, tobacco=yes] then alive. Coverage:96, accuracy: 1.0
If [age<26, tobacco=yes, sex=female, obesity=yes] then alive. Coverage:87, accuracy: 1.0
If [age<30, obesity=yes, diabetes=no, sex=female, hypertension=yes] then alive. Coverage:84, accuracy: 1.0
If [age<34, obesity=no, hypertension=no, sex=female, tobacco=yes, imm_supr=no] then alive. Coverage:363, accuracy: 0.9972451790633609
If [age<26, tobacco=yes, obesity=no, hypertension=no, renal_chronic=no, imm_supr=no, sex=male] then alive. Coverage:703, accuracy: 0.9971550497866287
If [age<26, tobacco=yes, sex=fema

It does seem that age plays the most important role in defining COVID outcomes. Also the tobacco use seems to help young people survive.

Copyright &copy; 2022 Marina Barsky. All rights reserved.