# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [30]:

import numpy as np
import pandas as pd
import math



class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0
        self.coverage_list =[]  # indexes of the rows this rule covers

    def addCondition(self, condition):
        self.conditions.append(condition)

    def setParams(self, accuracy, coverage, coverage_list):
        self.accuracy = accuracy
        self.coverage = coverage
        self.coverage_list = coverage_list


    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)


class Condition:
    def __init__(self, attribute, value, all, true_false=None):
        self.attribute = attribute
        self.value = value
        self.all = all # index of all rows that have this attributes value.
        self.true_false = true_false


    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)


def parse(data):
    # probobly a better way to do this check https://stackoverflow.com/questions/54196959/is-there-any-faster-alternative-to-col-drop-duplicates
    new_df = []
    [new_df.append(pd.DataFrame(data[i].unique(), columns=[i])) for i in data.columns]
    new_df = pd.concat(new_df, axis=1)

    columns_list = data.columns.to_numpy().tolist()

    class_name = columns_list.pop(-1)

    return (new_df,columns_list, class_name)



def compare(rule1, rule2, rule_list):

    if rule1.accuracy > rule2.accuracy:
        rule2 = rule1
        rule_list = [rule2]

    elif rule1.accuracy == rule2.accuracy:
        if rule1.coverage > rule2.coverage:
            rule2 = rule1
            rule_list = [rule2]

        elif rule1.coverage == rule2.coverage:
            rule_list.append(rule1)

    return rule_list

def Copy(x, rule, i ):

    rule.class_label = x.class_label

    for j in range(i+1):
        attribute = x.conditions[j].attribute
        value = x.conditions[j].value
        all = x.conditions[j].all
        true_false = x.conditions[j].true_false

        temp = Condition(attribute,  value, all, true_false)
        rule.addCondition(temp)

    rule.accuracy =x.accuracy
    rule.coverage = x.coverage
    rule.coverage_list = x.coverage_list

    return rule


def makeRule(x, subset, all_subset, truefalse, attribute, value, min_coverage, class_label = None, i = 0):

    numCount = len(subset)
    denCount = len(all_subset)

    if denCount == 0 or numCount < min_coverage:
        return None

    temp_coverage_list = np.array(subset.index)

    if x != None:
        temp_rule = Rule(None)
        temp_rule = Copy(x, temp_rule, i)

    else:
        temp_rule = Rule(class_label)

    temp_cond = Condition(attribute, value, np.array(all_subset.index), truefalse)
    temp_rule.addCondition(temp_cond)
    temp_rule.accuracy = numCount / denCount
    temp_rule.coverage = numCount
    temp_rule.coverage_list = temp_coverage_list

    return temp_rule


def refine_rule(data, column_list, rule_list, un_data, class_name, min_coverage ):


    count = 0
    while(True):

        count+=1

        temp = Rule(None)
        temp.accuracy = -math.inf
        temp.coverage = -math.inf
        real_best_rules = [temp]

        for x in rule_list:
            # This temporarily gets rid of all rows that don't have the attributes of the first conditions.
            new_df = data[data.index.isin(x.conditions[count - 1].all)]

            if len(new_df.index) >= min_coverage:
                Sbest_rules = x
                best_rules = [Sbest_rules]

                for y in column_list:
                    for z in un_data[y]:
                        flag = False

                        if pd.isnull(z):
                            continue

                        #Special cases for if the value is numeric
                        if isinstance(z, int) or isinstance(z, float):
                            flag = True
                            subset = new_df[(new_df[y] >= z) & (new_df[class_name] == x)]
                            all_subset = new_df[new_df[y] >= z]
                            truefalse = True

                        else:
                            subset =  new_df[( new_df[y] == z) & ( new_df[class_name] == x.class_label)]
                            all_subset =  new_df[ new_df[y] == z]
                            truefalse = None

                        temp_rule =  makeRule(x, subset, all_subset, truefalse, y, z, min_coverage, None, count - 1)
                        
                        if temp_rule != None:
                       
                            best_rules = compare(temp_rule, Sbest_rules, best_rules)
                            Sbest_rules = best_rules[0]

                        
                        if flag:

                            subset = new_df[(new_df[y] < z) & (new_df[class_name] == x)]
                            all_subset = new_df[new_df[y] < z]

                            
                            temp_rule =  makeRule(x, subset, all_subset, False, y, z, min_coverage, None, count - 1)
                            
                            if temp_rule != None:

                                best_rules = compare(temp_rule, Sbest_rules, best_rules)
                                Sbest_rules = best_rules[0]

                real_best_rules = compare(best_rules[0], real_best_rules[0], real_best_rules)


        if real_best_rules[0].accuracy == 1 or rule_list[0] == real_best_rules[0]:

            return real_best_rules[0]


        if real_best_rules[0].class_label == None:

            return rule_list[0]

        rule_list = real_best_rules

def find_one_rule(data, un_data, columns_list, class_name, min_accuracy,  min_coverage  ,classAtt = []):

    class_list = un_data[class_name].to_numpy().tolist()

    if classAtt == []:
        classAtt = class_list

    Sbest_rules = Rule(None)
    Sbest_rules.accuracy = -math.inf
    Sbest_rules.coverage = -math.inf
    best_rules = [Sbest_rules]

    for x in classAtt:
        for y in columns_list:
            for z in un_data[y]:

                flag = False

                if pd.isnull(z):
                    continue

                if isinstance(z, int) or isinstance(z, float):
                    flag = True
                    subset = data[(data[y] >= z) & (data[class_name] == x)]
                    all_subset = data[data[y] >= z]
                    truefalse = True

                else:

                    subset = data[(data[y] == z) & (data[class_name] == x)]
                    all_subset = data[data[y] == z]
                    truefalse = None

                temp_rule = makeRule(None, subset, all_subset, truefalse, y, z, min_coverage, x)

                if temp_rule != None:

                    best_rules = compare(temp_rule, Sbest_rules, best_rules)
                    Sbest_rules = best_rules[0]

                
                if flag:

                    subset = data[(data[y] < z) & (data[class_name] == x)]
                    all_subset = data[data[y] < z]

                    temp_rule = makeRule(None, subset, all_subset, False, y, z, min_coverage, x)

                    if temp_rule != None:

                        best_rules = compare(temp_rule, Sbest_rules, best_rules)
                        Sbest_rules = best_rules[0]

    if best_rules[0].accuracy == 1 and best_rules[0].coverage >= min_coverage:
        return best_rules[0]

    rule = refine_rule(data, columns_list, best_rules, un_data,   class_name, min_coverage)

    if rule.accuracy >= min_accuracy and rule.coverage >= min_coverage:
        return rule

    return None


def find_rules(data, un_data, columns_list, class_name, classAtt = [], min_accuracy = 1,  min_coverage = 1):

    rule_list = []
    temp = Rule(None)
    flag = 1
    while(flag):

        temp = find_one_rule(data, un_data, columns_list, class_name, min_accuracy,  min_coverage , classAtt)

        if temp == None:
            break
        rule_list.append(temp)

        data = data[~data.index.isin(temp.coverage_list)]

        if data.empty:
            flag = 0


    return rule_list



## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [43]:
data_file = "../titanic.csv - titanic.csv"

In [44]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)


Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [38]:
rule_list = []


(unique, columns_list, class_name) = parse(data)

rule_list = find_rules(data, unique, columns_list, class_name, [], 0.7, 30 )

for rule in rule_list[:10]:
    print(rule)

If [Sex=male] then 0.0. Coverage:360, accuracy: 0.7947019867549668
If [Sex=male] then 1.0. Coverage:93, accuracy: 1.0
If [Pclass>=2.0:False] then 1.0. Coverage:82, accuracy: 0.9647058823529412
If [Pclass>=3.0:False] then 1.0. Coverage:68, accuracy: 0.8831168831168831


In [46]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".
    

rule_list = [0.0]


(unique, columns_list, class_name) = parse(data)

rule_list = find_rules(data, unique, columns_list, class_name, rule_list, 0.7, 30 )

for rule in rule_list[:10]:
    print(rule)

If [Sex=male] then 0.0. Coverage:360, accuracy: 0.7947019867549668


## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [51]:
data_file = "covid_categorical_good.csv"

In [52]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [41]:
# We really want to learn first what makes covid deadly
class_labels = ["dead"]

data.drop(["age"], axis = 1, inplace = True)

(unique, columns_list, class_name) = parse(data)

rule_list = find_rules(data, unique, columns_list, class_name, class_labels, 0.6, 30 )
for rule in rule_list[:20]:
    print(rule)

If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, imm_supr=no, hypertension=yes, asthma=no] then dead. Coverage:46, accuracy: 0.6571428571428571


In [35]:
# We really want to learn first what makes covid deadly
class_labels = ["dead"]


(unique, columns_list, class_name) = parse(data)

rule_list = find_rules(data, unique, columns_list, class_name, class_labels, 0.6, 30 )
for rule in rule_list[:20]:
    print(rule)

If [age>=80:True, renal_chronic=yes, diabetes=yes, hypertension=yes, imm_supr=no, sex=female, tobacco=no] then dead. Coverage:32, accuracy: 0.6037735849056604
If [age>=80:True, sex=male, obesity=yes, diabetes=yes, tobacco=no, imm_supr=no, cardiovascular=no, renal_chronic=no] then dead. Coverage:40, accuracy: 0.625


Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [None]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)
(unique, columns_list, class_name) = parse(data)
rule_list = find_rules(data, unique, columns_list, class_name, [], 0.6, 30 )
for rule in rule_list[:20]:
    print(rule)

The following rules are from running my program on my main machine, because the VM dosn't seem capable of running it on such a big dataset. Even on my main machine it took about four hours to run. 

Results:

If [age>=26:False, tobacco=yes, asthma=yes] then alive. Coverage:47, accuracy: 1.0

If [age>=26:False, tobacco=yes, sex=female, obesity=yes] then alive. Coverage:83, accuracy: 1.0

If [age>=26:False, tobacco=yes, obesity=no, hypertension=no, sex=female] then alive. Coverage:273, accuracy: 0.9963503649635036

If [age>=26:False, asthma=yes, obesity=no, sex=female] then alive. Coverage:246, accuracy: 1.0

If [age>=29:False, hypertension=no, sex=female, tobacco=yes, obesity=yes] then alive. Coverage:79, accuracy: 1.0

If [age>=26:False, hypertension=no, imm_supr=no, sex=female, obesity=no, diabetes=no, renal_chronic=no, cardiovascular=no, tobacco=no] then alive. Coverage:7734, accuracy: 0.9949826321883443

If [age>=30:False, tobacco=yes, asthma=yes] then alive. Coverage:53, accuracy: 1.0

If [age>=30:False, tobacco=yes, obesity=no, hypertension=no, imm_supr=no, sex=female] then alive. Coverage:333, accuracy: 0.9970059880239521

If [age>=30:False, hypertension=no, obesity=no, tobacco=yes, sex=male, renal_chronic=no, imm_supr=no] then alive. Coverage:1392, accuracy: 0.9949964260185847

If [age>=30:False, hypertension=no, obesity=no, asthma=yes, renal_chronic=no, sex=male] then alive. Coverage:380, accuracy: 0.9921671018276762

If [age>=30:False, hypertension=no, obesity=no, imm_supr=no, diabetes=no, renal_chronic=no, tobacco=no, sex=male, asthma=no, copd=no, cardiovascular=no] then alive. Coverage:13192, accuracy: 0.9903903903903903

If [age>=35:False, sex=female, hypertension=no, tobacco=yes, obesity=no, imm_supr=no] then alive. Coverage:469, accuracy: 0.9936440677966102

If [age>=35:False, sex=female, hypertension=no, obesity=no, asthma=yes, diabetes=no] then alive. Coverage:485, accuracy: 0.9918200408997955

If [age>=35:False, sex=female, hypertension=no, obesity=no, imm_supr=no, diabetes=no, renal_chronic=no, asthma=no, tobacco=no, cardiovascular=no, copd=no] then alive. Coverage:14126, accuracy: 0.9912982456140351

If [age>=38:False, tobacco=yes, diabetes=no, sex=female, obesity=yes, hypertension=yes] then alive. Coverage:32, accuracy: 1.0

If [age>=38:False, tobacco=yes, diabetes=no, sex=female, obesity=yes, asthma=no] then alive. Coverage:272, accuracy: 0.9890909090909091

If [age>=38:False, diabetes=no, tobacco=yes, obesity=no, asthma=yes] then alive. Coverage:33, accuracy: 1.0

If [age>=38:False, diabetes=no, tobacco=yes, obesity=no, hypertension=no, sex=female, imm_supr=no] then alive. Coverage:237, accuracy: 0.983402489626556

If [age>=38:False, diabetes=no, renal_chronic=no, sex=female, obesity=yes, cardiovascular=yes, hypertension=no] then alive. Coverage:40, accuracy: 1.0

If [age>=38:False, diabetes=no, renal_chronic=no, sex=female, obesity=yes, tobacco=no, imm_supr=no, cardiovascular=no, hypertension=yes, asthma=no] then alive. Coverage:238, accuracy: 0.9834710743801653


## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

Did any of these rules surprise you?

Do you have a meaningful logical explanation for these rules?

What additional research is needed to understand the meaning of your findings?

Copyright &copy; 2022 Marina Barsky. All rights reserved.

In this project we implemented the PRiSM algorithm. PRiSM is an algorithm for finding rules from a given dataset. We care about rules beacsue it can be very hard to visualize a decision tree so learinig anything from it is nearly impossable. With rules on the other hand we have clear pieces of knowledge we can glean from the data, that can give us useful insight. 

We can also use rules to classify objects by going down the list of rules and seeing if our object match any of the rules. But we don't really use it for classification as it is less reliable for that, and much more fit for learning localized facts.  

The actual algorithm works by divide and separation, where we find the best rule, remove all the rows that rule covers, and then find the next best rule on the rest of the data. More details on the algorithm can be found in the implementation file _PRiSM_algorithm_  


For the titanic data all the rules were pretty expected. The first rule is that 79 percent of males died which makes sense. The second rule was abvious becaseu all its really saying is that 21 percent of males survived. The thid rule says that 96.5 percent of the women in first class survived which again makes sense. And the last rule is that 88 percent of women in second class survived, which is mostly expected. 


The covid-19 data when filtered for "dead" rules, was suprising in that there werent any good rules. There was only one rule with a 65 percent accuracy and 46 coverage, that associated death with some of the more severe probloms like diabetes. If we did not take out age then the results were basically that over the age of 88 with some other problems resulted in death, which again makes sense. 

The unfiltered covid-19 data had a little more interesting result. The main indicator for living seems to be age as the top 20 rules are for age less then 38. This makes sense and is a known fact about covid. More interesting is that yes to smoking appears in 12 out of the top 20 rules, including the top three rules. This is very suprising as we would naturally assume that smoking makes covid worse not better. 

Of course more thought into what data distribution we started out with is needed to really analyze these rules. For eacmple if most of the people were smokers then it would make more sense that smoking is appears in many rules. Additinally we would need to look at data from patients with the same attributes, but without covid, to varify that the survived status is at all related to covid. 

