In [15]:
import numpy as np
import pandas as pd
import math




Each Rule consists of antecedent (Left Hand Side) and consequent (Right Hand Side). The LHS includes multiple conditions joined with AND, and RHS is a class label. The Rule also needs to store its accuracy and coverage. Finally the store a list of indexes that the rule covers. 

In [2]:
class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0
        self.coverage_list =[]  # indexes of the rows this rule covers

    def addCondition(self, condition):
        self.conditions.append(condition)

    def setParams(self, accuracy, coverage, coverage_list):
        self.accuracy = accuracy
        self.coverage = coverage
        self.coverage_list = coverage_list


    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)


The list of conditions contains several objects of class _Condition_. 

Each condition includes the _attribute name_ its _value_ and a list of indexes _all_ for all rows that have that value. 

If the _value_ is numeric, then the condition also includes an additional field `true_false` which means the following: 
- *if true_false == True then values are >= value* 
- *if true_false == False then values are < value*
- If *true_false is None*, then this condition is simply of form *categorical attribute = value*.

In [3]:
class Condition:
    def __init__(self, attribute, value, all, true_false=None):
        self.attribute = attribute
        self.value = value
        self.all = all # index of all rows that have this attributes value.
        self.true_false = true_false


    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)




 First we call the _parse_ function  to get some necessary information out of our data. _parse_ accepts a dataframe and process the data to get back a dataframe with only the unique values.
 It also returns the name of all the attribute columns and the class name.

In [4]:

def parse(data):
    # probobly a better way to do this check https://stackoverflow.com/questions/54196959/is-there-any-faster-alternative-to-col-drop-duplicates
    new_df = []
    [new_df.append(pd.DataFrame(data[i].unique(), columns=[i])) for i in data.columns]
    new_df = pd.concat(new_df, axis=1)

    columns_list = data.columns.to_numpy().tolist()

    class_name = columns_list.pop(-1)

    return (new_df,columns_list, class_name)

 We use this function to decide if one rule is better than another,
and update a list of rules based on the result. The function accepts two rules and a list of rules.
 It then compares the accuracy and coverage of the rules as follows:
if rule one has better accuracy or equal accuracy and better coverage then rule two,
it updates the list to only hold rule one. If they are equal in accuracy and coverage, then
it appends rule one to the list of rules. Finally if rule one is worse then rule two, it dosnt do anything. 
_compare_ then returns the updated list of rules

In [5]:


def compare(rule1, rule2, rule_list):

    if rule1.accuracy > rule2.accuracy:
        rule2 = rule1
        rule_list = [rule2]

    elif rule1.accuracy == rule2.accuracy:
        if rule1.coverage > rule2.coverage:
            rule2 = rule1
            rule_list = [rule2]

        elif rule1.coverage == rule2.coverage:
            rule_list.append(rule1)

    return rule_list

We use this function to hard copy one rule to another.
It accepts the rule to copy from, an empty rule,
and the number of conditions the rule has. It then returns the new rule with the data of the old rule.


In [6]:
def Copy(x, rule, i ):

    rule.class_label = x.class_label

    for j in range(i+1):
        attribute = x.conditions[j].attribute
        value = x.conditions[j].value
        all = x.conditions[j].all
        true_false = x.conditions[j].true_false

        temp = Condition(attribute,  value, all, true_false)
        rule.addCondition(temp)

    rule.accuracy =x.accuracy
    rule.coverage = x.coverage
    rule.coverage_list = x.coverage_list

    return rule


Helper function that creates a new rule for a new condtion.
It accepts a rule to start from, a dataframe of rows covered by the new condition,
a dataframe of all rows with the given attribute value, the attribute, the attribute value,
the least amount of covrege allowed, the class label for when x is empty, and the number of conditions
the new rule should have. It then creates a rule from all these values and returns it. If the rule goes
bellow min_coverage it returns None.

In [7]:

def makeRule(x, subset, all_subset, truefalse, attribute, value, min_coverage, class_label = None, i = 0):

    numCount = len(subset)
    denCount = len(all_subset)

    if denCount == 0 or numCount < min_coverage:
        return None

    temp_coverage_list = np.array(subset.index)

    if x != None:
        temp_rule = Rule(None)
        temp_rule = Copy(x, temp_rule, i)

    else:
        temp_rule = Rule(class_label)

    temp_cond = Condition(attribute, value, np.array(all_subset.index), truefalse)
    temp_rule.addCondition(temp_cond)
    temp_rule.accuracy = numCount / denCount
    temp_rule.coverage = numCount
    temp_rule.coverage_list = temp_coverage_list

    return temp_rule


Once we have our best rules with one condition that has accuracy less then one,
we use this function to refine the rules as much as possible, and return the best one.

The function works by running a loop that finds the best possible conditions to add in order to increase accuracy.
Every itiration of the loop adds a new condition, and it only stops if either we reached accuracy one,
or we didn't add any conditions in an itiration.

The function accepts the data, names of the column attributes, list of rules we want to refine,
data frame with all the unique attribute values, the class name, and the minimum coverage.
And returns the first of the best rules we have.


In [8]:

def refine_rule(data, column_list, rule_list, un_data, class_name, min_coverage ):


    count = 0
    while(True):

        count+=1

        temp = Rule(None)
        temp.accuracy = -math.inf
        temp.coverage = -math.inf
        real_best_rules = [temp]

        for x in rule_list:
            # This temporarily gets rid of all rows that don't have the attributes of the first conditions.
            new_df = data[data.index.isin(x.conditions[count - 1].all)]

            if len(new_df.index) >= min_coverage:
                Sbest_rules = x
                best_rules = [Sbest_rules]

                for y in column_list:
                    for z in un_data[y]:
                        flag = False

                        if pd.isnull(z):
                            continue

                        #Special cases for if the value is numeric
                        if isinstance(z, int) or isinstance(z, float):
                            flag = True
                            subset = new_df[(new_df[y] >= z) & (new_df[class_name] == x)]
                            all_subset = new_df[new_df[y] >= z]
                            truefalse = True

                        else:
                            subset =  new_df[( new_df[y] == z) & ( new_df[class_name] == x.class_label)]
                            all_subset =  new_df[ new_df[y] == z]
                            truefalse = None

                        temp_rule =  makeRule(x, subset, all_subset, truefalse, y, z, min_coverage, None, count - 1)

                        if temp_rule != None:

                            best_rules = compare(temp_rule, Sbest_rules, best_rules)
                            Sbest_rules = best_rules[0]

                        else:
                            if flag:

                                subset = new_df[(new_df[y] < z) & (new_df[class_name] == x)]
                                all_subset = new_df[new_df[y] < z]

                                temp_rule =  makeRule(x, subset, all_subset, False, y, z, min_coverage, None, count - 1)
                                if temp_rule == None:

                                    continue

                                best_rules = compare(temp_rule, Sbest_rules, best_rules)
                                Sbest_rules = best_rules[0]

                real_best_rules = compare(best_rules[0], real_best_rules[0], real_best_rules)


        if real_best_rules[0].accuracy == 1 or rule_list[0] == real_best_rules[0]:

            return real_best_rules[0]


        if real_best_rules[0].class_label == None:

            return rule_list[0]

        rule_list = real_best_rules


This function implements the algorithm to find the best rule from a given dataset. It works by itirating over all possable classes and attributes to find the combination with the best accuracy and coverage for only one condition.
Then if the rule does not have accuracy one, it calss _refine_rule_ to add on conditions and get the best possable rule. 

The function accepts the data, a dataframe with all the unique attributes,
the class name, the minumum allowed accuracy and coverage, and the class attributes we care about.

If the resulting rule has the minimum accuracy and coverage, it returns the rule,
if not it returns None.

In [9]:

def find_one_rule(data, un_data, columns_list, class_name, min_accuracy,  min_coverage  ,classAtt = []):

    class_list = un_data[class_name].to_numpy().tolist()

    if classAtt == []:
        classAtt = class_list

    Sbest_rules = Rule(None)
    Sbest_rules.accuracy = -math.inf
    Sbest_rules.coverage = -math.inf
    best_rules = [Sbest_rules]

    for x in classAtt:
        for y in columns_list:
            for z in un_data[y]:

                flag = False

                if pd.isnull(z):
                    continue

                if isinstance(z, int) or isinstance(z, float):
                    flag = True
                    subset = data[(data[y] >= z) & (data[class_name] == x)]
                    all_subset = data[data[y] >= z]
                    truefalse = True

                else:

                    subset = data[(data[y] == z) & (data[class_name] == x)]
                    all_subset = data[data[y] == z]
                    truefalse = None

                temp_rule = makeRule(None, subset, all_subset, truefalse, y, z, min_coverage, x)

                if temp_rule != None:

                    best_rules = compare(temp_rule, Sbest_rules, best_rules)
                    Sbest_rules = best_rules[0]

                else:
                    if flag:

                        subset = data[(data[y] < z) & (data[class_name] == x)]
                        all_subset = data[data[y] < z]

                        temp_rule = makeRule(None, subset, all_subset, False, y, z, min_coverage, x)

                        if temp_rule == None:
                            continue

                        best_rules = compare(temp_rule, Sbest_rules, best_rules)
                        Sbest_rules = best_rules[0]

    if best_rules[0].accuracy == 1 and best_rules[0].coverage >= min_coverage:
        return best_rules[0]

    rule = refine_rule(data, columns_list, best_rules, un_data,   class_name, min_coverage)

    if rule.accuracy >= min_accuracy and rule.coverage >= min_coverage:
        return rule

    return None

We call this function to actually find all the rules for a given dataset. _find_rules_ uses a loop to call _find_one_rule_ and get a single rule. After each rule that it gets it removes all the rows covered by that rule from the dataset. The loop continues to do this until either the dataset is empty or there are no more rules to find which happens when _find_one_rule_ returns None.

The function accepts the data, a dataframe of unique attributes, the name of all attribute columns,
,the class name, the class atrributes we care about, and the minimum accuracy and coverage.

It returns a list with all the rules it can find.

In [10]:
def find_rules(data, un_data, columns_list, class_name, classAtt = [], min_accuracy = 1,  min_coverage = 1):

    rule_list = []
    temp = Rule(None)
    flag = 1
    while(flag):

        temp = find_one_rule(data, un_data, columns_list, class_name, min_accuracy,  min_coverage , classAtt)

        if temp == None:
            break
        rule_list.append(temp)

        data = data[~data.index.isin(temp.coverage_list)]

        if data.empty:
            flag = 0


    return rule_list

In [28]:

data_file = "contact_lenses.csv"
data = pd.read_csv(data_file)
data.drop(["id"], axis = 1, inplace = True)

data.columns

conditions = [ data['lenses type'].eq(1), data['lenses type'].eq(2), data['lenses type'].eq(3)]
choices = ["hard","soft","none"]

data['lenses type'] = np.select(conditions, choices)

# age groups
conditions = [ data['age'].eq(1), data['age'].eq(2), data['age'].eq(3)]
choices = ["young","medium","old"]

data['age'] = np.select(conditions, choices)

# spectacles
conditions = [ data['spectacles'].eq(1), data['spectacles'].eq(2)]
choices = ["nearsighted","farsighted"]

data['spectacles'] = np.select(conditions, choices)

# astigmatism
conditions = [ data['astigmatism'].eq(1), data['astigmatism'].eq(2)]
choices = ["no","yes"]

data['astigmatism'] = np.select(conditions, choices)

# tear production rate
conditions = [ data['tear production rate'].eq(1), data['tear production rate'].eq(2)]
choices = ["reduced","normal"]

data['tear production rate'] = np.select(conditions, choices)
    
class_labels = []

rule_list = []
min_accuracy = 1
min_coverage = 1

(cp_data, columns_list, class_name) = parse(data)

rule_list = find_rules(data, cp_data, columns_list, class_name, class_labels, min_accuracy, min_coverage )
   

for x in rule_list:
        print(x, "\n")

If [tear production rate=reduced] then none. Coverage:12, accuracy: 1.0 

If [astigmatism=no, spectacles=farsighted] then soft. Coverage:3, accuracy: 1.0 

If [astigmatism=yes, spectacles=nearsighted] then hard. Coverage:3, accuracy: 1.0 

If [age=old] then none. Coverage:2, accuracy: 1.0 

If [spectacles=nearsighted] then soft. Coverage:2, accuracy: 1.0 

If [age=medium] then none. Coverage:1, accuracy: 1.0 

If [age=young] then hard. Coverage:1, accuracy: 1.0 

