## Libraries & Dataset Directory

In [15]:
import pandas as pd
import numpy as np 
import os

for list_of_files in os.listdir("../data_sets"):
	print(list_of_files)


contact_lenses.csv
covid_categorical_good.csv
StudentEvaluations.csv
titanic.csv
two_houses.csv


In [16]:
data_file = "contact_lenses.csv"
data = pd.read_csv(data_file, index_col=['id'])
print("\nColumns of the contact_lenses dataset:\n", data.columns)


Columns of the contact_lenses dataset:
 Index(['age', 'spectacles', 'astigmatism', 'tear production rate',
       'lenses type'],
      dtype='object')


### Given Implementation

In [17]:
# classes
conditions = [ data['lenses type'].eq(1), data['lenses type'].eq(2), data['lenses type'].eq(3)]
choices = ["hard","soft","none"]

data['lenses type'] = np.select(conditions, choices)

# age groups
conditions = [ data['age'].eq(1), data['age'].eq(2), data['age'].eq(3)]
choices = ["young","medium","old"]

data['age'] = np.select(conditions, choices)

# spectacles
conditions = [ data['spectacles'].eq(1), data['spectacles'].eq(2)]
choices = ["nearsighted","farsighted"]

data['spectacles'] = np.select(conditions, choices)

# astigmatism
conditions = [ data['astigmatism'].eq(1), data['astigmatism'].eq(2)]
choices = ["no","yes"]

data['astigmatism'] = np.select(conditions, choices)

# tear production rate
conditions = [ data['tear production rate'].eq(1), data['tear production rate'].eq(2)]
choices = ["reduced","normal"]

data['tear production rate'] = np.select(conditions, choices)

In [18]:
class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0

    def addCondition(self, condition):
        self.conditions.append(condition)

    def setParams(self, accuracy, coverage):
        self.accuracy = accuracy
        self.coverage = coverage
    
    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)

In [19]:
class Condition:
    def __init__(self, attribute, value, true_false = None):
        self.attribute = attribute
        self.value = value
        self.true_false = true_false

    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)

## Learn One Rule

My first approach while developing "learn_one_rule" was to follow the instructions from the slides "Classification Rules" (slide 30) for the Prism Algorithm implementation. My first attempt was to impelement with loops, and the after roughly implementing the algorithm, I tried to improve it with list comprehensions. I also used Pandas DataFrame functions such as groupby, apply, and lambda for better performance. For code debbuging, I tried many times to play with every line of the code and previous functions that had be given to see what each variable was returning after every interation. 

In [21]:
def learn_one_rule(columns, data, class_label, min_coverage = 30, min_accuracy = 0.6):
    covered_subset = data.copy()

    # YOUR CODE      
    rule = Rule(class_label)
    
    # Creating a set for the available features.
    features = set(columns[:-1]) #Not including the last column (class_label)
    target = columns[-1]

    # As long there is a feature, we then create a condition.
    while features:

        # The following best "attributes" are initialized to zero.
        # Checking the accuracy for each attribute and its values.
        best_accuracy = 0
        best_feature = None
        best_value = 0
        best_coverage = 0

        # This loop will run through each available feature.

        for col in features:
            
            # Calculates acurracy/coverage for specific feature on at the time.
            '''
                Below we have some Pandas functions, where we start with the original dataset.
                We "groupby" to a specific column and then we extract targe column which is class_label.
                Then, we "apply" a feature. The "apply" function searches for what percentage of the feature is equals to class_label (accuracy).
                It also checks for how many those we have (shape).

                For example:
                    In the case below, we have "age" as a feature (cur) and class_label as "none".
                    So here, we can see that "age" has three cases being "old", "medium", and "young". 
                    We then can infer that 75% of the rows in "age" are "old" within the class_label as none. The percentage is the accuracy, and 8 represents the covarege.. 
                    
                    age
                    old        (0.75, 8)
                    medium    (0.625, 8)
                    young       (0.5, 8) 

                    The sorted values will give us the highest value among all accuracy since we don't care about the other ones. 

            ''' 
            cur = covered_subset.groupby(col)[target].apply(lambda x: ((x == class_label).mean(),x.shape[0])).sort_values(ascending=False)
            cur = cur[cur.apply(lambda x: x[1] > min_coverage)]
            if cur.size == 0:
                continue

            # Selecting the best accuracy for the especific column.
            cur = cur.head(1) 

            # This if statement keeps track of the best acurracy feature among features. 
            if best_accuracy < cur.values[0][0] or (cur.values[0][0] == best_accuracy and cur.values[0][1] > best_coverage):
                best_feature = col
                best_value = cur.index[0]
                best_accuracy = cur.values[0][0]
                best_coverage = cur.values[0][1]

        if best_feature is None:
            return None, None

        # After selecting the a feature, we then remove it from the dataset so that we can avoid using it again in the next interation of the loop.
        features.remove(best_feature)
        # We then add the next condition to the rule to what is the best feature corresponding to the value. 
        rule.addCondition(Condition(best_feature, best_value))
        # We then only select the data specified by the rule.
        covered_subset = covered_subset[covered_subset[best_feature] == best_value].copy()
        
        # If the accuracy we achieve so far is higher than min_accuracy than we stopped.
        # If not, we then go back to the while loop.
        if best_accuracy > min_accuracy:
            break

    # We then achieve the minimal accuracy or there is no other feature to create the condition.
    rule.setParams(best_accuracy, covered_subset.shape[0])   
    #print(rule.coverage, rule.accuracy)

    # Here we then check if the coverage and the minimal accuracy is mantained.
    if (rule.coverage < min_coverage) or (rule.accuracy < min_accuracy): 
        rule = None 
        rule = None

    return rule, covered_subset                                  

## Learn Rules

In [20]:
def learn_rules (columns, data, classes=None, 
                 min_coverage = 30, min_accuracy = 0.6):
    # List of final rules
    rules = []
    
    # If list of classes of interest is not provided - it is extracted from the last column of data
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()

    current_data = data.copy()
    
    # This follows the logic of the original PRISM algorithm
    # It processes each class in turn. 
    for class_label in class_labels:
        done = False
        while len(current_data) > min_coverage and not done:
            # Learn one rule 
            
            rule, subset = learn_one_rule(columns, current_data, class_label, min_coverage, min_accuracy)
            #print(current_data.shape, rule.accuracy)
            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                break

            # If we get the rule with accuracy and coverage above threshold
            
            if rule.accuracy >= min_accuracy:
                rules.append(rule)

                # remove rows covered by this rule
                # you have to remove the rows where all of the conditions hold
                current_data = current_data[~current_data.index.isin(subset.index)].copy()

            else:
                done = True         
                
    return rules

## Results

In [22]:
print(data.shape)
column_list = data.columns.to_numpy().tolist()
rules = learn_rules(column_list, data, None, 1, 0.95)
for rule in rules[:20]:
    print(rule)

(24, 5)
If [tear production rate=reduced] then none. Coverage:12, accuracy: 1.0
If [astigmatism=no, spectacles=farsighted] then soft. Coverage:3, accuracy: 1.0
If [astigmatism=yes, spectacles=nearsighted] then hard. Coverage:3, accuracy: 1.0



Results are given below for comparison:

If [tear production rate=reduced] then none. Coverage:12, accuracy: 1.0

If [astigmatism=no, spectacles=farsighted] then soft. Coverage:3, accuracy: 1.0

If [astigmatism=no, age=young] then soft. Coverage:1, accuracy: 1.0

If [astigmatism=no, age=medium] then soft. Coverage:1, accuracy: 1.0

If [age=young] then hard. Coverage:2, accuracy: 1.0

If [spectacles=nearsighted, astigmatism=yes] then hard. Coverage:2, accuracy: 1.0

While comparing the outputed results by the implemented algorithgm with the given results provided by Professor Barsky, it was possible to find similiraties and differences between them.

Both results output similar rules in their first two printed lines:

    If [tear production rate=reduced] then none. Coverage:12, accuracy: 1.0

    If [astigmatism=no, spectacles=farsighted] then soft. Coverage:3, accuracy: 1.0

However, their remaning printed lined are complemently different. Because the for loop "class_label" in the "learn_rule" function keeps using the same class unless the rule becomes empty, we have the immediate swith of classes from "none" to "soft", and "hard" being being in the total of tree outputed rules. It seems that after reaching each rule, we immedialty hit the best rule leading then to the next class label and finally finalizing with "If [astigmatism=yes, spectacles=nearsighted] then hard. Coverage:3, accuracy: 1.0." At this point, we reach to stage where there is nothing better that we can do. 

For that reason, I then wondered if the coverage outputed in Barksy's given results where randomly printed since it feels that ther might have been other rules that could be shown in between one to another. 
