## Scratch OneR Algorithm

### Overview

OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule".  To create a rule for a predictor, we construct a frequency table for each predictor against the target. It has been shown that OneR produces rules only slightly less accurate than state-of-the-art classification algorithms while producing rules that are simple for humans to interpret.

### Algorithm Definition

In [5]:
# For each predictor,
     # For each value of that predictor, make a rule as follows;
           # Count how often each value of target (class) appears
           # Find the most frequent class
           # Make the rule assign that class to this value of the predictor
     # Calculate the total error of the rules of each predictor
# Choose the predictor with the smallest total error.

### Scratch Code

This is a scratch implementation of OneR. I used pandas to ease the data wrangling the input for fit.

In [6]:
import pandas as pd

In [7]:
class OneR(object):
    
    def __init__(self):
        self.ideal_variable = None
        self.max_accuracy = 0
    
    def fit(self, X, y):
        response = list()
        result = dict()
        
        dfx = pd.DataFrame(X)
        
        for i in dfx:
            result[str(i)] = dict()
            options_values = set(dfx[i])
            join_data = pd.DataFrame({"variable":dfx[i], "label":y})
            cross_table = pd.crosstab(join_data.variable, join_data.label)
            summary = cross_table.idxmax(axis=1)
            result[str(i)] = dict(summary)
            
            counts = 0
            
            for idx, row in join_data.iterrows():
                if row['label'] == result[str(i)][row['variable']]:
                    counts += 1

            accuracy = (counts/len(y))
            
            if accuracy > self.max_accuracy:
                self.max_accuracy = accuracy
                self.ideal_variable = i

            result_feature = {"variable": str(i), "accuracy":accuracy, "rules": result[str(i)] }  
            response.append(result_feature)
            
        return response

    
    def predict(self, X=None):
        self_ideal_variable = self.ideal_variable + 1
        
    def __repr__(self):
        if self.ideal_variable != None:
            txt = "La mejor variable para tus datos es: " + str(self.ideal_variable)
        else:
            txt = "La mejor variable aun no se ha encontrado, intente ejecutar el metodo fit previamente"
        return txt

In [8]:
X = [ ["a", "1"],
      ["a", "2"],
      ["a", "3"],
      ["b", "1"],
      ["b", "2"],
      ["b", "3"],
      ["c", "1"],
      ["c", "1"],
      ]

y = ["p", "p", "e",
     "e", "p", "e",
     "p", "p"]

In [9]:
clf = OneR()
results = clf.fit(X, y)

print(results)
print(clf)

[{'variable': '0', 'accuracy': 0.75, 'rules': {'a': 'p', 'b': 'e', 'c': 'p'}}, {'variable': '1', 'accuracy': 0.875, 'rules': {'1': 'p', '2': 'p', '3': 'e'}}]
La mejor variable para tus datos es: 1


### Testing with data less newbie - Data Mushrooms

Se tiene una muestra de 8124 instancias de hongos provenientes de 23 especies de la familia Agaricus y Lepiota. Por el tipo de problema en cuestión, los hongos de comestibilidad desconocida fueron asignados a la clase de hongos definitivamente venenosos.

Fuente de los datos: https://archive.ics.uci.edu/ml/datasets/mushroom

In [12]:
data = pd.read_csv('data/mushrooms.csv')
y_mush = data['type']

x_mush = data.loc[:,'cap_shape':]

clf_mush = OneR()
results = clf_mush.fit(x_mush, y_mush)

print(clf_mush)

La mejor variable para tus datos es: odor


### Checking Cross Validation

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split

data = pd.read_csv('data/mushrooms.csv')

num = 10
clf_mush_cv = OneR()
accuracy_items = list()

for i in range(num):
    
    x_train, x_test, y_train, y_test = train_test_split(
        x_mush, y_mush,
        test_size=2,
        random_state=42)

    clf_mush_cv.fit(x_train, y_train)
    accuracy_items.append(clf_mush_cv.max_accuracy)

print(sum(accuracy_items) / num)

0.9852253139620786
