# Association rules for multiclass prediction

## Dataset: CAR Acceptability

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

CAR car acceptability
- PRICE overall price
  - buying buying price
  - maint price of the maintenance
- TECH technical characteristics
  - COMFORT comfort
    - doors number of doors
    - persons capacity in terms of persons to carry
  - lug_boot the size of luggage boot
  - safety estimated safety of the car 

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

In [3]:
dataset = pd.read_csv("files/CAR_Acceptability.csv", header=0)

In [4]:
dataset.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car_accept
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
dataset.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car_accept
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,med,med,3,more,med,med,unacc
freq,432,432,432,576,576,576,1210


# Problem: predict car acceptability

## Multiclass Classification problem

### One approach: Random forest multiclass prediction

In [6]:
from sklearn.ensemble import RandomForestClassifier

In [7]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                    min_samples_split=2, random_state=0)

In [8]:
# It is necessary to encode variables
from sklearn.preprocessing import LabelEncoder

In [9]:
features = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car_accept']

In [10]:
features_enc = [f + '_enc' for f in features]

In [11]:
dataset_enc = dataset.copy()
le = LabelEncoder()

for f in features:   
    print(f)
    le.fit(dataset[[f]])
    #print(le.classes_)
    dataset_enc[f + '_enc'] = le.transform(dataset[[f]])

buying
maint
doors
persons
lug_boot
safety
car_accept


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [19]:
# Divide dataset into train and test
train_enc, test_enc = train_test_split(dataset_enc, test_size = 0.3, random_state = 0)

# Save into files
train_enc[features].to_csv("files/CAR_Acceptability_train.csv", header=True, index=False)
test_enc[features].to_csv("files/CAR_Acceptability_test.csv", header=True, index=False)

In [20]:
# Now we are ready to predict
clf.fit(train_enc[features_enc[0:-1]], train_enc[features_enc[-1]])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [21]:
test_enc['pred'] = clf.predict(test_enc[features_enc[0:-1]])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [22]:
test_enc.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car_accept,buying_enc,maint_enc,doors_enc,persons_enc,lug_boot_enc,safety_enc,car_accept_enc,pred
1318,low,vhigh,2,more,med,med,unacc,1,3,0,2,1,2,2,2
124,vhigh,high,2,4,big,med,unacc,3,0,0,1,0,2,2,2
648,high,med,2,2,small,low,unacc,0,2,0,0,2,1,2,2
249,vhigh,med,3,2,big,low,unacc,3,2,1,0,0,1,2,2
1599,low,med,5more,2,big,low,unacc,1,2,3,0,0,1,2,2


In [23]:
### Check accuracy
test_enc['hit'] = 0
test_enc.loc[test_enc['pred'] == test_enc['car_accept_enc'], 'hit'] = 1
test_enc['hit'].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


0.9576107899807321

## Another approach: Association rules

In [24]:
dataset.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car_accept
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## Idea

From each observations in the train dataset we can generate multiple rules, all with features in the left hand side and the target variable in the right hand side:

- From first observation
```
...
[buy = vhigh]                                                                       => [car_accept = unacc]
[buy = vhigh] + [mai = vhigh] + [doo = 2]                                           => [car_accept = unacc]
[buy = vhigh] +               + [doo = 2] + [per = 2]                               => [car_accept = unacc]
[                                           [per = 2] + [lub = small] + [saf = low] => [car_accept = unacc]
...
```

- From second observation
```
...
[                                                                       [saf = med] => [car_accept = unacc]
[                                           [per = 2] + [lub = small] + [saf = med] => [car_accept = unacc]
...
```

## Pseudocode

```
Generate all possible (or selected) rules

Train:   
    For each observation:
        Update the rules base with values from the observation (features => target)
    
Predict:
    For each observation:
        Apply all rules according with the left hand side of the observation
        Select the rules with higher support and / or confidence
        Target(s) are the right hand sides of the selected rules
```


In [25]:
from datetime import timedelta, date, datetime
from heapq import nlargest
from operator import itemgetter
import math
import re

In [34]:
MIN_SUPPORT = 1
NUM_OPTS = 1
FILE_TRAIN = 'files/CAR_Acceptability_train.csv'
FILE_TEST = 'files/CAR_Acceptability_test.csv'

In [35]:
antecedents_index = [0,1,2,3,4,5]
target_index = 6

# Generate all possible combinations of attributes
ant_list = []

def list_powerset(lst):
    # the power set of the empty set has one element, the empty set
    result = [[]]
    for x in lst:
        result.extend([subset + [x] for subset in result])
    return result


ant_list = list_powerset(antecedents_index)
ant_collections = {tuple(a): dict() for a in ant_list}
totals = {tuple(a): 0 for a in ant_list}

In [36]:
def append_rule(antecedents, consequent, weight, rule_collection):
    """
    Antecedents are usually a tuple of matched attributes and consequent
    is the target.
    weight is the weight assigned to the generated rule, instead of just 1.
    rule_collection is the collection to append the generated rule to.
    """
    if antecedents in rule_collection:
        if consequent in rule_collection[antecedents]:
            rule_collection[antecedents][consequent] += weight
        else:
            rule_collection[antecedents][consequent] = weight
        rule_collection[antecedents]['__all__'] += weight
    else:
        rule_collection[antecedents] = {'__all__': weight}
        rule_collection[antecedents][consequent] = weight


In [37]:
def apply_rules(antecedents, consequents, rule_collection):
    """
    Apply the given rule collection to the antecedents, writing the
    consequents to out.
    Return number of added consequents.
    """
    new_consequents = 0
    if antecedents in rule_collection:
        rule = rule_collection[antecedents]
        support_antecedents = float(rule['__all__']) + 1e-9
        if support_antecedents <= MIN_SUPPORT:
            return 0
        
        # Only one sorting per antecedents for all items
        if 'topitems' in rule:
            topitems = rule['topitems']
        else:
            rule_items = [(target, support / (0.0 + support_antecedents)) 
                for target, support in rule.items() if target != '__all__' 
                    and target != 'topitems' and support >= MIN_SUPPORT]
                    
            topitems = nlargest(NUM_OPTS, rule_items, key=itemgetter(1))
            rule['topitems'] = topitems
            
        # if target already in consequents, add only if higher confidence
        for topitem in topitems:
            target, new_confidence = topitem
            if target in consequents.keys():
                if consequents[target] < new_confidence:
                    consequents[target] = new_confidence
                    new_consequents += 1
            else:  # otherwise just add it to the list
                consequents[target] = new_confidence
                new_consequents += 1

    return new_consequents

In [38]:
def train():    
    
    myfile = open(FILE_TRAIN, 'r')
    myheader = re.split(',',myfile.readline()[:-1])
    myline = re.split(',',myfile.readline()[:-1])

    while myline is not None and len(myline) > 1:
        # Add rules with current instance
        weight = 1
        conseq = myline[target_index]

        # For each possible antecedent
        for ant in ant_list:
                
            # Get current antecedent
            cur_ant = map(lambda x:myline[x], ant)
            
            # Update complete rule
            append_rule(tuple(cur_ant), conseq, weight, ant_collections[tuple(ant)])
        
        myline = re.split(',',myfile.readline()[:-1])

In [39]:
def predict():

    num_options = 1
    match, total, sum_apk, sum_match_apk, sum_pos = [0, 0, 0, 0, 0]

    myfile = open(FILE_TEST, 'r')
    myheader = re.split(',',myfile.readline()[:-1])
    myline = re.split(',',myfile.readline()[:-1])
    
    while myline is not None and len(myline) > 1:
        # Add rules with current instance
        weight = 1
        target_real = myline[target_index]
        filled = {}

        # For each possible antecedent
        for ant in ant_list:
            
            # Get current antecedent        
            cur_ant = map(lambda x:myline[x], ant) 
    
            # Apply rule
            totals[tuple(ant)] += apply_rules(tuple(cur_ant), filled, ant_collections[tuple(ant)])
        
        # Select rule(s) with more confidence
        clusters_conf = filled.items()
        clusters_conf = sorted(clusters_conf, key=lambda x: x[1], reverse=True)[:NUM_OPTS]      
        myresult = [c[0] for c in clusters_conf]

        # Evaluate        
        try:
            pos = myresult.index(target_real)
        except:
            pos = -1
            
        current_match = (1 if pos >= 0 and pos < num_options else 0)
        total_res = len(myresult)
        print("{obs} {pred}".format(obs=myline, pred=myresult))    

        # Update metrics
        total += 1        
        match += current_match
        
        myline = re.split(',',myfile.readline()[:-1])
    
    # Final result
    print("\ntotal:\t{}\nmatch:\t{}\nrate:\t{}".format(
            total, 
            match, 
            match / (0.0 + total), 
        ))
    #print(totals)

In [40]:
train()

In [41]:
predict()

['low', 'vhigh', '2', 'more', 'med', 'med', 'unacc'] ['unacc']
['vhigh', 'high', '2', '4', 'big', 'med', 'unacc'] ['unacc']
['high', 'med', '2', '2', 'small', 'low', 'unacc'] ['unacc']
['vhigh', 'med', '3', '2', 'big', 'low', 'unacc'] ['unacc']
['low', 'med', '5more', '2', 'big', 'low', 'unacc'] ['unacc']
['low', 'low', '2', 'more', 'med', 'high', 'good'] ['unacc']
['high', 'low', '3', '2', 'small', 'low', 'unacc'] ['unacc']
['low', 'vhigh', '4', '4', 'med', 'high', 'acc'] ['acc']
['med', 'low', '3', '4', 'med', 'med', 'acc'] ['unacc']
['med', 'high', '2', '4', 'med', 'med', 'unacc'] ['unacc']
['vhigh', 'vhigh', '2', '2', 'big', 'low', 'unacc'] ['unacc']
['vhigh', 'high', '4', 'more', 'small', 'high', 'unacc'] ['unacc']
['low', 'low', '2', '4', 'med', 'med', 'acc'] ['unacc']
['low', 'vhigh', '4', 'more', 'small', 'low', 'unacc'] ['unacc']
['med', 'med', '3', '4', 'big', 'low', 'unacc'] ['unacc']
['med', 'vhigh', '3', '4', 'big', 'med', 'acc'] ['acc']
['vhigh', 'high', '2', '4', 'small'