CSE 5243 - Yuxiao Zhao (zhao.2379) & Shuo Lin (lin.2237) - April 3 2018

# Lab 4 — Association Rule Mining

In this lab, you'll preprocess a dataset and use the [MLxtend](http://rasbt.github.io/mlxtend/) machine
learning library to mine frequent patterns and association rules.

Please follow all instructions carefully. There are 16 steps. When you are finished, you
will submit your work via Carmen as usual.

**This lab is due on Tuesday, April 3, 2018, at 11:59 PM ET.**


### BEFORE YOU BEGIN: Install the MLxtend machine learning library.

If you're using anaconda, this is realy easy! At the command prompt, enter the following:

```
$ conda install -c conda-forge mlxtend
```

Answer 'yes' to install any dependencies or updates. That's it! If you run into trouble,
[email me right away](mailto:burkhardt.5@osu.edu)!

In [1]:
# These are the libraries you are most likely to need, but you may include others as well.
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

## Part 1. Download the data.

For this lab, you will use the [Zoo dataset](https://archive.ics.uci.edu/ml/datasets/Zoo), which 
is available for download at the UCI Machine Learning Repository.

### 1. Download the file (`zoo.data`). The file should contain 101 rows $\times$ 18 columns.


In [2]:
# If you don't already have a copy of the Iris data file, you can download it here.
DATAURL = "https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data"
import urllib
urllib.request.urlretrieve (DATAURL, "zoo.data")

('zoo.data', <http.client.HTTPMessage at 0x1af1fa5b080>)

## Part 2. Preprocessing

### 2. Use the input data to generate a new data frame with 24 columns, as follows:

| Column No. | Column Name | Description |
|-|-|-|
|1|name| The name of the animal (string)|
|2-16|hair, feathers, ...| All (15) asymmetric binary attributes from the original data file (hair, feathers, etc.)|
|17|has_legs| Does the animal have legs? (1=yes, 0=no) |
|18|mammal|Is the animal a mammal? (1=yes, 0=no) |
|19|bird|Is the animal a bird? (1=yes, 0=no) |
|20|reptile|Is the animal a reptile? (1=yes, 0=no) |
|21|fish|Is the animal a fish? (1=yes, 0=no) |
|22|amphibian|Is the animal an amphibian? (1=yes, 0=no) |
|23|insect|Is the animal an insect? (1=yes, 0=no) |
|24|mollusk|Is the animal a mollusk? (1=yes, 0=no) |

In [3]:
zoo = pd.read_table("zoo.data",sep=',',header = None, names = ["name","hair","feathers","eggs","milk","airborne","aquatic","predator","toothed","backbone","breathes","venomous","fins","legs","tail","domestic","catsize","type","has_legs","mammal","bird","reptile","fish","amphibian","insect","mollusk"])

In [4]:
zoo['has_legs'] = np.where(zoo['legs']>0, 1, 0)
zoo['mammal'] =  np.where(zoo['type']==1, 1, 0)
zoo['bird'] = np.where(zoo['type']==2, 1, 0)
zoo['reptile'] = np.where(zoo['type']==3, 1, 0)
zoo['fish'] = np.where(zoo['type']==4, 1, 0)
zoo['amphibian'] = np.where(zoo['type']==5, 1, 0)
zoo['insect'] = np.where(zoo['type']==6, 1, 0)
zoo['mollusk'] = np.where(zoo['type']==7, 1, 0)

In [5]:
zoo = zoo.drop(['legs','type'],axis=1)
zoo.head()

Unnamed: 0,name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,...,domestic,catsize,has_legs,mammal,bird,reptile,fish,amphibian,insect,mollusk
0,aardvark,1,0,0,1,0,0,1,1,1,...,0,1,1,1,0,0,0,0,0,0
1,antelope,1,0,0,1,0,0,0,1,1,...,0,1,1,1,0,0,0,0,0,0
2,bass,0,0,1,0,0,1,1,1,1,...,0,0,0,0,0,0,1,0,0,0
3,bear,1,0,0,1,0,0,1,1,1,...,0,1,1,1,0,0,0,0,0,0
4,boar,1,0,0,1,0,0,1,1,1,...,0,1,1,1,0,0,0,0,0,0


### 3. Save the data frame to a CSV file and give it a descriptive name (e.g. `zoo_basket.csv`).

In [6]:
zoo.to_csv("zoo_basket.csv")

## Part 3. Generate Frequent Patterns

### 4. Use the [MLxtend **apriori**](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/#apriori) function to find frequent patterns with $min\_support=0.5$ and $max\_len=5$.

In [7]:
apriori(zoo.iloc[:,1:], min_support=0.5, use_colnames=True, max_len=5)

Unnamed: 0,support,itemsets
0,0.584158,[eggs]
1,0.554455,[predator]
2,0.60396,[toothed]
3,0.821782,[backbone]
4,0.792079,[breathes]
5,0.742574,[tail]
6,0.772277,[has_legs]
7,0.60396,"[toothed, backbone]"
8,0.514851,"[toothed, tail]"
9,0.683168,"[backbone, breathes]"


### 5. How many frequent itemsets were generated? List all the frequent 3-itemsets.

  21 frequent itemsets were generated, all the frequent 3-itemsets are the following:

[toothed, backbone, tail], [backbone, breathes, tail], [backbone, breathes, has_legs], [backbone, tail, has_legs], 
[breathes, tail, has_legs]

### 6. Experiment with different values of $min\_support$ and $max\_len$.
* Are you able to find non-zero values for the parameters $min\_support$ and $max\_len$
  that yield frequent itemsets containing all 7 of the animal type items?
* How many frequent patterns were generated using these parameters?

In [8]:
zoo_set = apriori(zoo.iloc[:,1:], min_support=0.03, use_colnames=True, max_len=1)
for i in range(len(zoo_set.iloc[:,1])):
    if 'mammal' in zoo_set.iloc[i,1]:
        print('mammal is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('mammal is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'bird' in zoo_set.iloc[i,1]:
        print('bird is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('bird is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'reptile' in zoo_set.iloc[i,1]:
        print('reptile is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('reptile is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'fish' in zoo_set.iloc[i,1]:
        print('fish is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('fish is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'amphibian' in zoo_set.iloc[i,1]:
        print('amphibian is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('amphibian is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'insect' in zoo_set.iloc[i,1]:
        print('insect is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('insect is not in the itemsets')
        
for i in range(len(zoo_set.iloc[:,1])):
    if 'mollusk' in zoo_set.iloc[i,1]:
        print('mollusk is in the itemsets')
        break
    if i == len(zoo_set.iloc[:,1])-1:
        print('mollusk is not in the itemsets')

mammal is in the itemsets
bird is in the itemsets
reptile is in the itemsets
fish is in the itemsets
amphibian is in the itemsets
insect is in the itemsets
mollusk is in the itemsets


In [9]:
zoo_set.shape

(23, 2)

23 frequent itemsets are generated using the parameters: min_support=0.03, max_len=1

## Part 4. Association Rule Mining

### 7. What is the maximum number of association rules that can be extracted from this dataset, including rules that have zero support?

There are 23 attributes in the datasets. Each attribute could be in the antecedent, consequent, or not the rule (three possibilities). It is not possible that there is one or zero attribute in the rule. 
3^23-2^24+1 = 94,126,401,612 is the maximum number of rules that can be extracted from this dataset.

### 8. Regenerate frequent patterns using $min\_support=0.1$ and $max\_len=5$.

In [10]:
zoo_freqsets = apriori(zoo.iloc[:,1:], min_support=0.1, use_colnames=True, max_len=5)
zoo_freqsets.head()

Unnamed: 0,support,itemsets
0,0.425743,[hair]
1,0.19802,[feathers]
2,0.584158,[eggs]
3,0.405941,[milk]
4,0.237624,[airborne]


### 9. Use the [MLxtend **association_rules**](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/#association_rules) function to find association rules with $min\_threshold=0.7$.

In [11]:
zoo_rules = association_rules(zoo_freqsets, metric='confidence', min_threshold=0.7)
zoo_rules.head()

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(hair),(milk),0.425743,0.405941,0.386139,0.906977,2.23426,0.213312,6.386139
1,(milk),(hair),0.405941,0.425743,0.386139,0.95122,2.23426,0.213312,11.772277
2,(hair),(toothed),0.425743,0.60396,0.376238,0.883721,1.46321,0.119106,3.405941
3,(hair),(backbone),0.425743,0.821782,0.386139,0.906977,1.10367,0.036271,1.915842
4,(hair),(breathes),0.425743,0.792079,0.425743,1.0,1.2625,0.088521,inf


### 10. How many association rules were found?

In [12]:
zoo_rules.shape[0]

8584

8584 association rules were found.

### 11. Construct a contingency table for the rule

$$ \{predator, catsize, toothed, mammal\} \rightarrow \{tail\} $$

In [13]:
rule_table = pd.DataFrame(index = ['{predator,catsize,toothed,mammal}','not {predator,catsize,toothed,mammal}','total'],
                         columns = ['tail','not tail', 'total'])

In [14]:
rule_table.loc['{predator,catsize,toothed,mammal}','tail'] = len(zoo[(zoo['predator']==1) & (zoo['catsize']==1) & (zoo['toothed']==1) & (zoo['mammal']==1) & (zoo['tail']==1)])
rule_table.loc['{predator,catsize,toothed,mammal}','not tail'] = len(zoo[(zoo['predator']==1) & (zoo['catsize']==1) & (zoo['toothed']==1) & (zoo['mammal']==1) & (zoo['tail']==0)])
rule_table.loc['{predator,catsize,toothed,mammal}','total'] = len(zoo[(zoo['predator']==1) & (zoo['catsize']==1) & (zoo['toothed']==1) & (zoo['mammal']==1)])
rule_table.loc['not {predator,catsize,toothed,mammal}','tail'] = len(zoo[((zoo['predator']==0) | (zoo['catsize']==0) | (zoo['toothed']==0) | (zoo['mammal']==0)) & (zoo['tail']==1)])
rule_table.loc['not {predator,catsize,toothed,mammal}','not tail'] = len(zoo[((zoo['predator']==0) | (zoo['catsize']==0) | (zoo['toothed']==0) | (zoo['mammal']==0)) & (zoo['tail']==0)])
rule_table.loc['not {predator,catsize,toothed,mammal}','total'] = len(zoo[(zoo['predator']==0) | (zoo['catsize']==0) | (zoo['toothed']==0) | (zoo['mammal']==0)])
rule_table.loc['total','tail'] = len(zoo[(zoo['tail']==1)])
rule_table.loc['total','not tail'] = len(zoo[(zoo['tail']==0)])
rule_table.loc['total','total'] = len(zoo)
rule_table

Unnamed: 0,tail,not tail,total
"{predator,catsize,toothed,mammal}",15,4,19
"not {predator,catsize,toothed,mammal}",60,22,82
total,75,26,101


### 12. Use the contingency table to compute support for the rule. Does your result agree with the output from the *association_rules* function?

In [15]:
support = rule_table.iloc[0,0]/rule_table.iloc[2,2]
print('support of the rule is %f'%support)

support of the rule is 0.148515


In [16]:
zoo_rules.iloc[8077,:]

antecedants           (predator, tail, catsize, toothed)
consequents                                     (mammal)
antecedent support                              0.188119
consequent support                              0.405941
support                                         0.148515
confidence                                      0.789474
lift                                              1.9448
leverage                                       0.0721498
conviction                                       2.82178
Name: 8077, dtype: object

It agrees with the association_rules function.

### 13. List the "most interesting" rules—those with the highest measures of confidence, lift, and leverage. If there are ties (i.e. more than one rule sharing the highest value) then show them all.

In [17]:
# Those with highest leverage
zoo_rules_max = zoo_rules[(zoo_rules['leverage'] == (zoo_rules['leverage'].max()))]
zoo_rules_max

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
30,(mammal),(milk),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
31,(milk),(mammal),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
405,"(backbone, mammal)",(milk),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
406,"(backbone, milk)",(mammal),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
408,(mammal),"(backbone, milk)",0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
409,(milk),"(backbone, mammal)",0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
421,"(breathes, mammal)",(milk),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
422,"(breathes, milk)",(mammal),0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
424,(mammal),"(breathes, milk)",0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf
425,(milk),"(breathes, mammal)",0.405941,0.405941,0.405941,1.0,2.463415,0.241153,inf


In [18]:
# Those with higheset lift
zoo_rules_max = zoo_rules[(zoo_rules['lift'] == (zoo_rules['lift'].max()))]
zoo_rules_max

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
344,"(eggs, fins)",(fish),0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
346,(fish),"(eggs, fins)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1880,"(eggs, aquatic, fins)",(fish),0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1883,"(eggs, fins)","(aquatic, fish)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1887,"(aquatic, fish)","(eggs, fins)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1889,(fish),"(eggs, aquatic, fins)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1935,"(eggs, fins, toothed)",(fish),0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1938,"(eggs, fins)","(fish, toothed)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1942,"(fish, toothed)","(eggs, fins)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf
1944,(fish),"(eggs, fins, toothed)",0.128713,0.128713,0.128713,1.0,7.769231,0.112146,inf


In [19]:
# Those with higheset confidence
zoo_rules_max = zoo_rules[(zoo_rules['confidence'] == (zoo_rules['confidence'].max()))]
zoo_rules_max

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(hair),(breathes),0.425743,0.792079,0.425743,1.0,1.262500,0.088521,inf
9,(feathers),(eggs),0.198020,0.584158,0.198020,1.0,1.711864,0.082345,inf
11,(feathers),(backbone),0.198020,0.821782,0.198020,1.0,1.216867,0.035291,inf
12,(feathers),(breathes),0.198020,0.792079,0.198020,1.0,1.262500,0.041172,inf
13,(feathers),(tail),0.198020,0.742574,0.198020,1.0,1.346667,0.050975,inf
14,(feathers),(has_legs),0.198020,0.772277,0.198020,1.0,1.294872,0.045094,inf
15,(feathers),(bird),0.198020,0.198020,0.198020,1.0,5.050000,0.158808,inf
16,(bird),(feathers),0.198020,0.198020,0.198020,1.0,5.050000,0.158808,inf
21,(bird),(eggs),0.198020,0.584158,0.198020,1.0,1.711864,0.082345,inf
22,(fish),(eggs),0.128713,0.584158,0.128713,1.0,1.711864,0.053524,inf


## Part 5. Association Rules as Predictors

### 14. How many association rules were found in which the consequent contains exactly one element, and that element is one of the 7 animal types?

(Let's call this "Ruleset 2". Save these to another data frame or matrix, and answer
the remaining questions in this section based ONLY ON RULESET2.)


In [20]:
Ruleset2  = zoo_rules[(zoo_rules['consequents'] == {'mammal'}) | (zoo_rules['consequents'] == {'bird'}) | (zoo_rules['consequents'] == {'reptile'}) | 
                      (zoo_rules['consequents'] == {'fish'}) | (zoo_rules['consequents'] == {'amphibian'}) | (zoo_rules['consequents'] == {'insect'}) | (zoo_rules['consequents'] == {'mollusk'})]
Ruleset2.sample(5)

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3318,"(predator, hair, milk, breathes)",(mammal),0.19802,0.405941,0.19802,1.0,2.463415,0.117636,inf
2271,"(has_legs, milk, toothed)",(mammal),0.366337,0.405941,0.366337,1.0,2.463415,0.217626,inf
3988,"(backbone, has_legs, hair, milk)",(mammal),0.376238,0.405941,0.376238,1.0,2.463415,0.223507,inf
8084,"(predator, has_legs, tail, toothed)",(mammal),0.168317,0.405941,0.148515,0.882353,2.173601,0.080188,5.049505
1314,"(backbone, has_legs, hair)",(mammal),0.376238,0.405941,0.376238,1.0,2.463415,0.223507,inf


In [21]:
Ruleset2.shape[0]

317

There are 317 rules in this case.

### 15. How many rules are there for each animal type?

In [22]:
print("There are %d rules for mammal."%len(Ruleset2[Ruleset2['consequents']=={'mammal'}]))
print("There are %d rules for bird."%len(Ruleset2[Ruleset2['consequents']=={'bird'}]))
print("There are %d rules for reptile."%len(Ruleset2[Ruleset2['consequents']=={'reptile'}]))
print("There are %d rules for fish."%len(Ruleset2[Ruleset2['consequents']=={'fish'}]))
print("There are %d rules for insect."%len(Ruleset2[Ruleset2['consequents']=={'insect'}]))
print("There are %d rules for amphibian."%len(Ruleset2[Ruleset2['consequents']=={'amphibian'}]))
print("There are %d rules for mollusk."%len(Ruleset2[Ruleset2['consequents']=={'mollusk'}]))

There are 215 rules for mammal.
There are 71 rules for bird.
There are 0 rules for reptile.
There are 31 rules for fish.
There are 0 rules for insect.
There are 0 rules for amphibian.
There are 0 rules for mollusk.


## 6. Thought Experiment

Don't implement anything for this section. Just write out your answer in a markdown cell.

### 16. Explain how you might use association rules to suggest an animal's type based on the other attributes. Based on your observations, what changes would you make in the frequent itemset generation and rule generation steps?

I will select the rules with highest support and confidence which has the one of the animal type as the consequent. Use those antecdents to predict the animal type.

I will increase the max_length, decrease the confidence threshold or decrease min_support to regenerate frequent itemsets and associate rules which will cover at least six of all the animal types. For each animal type, I will choose one rule with highest support and confidence. Last I will use these rules to predict the animal type using these attributes.

## 7. Submit your work

**Submit your completed notebook, along with the data file you created in Part 2, via Carmen as LAB 4.**