# Project 1 Association Analysis

## Task

* Dataset1: Select from kaggle.com / UCI
* Dataset2: Use IBM Quest Synthetic Data Generator 
    * https://sourceforge.net/projects/ibmquestdatagen/ 
    * Generate different datasets
* Implement **Apriori Algorithm** and apply on these datasets 
    * Hash? Tree? (optional)
    * **FP-growth**
* Use association analysis tools (e.g. WEKA) to generate association rules from the datasets you generate
* Compare your results

In [1]:
import itertools
import pandas as pd

## Dataset 2

### Preprocessing

利用 IBM Quest Synthetic Data Generator 的 lit 模式產生資料集，調整的參數如下:

```
-ntran 1
-tlen 3
-nitems 20
-npats 10
-patlen 5
```

In [2]:
inputfile = open('data/data_1_3_20_10_5.data', 'r')
outputfile = open('data/data_1_3_20_10_5.csv', 'w')

In [3]:
outputfile.write('CustID,TransID,Item\n')

20

In [4]:
while True:

    s = inputfile.readline().rstrip('\n')
    
    # if this is the eof
    if len(s) == 0:
        break
        
    # Extract dimensions from first line. Cast values to integers from strings.
    CustID, TransID, Item = (int(val) for val in s.split())
    output_str = '%d,%d,%d\n' % (CustID, TransID, Item)
    outputfile.write(output_str)

In [5]:
inputfile.close()
outputfile.close()

In [6]:
# Load data
FILE_PATH = 'data/data_1_3_20_10_5.csv'
df = pd.read_csv(FILE_PATH)
df = df.astype({'Item': str})

In [7]:
df.head()

Unnamed: 0,CustID,TransID,Item
0,1,1,4089
1,1,1,8704
2,1,1,9205
3,1,1,9430
4,1,1,12679


### Before Getting Started

In [8]:
# Parameters
MINSUP = 90
MINCONF = 0.8

In [9]:
# Candidate 1-itemset
C1_df = df['Item'].value_counts()
# Frequent 1-itemset
L1_df = C1_df.loc[C1_df.values >= MINSUP]
L1 = L1_df.index.values.tolist()
tmp = L1_df.values.tolist()
L1_freq = {key: value for key, value in zip(L1, tmp)}

In [10]:
print(L1)

['11264', '4639', '18299', '38', '8799', '12219', '6056', '12779', '17679', '1034', '19970', '8704', '9205', '12679', '4089']


In [11]:
L1_freq

{'11264': 290,
 '4639': 288,
 '18299': 283,
 '38': 283,
 '8799': 273,
 '12219': 196,
 '6056': 194,
 '12779': 177,
 '17679': 109,
 '1034': 105,
 '19970': 93,
 '8704': 92,
 '9205': 91,
 '12679': 91,
 '4089': 90}

In [12]:
# Init dictionary for every transaction
trans_num = df['TransID'].max()
transaction_db = {}
for i in range(1, trans_num + 1):
    transaction_db[i] = []
# Extract info from df to dictionary
df_num = len(df)
for i in range(df_num):
    index = df.iloc[i][0]
    item = df.iloc[i][2]    
    transaction_db[index] += [item]

In [13]:
# transaction_db

### Apriori Algorithm

In [14]:
def Apriori_gen(x, k):
    Ck = []
    # Combination of k items in Lk
    for subset1 in itertools.combinations(x, k):
        # Change subset1 into `set` type for set operation
        tmp = [set(item) for item in subset1]
        union_result = set()
        # Combination of k-1 items in subset1
        for subset2 in itertools.combinations(tmp, k - 1):
            # Intersection of all items in subset2 (k-1 items)
            result = subset2[0]
            for i in range(k - 1):
                result = result.intersection(subset2[i])
            union_result = union_result.union(result)
        if len(union_result) == k:
            Ck.append(list(union_result))
    return Ck

In [15]:
def Apriori(tdb, L1, minsup):
    Lk = [(item,) for item in L1]
    Lk_freq = {(key,): value for (key, value) in L1_freq.items()}
    k = 2
    FreqPat = []
    FreqPat_freq = {}
    while Lk != []:
        # Add Lk in freqent patterns            
        for item in Lk:
            FreqPat.append(item)
        # Add Lk_freq in FreqPat_freq
        FreqPat_freq.update(Lk_freq)
        # Use previous Lk to generate Ck
        Ck = Apriori_gen(Lk, k)
        # Count the number of every item in Ck appears in DB
        Ck_freq = {}
        for item in Ck:
            count = 0
            for transaction in tdb.values():
                if all(x in transaction for x in item):
                    count += 1
            Ck_freq[tuple(item)] = count
        # Generate Lk
        Lk = []
        Lk_freq = {}
        for (key, value) in Ck_freq.items():
            if value >= minsup:
                # Because tuple has order
                new_key = tuple(sorted(key))
                Lk.append(new_key)
                Lk_freq[new_key] = value
        k += 1
    return FreqPat, FreqPat_freq

In [16]:
FreqPat, FreqPat_freq = Apriori(transaction_db, L1_freq, MINSUP)
FreqPat

[('11264',),
 ('4639',),
 ('18299',),
 ('38',),
 ('8799',),
 ('12219',),
 ('6056',),
 ('12779',),
 ('17679',),
 ('1034',),
 ('19970',),
 ('8704',),
 ('9205',),
 ('12679',),
 ('4089',),
 ('11264', '4639'),
 ('11264', '18299'),
 ('11264', '38'),
 ('11264', '8799'),
 ('18299', '4639'),
 ('38', '4639'),
 ('4639', '8799'),
 ('18299', '38'),
 ('18299', '8799'),
 ('38', '8799'),
 ('12219', '6056'),
 ('12219', '17679'),
 ('1034', '12219'),
 ('17679', '6056'),
 ('1034', '6056'),
 ('12779', '19970'),
 ('12679', '8704'),
 ('11264', '18299', '4639'),
 ('11264', '38', '4639'),
 ('11264', '4639', '8799'),
 ('11264', '18299', '38'),
 ('11264', '18299', '8799'),
 ('11264', '38', '8799'),
 ('18299', '38', '4639'),
 ('18299', '4639', '8799'),
 ('38', '4639', '8799'),
 ('18299', '38', '8799'),
 ('12219', '17679', '6056'),
 ('11264', '18299', '38', '4639'),
 ('11264', '18299', '4639', '8799'),
 ('11264', '38', '4639', '8799'),
 ('11264', '18299', '38', '8799'),
 ('18299', '38', '4639', '8799'),
 ('11264',

### FP-growth

In [17]:
class HeaderTableNode:
    def __init__(self):
        self.head = None
        self.tail = None

class FPtreeNode:
    def __init__(self, val, parent=None):
        self.val = val
        self.count = 1
        self.parent = parent
        self.children = []
        self.next = None
    def insert_frequent_items(self, items, hdtable):
        # If there is no frequent item
        if len(items) == 0:
            return
        item = items[0]
        for child in self.children:
            if child.val == item:
                child.count += 1
                child.insert_frequent_items(items[1:], hdtable)
                return
        # If cannot find the item among children
        new_child = FPtreeNode(item, self)
        # Add new node to header table
        if hdtable[item].head == None:
            hdtable[item].head = new_child
            hdtable[item].tail = new_child
        else:
            hdtable[item].tail.next = new_child
            hdtable[item].tail = new_child
        # Add new node to current node's children
        self.children.append(new_child)
        new_child.insert_frequent_items(items[1:], hdtable)

class CondPatternBase:
    def __init__(self, pattern, freq):
        self.pattern = pattern
        self.freq = freq

In [18]:
def FP_growth(transaction_db, L1, minsup):
    trans_num = len(transaction_db)
    # Init dictionary for ordered frequent items of every transaction
    ofi = {}
    for i in range(1, trans_num + 1):
        ofi[i] = []
    # Construct ordered frequent items of every transaction
    for i in range(1, trans_num + 1):
        for item in L1:
            if item in transaction_db[i]:
                ofi[i] += [item]

    # Init header table
    HeaderTable = {}
    for item in L1:
        new_node = HeaderTableNode()
        HeaderTable[item] = new_node

    # Construct FP-tree
    FPtree = FPtreeNode('root')
    for i in range(1, trans_num + 1):
        FPtree.insert_frequent_items(ofi[i], HeaderTable)

    # Generate conditional pattern base
    CondBase = {}
    for item in L1:
        # Init
        CondBase[item] = []
        # Start from head, and no need to traverse the leaf node
        listnode = HeaderTable[item].head
        treenode = listnode.parent
        # Traversal of linked-list
        while True:
            pattern = []
            # Traversal of tree
            while True:
                if treenode.val == 'root':
                    # print()
                    break
                # print('%s ' % treenode.val, end = '')
                pattern.insert(0, treenode.val)
                treenode = treenode.parent
            # Create a new base for this item
            if len(pattern) > 0:
                new_base = CondPatternBase(pattern, listnode.count)
                CondBase[item].append(new_base)
                # print('item = %s, count = %d: ' % (item, listnode.count), end = '\t')
                # print(pattern)
            # Reach the end of the list of this item
            if listnode.next == None:
                break
            # Continue to next node in the list, and no need to traverse the leaf node
            listnode = listnode.next
            treenode = listnode.parent

    # Accumulate the count for each item in the base
    freq = {}
    for item1 in L1:
        freq[item1] = {}
        for item2 in L1:
            freq[item1][item2] = 0
        for base in CondBase[item1]:
            for item3 in base.pattern:
                freq[item1][item3] += base.freq    
        # print(item1)
        # print(freq[item1])

    # Conditional FP-tree (not a tree actually)
    condFPtree = {}
    for item1 in L1:
        tmp_pattern = []
        for item2 in L1:
            if freq[item1][item2] >= minsup:
                tmp_pattern.append(item2)
        if len(tmp_pattern) > 0:
            condFPtree[item1] = tmp_pattern

    # Generate frequent patterns
    FreqPat2 = []
    for key in condFPtree:
        x = condFPtree[key]
        for L in range(1, len(x)+1):
            for subset in itertools.combinations(x, L):
                pat = list(subset)
                pat.append(key)
                FreqPat2.append(tuple(sorted(pat)))
    # Add L1            
    for item in L1:
        FreqPat2.append((item,))
    return FreqPat2

In [19]:
FP_growth(transaction_db, L1, MINSUP)

[('11264', '4639'),
 ('11264', '18299'),
 ('18299', '4639'),
 ('11264', '18299', '4639'),
 ('11264', '38'),
 ('38', '4639'),
 ('18299', '38'),
 ('11264', '38', '4639'),
 ('11264', '18299', '38'),
 ('18299', '38', '4639'),
 ('11264', '18299', '38', '4639'),
 ('11264', '8799'),
 ('4639', '8799'),
 ('18299', '8799'),
 ('38', '8799'),
 ('11264', '4639', '8799'),
 ('11264', '18299', '8799'),
 ('11264', '38', '8799'),
 ('18299', '4639', '8799'),
 ('38', '4639', '8799'),
 ('18299', '38', '8799'),
 ('11264', '18299', '4639', '8799'),
 ('11264', '38', '4639', '8799'),
 ('11264', '18299', '38', '8799'),
 ('18299', '38', '4639', '8799'),
 ('11264', '18299', '38', '4639', '8799'),
 ('12219', '6056'),
 ('12219', '17679'),
 ('17679', '6056'),
 ('12219', '17679', '6056'),
 ('1034', '12219'),
 ('1034', '6056'),
 ('1034', '12219', '6056'),
 ('12779', '19970'),
 ('12679', '8704'),
 ('11264',),
 ('4639',),
 ('18299',),
 ('38',),
 ('8799',),
 ('12219',),
 ('6056',),
 ('12779',),
 ('17679',),
 ('1034',),
 

### Rule Generation

In [20]:
# Rule generation
def rule_gen(FreqPat, FreqPat_freq, minconf, N):
    rules = []
    for pattern in FreqPat:
        pattern_len = len(pattern)
        if pattern_len == 1:
            continue
        for length in range(2, pattern_len):
            for subset in itertools.combinations(pattern, length):
                # print(pattern, subset)
                sup = float(FreqPat_freq[pattern]) / N
                conf = float(FreqPat_freq[pattern]) / FreqPat_freq[subset]
                if conf >= minconf:
                    rhs = set(pattern).difference(set(subset))
                    rules.append('%s -> %s, conf=%.3f, sup=%.3f' % (list(subset), list(rhs), conf, sup))
    return rules

In [21]:
rule_gen(FreqPat, FreqPat_freq, MINCONF, len(transaction_db))

["['11264', '18299'] -> ['4639'], conf=0.964, sup=0.421",
 "['11264', '4639'] -> ['18299'], conf=0.960, sup=0.421",
 "['18299', '4639'] -> ['11264'], conf=0.982, sup=0.421",
 "['11264', '38'] -> ['4639'], conf=0.960, sup=0.415",
 "['11264', '4639'] -> ['38'], conf=0.946, sup=0.415",
 "['38', '4639'] -> ['11264'], conf=0.978, sup=0.415",
 "['11264', '4639'] -> ['8799'], conf=0.932, sup=0.409",
 "['11264', '8799'] -> ['4639'], conf=0.977, sup=0.409",
 "['4639', '8799'] -> ['11264'], conf=0.981, sup=0.409",
 "['11264', '18299'] -> ['38'], conf=0.968, sup=0.423",
 "['11264', '38'] -> ['18299'], conf=0.978, sup=0.423",
 "['18299', '38'] -> ['11264'], conf=0.985, sup=0.423",
 "['11264', '18299'] -> ['8799'], conf=0.939, sup=0.410",
 "['11264', '8799'] -> ['18299'], conf=0.981, sup=0.410",
 "['18299', '8799'] -> ['11264'], conf=0.981, sup=0.410",
 "['11264', '38'] -> ['8799'], conf=0.934, sup=0.404",
 "['11264', '8799'] -> ['38'], conf=0.966, sup=0.404",
 "['38', '8799'] -> ['11264'], conf=0.

### Generate Association Rule using WEKA

#### Preprocessing

In [22]:
df = pd.read_csv('data/data_1_3_20_10_5.csv')

In [23]:
# Init dictionary for every transaction
trans_num = df['TransID'].max()
di = {}
for i in range(1, trans_num + 1):
    di[i] = []
# Extract info from df to dictionary
df_num = len(df)
for i in range(df_num):
    index = df.iloc[i][0]
    item = df.iloc[i][2]    
    di[index] += [item]

In [24]:
outputfile = open('data/weka_data_1_3_20_10_5.csv', 'w')
outputfile.write('TransID')

title = list(df['Item'].unique())
for item in title:
    outputfile.write(',%s' % item)
outputfile.write('\n')

trans_num = df['TransID'].max()

In [25]:
for i in range(1, trans_num + 1):
    outputfile.write('%d' % i)
    for item in title:
        if item in di[i]:
            outputfile.write(',1')
        else:
            outputfile.write(',0')
    outputfile.write('\n')
outputfile.close()

In [26]:
df_weka = pd.read_csv('data/weka_data_1_3_20_10_5.csv')
df_weka.head()

Unnamed: 0,TransID,4089,8704,9205,9430,12679,12779,18927,19970,38,...,17679,15854,16381,9914,7711,14049,19119,17637,5085,11480
0,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
df_weka.drop(['TransID'], axis=1, inplace=True)
df_weka.to_csv('data/weka_data_1_3_20_10_5.csv', index=False)

以 WEKA 產生之 association rule 結果如下：

```
Apriori
=======

Minimum support: 0.4 (254 instances)
Minimum metric <confidence>: 0.8
Number of cycles performed: 12

Generated sets of large itemsets:

Size of set of large itemsets L(1): 5

Size of set of large itemsets L(2): 10

Size of set of large itemsets L(3): 9

Size of set of large itemsets L(4): 2

Best rules found:

 1. 38=1 4639=1 18299=1 261 ==> 11264=1 258      <conf:(0.99)> lift:(2.16) lev:(0.22) [138] conv:(35.4)
 2. 38=1 18299=1 272 ==> 11264=1 268             <conf:(0.99)> lift:(2.15) lev:(0.23) [143] conv:(29.52)
 3. 4639=1 8799=1 18299=1 259 ==> 11264=1 255    <conf:(0.98)> lift:(2.15) lev:(0.22) [136] conv:(28.11)
 4. 4639=1 8799=1 11264=1 259 ==> 18299=1 255    <conf:(0.98)> lift:(2.21) lev:(0.22) [139] conv:(28.68)
 5. 4639=1 18299=1 272 ==> 11264=1 267           <conf:(0.98)> lift:(2.15) lev:(0.22) [142] conv:(24.6)
 6. 8799=1 18299=1 265 ==> 11264=1 260           <conf:(0.98)> lift:(2.14) lev:(0.22) [138] conv:(23.96)
 7. 8799=1 11264=1 265 ==> 18299=1 260           <conf:(0.98)> lift:(2.2)  lev:(0.22) [141] conv:(24.45)
 8. 4639=1 8799=1 264 ==> 11264=1 259            <conf:(0.98)> lift:(2.14) lev:(0.22) [138] conv:(23.87)
 9. 4639=1 8799=1 264 ==> 18299=1 259            <conf:(0.98)> lift:(2.2)  lev:(0.22) [141] conv:(24.36)
10. 38=1 4639=1 11264=1 263 ==> 18299=1 258      <conf:(0.98)> lift:(2.2)  lev:(0.22) [140] conv:(24.27)
```

找到的 rules 幾乎都有在我自己產生的 rules 當中，像是<br/>
第一條 `['18299', '38', '4639'] -> ['11264'], conf=0.989, sup=0.407` 與<br/>
第二條 `['18299', '38'] -> ['11264'], conf=0.985, sup=0.423`。<br/>
只是有些地方仍不太一樣。例如我產生的 rules 當中 confidence 最高的為<br/> 
`['18299', '38', '4639', '8799'] -> ['11264'], conf=0.992, sup=0.388` 而不是第一條。<br/>
而 WEKA 有個奇怪的點是：他似乎不傾向產生 RHS 有兩個 items 以上的 rules，像是<br/>
`['38', '8799'] -> ['11264', '4639', '18299'], conf=0.939, sup=0.388` 這樣。

## Dataset 1

Dataset source: https://www.kaggle.com/abcsds/pokemon

Schema
* #: ID for each pokemon
* Name: Name of each pokemon
* Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
* Type 2: Some pokemon are dual type and have 2
* Total: sum of all stats that come after this, a general guide to how strong a pokemon is
* HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
* Attack: the base modifier for normal attacks (eg. Scratch, Punch)
* Defense: the base damage resistance against normal attacks
* SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
* SP Def: the base damage resistance against special attacks
* Speed: determines which pokemon attacks first each round
* Generation: generation of the pokemon, from 1 to 6
* Legendary: indicate it is a legend pokemon or not

資料前處理方式：
* 數值型 column 以該 column 之第 75、50 百分位數區分成 high、medium、low。
* 類別型 column 以該 column 之全部 categories 做 one-hot-encoding。

### Preprocessing


In [28]:
df1 = pd.read_csv('data/Pokemon.csv')

In [29]:
df1.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [30]:
df1.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [31]:
def tmp(x):
    if x > 515: return 'high Total'
    elif x > 450: return 'medium Total'
    else: return 'low Total'

df1['Total'] = df1['Total'].apply(tmp)

def tmp(x):
    if x > 80: return 'high HP'
    elif x > 65: return 'medium HP'
    else: return 'low HP'

df1['HP'] = df1['HP'].apply(tmp)

def tmp(x):
    if x > 100: return 'high Attack'
    elif x > 75: return 'medium Attack'
    else: return 'low Attack'

df1['Attack'] = df1['Attack'].apply(tmp)

def tmp(x):
    if x > 90: return 'high Defense'
    elif x > 70: return 'medium Defense'
    else: return 'low Defense'

df1['Defense'] = df1['Defense'].apply(tmp)

def tmp(x):
    if x > 95: return 'high Sp. Atk'
    elif x > 65: return 'medium Sp. Atk'
    else: return 'low Sp. Atk'

df1['Sp. Atk'] = df1['Sp. Atk'].apply(tmp)

def tmp(x):
    if x > 90: return 'high Sp. Def'
    elif x > 70: return 'medium Sp. Def'
    else: return 'low Sp. Def'

df1['Sp. Def'] = df1['Sp. Def'].apply(tmp)

def tmp(x):
    if x > 90: return 'high Speed'
    elif x > 65: return 'medium Speed'
    else: return 'low Speed'

df1['Speed'] = df1['Speed'].apply(tmp)

In [32]:
df1['Type 2'].fillna('Does not have Type 2', inplace=True)

In [33]:
df1 = df1.astype({'Generation': str, 'Legendary': str})
df1.drop(['#', 'Name'], axis=1, inplace=True)

### Before Getting Started

In [34]:
# Parameters
MINSUP = 300
MINCONF = 0.8

In [35]:
# Candidate 1-itemset
cols_names = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
C1_series = df1['Type 1'].value_counts()
tmp = df1['Type 2'].value_counts()
C1_series = C1_series.add(tmp, fill_value=0)
C1_series = C1_series.astype(int)
for col in cols_names:
    tmp = df1[col].value_counts()
    C1_series = pd.concat([C1_series, tmp])

# Frequent 1-itemset
L1_series = C1_series.loc[C1_series.values >= MINSUP]
L1 = L1_series.index.values.tolist()
tmp = L1_series.values.tolist()
L1_freq = {key: value for key, value in zip(L1, tmp)}

In [36]:
L1_freq

{'Does not have Type 2': 386,
 'low Total': 404,
 'low HP': 405,
 'low Attack': 409,
 'low Defense': 438,
 'low Sp. Atk': 411,
 'low Sp. Def': 433,
 'low Speed': 410,
 'False': 735}

In [37]:
# Extract info from df to dictionary
trans_num = len(df1)
cols_names = df1.columns.to_list()
transaction_db = {}
for index, row in df1.iterrows():
    new_index = index + 1
    transaction_db[new_index] = []
    for col in cols_names:
        transaction_db[new_index].append(row[col])

In [38]:
# transaction_db

### Apply Apriori to Dataset 1

In [39]:
FreqPat, FreqPat_freq = Apriori(transaction_db, L1_freq, MINSUP)
FreqPat

[('Does not have Type 2',),
 ('low Total',),
 ('low HP',),
 ('low Attack',),
 ('low Defense',),
 ('low Sp. Atk',),
 ('low Sp. Def',),
 ('low Speed',),
 ('False',),
 ('Does not have Type 2', 'False'),
 ('low HP', 'low Total'),
 ('low Attack', 'low Total'),
 ('low Defense', 'low Total'),
 ('low Sp. Atk', 'low Total'),
 ('low Sp. Def', 'low Total'),
 ('False', 'low Total'),
 ('False', 'low HP'),
 ('False', 'low Attack'),
 ('low Defense', 'low Sp. Def'),
 ('False', 'low Defense'),
 ('low Sp. Atk', 'low Sp. Def'),
 ('False', 'low Sp. Atk'),
 ('False', 'low Sp. Def'),
 ('False', 'low Speed'),
 ('False', 'low HP', 'low Total'),
 ('False', 'low Attack', 'low Total'),
 ('False', 'low Defense', 'low Total'),
 ('False', 'low Sp. Atk', 'low Total'),
 ('False', 'low Sp. Def', 'low Total'),
 ('False', 'low Defense', 'low Sp. Def'),
 ('False', 'low Sp. Atk', 'low Sp. Def')]

### Apply FP-Growth to Dataset 1

In [40]:
FP_growth(transaction_db, L1, MINSUP)

[('low HP', 'low Total'),
 ('low Attack', 'low Total'),
 ('low Defense', 'low Total'),
 ('low Sp. Atk', 'low Total'),
 ('low Sp. Def', 'low Total'),
 ('low Defense', 'low Sp. Def'),
 ('low Sp. Atk', 'low Sp. Def'),
 ('low Defense', 'low Sp. Def', 'low Total'),
 ('low Sp. Atk', 'low Sp. Def', 'low Total'),
 ('low Defense', 'low Sp. Atk', 'low Sp. Def'),
 ('low Defense', 'low Sp. Atk', 'low Sp. Def', 'low Total'),
 ('Does not have Type 2', 'False'),
 ('False', 'low Total'),
 ('False', 'low HP'),
 ('False', 'low Attack'),
 ('False', 'low Defense'),
 ('False', 'low Sp. Atk'),
 ('False', 'low Sp. Def'),
 ('False', 'low Speed'),
 ('Does not have Type 2', 'False', 'low Total'),
 ('Does not have Type 2', 'False', 'low HP'),
 ('Does not have Type 2', 'False', 'low Attack'),
 ('Does not have Type 2', 'False', 'low Defense'),
 ('Does not have Type 2', 'False', 'low Sp. Atk'),
 ('Does not have Type 2', 'False', 'low Sp. Def'),
 ('Does not have Type 2', 'False', 'low Speed'),
 ('False', 'low HP', '

### Rule Generation of Dataset 1

In [41]:
rule_gen(FreqPat, FreqPat_freq, MINCONF, len(transaction_db))

["['False', 'low HP'] -> ['low Total'], conf=0.802, sup=0.400",
 "['low HP', 'low Total'] -> ['False'], conf=1.000, sup=0.400",
 "['low Attack', 'low Total'] -> ['False'], conf=1.000, sup=0.394",
 "['False', 'low Total'] -> ['low Defense'], conf=0.807, sup=0.407",
 "['low Defense', 'low Total'] -> ['False'], conf=1.000, sup=0.407",
 "['low Sp. Atk', 'low Total'] -> ['False'], conf=1.000, sup=0.394",
 "['False', 'low Total'] -> ['low Sp. Def'], conf=0.832, sup=0.420",
 "['low Sp. Def', 'low Total'] -> ['False'], conf=1.000, sup=0.420",
 "['low Defense', 'low Sp. Def'] -> ['False'], conf=0.991, sup=0.419",
 "['low Sp. Atk', 'low Sp. Def'] -> ['False'], conf=1.000, sup=0.391"]

### Generate Association Rule of Dataset 1 using WEKA

#### Preprocessing

In [42]:
outputfile = open('data/weka_pokemon.csv', 'w')
outputfile.write('TransID')

title = C1_series.index.to_list()
for item in title:
    outputfile.write(',%s' % item)
outputfile.write('\n')

trans_num = len(transaction_db)

In [43]:
for i in range(1, trans_num + 1):
    outputfile.write('%d' % i)
    for item in title:
        if item in transaction_db[i]:
            outputfile.write(',1')
        else:
            outputfile.write(',0')
    outputfile.write('\n')
outputfile.close()

In [44]:
df_weka = pd.read_csv('data/weka_pokemon.csv')
df_weka.head()

Unnamed: 0,TransID,Bug,Dark,Does not have Type 2,Dragon,Electric,Fairy,Fighting,Fire,Flying,...,medium Speed,high Speed,1,5,3,4,2,6,False,True
0,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
2,3,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0
3,4,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0
4,5,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0


In [45]:
df_weka.drop(['TransID'], axis=1, inplace=True)
df_weka.to_csv('data/weka_pokemon.csv', index=False)

以 WEKA 產生之 association rule 結果如下：

```
Apriori
=======

Minimum support: 0.4 (320 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 12

Generated sets of large itemsets:

Size of set of large itemsets L(1): 9

Size of set of large itemsets L(2): 12

Size of set of large itemsets L(3): 4

Best rules found:

 1. low Total=1 404 ==> False=1 404                  <conf:(1)>    lift:(1.09) lev:(0.04) [32] conv:(32.83)
 2. low Total=1 low Sp. Def=1 336 ==> False=1 336    <conf:(1)>    lift:(1.09) lev:(0.03) [27] conv:(27.3)
 3. low Total=1 low Defense=1 326 ==> False=1 326    <conf:(1)>    lift:(1.09) lev:(0.03) [26] conv:(26.49)
 4. low Total=1 low HP=1 320 ==> False=1 320         <conf:(1)>    lift:(1.09) lev:(0.03) [26] conv:(26)
 5. low Sp. Atk=1 411 ==> False=1 410                <conf:(1)>    lift:(1.09) lev:(0.04) [32] conv:(16.7)
 6. low Sp. Def=1 433 ==> False=1 430                <conf:(0.99)> lift:(1.08) lev:(0.04) [32] conv:(8.8)
 7. low Defense=1 low Sp. Def=1 338 ==> False=1 335  <conf:(0.99)> lift:(1.08) lev:(0.03) [24] conv:(6.87)
 8. low Speed=1 410 ==> False=1 406                  <conf:(0.99)> lift:(1.08) lev:(0.04) [29] conv:(6.66)
 9. low Attack=1 409 ==> False=1 404                 <conf:(0.99)> lift:(1.08) lev:(0.04) [28] conv:(5.54)
10. low HP=1 405 ==> False=1 399                     <conf:(0.99)> lift:(1.07) lev:(0.03) [26] conv:(4.7)
```

找到的 rules 也幾乎都有在我自己產生的 rules 當中，例如<br/>
第二條 `['low Sp. Def', 'low Total'] -> ['False'], conf=1.000, sup=0.420`、<br/>
第三條 `['low Defense', 'low Total'] -> ['False'], conf=1.000, sup=0.407` 與<br/>
第四條 `['low HP', 'low Total'] -> ['False'], conf=1.000, sup=0.400`<br/>
但不知道為什麼第一條不在我產生的 rules 當中。

## Find and answer

What are rules with
* High support, high confidence ? 
* High support, low confidence ? 
* Low support, low confidence ? 
* Low support, high confidence ?

以 rule X -> Y 來說，High support 意味 X + Y 的 count 很高，反之很低。High confidence 則有兩種可能，第一個一樣是 X + Y 的 count 很高，第二個是 X 的 count 很低。所以 High support, high confidence 即為 X + Y 的 count 很高，X 的 count 與 X + Y 的 count 差不多；High support, low confidence 即為 X + Y 的 count 很高，而且 X 的 count 又比與 X + Y 的 count 大上許多。Low support, low confidence 為 X + Y 和 X 的 count 差不多且都很低；Low support, high confidence 則是 X + Y 的 count 很低，而 X 的 count 又低上更多。而 `rule_gen()` 產生的 rules ，由於是從 frequent itemset 來的，所以一定是 high support。之後又過濾掉低於 `MINCONF` 的 rules ，所以也都是 high confidence。