# Finding frequent itemsets and association rules

### Library: mlxtend
We are going to use the library mlxtend which contains an implementation of the apriori algorithm. Click on the following <a href="http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/">link</a>. We recommend to install it using

pip install mlxtend

or

conda install mlxtend

Sometimes using the previous commands the library is not correctly installed, especially in the case when there are multiple Python versions installed on your Jupyter notebook. In that case, we recommend to use the command shown in the next cell (directly in your Jupyter notebook):

In [4]:
import sys
!{sys.executable} -m pip install mlxtend



### One-hot encoding

Sometimes it is required to transform categorical data where features might take more than two values into one-hot encoding, where each feature can take either 0 or 1 (or True/False). For example, let's consider the following set of baskets with each basket containing all the items bought during a simple trip to the supermarket. Using the following script (which uses a TransactionEncoder from mlxtend) we obtain a one-hot encoding of the data, which is required by the apriori method of mlxtend:

In [2]:
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
# te_ary
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


Observe that we have now a number of features being equal to the number of items in the data. Each column corresponds to a feature, while each row-column entry specifies whether the corresponding basket contains the corresponding feature or not (e.g. "Milk").

### Frequent itemsets and association rules

Finally we are able to run apriori on the data represented with one-hot encoding using the corresponding methods from mlxtend. Please find below the full code:

In [4]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

frequent_itemsets

from mlxtend.frequent_patterns import association_rules

a=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

a[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(Eggs),(Kidney Beans),0.8,1.0
1,(Kidney Beans),(Eggs),0.8,0.8
2,(Eggs),(Onion),0.6,0.75
3,(Onion),(Eggs),0.6,1.0
4,(Milk),(Kidney Beans),0.6,1.0
5,(Onion),(Kidney Beans),0.6,1.0
6,(Yogurt),(Kidney Beans),0.6,1.0
7,"(Eggs, Onion)",(Kidney Beans),0.6,1.0
8,"(Eggs, Kidney Beans)",(Onion),0.6,0.75
9,"(Kidney Beans, Onion)",(Eggs),0.6,1.0


### DataFrame

One convenient way to load data from an input file and perform some preprocessing on it is to use panda DataFrame. A DataFrame is a "two-dimensional, size-mutable, potentially heterogeneous tabular data." It contains several tools to process and handle tabular data. Click on this <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">link</a> for a documentation. 

To use a DataFrame we need to install the "pandas" library with any of the methods we saw up to now. 



In [5]:
import pandas as pd

#read order_data.csv and create a DataFrame with that content
data = pd.read_csv(r"order_data.csv",delimiter=" ",header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,toothpaste,brush,milk,cereals,honey,bread,butter,cheese,yogurt
1,milk,cereals,honey,bread,cheese,razor,gel,shampoo,
2,milk,cereals,honey,cheese,soap,shampoo,,,
3,honey,bread,butter,cheese,mouthwash,toothpaste,,,
4,cereals,honey,bread,butter,gel,soap,,,
5,cheesse,yogurt,milk,cereals,honey,shampoo,gel,,
6,honey,bread,cheese,razor,butter,yogurt,,,
7,honey,bread,cheese,butter,milk,,,,
8,cereals,butter,cookies,chips,,,,,
9,cerals,cheese,yogurt,cookies,chips,,,,


In [6]:
#we run apriori on the order_data.csv file

import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


data = pd.read_csv(r"order_data.csv",delimiter=" ",header=None)

#preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d=data.values.tolist()

#removing nan values
for i in range(len(d)):
    j=0
    while(True):
        if (type(d[i][j])==float and math.isnan(d[i][j])) :
            del d[i][j]
            j-=1
        j+=1
        if (j>len(d[i])-1):
            break

te = TransactionEncoder()
te_ary = te.fit(d).transform(d)
df = pd.DataFrame(te_ary, columns=te.columns_)

#computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

a=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

#visualizing association rules results
a[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(bread),(butter),0.35,0.636364
1,(butter),(bread),0.35,0.875000
2,(cheese),(bread),0.35,0.700000
3,(bread),(cheese),0.35,0.636364
4,(honey),(bread),0.30,0.666667
...,...,...,...,...
99,"(shampoo, cereals)","(milk, honey)",0.20,0.800000
100,"(shampoo, honey)","(milk, cereals)",0.20,1.000000
101,"(milk, cereals)","(shampoo, honey)",0.20,0.666667
102,"(milk, honey)","(shampoo, cereals)",0.20,0.666667


In [7]:
type(a["antecedents"][99]) #prints frozenset

frozenset

In [8]:
#frozenset is very similar to a set in Python with the main difference that cannot be changed.
#so we can check if a rule contains shampoo and honey as follows
i=99
if ("shampoo" in a["antecedents"][i] and "honey" in a["consequents"][i] ):
    print ("The "+str(i)+" rule talks about shampoo and honey:" )
else:
    print ("The "+str(i)+"th rule does not talk about shampoo and honey:" )

The 99 rule talks about shampoo and honey:


## Question 1:
<p style="font-size:20px;">Report 3 rules with support at least 0.2 and confidence at least 0.9. Specify for each of them the support and the confidence.</p>

In [14]:
import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules\


data = pd.read_csv(r"mammographic_masses.csv",delimiter=",",header=0)

#preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d=data.values.tolist()


# NAN == "?"
for i in range(len(d)):
    j = 0
    while True:
        if d[i][j] == '?':  # 检查是否为缺失值标识 '?'
            del d[i][j]     # 删除该元素
            j -= 1          # 调整索引以保持正确遍历
        j += 1
        if j > len(d[i]) - 1:
            break


#adding attributes
for i in range(len(d)):
    for j in range (len(d[i])):
        d[i][j]=data.columns[j] + "=" +str(d[i][j])

            
te = TransactionEncoder()
te_ary = te.fit(d).transform(d)

df = pd.DataFrame(te_ary, columns=te.columns_)

#computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

a=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)

selected_rules = a[(a['support'] >= 0.2) & (a['confidence'] >= 0.9)]
selected_rules[["antecedents", "consequents", "support", "confidence"]]




Unnamed: 0,antecedents,consequents,support,confidence
0,(Severity=1),(Density=3),0.391259,0.933002
1,"(Margin=1, Density=3)",(BI-RADS=4),0.263267,0.900356
2,"(Margin=1, Severity=0)",(BI-RADS=4),0.269511,0.91844
3,"(Severity=1, BI-RADS=5)",(Density=3),0.277836,0.933566
4,"(BI-RADS=5, Density=3)",(Severity=1),0.277836,0.911263
5,"(Shape=4, Severity=1)",(Density=3),0.288241,0.92953
6,"(Margin=1, BI-RADS=4, Density=3)",(Severity=0),0.238293,0.905138
7,"(Margin=1, Severity=0, Density=3)",(BI-RADS=4),0.238293,0.927126
8,"(Shape=4, Severity=1, BI-RADS=5)",(Density=3),0.221644,0.930131
9,"(Shape=4, BI-RADS=5, Density=3)",(Severity=1),0.221644,0.914163


<p style="font-size:20px;">So, The table displays all rules with support >= 0.2 and confidence >= to 0.9. And the the rules along with their actual support and confidence are displayed.</p>