<a href="https://colab.research.google.com/github/kaiqizhao/760/blob/master/arm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Frequent Pattern Mining and Association Analysis on Retail Data



###  Download dataset    

We are using a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail[1].
      
[1] Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197

In [0]:
!pip install mlxtend
!pip install Orange3-Associate

from urllib.request import urlretrieve
import os

urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx','online_retail.xlsx')


### Prepocessing

We need to assign numeric ids to each item and arrange the data in a list of transactions. For example, we may want to have a transaction list. For example:


```
[['shoes', 'tennis rackets'],
 ['milk', 'coke', 'bread'],
  ...
 ['book', 'bookmark', 'paper', 'ink']]
```





In [0]:
import pandas as pd
import numpy as np

# load the dataset using pandas
data_frame = pd.read_excel('online_retail.xlsx', header=0, columns=['InvoiceNo', 'StockCode', 'Description'], nrows=5000)

# transform the stock code in to numeric ids
data_frame['StockCode'] = pd.Categorical(data_frame['StockCode'], categories=data_frame['StockCode'].unique()).codes

# get a dictionary of stock id --> description
item_dict = {row[0]:row[1] for row in data_frame[['StockCode', 'Description']].to_numpy()}
print('Total number of unique items: {}'.format(len(item_dict)))

# transform the datasets into transactions
uids, index = np.unique(data_frame['InvoiceNo'].apply(str).values, True)
T = np.split(data_frame['StockCode'].values, index[1:])
print('Total number of transactions: {}'.format(len(T)))

Total number of unique items: 1595
Total number of transactions: 300


## Use Orange3-Associate for frequent patten mining

Orange3-Associate implemenets the FP-growth algorithm for mining frequent itemsets.

### Frequent itemset mining

We can use fp-growth for mining frequent itemsets using Orange3-Associate. Let's use a minimum support of 10.

In [0]:
from orangecontrib.associate.fpgrowth import *
itemsets = dict(frequent_itemsets(T, 10))
print('Total number of frequent itemsets found: {}'.format(len(itemsets)))

Total number of frequent itemsets found: 129613


> Let's take a look at what we have found for 3-itemsets

In [0]:
three_itemsets = {item[0]:item[1] for item in itemsets.items() if len(item[0])==3}
print('# of 3-itemsets: {}'.format(len(three_itemsets)))

most_freq_itemsets = sorted(three_itemsets.items(), key = lambda kv:kv[1], reverse=True)[:10]
for itemset in most_freq_itemsets:
  print(',  '.join([item_dict[item] for item in list(itemset[0])]))

# of 3-itemsets: 1625
WHITE HANGING HEART T-LIGHT HOLDER,  KNITTED UNION FLAG HOT WATER BOTTLE,  RED WOOLLY HOTTIE WHITE HEART.
RETRO COFFEE MUGS ASSORTED,  KNITTED UNION FLAG HOT WATER BOTTLE,  RED WOOLLY HOTTIE WHITE HEART.
WHITE HANGING HEART T-LIGHT HOLDER,  VINTAGE BILLBOARD DRINK ME MUG,  RETRO COFFEE MUGS ASSORTED
WHITE HANGING HEART T-LIGHT HOLDER,  WHITE METAL LANTERN,  KNITTED UNION FLAG HOT WATER BOTTLE
WHITE HANGING HEART T-LIGHT HOLDER,  WHITE METAL LANTERN,  RED WOOLLY HOTTIE WHITE HEART.
WHITE METAL LANTERN,  KNITTED UNION FLAG HOT WATER BOTTLE,  RED WOOLLY HOTTIE WHITE HEART.
RETRO COFFEE MUGS ASSORTED,  WHITE HANGING HEART T-LIGHT HOLDER,  KNITTED UNION FLAG HOT WATER BOTTLE
RETRO COFFEE MUGS ASSORTED,  WHITE HANGING HEART T-LIGHT HOLDER,  RED WOOLLY HOTTIE WHITE HEART.
WHITE HANGING HEART T-LIGHT HOLDER,  VINTAGE BILLBOARD DRINK ME MUG,  KNITTED UNION FLAG HOT WATER BOTTLE
WHITE HANGING HEART T-LIGHT HOLDER,  VINTAGE BILLBOARD DRINK ME MUG,  RED WOOLLY HOTTIE WHITE HE

### Mining association rules

We have got the frequent itemsets. Based on them, we can mine the association rules with a confidence 80%. Let's focus on rules that have 3 items on the LHS and 1 item on the RHS.

In [0]:
# generat rules from frequent itemsets.
rules = [(P, Q, supp, conf) for P, Q, supp, conf in association_rules(itemsets, .8) if len(P)==3 and len(Q)==1]

# print each row, replace ids with the actual names of the items
for ante, cons, supp, conf in rules[:5]:
  print(', '.join(item_dict[i] for i in ante), '-->', ', '.join(item_dict[j] for j in cons), '(supp: {}, conf: {})'.format(supp, conf))

RED WOOLLY HOTTIE WHITE HEART., HAND WARMER SCOTTY DOG DESIGN, HAND WARMER UNION JACK --> WHITE HANGING HEART T-LIGHT HOLDER (supp: 10, conf: 1.0)
WHITE HANGING HEART T-LIGHT HOLDER, HAND WARMER SCOTTY DOG DESIGN, HAND WARMER UNION JACK --> RED WOOLLY HOTTIE WHITE HEART. (supp: 10, conf: 0.8333333333333334)
WHITE HANGING HEART T-LIGHT HOLDER, RED WOOLLY HOTTIE WHITE HEART., HAND WARMER UNION JACK --> HAND WARMER SCOTTY DOG DESIGN (supp: 10, conf: 1.0)
WHITE HANGING HEART T-LIGHT HOLDER, RED WOOLLY HOTTIE WHITE HEART., HAND WARMER SCOTTY DOG DESIGN --> HAND WARMER UNION JACK (supp: 10, conf: 0.9090909090909091)
HAND WARMER RED RETROSPOT, RED WOOLLY HOTTIE WHITE HEART., HAND WARMER SCOTTY DOG DESIGN --> WHITE HANGING HEART T-LIGHT HOLDER (supp: 10, conf: 1.0)
