## Data Understanding and Pre-Processing

In [111]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [24]:
df = pd.read_csv("Groceries_dataset.csv")
df.head(10)

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk
5,4941,14-02-2015,rolls/buns
6,4501,08-05-2015,other vegetables
7,3803,23-12-2015,pot plants
8,2762,20-03-2015,whole milk
9,4119,12-02-2015,tropical fruit


In [15]:
df.shape

(38765, 3)

In [16]:
df.isnull().sum()

Member_number      0
Date               0
itemDescription    0
dtype: int64

In [17]:
df.isnull().values.any()

False

In [18]:
df.dtypes

Member_number       int64
Date               object
itemDescription    object
dtype: object

In [22]:
df.nunique()

Member_number      3898
Date                728
itemDescription     167
dtype: int64

GroupBy Member_number and Date and convert to a list

In [78]:
theList = list(df.groupby(['Member_number','Date']))
theList

[((1000, '15-03-2015'),
         Member_number        Date      itemDescription
  4843            1000  15-03-2015              sausage
  8395            1000  15-03-2015           whole milk
  20992           1000  15-03-2015  semi-finished bread
  24544           1000  15-03-2015               yogurt),
 ((1000, '24-06-2014'),
         Member_number        Date itemDescription
  13331           1000  24-06-2014      whole milk
  29480           1000  24-06-2014          pastry
  32851           1000  24-06-2014     salty snack),
 ((1000, '24-07-2015'),
         Member_number        Date  itemDescription
  2047            1000  24-07-2015      canned beer
  18196           1000  24-07-2015  misc. beverages),
 ((1000, '25-11-2015'),
         Member_number        Date   itemDescription
  6388            1000  25-11-2015           sausage
  22537           1000  25-11-2015  hygiene articles),
 ((1000, '27-05-2015'),
         Member_number        Date     itemDescription
  1629            

For all itemDescriptions associated with each Member_number + Date GroupBy, display in list format

In [80]:
transactions = []
for item in theList:
    transactions.append(item[1]['itemDescription'].tolist())
print(transactions)

[['sausage', 'whole milk', 'semi-finished bread', 'yogurt'], ['whole milk', 'pastry', 'salty snack'], ['canned beer', 'misc. beverages'], ['sausage', 'hygiene articles'], ['soda', 'pickled vegetables'], ['frankfurter', 'curd'], ['sausage', 'whole milk', 'rolls/buns'], ['whole milk', 'soda'], ['beef', 'white bread'], ['frankfurter', 'soda', 'whipped/sour cream'], ['frozen vegetables', 'other vegetables'], ['butter', 'whole milk'], ['tropical fruit', 'sugar'], ['butter milk', 'specialty chocolate'], ['sausage', 'rolls/buns'], ['root vegetables', 'detergent'], ['frozen meals', 'dental care'], ['rolls/buns', 'rolls/buns'], ['dish cleaner', 'cling film/bags'], ['canned beer', 'frozen fish'], ['other vegetables', 'hygiene articles'], ['pip fruit', 'whole milk', 'tropical fruit'], ['rolls/buns', 'red/blush wine', 'chocolate'], ['other vegetables', 'shopping bags'], ['whole milk', 'chocolate', 'packaged fruit/vegetables', 'rolls/buns'], ['root vegetables', 'whole milk', 'pastry'], ['rolls/buns

## Generating Association Rules from Frequent Itemsets

- Association rules are if/then statements that help discover relationships or patterns between datasets in various kinds of databases, including relational
    - An association rule has 2 parts:
        - Antecedent (if): Something that's found in the data
        - Consequent (then): Item them is found in combination with the antecedent
    - Itemset = List of all the items in the antecedent and sonsequent
    - For example: {Bread, Eggs} --> {Milk}
        - Antecedent = Bread, Eggs
        - Consequent = Milk
        - Itemset = {Bread, Eggs, Milk}

- Association rules are created by thoroughly analyzing the data and looking for frequent if/then patterns
    - In order to understand the strength of the association, we can use 3 different metrics: Support, Confidence, Lift
    - Support (s) = Total # of transactions containing both ItemsetX and ItemsetY / Total # of transactions
        - Support tells us how frequently an itemset is in all the transactions
        - Support is used to help us identify the association rules that are worth considering in future analysis
    - Confidence (c) = Total # of transactions containing both ItemsetX and ItemsetY / Transactions containing ItemsetX
        - Confidence tells us the likelihood of occurrence of consequent (ItemsetY) in the purchase given that the purchase already has the antecedent (ItemsetX)
    - Lift = (Transactions containing both ItemsetX and ItemsetY)/(Transactions containing ItemsetX) / Fraction of transactions containing ItemsetY
        - Lift is the *lift* that ItemsetX gives to our confidence for having ItemsetY in the purchase
        - Lift > 1 means high association between ItemsetY and ItemsetX....and the higher the lift, the more likely a customer will buy ItemsetY if they have already purchased ItemsetX
        - Lift < 1 means low association between ItemsetY and ItemsetX
        
- The goal of association rule mining is to find all the rules having a support and confidence greater than or equal to a threshold value

In [91]:
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14958,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False
14959,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14960,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14961,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Apriori Algorithm

- The Apriori algorithm is used to come up with a set of most important association rules to be considered
    - The apriori algorithm uses frequent itemsets to generate the association rules and is based on the idea that a subset of a frequent itemset must also be a frequent itemset
    - Frequent itemset = An itemset whose support is greater than or equal to a minimum threshold value (aka, minsup)
    - The apriori algorithm is a brute force approach to finding frequent itemsets by forming all possible itemsets and checking the support value of each of these
- To determine minsup, start at the highest and work down until you get a result set that is satisfactory for your data mining goals
    - In the below example, I used a minsup of 0.01

In [110]:
frequent_itemsets = apriori(df, min_support=0.01, use_colnames=True, verbose=1)
frequent_itemsets

Processing 18 combinations | Sampling itemset size 3 2


Unnamed: 0,support,itemsets
0,0.021386,(UHT-milk)
1,0.033950,(beef)
2,0.021787,(berries)
3,0.016574,(beverages)
4,0.045312,(bottled beer)
...,...,...
64,0.010559,"(other vegetables, rolls/buns)"
65,0.014837,"(other vegetables, whole milk)"
66,0.013968,"(rolls/buns, whole milk)"
67,0.011629,"(soda, whole milk)"


### Generate_rules() to specify metric of interest and threshold

- Once you have found your most important frequent itemsets using the apriori algorithm, us the generate_rules() function to specify the metric of interest and the according threshold
- Currently implemented measures are confidence and lift
- In the below example I chose a min_threshold of 0.1, so I was only interested in rules derived from the frequent itemsets only if the level of confidence is above the 10% threshold

In [125]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(other vegetables),(whole milk),0.122101,0.157923,0.014837,0.121511,0.76943,-0.004446,0.958551
1,(rolls/buns),(whole milk),0.110005,0.157923,0.013968,0.126974,0.804028,-0.003404,0.96455
2,(soda),(whole milk),0.097106,0.157923,0.011629,0.119752,0.758296,-0.003707,0.956636
3,(yogurt),(whole milk),0.085879,0.157923,0.011161,0.129961,0.82294,-0.002401,0.967861


Looking at index 1 as an example, if a customer purchases rolls/buns they are 12.7% likely to also purchase whole milk