# Association Rules

## Dataset

In [28]:
import pandas as pd

# reading xlsx doc
df = pd.read_excel('Online retail.xlsx', names=['Items'])
df.head()

Unnamed: 0,Items
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


## Data Preprocessing

In [29]:
# splitting Items
df_split = df['Items'].str.split(',', expand=True)
df_split.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,


In [30]:
# Replacing None with ' '
df_split = df_split.fillna('')
df_split

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,
7496,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,
7497,chicken,,,,,,,,,,,,,,,,,,
7498,escalope,green tea,,,,,,,,,,,,,,,,,


In [31]:
transactions=[]
for i in range(0, df_split.shape[0]):
    transactions.append([str(df_split.values[i,j]) for j in range(0, df_split.shape[1])])

In [32]:
transactions[0:5]

[['burgers',
  'meatballs',
  'eggs',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['chutney',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['turkey',
  'avocado',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['mineral water',
  'milk',
  'energy bar',
  'whole wheat rice',
  'green tea',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['low fat yogurt',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '']]

In [33]:
# appending to dataframe
df1 = pd.DataFrame(pd.Series(transactions))
df1.columns=['Items']
df1.head()

Unnamed: 0,Items
0,"[burgers, meatballs, eggs, , , , , , , , , , ,..."
1,"[chutney, , , , , , , , , , , , , , , , , , ]"
2,"[turkey, avocado, , , , , , , , , , , , , , , ..."
3,"[mineral water, milk, energy bar, whole wheat ..."
4,"[low fat yogurt, , , , , , , , , , , , , , , ,..."


In [34]:
# getting dummies
df_dummies = df1['Items'].str.join(sep=',').str.get_dummies(sep=',')
df_dummies.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# duplicates
df_dummies.duplicated().sum()

2347

In [40]:
# handling duplicactes
df_dummies1 = df_dummies.drop_duplicates()
df_dummies1.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Association Rule Mining

In [41]:
!pip install mlxtend



I'm going to do proceed with the given problem in two approaches.

1.   Including duplicates, that might be due to strong association.
2.   Excluding Duplicates, to reduce the effect of its association on potential associations (If they are errors).



### With Duplicates

In [42]:
# frequent items
from mlxtend.frequent_patterns import apriori

frequent_items = apriori(df_dummies, min_support=0.01, use_colnames=True)
frequent_items



Unnamed: 0,support,itemsets
0,0.020267,(almonds)
1,0.033200,(avocado)
2,0.010800,(barbecue sauce)
3,0.014267,(black tea)
4,0.011467,(body spray)
...,...,...
254,0.011067,"(ground beef, milk, mineral water)"
255,0.017067,"(spaghetti, ground beef, mineral water)"
256,0.015733,"(spaghetti, milk, mineral water)"
257,0.010267,"(spaghetti, olive oil, mineral water)"


In [43]:
# sorting in descending order
frequent_items.sort_values('support', ascending=False, inplace=True)
frequent_items

  and should_run_async(code)


Unnamed: 0,support,itemsets
46,0.238267,(mineral water)
19,0.179733,(eggs)
63,0.174133,(spaghetti)
24,0.170933,(french fries)
13,0.163867,(chocolate)
...,...,...
251,0.010133,"(spaghetti, mineral water, french fries)"
177,0.010133,"(frozen vegetables, low fat yogurt)"
123,0.010133,"(soup, chocolate)"
164,0.010000,"(shrimp, french fries)"


In [50]:
# association rules
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_items, metric='lift', min_threshold=1, num_itemsets=len(frequent_items))
rules.sort_values('lift', ascending=False).head(10)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
162,(ground beef),(herb & pepper),0.098267,0.049467,0.016,0.162822,3.291555,1.0,0.011139,1.135402,0.77206,0.121457,0.119255,0.243136
163,(herb & pepper),(ground beef),0.049467,0.098267,0.016,0.32345,3.291555,1.0,0.011139,1.332841,0.732423,0.121457,0.249723,0.243136
129,"(mineral water, spaghetti)",(ground beef),0.059733,0.098267,0.017067,0.285714,2.90754,1.0,0.011197,1.262427,0.697745,0.121097,0.207875,0.229696
132,(ground beef),"(mineral water, spaghetti)",0.098267,0.059733,0.017067,0.173677,2.90754,1.0,0.011197,1.137893,0.727562,0.121097,0.121182,0.229696
387,"(mineral water, spaghetti)",(olive oil),0.059733,0.065733,0.010267,0.171875,2.614731,1.0,0.00634,1.128171,0.656783,0.08912,0.11361,0.164031
390,(olive oil),"(mineral water, spaghetti)",0.065733,0.059733,0.010267,0.156187,2.614731,1.0,0.00634,1.114306,0.661001,0.08912,0.102581,0.164031
160,(frozen vegetables),(tomatoes),0.095333,0.0684,0.016133,0.169231,2.474134,1.0,0.009613,1.12137,0.658605,0.109304,0.108234,0.202549
161,(tomatoes),(frozen vegetables),0.0684,0.095333,0.016133,0.235867,2.474134,1.0,0.009613,1.183913,0.639564,0.109304,0.155344,0.202549
145,(frozen vegetables),(shrimp),0.095333,0.071333,0.016667,0.174825,2.45082,1.0,0.009866,1.125418,0.654355,0.111111,0.111441,0.204235
144,(shrimp),(frozen vegetables),0.071333,0.095333,0.016667,0.233645,2.45082,1.0,0.009866,1.18048,0.637444,0.111111,0.152887,0.204235


## Analysis & Interpretation

* Given (herb & pepper), there is 32% chance that they will go for ground beef.
* Given (Mineral Water & Spaghetti), there is 28% chance that customer will opt for ground beef.
* Given (Mineral Water & Spaghetti), there is 17% chance that customer will opt for Olive oil.
* Similarly, Tomatoes & Frozen vegetable, Shrimp & Frozen vegetables are associated.







### Excluding Duplicates

In [47]:
# frequent items
frequent_items1 = apriori(df_dummies1, min_support=0.01, use_colnames=True)
frequent_items1

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.029303,(almonds)
1,0.011062,(antioxydant juice)
2,0.045993,(avocado)
3,0.012614,(bacon)
4,0.015525,(barbecue sauce)
...,...,...
431,0.014749,"(spaghetti, olive oil, mineral water)"
432,0.016689,"(pancakes, spaghetti, mineral water)"
433,0.012420,"(shrimp, spaghetti, mineral water)"
434,0.010867,"(soup, spaghetti, mineral water)"


In [48]:
# sorting in descending order
frequent_items1.sort_values('support', ascending=False, inplace=True)
frequent_items1

  and should_run_async(code)


Unnamed: 0,support,itemsets
54,0.299825,(mineral water)
73,0.230157,(spaghetti)
24,0.208034,(eggs)
17,0.203765,(chocolate)
30,0.192897,(french fries)
...,...,...
373,0.010091,"(burgers, milk, mineral water)"
377,0.010091,"(spaghetti, mineral water, chicken)"
402,0.010091,"(eggs, mineral water, french fries)"
388,0.010091,"(mineral water, chocolate, green tea)"


In [51]:
# association rules
rules1 = association_rules(frequent_items1, metric='lift', min_threshold=1, num_itemsets=len(frequent_items1))
rules1.sort_values('lift', ascending=False).head(10)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
671,(olive oil),(whole wheat pasta),0.08791,0.040753,0.011062,0.125828,3.087575,1.0,0.007479,1.09732,0.741288,0.094059,0.088689,0.198628
670,(whole wheat pasta),(olive oil),0.040753,0.08791,0.011062,0.271429,3.087575,1.0,0.007479,1.251888,0.704846,0.094059,0.201207,0.198628
533,(soup),"(milk, mineral water)",0.071221,0.067922,0.01242,0.174387,2.567474,1.0,0.007583,1.128953,0.657327,0.098009,0.114224,0.178622
532,"(milk, mineral water)",(soup),0.067922,0.071221,0.01242,0.182857,2.567474,1.0,0.007583,1.136618,0.655001,0.098009,0.120197,0.178622
147,(herb & pepper),(ground beef),0.066757,0.136425,0.022899,0.343023,2.514365,1.0,0.013792,1.314468,0.645368,0.127018,0.239236,0.255438
146,(ground beef),(herb & pepper),0.136425,0.066757,0.022899,0.167852,2.514365,1.0,0.013792,1.121487,0.697433,0.127018,0.108326,0.255438
740,(frozen vegetables),"(shrimp, mineral water)",0.130409,0.033573,0.010479,0.080357,2.393528,1.0,0.006101,1.050872,0.669518,0.068268,0.04841,0.196248
737,"(shrimp, mineral water)",(frozen vegetables),0.033573,0.130409,0.010479,0.312139,2.393528,1.0,0.006101,1.264195,0.602432,0.068268,0.208983,0.196248
512,"(frozen vegetables, spaghetti)",(ground beef),0.0392,0.136425,0.012614,0.321782,2.358668,1.0,0.007266,1.2733,0.599534,0.077381,0.214639,0.207122
513,(ground beef),"(frozen vegetables, spaghetti)",0.136425,0.0392,0.012614,0.092461,2.358668,1.0,0.007266,1.058687,0.667032,0.077381,0.055433,0.207122


## Analysis and Interpretation

* Given (Whole wheat Pasta), there is 27% chance that they will go for olive oil.
* Given (Mineral Water & milk), there is 18% chance that customer will opt for soup.
* Given (herb & pepper), there is 34% chance that customer will opt for ground beef.
* Similarly, ground beef & (Frozen vegetables,Spaghetti), (Shrimp, Mineral Water) & Frozen vegetables are associated.


## Interview Questions



1. **What is lift and why is it important in Association rules?**

**Lift** is a measure in association rule mining that evaluates the strength of an association between items in a rule, compared to their expected co-occurrence if they were independent.


$$\text{Lift}(X \rightarrow Y) = \frac{P(Y|X)}{P(Y)} = \frac{P(X,Y)}{P(X)P(Y)}$$

2.   **What is support and Confidence. How do you calculate them?**

Support measures how frequently an itemset appears in the dataset.

$$\text{Support}(X \rightarrow Y) = \frac{\text{Number of transactions containing both } X \text{ and } Y}{\text{Total number of transactions}}$$

It is the conditional probability of finding the consequent
**Y** in a transaction, given that the antecedent **X** is present.

$$\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} = P(\frac{Y}{X})$$

3. **What are some limitations or challenges of Association rules mining?**



*   Larger datasets make them computationally Intensive.
*   Randomly choosing threshold values may lead to important associations getting ignored sometimes. Simultaneously low threshold values lead to unwanted/ weak associations.

