# Association Rule - Apriori and ECLAT 

Training association rule models (Apriori and ECLAT) to find the most related items bought by customers of a french supermarket during a week. All 7501 lines of the dataset represent items bought by an unique customer, during this week.

This algorithm associate products preferences by most of the customers and can be used to generate products recommendation and help on displaying products strategy.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Data Loading
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

# Adding all customers into a list of lists
transactions = []
for i in range(0, len(dataset)):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

In [3]:
dataset.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,
6,whole wheat pasta,french fries,,,,,,,,,,,,,,,,,,
7,soup,light cream,shallot,,,,,,,,,,,,,,,,,
8,frozen vegetables,spaghetti,green tea,,,,,,,,,,,,,,,,,
9,french fries,,,,,,,,,,,,,,,,,,,


### Apriori implementation using apyori library 
source: https://github.com/ymoch/apyori

The output of this part is to see which are the products that used to be more bought in combination compared to other combinations using apriori algorithm.

We will put some transformations to fit on dataframes and to make the visualization easier.

In [4]:
# Inspecting elements
transactions[:3]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['chutney',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan']]

In [5]:
# Training Apriori on the dataset
# The hyperparameters choosen on this training are:
# min_support = items bought more than 3 times a day * 7 days (week) / 7500 customers = 0.0028
# min_confidence: at least 20%, min_lift = minimum of 3 (less than that is too low)
# min_length: we want at least 2 items to be associated. No point in having a single item in the result

from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [6]:
# Visualising the results
results = list(rules)

In [7]:
lift = []
association = []
for i in range (0, len(results)):
    lift.append(results[:len(results)][i][2][0][3])
    association.append(list(results[:len(results)][i][0]))

### Visualizing results in a dataframe

In [8]:
rank = pd.DataFrame([association, lift]).transpose()
rank.columns = ['Association', 'Lift']

In [9]:
# Show top 10 higher lift scores
rank.sort_values('Lift', ascending=False).head(10)

Unnamed: 0,Association,Lift
128,"[mineral water, whole wheat pasta, nan, olive ...",6.11586
58,"[mineral water, whole wheat pasta, olive oil]",6.11586
96,"[mineral water, soup, milk, frozen vegetables]",5.48441
146,"[soup, milk, frozen vegetables, mineral water,...",5.48441
28,"[fromage blanc, honey, nan]",5.16427
3,"[fromage blanc, honey]",5.16427
16,"[nan, chicken, light cream]",4.84395
0,"[chicken, light cream]",4.84395
2,"[escalope, pasta]",4.70081
26,"[escalope, nan, pasta]",4.70081


By the study, "olive oil, whole wheat pasta, mineral water" are the most commom combined items from this week for the supermarket in question.  

## ECLAT Implementation

This is an implementation of the ECLAT code by hand. It calculate the pairs that have been bought more frequently comparing to other pairs. At the end, we expect to see what is the most common combination of products during the week. 

An extension of the code can calculate the three most common combination, 4, and so on.

#### Getting the list of products bought this week by all customers

In [10]:
# Putting all transactions in a single list
itens = []
for i in range(0, len(transactions)):
    itens.extend(transactions[i])

# Finding unique items from transactions and removing nan
uniqueItems = list(set(itens))
uniqueItems.remove('nan')

In [11]:
uniqueItems

['fresh tuna',
 'strong cheese',
 'meatballs',
 'low fat yogurt',
 'magazines',
 'shampoo',
 'mint',
 'candy bars',
 'tomato juice',
 'chocolate',
 'corn',
 'green tea',
 'tomatoes',
 'body spray',
 'gluten free bar',
 'champagne',
 'burger sauce',
 'strawberries',
 'fromage blanc',
 'salt',
 'mayonnaise',
 'cookies',
 'cooking oil',
 'barbecue sauce',
 'bacon',
 'shrimp',
 'cider',
 'yogurt cake',
 'melons',
 'ketchup',
 'spaghetti',
 'muffins',
 'ground beef',
 'asparagus',
 'cauliflower',
 'energy drink',
 'dessert wine',
 'chutney',
 'extra dark chocolate',
 'french wine',
 'pet food',
 'pepper',
 'french fries',
 'antioxydant juice',
 'spinach',
 'oatmeal',
 'sandwich',
 'mashed potato',
 'whole wheat rice',
 'energy bar',
 'hot dogs',
 'shallot',
 'tomato sauce',
 'frozen smoothie',
 'rice',
 'bramble',
 'pickles',
 'soda',
 'hand protein bar',
 'clothes accessories',
 'salad',
 'oil',
 'green grapes',
 'brownies',
 'soup',
 'eggplant',
 'cottage cheese',
 'whole wheat pasta',
 '

#### Creating combinations with the items - pairs

In [12]:
pair = []
for j in range(0, len(uniqueItems)):
    k = 1;
    while k <= len(uniqueItems):
        try:
            pair.append([uniqueItems[j], uniqueItems[j+k]])
        except IndexError:
            pass
        k = k + 1;       

In [13]:
pair

[['fresh tuna', 'strong cheese'],
 ['fresh tuna', 'meatballs'],
 ['fresh tuna', 'low fat yogurt'],
 ['fresh tuna', 'magazines'],
 ['fresh tuna', 'shampoo'],
 ['fresh tuna', 'mint'],
 ['fresh tuna', 'candy bars'],
 ['fresh tuna', 'tomato juice'],
 ['fresh tuna', 'chocolate'],
 ['fresh tuna', 'corn'],
 ['fresh tuna', 'green tea'],
 ['fresh tuna', 'tomatoes'],
 ['fresh tuna', 'body spray'],
 ['fresh tuna', 'gluten free bar'],
 ['fresh tuna', 'champagne'],
 ['fresh tuna', 'burger sauce'],
 ['fresh tuna', 'strawberries'],
 ['fresh tuna', 'fromage blanc'],
 ['fresh tuna', 'salt'],
 ['fresh tuna', 'mayonnaise'],
 ['fresh tuna', 'cookies'],
 ['fresh tuna', 'cooking oil'],
 ['fresh tuna', 'barbecue sauce'],
 ['fresh tuna', 'bacon'],
 ['fresh tuna', 'shrimp'],
 ['fresh tuna', 'cider'],
 ['fresh tuna', 'yogurt cake'],
 ['fresh tuna', 'melons'],
 ['fresh tuna', 'ketchup'],
 ['fresh tuna', 'spaghetti'],
 ['fresh tuna', 'muffins'],
 ['fresh tuna', 'ground beef'],
 ['fresh tuna', 'asparagus'],
 ['fre

#### Calculating score
The calculation is done looking at the number of customers that bought both items (the pair) and divided by all customers of the week (7501). This calculation is done for all pairs possible and the score is returned on "score" list.

$ score = \frac{\text{number of lists that contain [item x and item y]}} {\text{number of all lists}} $

In [14]:
%%time
score = []
for i in pair:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score.append(len(eval(mycode))/7501.)

CPU times: user 21.7 s, sys: 4.03 ms, total: 21.7 s
Wall time: 21.7 s


#### Showing results

Top 10 Most common pairs of items of this week

In [15]:
ranking_ECLAT = pd.DataFrame([pair, score]).transpose()
ranking_ECLAT.columns = ['Pair', 'Score']

In [16]:
ranking_ECLAT.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Pair,Score
3174,"[spaghetti, mineral water]",0.0597254
1095,"[chocolate, mineral water]",0.0526596
5948,"[mineral water, eggs]",0.0509265
5926,"[mineral water, milk]",0.0479936
3349,"[ground beef, mineral water]",0.0409279
3136,"[spaghetti, ground beef]",0.0391948
1055,"[chocolate, spaghetti]",0.0391948
3208,"[spaghetti, eggs]",0.0365285
4198,"[french fries, eggs]",0.0363951
5929,"[mineral water, frozen vegetables]",0.0357286


### What if we do that for trios?

In [17]:
# Creating trios
trio = []
for j in range(0, len(uniqueItems)):
    for k in range(j, len(uniqueItems)):
        for l in range(k, len(uniqueItems)):
            if (k != j) and (j != l) and (k != l):
                try:
                    trio.append([uniqueItems[j], uniqueItems[j+k], uniqueItems[j+l]])
                except IndexError:
                    pass 

In [18]:
trio[:5]

[['fresh tuna', 'strong cheese', 'meatballs'],
 ['fresh tuna', 'strong cheese', 'low fat yogurt'],
 ['fresh tuna', 'strong cheese', 'magazines'],
 ['fresh tuna', 'strong cheese', 'shampoo'],
 ['fresh tuna', 'strong cheese', 'mint']]

In [19]:
%%time
score_trio = []
for i in trio:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score_trio.append(len(eval(mycode))/7501.)

CPU times: user 6min 13s, sys: 116 ms, total: 6min 13s
Wall time: 6min 13s


In [20]:
ranking_ECLAT_trio = pd.DataFrame([trio, score_trio]).transpose()
ranking_ECLAT_trio.columns = ['Trio', 'Score']
ranking_ECLAT_trio.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Trio,Score
56149,"[chocolate, spaghetti, mineral water]",0.0158646
125142,"[spaghetti, mineral water, milk]",0.0157312
125164,"[spaghetti, mineral water, eggs]",0.0142648
58901,"[chocolate, mineral water, milk]",0.0139981
58923,"[chocolate, mineral water, eggs]",0.0134649
125145,"[spaghetti, mineral water, frozen vegetables]",0.0119984
125135,"[spaghetti, mineral water, pancakes]",0.0114651
128223,"[ground beef, mineral water, milk]",0.0110652
56324,"[chocolate, ground beef, mineral water]",0.0109319
56161,"[chocolate, spaghetti, milk]",0.0109319


## What about comparing the results from Apriori and ECLAT?

We got from Apriori that the combination that lead to more "attractiveness power" is "olive oil", "whole wheat pasta" and "mineral water". If we run the ECLAT code for this set of items, we will obtain: 0.0039.

This score of 3 items has not enough score to be placed among top 10, but they are measuring different metrics.  According to apriori these are the items that when picked one lead to another items more frequently than other combinations, i.e. when a person pick 'olive oil', the probability of picking 'whole wheat pasta' and 'mineral water' is much higher than picking another combination. ECLAT in another hand is just sorting as the most common combinations of all lists, not caring about how one item isolatedly can influence in the purchase of another.

In [21]:
i = ["olive oil", "whole wheat pasta", "mineral water"]
cond = []
for item in i:
    cond.append('("{}") in s'.format(item))
mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
tra = eval(mycode)

In [22]:
cond

['("olive oil") in s', '("whole wheat pasta") in s', '("mineral water") in s']

In [23]:
mycode

'[s for s in transactions if ("olive oil") in s and ("whole wheat pasta") in s and ("mineral water") in s]'

In [24]:
tra

[['herb & pepper',
  'whole wheat pasta',
  'ground beef',
  'mineral water',
  'olive oil',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['whole wheat pasta',
  'spaghetti',
  'mineral water',
  'olive oil',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['whole wheat pasta',
  'mineral water',
  'olive oil',
  'pancakes',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['frozen vegetables',
  'whole wheat pasta',
  'ground beef',
  'mineral water',
  'chocolate',
  'milk',
  'olive oil',
  'almonds',
  'french wine',
  'yogurt cake',
  'fresh bread',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['whole wheat pasta',
  'mineral water',
  'olive oil',
  'cooking oi

In [25]:
print ('Score for "olive oil", "whole wheat pasta", "mineral water": {}'.format(len(tra)/7501.))

Score for "olive oil", "whole wheat pasta", "mineral water": 0.0038661511798426876
