# Association Rule - Apriori and ECLAT 

Training association rule models (Apriori and ECLAT) to find the most related items bought by customers of a french supermarket during a week. All 7501 lines of the dataset represent items bought by an unique customer, during this week.

This algorithm associate products preferences by most of the customers and can be used to generate products recommendation and help on displaying products strategy.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Data Loading
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

# Adding all customers into a list of lists
transactions = []
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

In [3]:
dataset.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


### Apriori implementation using apyori library 
source: https://github.com/ymoch/apyori

The output of this part is to see which are the products that used to be more bought in combination compared to other combinations using apriori algorithm.


In [4]:
# Inspecting elements
transactions[:2]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan']]

In [5]:
# Training Apriori on the dataset
# The hyperparameters choosen on this training are:
# min_support = items bought more than 3 times a day * 7 days (week) / 7500 customers = 0.0028
# min_confidence: at least 20%, min_lift = minimum of 3 (less than that is too low)

from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [6]:
# Visualising the results
results = list(rules)

In [7]:
lift = []
association = []
for i in range (0, len(results)):
    lift.append(results[:len(results)][i][2][0][3])
    association.append(list(results[:len(results)][i][0]))

### Visualizing results in a dataframe

In [8]:
rank = pd.DataFrame([association, lift]).T
rank.columns = ['Association', 'Lift']

In [9]:
# Show top 10 higher lift scores
rank.sort_values('Lift', ascending=False).head(10)

Unnamed: 0,Association,Lift
128,"[nan, whole wheat pasta, olive oil, mineral wa...",6.11586
58,"[whole wheat pasta, olive oil, mineral water]",6.11586
96,"[mineral water, soup, milk, frozen vegetables]",5.48441
146,"[frozen vegetables, nan, soup, mineral water, ...",5.48441
28,"[honey, nan, fromage blanc]",5.16427
3,"[honey, fromage blanc]",5.16427
16,"[chicken, nan, light cream]",4.84395
0,"[chicken, light cream]",4.84395
2,"[escalope, pasta]",4.70081
26,"[nan, escalope, pasta]",4.70081


By the study, "olive oil, whole wheat pasta, mineral water" are the most commom combined items from this week for the supermarket in question.  

## ECLAT Implementation

This is an implementation of the ECLAT code by hand. It calculate the pairs that have been bought more frequently comparing to other pairs. At the end, we expect to see what is the most common combination of products during the week. 

An extension of the code can calculate the three most common combination, 4, and so on.

#### Getting the list of products bought this week by all customers

In [11]:
# Putting all transactions in a single list
itens = []
for i in range(0, len(transactions)):
    itens.extend(transactions[i])

# Finding unique items from transactions and removing nan
uniqueItems = list(set(itens))
uniqueItems.remove('nan')

In [12]:
# test code
#tra = [s for s in transactions if ("mineral water") in s and ("ground beef") in s and ("shrimp") in s]

#### Creating combinations with the items - pairs

In [13]:
pair = []
for j in range(0, len(uniqueItems)):
    k = 1;
    while k <= len(uniqueItems):
        try:
            pair.append([uniqueItems[j], uniqueItems[j+k]])
        except IndexError:
            pass
        k = k + 1;       

#### Calculating score
The calculation is done looking at the number of customers that bought both items (the pair) and divided by all customers of the week (7501). This calculation is done for all pairs possible and the score is returned on "score" list.

<center> . </center>
<center> *** score = (# lists that contain [item x and item y]) / (# all lists) ***</center>

In [14]:
score = []
for i in pair:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score.append(len(eval(mycode))/7501.)

#### Showing results

Top 10 Most common pairs of items of this week

In [15]:
ranking_ECLAT = pd.DataFrame([pair, score]).T
ranking_ECLAT.columns = ['Pair', 'Score']

In [16]:
ranking_ECLAT.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Pair,Score
257,"[mineral water, spaghetti]",0.0597254
327,"[mineral water, chocolate]",0.0526596
328,"[mineral water, eggs]",0.0509265
314,"[mineral water, milk]",0.0479936
353,"[mineral water, ground beef]",0.0409279
2579,"[spaghetti, ground beef]",0.0391948
2553,"[spaghetti, chocolate]",0.0391948
2554,"[spaghetti, eggs]",0.0365285
3373,"[french fries, eggs]",0.0363951
252,"[mineral water, frozen vegetables]",0.0357286


### What if we do that for trios?

In [27]:
# Creating trios
trio = []
for j in range(0, len(uniqueItems)):
    for k in range(j, len(uniqueItems)):
        for l in range(k, len(uniqueItems)):
            if (k != j) and (j != l) and (k != l):
                try:
                    trio.append([uniqueItems[j], uniqueItems[j+k], uniqueItems[j+l]])
                except IndexError:
                    pass 

In [29]:
trio[:5]

[['pet food', 'green tea', 'whole wheat rice'],
 ['pet food', 'green tea', 'antioxydant juice'],
 ['pet food', 'green tea', 'chicken'],
 ['pet food', 'green tea', 'milk'],
 ['pet food', 'green tea', 'mint green tea']]

In [30]:
score_trio = []
for i in trio:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score_trio.append(len(eval(mycode))/7501.)

In [31]:
ranking_ECLAT_trio = pd.DataFrame([trio, score_trio]).T
ranking_ECLAT_trio.columns = ['Trio', 'Score']
ranking_ECLAT_trio.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Trio,Score
134586,"[spaghetti, chocolate, mineral water]",0.0158646
35350,"[milk, spaghetti, mineral water]",0.0157312
135293,"[spaghetti, mineral water, eggs]",0.0142648
37930,"[milk, chocolate, mineral water]",0.0139981
38637,"[milk, mineral water, eggs]",0.0130649
86786,"[frozen vegetables, spaghetti, mineral water]",0.0119984
37543,"[milk, ground beef, mineral water]",0.0110652
33418,"[milk, frozen vegetables, mineral water]",0.0110652
35320,"[milk, spaghetti, chocolate]",0.0109319
134588,"[spaghetti, chocolate, eggs]",0.0105319


## What about comparing the results from Apriori and ECLAT?

We got from Apriori that the combination that lead to more "attractiveness power" is "olive oil", "whole wheat pasta" and "mineral water". If we run the ECLAT code for this set of items, we will obtain: 0.0039.

This score of 3 items has not enough score to be placed among top 10, but they are measuring different metrics.  According to apriori these are the items that when picked one lead to another items more frequently than other combinations, i.e. when a person pick 'olive oil', the probability of picking 'whole wheat pasta' and 'mineral water' is much higher than picking another combination. ECLAT in another hand is just sorting as the most common combinations of all lists, not caring about how one item isolatedly can influence in the purchase of another.

In [33]:
i = ["olive oil", "whole wheat pasta", "mineral water"]
cond = []
for item in i:
    cond.append('("%s") in s' %item)
mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
#mycode = "print 'hello world'"
tra = eval(mycode)




In [34]:
print 'Score for "olive oil", "whole wheat pasta", "mineral water":', len(tra)/7501.

Score for "olive oil", "whole wheat pasta", "mineral water": 0.00386615117984
