# Understanding the apriori algorithm

Let's study three key concepts that we'll need to have our algorithm working well:
<br>
First, We need to define our **support**. In our example (supermarket analysis), it corresponds to the percentage (%) of times a product or a combination of products is bought. When we define a support, we are removing from our rules all the products and the combinations that appear less than our defined support.<br><br>
Imagine that we have 10 customers.<br>
- 4 of them bought coffee -> coffee support is 0.4<br>
- 3 of those 4 also bought bread -> coffee and bread (together) support is 0.3<br>
- 2 customers bought milk -> milk support is 0.2<br><br>

Alright, if we define that our support is going to be 0.3, we are including (coffee) and (coffee + bread) in our rules, but removing milk of it.<br><br>

The second parameter we need to define is the **confidence**. In our example:<br>
- If buy cofee, then buy bread -> 3/4 = 0.75 (because we got 4 coffee buyers and 3 of them also got bread)
- If buy bread, then buy coffee -> 3/3 = 1.0 (because everybody who bought bread, also bought coffee).
<br><br>
When we define our confidence as 0.8, we are excluding erverything below it from our rules.<br><br>

The last key concept we need to know about is the **lift**. It is the confidence of (A -> B) divided by the support of (B):<br>
- Confidence of (coffee -> bread) = 0.75
- Support of (bread) = 3/10 (3 customers got bread in a total of 10 customers)
- So, the lift for this combination is 0.75 / 0.3 = 2.5<br><br>
But what it means?<br>
It means that if a customer buy coffee (A), he/she has 2.5 more chances of buying bread (B)<br><br>
Okay, sorry for that long theoretical exposition, but I think this is really important. Let's get our algo finally working...



# Importing required modules

In [1]:
import pandas as pd
from apyori import apriori

# 1. Loading our dataset

In [2]:
df = pd.read_csv('market_df.csv', header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In this dataset, we have 7501 rows. Each of them contains an individual shopping cart that customers bought in this market in a whole week (7 days). i.e.:
- The first customer bought **shrimp**, **almomnds**, **avocado** and much more
- The second customer bought **burgers**, **meatballs** and **eggs**
- The third customer bought only **chutney**
- And so forth...

# 2. Preprocessing the data

It is very important to know that the **apriori** alogorithm requires the input data to be in a list format. So let's get it done now...

In [4]:
transactions = []
for i in range(df.shape[0]):
    transactions.append([str(df.values[i, j]) for j in range(df.shape[1])])

Now, we have a list of lists! Every single customer is represented by a list. Let's check out the three first customers to have a better understanding of what we are doing

In [5]:
# First customer
transactions[0]

['shrimp',
 'almonds',
 'avocado',
 'vegetables mix',
 'green grapes',
 'whole weat flour',
 'yams',
 'cottage cheese',
 'energy drink',
 'tomato juice',
 'low fat yogurt',
 'green tea',
 'honey',
 'salad',
 'mineral water',
 'salmon',
 'antioxydant juice',
 'frozen smoothie',
 'spinach',
 'olive oil']

In [6]:
# Second customer
transactions[1]

['burgers',
 'meatballs',
 'eggs',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan']

In [7]:
# Third customer
transactions[2]

['chutney',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan',
 'nan']

# 3. Getting our rules

First, let's define:<br>
- min_support = 0.3
- min_confidence = 0.8
- min_lift = 2

In [8]:
rules = apriori(transactions, min_support=0.3, min_confidence=0.8, min_lift=3)
results = list(rules)
print("Amount of entries generated: {}".format(len(results)))

Amount of entries generated: 0


It happened because it is nearly impossible to have a support of 30% with many many products. We will never get the same product being bought in 30% of the carts with a large variety of products. So, let's improve our parameters... <br><br>
Imagine that we want to get the products which are bought at least 4 times a day. We know that our dataset represents the products sold in a week (7 days), so we can expect to get a product which were bought 28 times in this week (4 times a day).<br><br>

Okay, our product was sold at least 28 times out of 7501 shopping carts:

In [9]:
28/7501

0.0037328356219170776

Great. We can use 0.003 as our support to have some insights!

In [10]:
rules = apriori(transactions, min_support=0.003, min_confidence=0.8, min_lift=3)
results = list(rules)
print("Amount of entries generated: {}".format(len(results)))

Amount of entries generated: 0


It also generated  no entries, because this time the confidence is too high (0.8). Let's lower it to 0.2

In [11]:
rules = apriori(transactions, min_support=0.003, min_confidence=0.2, min_lift=3)
results = list(rules)
print("Amount of entries generated: {}".format(len(results)))

Amount of entries generated: 160


Looks like the apriori algo generated 160 entries! Let's check it out!

In [12]:
A = [] # IF
B = [] # THEN
support = []
confidence = []
lift = []

for result in results:
    s = result[1] # support
    result_rules = result[2] # all rules
    for result_rule in result_rules:
        a = list(result_rule[0]) # if
        b = list(result_rule[1]) # then
        c = result_rule[2] # confidence
        l = result_rule[3] # lift
#         print ('IF ', a, ', THEN ', b, ' -> confidence: ', c, ' | lift: ', l)
        A.append(a)
        B.append(b)
        support.append(s)
        confidence.append(c)
        lift.append(l)

# 4. Generating Pandas DataFrame

In [13]:
rules_df = pd.DataFrame({'A': A, 'B': B, 'support': support, 'confidence': confidence, 'lift': lift}) 
rules_df = rules_df.sort_values(by='lift', ascending=False)
rules_df

Unnamed: 0,A,B,support,confidence,lift
344,"[soup, frozen vegetables]","[nan, milk, mineral water]",0.003066,0.383333,7.987176
177,"[soup, frozen vegetables]","[milk, mineral water]",0.003066,0.383333,7.987176
349,"[nan, soup, frozen vegetables]","[milk, mineral water]",0.003066,0.383333,7.987176
173,"[frozen vegetables, olive oil]","[milk, mineral water]",0.003333,0.294118,6.128268
339,"[nan, frozen vegetables, olive oil]","[milk, mineral water]",0.003333,0.294118,6.128268
...,...,...,...,...,...
238,"[shrimp, nan, ground beef]",[spaghetti],0.005999,0.523256,3.005315
67,"[shrimp, ground beef]",[spaghetti],0.005999,0.523256,3.005315
198,"[tomatoes, frozen vegetables, mineral water]",[spaghetti],0.003066,0.522727,3.002280
368,"[nan, tomatoes, frozen vegetables, mineral water]",[spaghetti],0.003066,0.522727,3.002280


Great. Let's just remove those 'nan' values and duplicated rows

In [14]:
def removenan(x):
    if 'nan' in x: x.remove('nan');
    x.sort()
    return x

In [15]:
rules_df['A'] = rules_df['A'].apply(lambda x: removenan(x))
rules_df['B'] = rules_df['B'].apply(lambda x: removenan(x))
rules_df['A'] = rules_df['A'].apply(lambda x: ' & '.join(x))
rules_df['B'] = rules_df['B'].apply(lambda x: ' & '.join(x))
rules_df.drop_duplicates(subset=['A','B'], keep='first',inplace=True)
rules_df.reset_index(drop=True,inplace=True)

In [16]:
rules_df

Unnamed: 0,A,B,support,confidence,lift
0,frozen vegetables & soup,milk & mineral water,0.003066,0.383333,7.987176
1,frozen vegetables & olive oil,milk & mineral water,0.003333,0.294118,6.128268
2,mineral water & whole wheat pasta,olive oil,0.003866,0.402778,6.128268
3,milk & soup,frozen vegetables & mineral water,0.003066,0.201754,5.646864
4,tomato sauce,ground beef & spaghetti,0.003066,0.216981,5.535971
...,...,...,...,...,...
125,chocolate & eggs & mineral water,ground beef,0.003999,0.297030,3.023093
126,milk & mineral water & spaghetti,frozen vegetables,0.004533,0.288136,3.022804
127,frozen vegetables & spaghetti,shrimp,0.005999,0.215311,3.018781
128,ground beef & shrimp,spaghetti,0.005999,0.523256,3.005315


There it is. Lots of rules with support > 0.003, confidence > 0.2 and lift > 3!<br><br>
Enjoy!

---

Reach me at:
- Email: contact@mathfigueiredo.com
- Linkedin: https://www.linkedin.in/mathfigueiredo
- Portfolio: https://mathfigueiredo.com