## Association Rules

#### Association Rule Learning is rule-based learning for identifying the association between different variables in a database. One of the best and most popular examples of Association Rule Learning is the Market Basket Analysis. The problem analyses the association between various items that has the highest probability of being bought together by a customer.

### Support
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item A.

This can be calculated as:

Support(A) = (Transactions containing (A))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Milk then the support for item Milk can be calculated as:

Support(Milk) = (Transactions containing Milk)/(Total Transactions)

Support(Milk) = 100/1000  
              
           = 10%

### Confidence
Confidence refers to the likelihood that an item B is also bought if item A is bought. 
It can be calculated by finding the number of transactions where A and B are bought together, 
divided by total number of transactions where A is bought. Mathematically, it can be represented as:

Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)  

Example: We had 50 transactions where Milk and Diaper were bought together. 
While in 150 transactions, Milk are bought. Then we can find likelihood of buying Diaper when Milk is bought can be represented as confidence of Milk -> Diaper and can be mathematically written as:

Confidence(Milk→Diaper) = (Transactions containing both (Milk and Diaper))/(Transactions containing Milk)

Confidence(Milk→Diaper) = 50/150  

                    = 33.3%

### Lift
Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated 
by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:

Lift(A→B) = (Confidence (A→B))/(Support (B))  
Coming back to our Milk and Diaper problem, the Lift(Milk -> Diaper) can be calculated as:

Lift(Milk→Diaper) = (Confidence (Milk→Diaper))/(Support (Diaper))

Lift(Milk→Diaper) = 33.3/10  

              = 3.33
    
Lift basically tells us that the likelihood of buying Milk and Diaper together is 3.33 times more than
the likelihood of just buying the Diaper. 

A Lift of 1 means there is no association between products A and B.
Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 
1 refers to the case where two products are unlikely to be bought together.


#### Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items.

**How it helps the business:**
1. A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
2. People who buy one of the products can be targeted through an advertisement campaign to buy the other.
3. Collective discounts can be offered on these products if the customer buys both of them.
4. Both A and B can be packaged together.

# Example

In [1]:
import pandas as pd
df = pd.DataFrame([[0,1,1,0,0,0],
              [1,1,0,1,1,0],
              [1,0,1,1,0,1],
              [1,1,1,1,0,0],
              [0,1,1,1,0,1]], 
             columns=['Beer', 'Bread', 'Milk', 'Diaper', 'Eggs', 'Coke'], index=['T1','T2','T3','T4','T5'])
df

Unnamed: 0,Beer,Bread,Milk,Diaper,Eggs,Coke
T1,0,1,1,0,0,0
T2,1,1,0,1,1,0
T3,1,0,1,1,0,1
T4,1,1,1,1,0,0
T5,0,1,1,1,0,1


In [1]:
# Install apyori (Apyori is a simple implementation of Apriori algorithm in Python)
# !pip install apyori

#### Import the Libraries

In [3]:
import numpy as np
import pandas as pd
from apyori import apriori
import matplotlib.pyplot as plt

#### Read the Dataset

In [10]:
# store_data = pd.read_csv('store_data.csv')
store_data = pd.read_csv('store_data.csv',header=None)

#### Understand the data

In [11]:
store_data.shape

(7501, 20)

In [12]:
store_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [6]:
store_data.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,
7500,eggs,frozen smoothie,yogurt cake,low fat yogurt,,,,,,,,,,,,,,,,


In [13]:
store_data.isnull().sum()

0        0
1     1754
2     3112
3     4156
4     4972
5     5637
6     6132
7     6520
8     6847
9     7106
10    7245
11    7347
12    7414
13    7454
14    7476
15    7493
16    7497
17    7497
18    7498
19    7500
dtype: int64

#### Data Preprocessing

The Apriori library requires our dataset to be in the form of a list of lists,
where the whole dataset is a big list and each transaction in the dataset is an inner list within 
the outer big list. Currently we have data in the form of a pandas dataframe. 
To convert our pandas dataframe into a list of lists, execute the following script:

In [17]:
records = []
# Create a list of lists for each transaction record present in store_data.csv
for i in range(0, 7501):
    records.append([str(store_data.values[i,j]) for j in range(0,20)\
                    if str(store_data.values[i,j]) != 'nan'])

#### Applying Apriori

In [18]:
records

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spagh

In [19]:
type(records)

list

### Steps involved in Apriori Algorithm
1. Set a minimum value for support and confidence. 
This means that we are only 
interested in finding rules for the items that have certain default existence 
(e.g. support) and have a minimum value for co-occurrence 
with other items (e.g. confidence).

2. Extract all the subsets having higher value of support than minimum threshold.

3. Select all the rules from the subsets with confidence value higher than minimum threshold.

#### Parameters
##### min_support - Refers to minimum popularity of an item.
##### min_confidence - Tells us how much more likely it is to buy item B given that item A are bought.
##### min_lift - Tells us that the likelihood of buying item A and item B together is x times more than the likelihood of just buying the item A. 


In [20]:
# Running the Apriori algorithm on transactions data with minimum support, confidence and list values and covert the output to a list
association_results = apriori(records,min_support=0.0046,min_confidence=0.2,\
                            min_lift=2)

association_results = list(association_results)


In [21]:
print(len(association_results))

135


In [22]:
print(association_results)

[RelationRecord(items=frozenset({'burgers', 'almonds'}), support=0.005199306759098787, ordered_statistics=[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.25490196078431376, lift=2.923577382023146)]), RelationRecord(items=frozenset({'burgers', 'ham'}), support=0.005599253432875617, ordered_statistics=[OrderedStatistic(items_base=frozenset({'ham'}), items_add=frozenset({'burgers'}), confidence=0.21105527638190955, lift=2.420681388594348)]), RelationRecord(items=frozenset({'milk', 'cereals'}), support=0.007065724570057326, ordered_statistics=[OrderedStatistic(items_base=frozenset({'cereals'}), items_add=frozenset({'milk'}), confidence=0.2746113989637306, lift=2.119197637476279)]), RelationRecord(items=frozenset({'tomato sauce', 'chocolate'}), support=0.005065991201173177, ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}), items_add=frozenset({'chocolate'}), confidence=0.3584905660377358, lift=2.1879883936932925)

In [23]:
print(association_results[1])

RelationRecord(items=frozenset({'burgers', 'ham'}), support=0.005599253432875617, ordered_statistics=[OrderedStatistic(items_base=frozenset({'ham'}), items_add=frozenset({'burgers'}), confidence=0.21105527638190955, lift=2.420681388594348)])


* Each RelationRecord  reflects all rules associated with a specific itemset (items) that has relevant rules. Support (support ), given that it’s simply a count of appearances of those items together, is the same for any rules involving those items, and so only appears once per RelationRecord. The ordered_statistic  reflects a list of all rules that met our min_confidence  and min_lift  requirements (parameterized when we called apriori() ). Each OrderedStatistic  contains the antecedent (items_base) and consequent (items_add) for the rule, as well as the associated confidence  and lift .

In [24]:
association_results[0]

RelationRecord(items=frozenset({'burgers', 'almonds'}), support=0.005199306759098787, ordered_statistics=[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.25490196078431376, lift=2.923577382023146)])

In [25]:
association_results[0][0]

frozenset({'almonds', 'burgers'})

In [26]:
association_results[0][1]

0.005199306759098787

In [27]:
association_results[0][2]

[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.25490196078431376, lift=2.923577382023146)]

In [28]:
item=association_results[0]

In [29]:
print(item.items)
print(item.support)
print(item.ordered_statistics)

frozenset({'burgers', 'almonds'})
0.005199306759098787
[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.25490196078431376, lift=2.923577382023146)]


In [30]:
# Print Support, Lift and Confidence for the resultant transactions
print("Support = ",item[1])
print("Confidence = ",item[2][0][2])
print("Lift = ",item[2][0][3])

Support =  0.005199306759098787
Confidence =  0.25490196078431376
Lift =  2.923577382023146


In [22]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0]
    items = [x for x in pair]
    print(items)
    print("Rule: " + items[0] + " -> " + items[1])

    # second index of the inner list
    print("Support: " + str(item[1]))

    # third index of the list located at 0th
    # of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

['burgers', 'almonds']
Rule: burgers -> almonds
Support: 0.005199306759098787
Confidence: 0.25490196078431376
Lift: 2.923577382023146
['burgers', 'ham']
Rule: burgers -> ham
Support: 0.005599253432875617
Confidence: 0.21105527638190955
Lift: 2.420681388594348
['milk', 'cereals']
Rule: milk -> cereals
Support: 0.007065724570057326
Confidence: 0.2746113989637306
Lift: 2.119197637476279
['tomato sauce', 'chocolate']
Rule: tomato sauce -> chocolate
Support: 0.005065991201173177
Confidence: 0.3584905660377358
Lift: 2.1879883936932925
['mushroom cream sauce', 'escalope']
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
['pasta', 'escalope']
Rule: pasta -> escalope
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
['extra dark chocolate', 'mineral water']
Rule: extra dark chocolate -> mineral water
Support: 0.005732568990801226
Confidence: 0.47777777777777775
Lift: 2.0043686303753416


**Rule Interpretation:**

Rule: ham -> burgers
* Support: 0.005599253432875617 - Popularity of both burgers and ham bought together is 0.005599253432875617
* Confidence: 0.21105527638190955 - The probability of ham being bought with burger is 0.21 (21%). (Out of all the transactions containing burger, 21% of the transactions are likely to contain ham as well)
* Lift:  2.420681388594348 - It is 2.4 times likely of Burger and ham being bought together than just ham alone. 