# Problem Statement

Apriori is a statistical algorithm for implementing associate rule mining, that primarily relies on
three components: Life, Support and Confidence. Using this algorithm try to find the rules that
describe the relation between each of the products that were brought by the customers as
described in

Dataset Link: Store Data
    
https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp=sharing

Association Rule Mining is used when we want to find an association between different
objects in a set or find frequent patterns in a transaction database or relational databases.
The applications of Association Rule Mining are found in Marketing, Basket Data Analysis
(or Market Basket Analysis) in retailing, clustering and classification. It can be used to find
what items do customers frequently buy together by generating a set of rules called
Association Rules

# <center>Association Rule Mining</center>


## Overview:

Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

- A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
- People who buy one of the products can be targeted through an advertisement campaign to buy the other.
- Collective discounts can be offered on these products if the customer buys both of them.
- Both A and B can be packaged together.

The process of identifying an associations between products is called association rule mining.

Association rule learning is the rule-based machine learning method for discovering interesting relations between variables in large databases using some measure of interestingness. Apriori algorithm is one such algorithm that is used to identify these strong rules. It is an algorithm for frequent item set mining and association rule learning over relational databases.

## Introduction:

Frequent pattern mining algorithm is one of the most important techniques of data mining to discover relationships between different items in a dataset. These relationships are represented in the form of association rules.
Apriori is an algorithm used to identify frequent item sets (in our case, item pairs). It does so by using a "bottom up" approach, first identifying individual items that satisfy a minimum occurrence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold. The algorithm stops when there are no more items to add that meet the minimum occurrence requirement.
A set of items together is called an Itemset. An itemset that occurs frequently is called frequent itemset. A set of items is called frequent if it satisfies a minimum threshold value for support & confidence.

Association rule mining is defined as:

    “Let I = {….} be a set of ‘n’ binary attributes called items. Let D = {….} be the set of transactions called database. Each transaction in D has a unique transaction ID and contains a subset of item in I. A rule is defined as an implication of the form A->B where A, B (subset symbol) I. The set of items A and B are called antecedent and consequent of the rules respectively.”
    
    
## Various Metric in Measure Association:

There are five key metrics to consider when evaluating association rules-

### Support

This is the percentage of orders that contain the item set. The minimum support required by apriori can be set based on knowledge of your domain. In the grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.

### Confidence

Given two items, A and B, confidence measures the percentage of items that B is purchased, given that item A was purchased.
This is expressed as:

    Confidence(A->B) = support (A, B) / support (A)
    
Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that items A is purchased, given that item B was purchased.

    Confidence(B->A) = support (A, B) / support(B)
    
A confidence value of 0.75 implies that out of all orders that contain A, 75% of them also contain B.

### Lift 

Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occurring together in the same orders simply by chance (i.e. at random). Unlike the confidence metric whose value may vary depending on the direction, lift has no direction. This means that the lift (A, B) is always equal to the lift (B, A).

    Lift (A, B) = Lift (B, A) = Confidence(B->A)/support(A) = support (A, B)/(support(A) * support(B))
    
One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same order if there was no relationship between them. If suppose A occurred in 80% of the orders and B occurred in 60% of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 48% of the time (ie: 80% * 60%). The numerator, on the other hand, represents how often A and B actually appear together in the same order. Taking the numerator and dividing it by the denominator, we get to know how many more times A and B actually appear in the same order, compared to if there was no relationship between them (i.e.: that they are occurring together simply at random).

In summary, lift can take on the following values:
    - Lift = 1 implies no relationship between A and B (ie: A & B occur together only by chance).
    - Lift > 1 implies that there is a positive relationship between A & B. (i.e.: A & B occur  together more often than random).
    - Lift < 1 implies that there is a negative relationship between A & B (i.e.: A & B occur together less often than random).
    
### Leverage
$$\text{levarage}(A\rightarrow C) = \text{support}(A\rightarrow C) - \text{support}(A) \times \text{support}(C), \;\;\; \text{range: } [-1, 1]$$

Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence.

### Conviction
$$\text{conviction}(A\rightarrow C) = \frac{1 - \text{support}(C)}{1 - \text{confidence}(A\rightarrow C)}, \;\;\; \text{range: } [0, \infty]$$

A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.
    
## Steps involved in Apriori Algorithm

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

- Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
- Extract all the subsets having higher value of support than minimum threshold.
- Select all the rules from the subsets with confidence value higher than minimum threshold.
- Order the rules by descending order of Lift.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('store_data.csv')
data.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   shrimp             7500 non-null   object 
 1   almonds            5746 non-null   object 
 2   avocado            4388 non-null   object 
 3   vegetables mix     3344 non-null   object 
 4   green grapes       2528 non-null   object 
 5   whole weat flour   1863 non-null   object 
 6   yams               1368 non-null   object 
 7   cottage cheese     980 non-null    object 
 8   energy drink       653 non-null    object 
 9   tomato juice       394 non-null    object 
 10  low fat yogurt     255 non-null    object 
 11  green tea          153 non-null    object 
 12  honey              86 non-null     object 
 13  salad              46 non-null     object 
 14  mineral water      24 non-null     object 
 15  salmon             7 non-null      object 
 16  antioxydant juice  3 non

In [4]:
data.shape

(7500, 20)

# Apriori
Informal definition: "Customer who bought this will also buy..." --> Apriori algorithm figures this out. This is used for optimization of combination of things.

Definition of terms

1) Support
<img src="1.png"> 

2) Confidence
<img src="2.png">

3) Lift
<img src="3.png">

Steps
<img src="step 1.png">

In [5]:
transactions = []
for i in range(0, data.shape[0]):
    transactions.append([str(data.values[i, j]) for j in range(0, 20)])

print(transactions[0])

['burgers', 'meatballs', 'eggs', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']


In [7]:
pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py): started
  Building wheel for apyori (setup.py): finished with status 'done'
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5974 sha256=75af293f7d87dac9349bd8c760a9eb0bbbc70e06023fb90c12c85e2a3de26abb
  Stored in directory: c:\users\govardhan\appdata\local\pip\cache\wheels\1b\02\6c\a45230be8603bd95c0a51cd2b289aefdd860c1a100eab73661
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2
Note: you may need to restart the kernel to use updated packages.


In [9]:
from apyori import apriori

rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
# Support: number of transactions containing set of times / total number of transactions
# .      --> products that are bought at least 3 times a day --> 21 / 7501 = 0.0027
# Confidence: Should not be too high, as then this wil lead to obvious rules

#Try many combinations of values to experiment with the model. 

#viewing the rules
results = list(rules)

In [10]:
#Transferring the list to a table

results = pd.DataFrame(results)
results.head(5)

Unnamed: 0,items,support,ordered_statistics
0,"(light cream, chicken)",0.004533,"[((light cream), (chicken), 0.2905982905982906..."
1,"(escalope, mushroom cream sauce)",0.005733,"[((mushroom cream sauce), (escalope), 0.300699..."
2,"(escalope, pasta)",0.005867,"[((pasta), (escalope), 0.37288135593220345, 4...."
3,"(fromage blanc, honey)",0.003333,"[((fromage blanc), (honey), 0.2450980392156863..."
4,"(ground beef, herb & pepper)",0.016,"[((herb & pepper), (ground beef), 0.3234501347..."


In [11]:
results['ordered_statistics'][0]

[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.2905982905982906, lift=4.843304843304844)]

As you can see we have Life , Confidence and Support