# Apriori algorithm

The Apriori algorithm is a association rule, which can determine associations and relationships in large dataset. One example is market based transactions, which allows retailers to determine which items consumers buy together most frequently. Given a set of transactions, we can determine what the other items the consumer has bought based on a subset of their transaction. This is where the Apriori algorithm gets its name because it uses prior knowledge of frequent itemset properties.

The algorithm assumes that if an itemset is frequent, then all it subsets must be frequent, converselty, if an itemset is infrequent then all of its supersets must be infrequent.

## How the algorithm works

The algorithm works in two phases: 
* Finding frequent itemsets - in this phase, the algorithm scans the dataset for all items that meet the minimum support threshold. The support of the itemset is the proportion of transactions that itemset is present in. The algorithm uses a level-wise approach, finding all frequent itemsets of size 1 then size 2 and so on.
* Generating association rules - the algorithm generates the association rules between frequent itemsets. An association rule is a statement of the form A → B, where A and B are itemsets and A is the antecedent (the left-hand side of the rule) and B is the consequent (the right-hand side of the rule). The algorithm generates all possible rules from each frequent itemset and evaluates their strength based on a measure called the confidence. The confidence is all the transactions that include both A and B, divided by the proportion of transactions that contain A. The algorithm selects those association rules which meet the minimum confidence threshold. 

## Import required libraries

In [9]:
import numpy as np
import pandas as pd
from apyori import apriori

## Import of dataset

In [10]:
df = pd.read_csv(r"C:\Users\pjhop\OneDrive\Documents\Programming & Coding\Python\Projects\Datasets\Market_Basket_Optimisation.csv", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [11]:
df.shape

(7501, 20)

In [12]:
# Conversion from df to list format 
transactions = []
for i in range(0, 7501):
    transactions.append([str(df.values[i,j]) for j in range(0, 20)])

The code above stores the transactions in a list format by iterating across the columns and rows in the DataFrame and appending the string values to a `transactions` list. 

In [13]:
rules = apriori(transactions=transactions, min_support=0.005, min_confidence=0.2, min_lift=3, min_length=2, max_length=2)

The apriori class takes a few parameters, which can be altered to give different outcomes:
* `transactions` takes the list format of the transaction data.
* `min_support` which is a threshold which specifies the minimum amount of support an item must have to be considered 'frequent'.
* `min_confidence` specifies the minimum amount of people who bought one item must have bought the other, in this case 20% of people who bought one item must have bought the other.
* `min_lift` considers the lift, which is a measure of the strength of the association, by having a minimum lift we consider only strong associations.
* `min_length` and `max_length` specify the minimum and maximum number of items which must be present in each itemset.

In [14]:
results = list(rules)
print(results)

[RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]), RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]), RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.015997866951073192, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), confidence=0.3234501347708895, lift=3.2919938411349285)]), RelationRecord(items=frozenset({'ground beef', 'tomato sauce'}), support=0.005332622317024397, ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}), items_add=frozenset({'ground b

In [15]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

In [19]:
resultsinDataFrame.head(6)

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
1,pasta,escalope,0.005866,0.372881,4.700812
2,herb & pepper,ground beef,0.015998,0.32345,3.291994
3,tomato sauce,ground beef,0.005333,0.377358,3.840659
4,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,pasta,shrimp,0.005066,0.322034,4.506672


## Limitations of the algorithm

The Apriori algorithm is extremely slow, because it relies on repeatedly scanning a large dataset for frequent itemsets. It will then further break this down into smaller subsets which will then be tested. This means it be very slow and inefficient when there are large datasets.