# <u>Apriori Algorithm With Supermarket Purchase Data</u>

The Apriori algorithm is a data mining technique used to find frequent itemsets and association rules in a dataset. To be more efficient, it uses the Apriori property, which is that any subset of a frequent itemset must also be frequent. These association rules can be used for market basket analysis (understanding consumer purchase behaviour, e.g. buying shampoo/conditioner together) and recommendation systems.

Dataset is from Kaggle: https://www.kaggle.com/datasets/ayushish12/market-basket-optimisation 

The Apriori algorithm is an example of unsupervised learning.

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
%pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: apyori
  Building wheel for apyori (pyproject.toml): started
  Building wheel for apyori (pyproject.toml): finished with status 'done'
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5976 sha256=0c4f551fd720a35f50673a45a78679c64c8b2a60c5c78fb0c269af1218831e83
  Stored in directory: c:\users\nathan\appdata\local\pip\cache\wheels\7f\49\e3\42c73b19a264de37129fadaa0c52f26cf50e87de08fb9804af
Successfully built apyori
Installing collected pa

In [3]:
# reading the dataset
data = pd.read_csv("market basket optimisation.csv",
                  header = None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,


In [4]:
data_rows = data.shape[0]
data_cols = data.shape[1]

## Data Exploration

In [5]:
# finding items sold by supermarket and quantity sold
sales = {}
for i in range(data_cols):
    for j in range(data_rows):
        item = data[i][j]
        if pd.isna(item):
            break
        if item in sales:
            sales[item] += 1
        else: sales[item] = 1
print(sales)
print("Number of items sold: ", len(sales))

{'shrimp': 325, 'burgers': 576, 'chutney': 7, 'turkey': 458, 'mineral water': 578, 'low fat yogurt': 47, 'whole wheat pasta': 95, 'soup': 78, 'frozen vegetables': 373, 'french fries': 244, 'eggs': 280, 'cookies': 270, 'spaghetti': 354, 'meatballs': 34, 'red wine': 123, 'rice': 10, 'parmesan cheese': 51, 'ground beef': 218, 'sparkling water': 4, 'herb & pepper': 232, 'pickles': 38, 'energy bar': 67, 'fresh tuna': 129, 'escalope': 143, 'avocado': 58, 'tomato sauce': 24, 'clothes accessories': 9, 'energy drink': 19, 'chocolate': 391, 'grated cheese': 293, 'yogurt cake': 31, 'mint': 10, 'asparagus': 3, 'champagne': 64, 'ham': 120, 'muffins': 69, 'french wine': 18, 'chicken': 44, 'pasta': 12, 'tomatoes': 212, 'pancakes': 80, 'frozen smoothie': 33, 'carrots': 3, 'yams': 25, 'shallot': 4, 'butter': 52, 'light mayo': 11, 'pepper': 61, 'candy bars': 25, 'cooking oil': 21, 'milk': 181, 'green tea': 98, 'bug spray': 6, 'oil': 24, 'olive oil': 68, 'salmon': 30, 'cake': 98, 'almonds': 12, 'salt': 7

In [6]:
# finding 15 most popular items
sorted_sales = sorted(sales.items(), key=lambda x: x[1], reverse=True)
for i in range(15):
    print(sorted_sales[i])

('mineral water', 578)
('burgers', 576)
('turkey', 458)
('chocolate', 391)
('frozen vegetables', 373)
('spaghetti', 354)
('shrimp', 325)
('grated cheese', 293)
('eggs', 280)
('cookies', 270)
('french fries', 244)
('herb & pepper', 232)
('ground beef', 218)
('tomatoes', 212)
('milk', 181)


## Apriori Algorithm

In [7]:
# transforming data for apriori
transactions = []
for i in range(data_rows):
    temp_row = []
    for j in range(data_cols):
        item = data[j][i]
        if pd.isna(item): break
        else: temp_row.append(item)
    transactions.append(temp_row)
print(transactions[:5])

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs'], ['chutney'], ['turkey', 'avocado'], ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea']]


#### Apriori Terms

Support:
- proportion of transactions an itemset appears in

Confidence:
- likelihood of item Y being purchased when X is purchased
- support(X,Y)/support(X)
- may be inflated if both X/Y are popular

Lift:
- likelihood of item Y being purchased when X is purchased and controls for popularity of item Y
- support(X,Y)/(support(X) x support(Y))
- 1: no association; <1: Y unlikely bought if X bought; >1: Y likely bought if X bought

In [8]:
from apyori import apriori

# using apriori algorithm to find association rules with min support = 0.002, min confidence = 0.2, min lift = 2.5, rule length 2
rules = apriori(transactions = transactions, 
                min_support = 0.002, 
                min_confidence = 0.2, 
                min_lift = 2.5, 
                min_length = 2, 
                max_length = 2)
results = list(rules)
results[0] # example of an association rule

RelationRecord(items=frozenset({'burgers', 'almonds'}), support=0.005199306759098787, ordered_statistics=[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.25490196078431376, lift=2.923577382023146)])

In [9]:
# transforming data into easier to read format
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    support     = [result[1] for result in results]
    confidence  = [result[2][0][2] for result in results]
    lift        = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, support, confidence, lift))
    
resultsinDataFrame = pd.DataFrame(inspect(results), 
                                  columns = ["Left Hand Side", "Right Hand Side", "Support", "Confidence", "Lift"])
resultsinDataFrame[:5]

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,almonds,burgers,0.005199,0.254902,2.923577
1,bacon,ground beef,0.002133,0.246154,2.505292
2,bacon,pancakes,0.002133,0.246154,2.589621
3,barbecue sauce,turkey,0.002533,0.234568,3.751586
4,blueberries,ground beef,0.0024,0.26087,2.655065


In [10]:
# displaying the results sorted by descending lift
sorted_results = resultsinDataFrame.sort_values(by="Lift", ascending=False)
sorted_results[:5]

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
11,fromage blanc,honey,0.003333,0.245098,5.164271
6,light cream,chicken,0.004533,0.290598,4.843951
8,pasta,escalope,0.005866,0.372881,4.700812
21,pasta,shrimp,0.005066,0.322034,4.506672
19,whole wheat pasta,olive oil,0.007999,0.271493,4.12241


## Applications

The association rule with the greatest lift is fromage blanc -> honey. About 1 out of 300 (0.33%) purchases contain both these items, and there is a 24.5% chance that honey is purchased when fromage blanc is purchased. For a supermarket, it may be a good idea to put a small display of honey near the cheese section. Additionally, a sale on fromage blanc should also increase sales on honey.

In [11]:
# Finding all instances of pasta on LHS
pasta_data = resultsinDataFrame[(resultsinDataFrame["Left Hand Side"] == "pasta") | (resultsinDataFrame["Left Hand Side"] == "whole wheat pasta")]
pasta_data.sort_values(by="Lift", ascending=False)

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
8,pasta,escalope,0.005866,0.372881,4.700812
21,pasta,shrimp,0.005066,0.322034,4.506672
19,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
18,whole wheat pasta,milk,0.009865,0.334842,2.583999


The data above suggests that consumers who purchase regular pasta also tend to purchase a meat product, while people who purchase whole wheat pasta tend to purchase olive oil or milk. This observation may/may not hold for all association rules involving pasta/whole wheat pasta, but could lead to further investigation into consumers who purchase different types of pasta. For example, if the people who purchase regular pasta and whole wheat pasta are different groups, both pasta types should be on sale at the same time to increase sales of meat products, olive oil, and milk (since people won't buy both types of pasta).

The association rules I found only have a length of 2, but there are more rules with greater length. If I wanted to find more rules, I could lower the minimum support, confidence, and lift values. There may be more interesting and important insights in the data, especially for more popular items such as burgers and spaghetti.