In [1]:
import kagglehub

path = kagglehub.dataset_download("heeraldedhia/groceries-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/heeraldedhia/groceries-dataset?dataset_version_number=1...


100%|██████████| 257k/257k [00:00<00:00, 36.7MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/heeraldedhia/groceries-dataset/versions/1





In [2]:
import pandas as pd
import mlxtend
from mlxtend.frequent_patterns import apriori, association_rules

In [30]:
df = pd.read_csv("Groceries_dataset.csv")
df.head()
df1 = pd.crosstab(df['Member_number'], df['itemDescription'])
df1 = df1.applymap(lambda x: 1 if x > 0 else 0)
frequent = apriori(df1, min_support = .05, use_colnames = True)
frequent


  df1 = df1.applymap(lambda x: 1 if x > 0 else 0)


Unnamed: 0,support,itemsets
0,0.078502,(UHT-milk)
1,0.119548,(beef)
2,0.079785,(berries)
3,0.062083,(beverages)
4,0.158799,(bottled beer)
...,...,...
160,0.050539,"(other vegetables, whole milk, tropical fruit)"
161,0.071832,"(yogurt, whole milk, other vegetables)"
162,0.065162,"(whole milk, soda, rolls/buns)"
163,0.065931,"(yogurt, whole milk, rolls/buns)"


In [41]:
associate = association_rules(frequent, metric = "confidence", min_threshold = .3)
print(associate.iloc[:, :7])

                  antecedents         consequents  ...  confidence      lift
0                      (beef)  (other vegetables)  ...    0.424893  1.128223
1                      (beef)        (whole milk)  ...    0.536481  1.170886
2              (bottled beer)  (other vegetables)  ...    0.431341  1.145345
3              (bottled beer)        (rolls/buns)  ...    0.397415  1.136555
4              (bottled beer)              (soda)  ...    0.347334  1.107946
..                        ...                 ...  ...         ...       ...
127      (yogurt, rolls/buns)        (whole milk)  ...    0.592166  1.292420
128  (whole milk, rolls/buns)            (yogurt)  ...    0.369253  1.304939
129            (yogurt, soda)        (whole milk)  ...    0.557895  1.217622
130      (yogurt, whole milk)              (soda)  ...    0.361158  1.152042
131        (soda, whole milk)            (yogurt)  ...    0.359932  1.271999

[132 rows x 7 columns]


The presence of antecedents implies the consequents above with a minimum confidence threshold of .3.

1. Support is just the frequency of an itemset in all of the baskets, so it measures how often something occurs or how probable it is to occur. Confidence is basically how related to items are to each other. It is like the ratio of how many times the items occur together normalized by the number of times the antecedent occurs. Lift is basically how related the antecedents and consequents are in an itemset normalized by their independance. If it is greater than 1 than that means there is a positive correlation while less than 1 means a negative correlation.

2. Support is the basis for most of our calculations to determine confidence, interestingness, and lift. We can make a lot of conclusions on relatedness of items based on how popular they are from this calculation, its like the base point for everything. Lift helps us filter out rules we are confident in that are in pairs based on their frequency. If we have a lot of pairs that reach a minimum support, we only want the ones that we are a certain level of confident about, meaning they occur frequently together compared to seperately. It doesn't really mean anything if a pair reached the support but they only occur one time together compared to the items occuring 20 times seperately, its more of a coincidence compared to confidence that they are correlated to each other. It measures the observed confidence compared to what was expected, to suggest there was a positive or negative correlation if it occured more or less than just random chance or what was expected. Lift requires that we know the confidence about the antecedents and the consequents, it helps determine how confident we are that there is a rule that has meaning. This also helps us filter out frequent pairs if they occure frequently together compared to how frequently the antecedent occurs. It measures the strength of two itemsets. Association rules are created by keeping track of items support, which we can build off of from monotonicity property to keep track of frequent pairs and frequent triplets and so on. Once we have frequent pairs we can try to start to find association rules only with a certain level of confidence though. We can check how often does one occur in the pair in relation to the other, and then how confident are we in this. We only want the ones we are confident in to a certain level because we want them to have a certain strength of association. Lastly, once we find association rules, we can basically say with a certain confidence level or positive/negative correlation that if we are going to put this item or these two items next to each other, we would benefit by putting the item it would imply next to it, because people like to buy them together. We could also use it to say we shouldn't put these together because people don't usually buy these things together and that wouldn't do us any good. We can also put emphasis on the ones that have a high confidence level so we know there is a greater chance they are going to buy those things together.

In [159]:
from collections import defaultdict
from itertools import combinations
import itertools
def PCY(transaction, supp_threshold, conf_threshold):
    item_counts = defaultdict(int)
    hash_table = defaultdict(int)
    pairs = defaultdict(int)
    for basket in transaction:
        for item in basket:
            item_counts[item] += 1
        for i in range(len(basket)):
            for j in range(i+1, len(basket)):
                hash_table[hash((basket[i], basket[j]))%50] += 1
    freqitems = {item: count for item, count in item_counts.items() if count/len(transaction) >= supp_threshold}
    bitmap = {bucket: 1 if count / len(transaction) >= supp_threshold else 0 for bucket, count in hash_table.items()}

    candidate_pairs = defaultdict(int)

    for basket in transaction:
        for i in range(len(basket)):
          for j in range(i+1, len(basket)):
            pair = (basket[i], basket[j])
            hashed_pair = hash(pair)%50
            if bitmap[hashed_pair] == 1 and pair[0] in freqitems and pair[1] in freqitems:
                candidate_pairs[pair] += 1

    frequentpairs = {pair:count for pair, count in candidate_pairs.items() if count >= supp_threshold}

    confident_itemsets = {}
    for pair, count in frequentpairs.items():
      conf = count / item_counts[pair[0]]
      if conf >= conf_threshold:
        confident_itemsets[pair] = conf
    return confident_itemsets

items = df.groupby('Member_number')['itemDescription'].apply(list).reset_index()
items = items['itemDescription'].tolist()
print(PCY(items, .05, .3))




{('canned beer', 'whole milk'): 0.4407252440725244, ('sausage', 'whole milk'): 0.685064935064935, ('sausage', 'yogurt'): 0.41233766233766234, ('whole milk', 'whole milk'): 0.37849720223820943, ('frankfurter', 'whole milk'): 0.656896551724138, ('frankfurter', 'soda'): 0.4689655172413793, ('frankfurter', 'rolls/buns'): 0.5086206896551724, ('beef', 'whole milk'): 0.5968992248062015, ('beef', 'soda'): 0.32945736434108525, ('beef', 'rolls/buns'): 0.37209302325581395, ('sausage', 'soda'): 0.42857142857142855, ('sausage', 'rolls/buns'): 0.47943722943722944, ('whole milk', 'rolls/buns'): 0.3225419664268585, ('curd', 'whole milk'): 0.31906614785992216, ('tropical fruit', 'whole milk'): 0.5067829457364341, ('tropical fruit', 'other vegetables'): 0.37112403100775193, ('root vegetables', 'rolls/buns'): 0.3099906629318394, ('other vegetables', 'rolls/buns'): 0.3103266596417281, ('other vegetables', 'whole milk'): 0.37565858798735513, ('pip fruit', 'rolls/buns'): 0.3588709677419355, ('pip fruit', 'w

In [164]:
associateant = associate['antecedents']
associatecon = associate["consequents"].tolist()
associateant = [list(i) for i in associateant]

associatecon = [list(i) for i in associatecon]
apriorisets = []

for i in range(len(associateant)):
  apriorisets.append([associateant[i][0:],associatecon[i] ])
for i in range(len(associateant)):
  apriorisets[i] = apriorisets[i][0] + apriorisets[i][1]

pcysets = [i for i in PCY(items, .05, .3).keys()]

print("Number of association rules found in apriori algorithm")
print(len(apriorisets))
print()
print("Number of frequent item sets found in PCY algorithm")
print(len(pcysets))
print()

num_of_similar_sets = 0
for i in apriorisets:
  for j in pcysets:
    if set(i) == set(j):
      num_of_similar_sets += 1
print("Number of similar sets")
print(num_of_similar_sets)



Number of association rules found in apriori algorithm
132

Number of frequent item sets found in PCY algorithm
84

Number of similar sets
43


To compare the PCY algorithm with the Apriori algorithm, the apriori algorithm found a lot of association rules quickly for this size of data, however, when you put more data in, the PCY algorithm will find frequent itemsets faster. I found the amount of frequent itemsets that made it as an association rule from the apriori algorithm which was 43. I also noticed that when you changed the bin size for the PCY algorithm, the amount of frequent itemsets found can change drastically. The PCY algorithm is less computationally complex.