<h1> Task 1: Implement the Apriori algorithm to mine frequent itemsets </h1>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools

In [1]:
# Make dummy data
data = pd.DataFrame(np.random.randint(0, 2, size=(10, 8)), columns=list('ABCDEFGH'))
data

Unnamed: 0,A,B,C,D,E,F,G,H
0,0,1,1,1,1,0,0,0
1,0,1,0,0,1,1,1,1
2,0,0,1,0,1,0,0,0
3,1,0,1,1,0,1,0,1
4,0,0,1,1,1,0,1,0
5,0,0,1,1,0,1,0,0
6,1,0,1,0,0,0,0,0
7,1,1,1,1,0,0,0,1
8,0,1,0,0,1,1,1,1
9,1,0,1,1,0,1,0,0


In [2]:
# Count the number of 0s and 1s in each column
# The number of 1s is the number of times each item appears
value_counts = data.apply(pd.value_counts)
value_counts

Unnamed: 0,A,B,C,D,E,F,G,H
0,6,6,2,4,5,5,7,6
1,4,4,8,6,5,5,3,4


In [3]:
value_counts['A'][1]

4

Using the lecture notes explanation of the Apriori Algorithm, we have 4 steps to do.
1. Candidate Generation
2. Candidate Pruning
3. Support Counting
4. Candidate Elimination

Sample code for 1 and 2 itemset

Define the min support

In [4]:
min_support = 4

In [100]:
# Combined dictionary of frequent itemsets
combined_freq_itemsets = {}

Generate F1 (frequent 1-itemsets)

In [101]:
# Get the frequent itemsets with count greater than or equal to min_support
columns = data.columns
frequent_itemsets = {}
for column in columns:
    # Append the itemset and its count to the dictionary if the count is greater than or equal to min_support
    if value_counts[column][1] >= min_support:
        frequent_itemsets[column] = value_counts[column][1]
        # frequent_itemsets.append((column, value_counts[column][1]))
        # data.drop(column, axis=1, inplace=True)

print(frequent_itemsets)

dummy_dict = frequent_itemsets.copy()
for key, item in dummy_dict.copy().items():
    dummy_dict[(tuple(key))] = dummy_dict.pop(key)
print(dummy_dict)
    
combined_freq_itemsets.update(dummy_dict)

{'A': 4, 'B': 4, 'C': 8, 'D': 6, 'E': 5, 'F': 5, 'H': 4}
{('A',): 4, ('B',): 4, ('C',): 8, ('D',): 6, ('E',): 5, ('F',): 5, ('H',): 4}


Step 1: Candidate Generation

In [102]:
# Generate all possible combinations of frequent itemsets with k+1 items
combinations = []
k = 1
combinations.append(list(itertools.combinations(frequent_itemsets.keys(), k+1)))

combinations

[[('A', 'B'),
  ('A', 'C'),
  ('A', 'D'),
  ('A', 'E'),
  ('A', 'F'),
  ('A', 'H'),
  ('B', 'C'),
  ('B', 'D'),
  ('B', 'E'),
  ('B', 'F'),
  ('B', 'H'),
  ('C', 'D'),
  ('C', 'E'),
  ('C', 'F'),
  ('C', 'H'),
  ('D', 'E'),
  ('D', 'F'),
  ('D', 'H'),
  ('E', 'F'),
  ('E', 'H'),
  ('F', 'H')]]

Step 2: Candidate Pruning (do not need to prune for 2 itemset as F1 items are all frequent)

Step 3: Support Counting

In [103]:
# Convert the list of lists of tuples to a list of tuples
combinations = combinations[0]

In [104]:
# Count the number of occurences of each combination in the data
combinations_count = {}
for combination in combinations:
    # Using groupby and size to count the number of occurences of each combination
    # Resetting the index to get the count of each combination as a column in the dataframe
    test = data.groupby(list(combination)).size().reset_index(name='count')
    
    # Append the combination and its count to the dictionary
    # The count of each combination is the last value in the count column
    # Moreover, we need to check whether the last row is a combination of 1s instead of 1s and 0s
    # If it is a combination of 1s, then we append the combination and its count to the dictionary
    # Otherwise, we do not append it to the dictionary
    if test[test.columns[0]].iloc[-1] == 1 and test[test.columns[1]].iloc[-1] == 1:
        combinations_count[combination] = test['count'].iloc[-1]

# print(test)
combinations_count

{('A', 'B'): 1,
 ('A', 'C'): 4,
 ('A', 'D'): 3,
 ('A', 'F'): 2,
 ('A', 'H'): 2,
 ('B', 'C'): 2,
 ('B', 'D'): 2,
 ('B', 'E'): 3,
 ('B', 'F'): 2,
 ('B', 'H'): 3,
 ('C', 'D'): 6,
 ('C', 'E'): 3,
 ('C', 'F'): 3,
 ('C', 'H'): 2,
 ('D', 'E'): 2,
 ('D', 'F'): 3,
 ('D', 'H'): 2,
 ('E', 'F'): 2,
 ('E', 'H'): 2,
 ('F', 'H'): 3}

In [105]:
# test.index.values[-1].count(1)
test1 = test
count = test1['count'].iloc[-1]
count

3

Step 4: Candidate Elimination

In [106]:
# Prune the combinations with count less than min_support
for combination in combinations_count.copy().keys():
    if combinations_count[combination] < min_support:
        combinations_count.pop(combination)

print(combinations_count)
combined_freq_itemsets.update(combinations_count)

{('A', 'C'): 4, ('C', 'D'): 6}


Candidate generation for 2 or more frequent itemsets

In [107]:
# Merge the combinations if the first k-1 items are the same
# and the last item is different
# This is done to generate combinations with k+1 items
# from combinations with k items

# Compare first k-1 items of each combination
# If they are the same, merge them
# If they are not the same, do not merge them
# The merged combinations are stored in a dictionary
merged_combinations = {}
# for combination1 in combinations_count.keys():
#     for combination2 in combinations_count.keys():
#         # Check if the first k-1 items are the same
#         if combination1[:-1] == combination2[:-1]:
#             # Check if the last item is different
#             if combination1[-1] != combination2[-1]:
#                 # Merge the combinations
#                 merged_combinations[combination1 + (combination2[-1],)] = 0

for index, combination1 in enumerate(combinations_count.keys()):
    for combination2 in list(combinations_count.keys())[index+1:]:
        # Check if the first k-1 items are the same
        if combination1[:-1] == combination2[:-1]:
            # Check if the last item is different
            if combination1[-1] != combination2[-1]:
                # Merge the combinations
                merged_combinations[combination1 + (combination2[-1],)] = 0


merged_combinations


{}

Support counting

In [108]:
# Count the number of occurences of each combination in the data
merged_combinations_count = {}
for combination in merged_combinations.keys():
    # Using groupby and size to count the number of occurences of each combination
    # Resetting the index to get the count of each combination as a column in the dataframe
    test = data.groupby(list(combination)).size().reset_index(name='count')

    # Append the combination and its count to the dictionary
    # The count of each combination is the last value in the count column
    # as the last row of the dataframe is when both items are present in one transaction in the original data dataframe
    merged_combinations_count[combination] = test['count'].iloc[-1]

# print(test)
merged_combinations_count

{}

In [109]:
# Prune the combinations with count less than min_support
for combination in merged_combinations_count.copy().keys():
    if merged_combinations_count[combination] < min_support:
        merged_combinations_count.pop(combination)

print(merged_combinations_count)
combined_freq_itemsets.update(merged_combinations_count)


{}


In [110]:
combined_freq_itemsets 

{('A',): 4,
 ('B',): 4,
 ('C',): 8,
 ('D',): 6,
 ('E',): 5,
 ('F',): 5,
 ('H',): 4,
 ('A', 'C'): 4,
 ('C', 'D'): 6}

Part 2: Rule generation

In [166]:
lis = ['Mineral Water', 'Ground Beef', 'Spagetti']

for i in range(1, len(lis)):  #  xrange will return the values 1,2,3,4 in this loop
    combinations = []
    combinations.append(list(itertools.combinations(lis, i)))
    if combinations:
        combinations = combinations[0]
        print(combinations)
        for combination in combinations:
            print(combination)

combinations

[('Mineral Water',), ('Ground Beef',), ('Spagetti',)]
('Mineral Water',)
('Ground Beef',)
('Spagetti',)
[('Mineral Water', 'Ground Beef'), ('Mineral Water', 'Spagetti'), ('Ground Beef', 'Spagetti')]
('Mineral Water', 'Ground Beef')
('Mineral Water', 'Spagetti')
('Ground Beef', 'Spagetti')


[('Mineral Water', 'Ground Beef'),
 ('Mineral Water', 'Spagetti'),
 ('Ground Beef', 'Spagetti')]

In [137]:
# Generate rules for frequent itemsets with k+1 items with min confidence
# The rules are generated by splitting the combination into two parts
min_confidence = 0.5
rules = {}
for key in combined_freq_itemsets.keys():
    combinations = []
    for i in range(1, len(key)):  #  xrange will return the values 1,2,3,4 in this loop
        combinations.append(list(itertools.combinations(key, i)))
        if combinations:
            combinations = combinations[0]
            for combination in combinations:
                antecedent = combination
                consequent = tuple(set(key) - set(combination))
                confidence = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent]
                if confidence >= min_confidence:
                    rules[(antecedent, consequent)] = confidence
    # Split the combination into two parts
    # The first part is the antecedent and the second part is the consequent
    # for i in range(1, len(key)):
    #     antecedent_1 = key[:i]
    #     consequent_1 = key[i:]

    #     antecedent_2 = key[i:]
    #     consequent_2 = key[:i]

    #     # Calculate the confidence of the rule
    #     # Confidence = support of combination / support of antecedent
    #     confidence_1 = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent_1]
    #     confidence_2 = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent_2]

    #     # Check if the confidence is greater than min_confidence
    #     if confidence_1 >= min_confidence:
    #         # Append the rule to the rules dictionary
    #         rules[(antecedent_1, consequent_1)] = confidence_1
        
    #     if confidence_2 >= min_confidence:
    #         # Append the rule to the rules dictionary
    #         rules[(antecedent_2, consequent_2)] = confidence_2

rules


{(('A',), ('C',)): 1.0,
 (('C',), ('A',)): 0.5,
 (('C',), ('D',)): 0.75,
 (('D',), ('C',)): 1.0}

In [17]:
# Prune smaller rules based on confidence of larger rules
# If larger rule has confidence less than min_confidence, smaller rules are pruned

# Sort the rules in descending order of confidence
sorted_rules = sorted(rules.items(), key=lambda x: x[1], reverse=True)
sorted_rules

# Prune the rules
pruned_rules = {}
for rule in sorted_rules:
    # Append the rule to the pruned_rules dictionary if it is not a subset of any rule in the dictionary
    if not any([set(rule[0]).issubset(set(pruned_rule[0])) for pruned_rule in pruned_rules.keys()]):
        pruned_rules[rule[0]] = rule[1]

pruned_rules

{(('A',), ('C',)): 1.0,
 (('A',), ('E',)): 1.0,
 (('A',), ('C', 'E')): 1.0,
 (('A', 'C'), ('E',)): 1.0,
 (('C',), ('D',)): 0.75}

In [59]:
# Apriori algorithm
# We combine the above steps to generate frequent itemsets with k+1 items
# from frequent itemsets with k items
# We continue this process until we get no frequent itemsets with k+1 items
# We then combine the frequent itemsets with k items to generate association rules
# We continue this process until we get no association rules
# We then combine the association rules to generate association rules with k+1 items


# Function to generate frequent itemsets with 1 item (initialisation)
def generate_freq_1_itemsets(data, min_support, combined_freq_itemsets):
    # Count the number of 0s and 1s in each column
    # The number of 1s is the number of times each item appears
    value_counts = data.apply(pd.value_counts)

    # Get the frequent itemsets with count greater than or equal to min_support
    columns = data.columns
    frequent_itemsets = {}
    for column in columns:
        # Append the itemset and its count to the dictionary if the count is greater than or equal to min_support
        if value_counts[column][1] >= min_support:
            frequent_itemsets[column] = value_counts[column][1]
            # frequent_itemsets.append((column, value_counts[column][1]))
            # data.drop(column, axis=1, inplace=True)

    dummy_dict = frequent_itemsets.copy()
    for key, item in dummy_dict.copy().items():
        # For dummy data
        # dummy_dict[(tuple(key))] = dummy_dict.pop(key)
        # For real data
        dummy_dict[(key,)] = dummy_dict.pop(key)
    print(dummy_dict)

    combined_freq_itemsets.update(dummy_dict)

    print(frequent_itemsets)
    return frequent_itemsets


# Function to generate frequent itemsets with k+1 items
def generate_k_plus_1_candidate_itemsets(frequent_itemsets, k):
    # Generate all possible combinations of frequent itemsets with k+1 items

    # If k = 1, we do not need to merge the combinations
    if k == 1:
        combinations = []
        combinations.append(list(itertools.combinations(frequent_itemsets.keys(), k+1)))
        return combinations
    
    else:
        # Merge the combinations if the first k-1 items are the same
        # and the last item is different
        # This is done to generate combinations with k+1 items
        # from combinations with k items
        # Compare first k-1 items of each combination
        # If they are the same, merge them
        # If they are not the same, do not merge them
        # The merged combinations are stored in a dictionary
        merged_combinations = {}
        

        for index, combination1 in enumerate(frequent_itemsets.keys()):
            for combination2 in list(frequent_itemsets.keys())[index+1:]:
                # Check if the first k-1 items are the same
                if combination1[:-1] == combination2[:-1]:
                    # Check if the last item is different
                    if combination1[-1] != combination2[-1]:
                        # Merge the combinations
                        merged_combinations[combination1 + (combination2[-1],)] = 0

    
        return merged_combinations

# Function to count the number of occurences of each combination in the candidate itemsets
def k_plus_1_itemsets_support_counting(k_plus_1_candidate_itemsets, k, data):
    # If k = 1, we need to convert the list of lists of tuples to a list of tuples
    if k == 1:
        k_plus_1_candidate_itemsets = k_plus_1_candidate_itemsets[0]

    # Count the number of occurences of each combination in the data
    candidate_itemsets_count = {}
    for candidate_itemset in k_plus_1_candidate_itemsets:
        # Using groupby and size to count the number of occurences of each combination
        # Resetting the index to get the count of each combination as a column in the dataframe
        test = data.groupby(list(candidate_itemset)).size().reset_index(name='count')

        # Append the combination and its count to the dictionary
        # The count of each combination is the last value in the count column
        # Moreover, we need to check whether the last row is a combination of 1s instead of 1s and 0s
        # If it is a combination of 1s, then we append the combination and its count to the dictionary
        # Otherwise, we do not append it to the dictionary
        # if test[test.columns[0]].iloc[-1] == 1 and test[test.columns[1]].iloc[-1] == 1:
        #     candidate_itemsets_count[candidate_itemset] = test['count'].iloc[-1]
        num_ones = 0
        print(test, len(test.columns))
        for i in range(len(test.columns)-1):
            if test[test.columns[i]].iloc[-1] != 1:
                break
            else:
                num_ones += 1
                continue
            
        if num_ones == len(test.columns)-1:
            candidate_itemsets_count[candidate_itemset] = test['count'].iloc[-1]

    return candidate_itemsets_count


def candidate_elimination(combinations_count, min_support, combined_freq_itemsets):
    
    # Prune the combinations with count less than min_support
    for combination in combinations_count.copy().keys():
        if combinations_count[combination] < min_support:
            combinations_count.pop(combination)
    
    combined_freq_itemsets.update(combinations_count)
    return combinations_count

def generate_rules(combined_freq_itemsets, min_confidence):
    # Generate rules for frequent itemsets with k+1 items with min confidence
    # The rules are generated by splitting the combination into two parts
    rules = {}
    for key in combined_freq_itemsets.keys():
        
        for i in range(1, len(key)):  #  xrange will return the values 1,2,3,4 in this loop
            combinations = []
            combinations.append(list(itertools.combinations(key, i)))
            print(key, combinations)
            if combinations:
                combinations = combinations[0]
                for combination in combinations:
                    if type(combination) == str:
                        combination = (combination,)
                    antecedent = combination
                    consequent = tuple(set(key) - set(combination))
                    # print("Combinations is ", combinations, "Combination is: ", combination, "Antecedent is: ", antecedent, "Consequent is: ", consequent)
                    confidence = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent]
                    # print('antecedent: ', antecedent, 'consequent: ', consequent, 'confidence: ', confidence)
                    if confidence >= min_confidence:
                        rules[(antecedent, consequent)] = confidence
        # Split the combination into two parts
        # The first part is the antecedent and the second part is the consequent
        # for i in range(1, len(key)):
        #     antecedent_1 = key[:i]
        #     consequent_1 = key[i:]

        #     antecedent_2 = key[i:]
        #     consequent_2 = key[:i]
        #     # Calculate the confidence of the rule
        #     # Confidence = support of combination / support of antecedent
        #     confidence_1 = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent_1]
        #     confidence_2 = combined_freq_itemsets[key] / combined_freq_itemsets[antecedent_2]

        #     print(antecedent_1, consequent_1, confidence_1)
        #     print(antecedent_2, consequent_2, confidence_2)
        #     # Check if the confidence is greater than min_confidence
        #     if confidence_1 >= min_confidence:
        #         # Append the rule to the rules dictionary
        #         rules[(antecedent_1, consequent_1)] = confidence_1

        #     if confidence_2 >= min_confidence:
        #         # Append the rule to the rules dictionary
        #         rules[(antecedent_2, consequent_2)] = confidence_2
                
    return rules
    

In [34]:
def my_apriori(data, min_support, min_confidence):
    
    # Combined dictionary of frequent itemsets
    combined_freq_itemsets = {}

    # Get frequent 1 itemsets
    frequent_1_itemsets = generate_freq_1_itemsets(data, min_support, combined_freq_itemsets)

    k_plus_1_candidate_itemsets = None
    k_plus_1_itemsets_support_count = None
    k_plus_1_frequent_itemsets = None
    
    k = 1

    while True:
        # print(k)
        if k == 1:
            k_plus_1_candidate_itemsets = generate_k_plus_1_candidate_itemsets(frequent_1_itemsets, k)
        else:
            k_plus_1_candidate_itemsets = generate_k_plus_1_candidate_itemsets(k_plus_1_frequent_itemsets, k)
        print(combined_freq_itemsets)
        # print(k_plus_1_candidate_itemsets)
        k_plus_1_itemsets_support_count = k_plus_1_itemsets_support_counting(k_plus_1_candidate_itemsets, k, data)
        
        k_plus_1_frequent_itemsets = candidate_elimination(k_plus_1_itemsets_support_count, min_support, combined_freq_itemsets)
        # print(k_plus_1_frequent_itemsets)
        k += 1
        print('k: ', k)
        # If there are no frequent itemsets with k+1 items, break
        if len(k_plus_1_frequent_itemsets) == 0:
            break

    # Generate rules for frequent itemsets with k+1 items with min confidence
    # The rules are generated by splitting the combination into two parts
    rules = generate_rules(combined_freq_itemsets, min_confidence)
    
    return combined_freq_itemsets, rules


In [139]:
combined_freq_itemsets, rules = my_apriori(data, 4, 0.5)

{('A',): 4, ('B',): 4, ('C',): 8, ('D',): 6, ('E',): 5, ('F',): 5, ('H',): 4}
{'A': 4, 'B': 4, 'C': 8, 'D': 6, 'E': 5, 'F': 5, 'H': 4}
{('A',): 4, ('B',): 4, ('C',): 8, ('D',): 6, ('E',): 5, ('F',): 5, ('H',): 4}
k:  2
{('A',): 4, ('B',): 4, ('C',): 8, ('D',): 6, ('E',): 5, ('F',): 5, ('H',): 4, ('A', 'C'): 4, ('C', 'D'): 6}
k:  3


In [140]:
print('combined frequent itemsets: ', combined_freq_itemsets)

combined frequent itemsets:  {('A',): 4, ('B',): 4, ('C',): 8, ('D',): 6, ('E',): 5, ('F',): 5, ('H',): 4, ('A', 'C'): 4, ('C', 'D'): 6}


In [141]:
freq_itemsets_df = pd.DataFrame.from_dict(combined_freq_itemsets, orient='index', columns=['support'])
freq_itemsets_df

Unnamed: 0,support
"(A,)",4
"(B,)",4
"(C,)",8
"(D,)",6
"(E,)",5
"(F,)",5
"(H,)",4
"(A, C)",4
"(C, D)",6


In [142]:
print('rules: ', rules)
for key, item in rules.items():
    for i in range(1, len(key)):
        antecedent = key[:i]
        consequent = key[i:]
        print('antecedent: ', list(sum(antecedent, ())), '-> consequent: ', list(sum(consequent, ())), 'confidence: ', item)

rules:  {(('A',), ('C',)): 1.0, (('C',), ('A',)): 0.5, (('C',), ('D',)): 0.75, (('D',), ('C',)): 1.0}
antecedent:  ['A'] -> consequent:  ['C'] confidence:  1.0
antecedent:  ['C'] -> consequent:  ['A'] confidence:  0.5
antecedent:  ['C'] -> consequent:  ['D'] confidence:  0.75
antecedent:  ['D'] -> consequent:  ['C'] confidence:  1.0


<h3> Showing the results of my code is correct by using the actual official Apriori algorithm library extension </h3>

In [24]:
!pip install mlxtend

Collecting mlxtend
  Obtaining dependency information for mlxtend from https://files.pythonhosted.org/packages/73/da/d5d77a9a7a135c948dbf8d3b873655b105a152d69e590150c83d23c3d070/mlxtend-0.23.0-py3-none-any.whl.metadata
  Downloading mlxtend-0.23.0-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   -------- ------------------------------- 0.3/1.4 MB 9.6 MB/s eta 0:00:01
   ---------------- ----------------------- 0.6/1.4 MB 7.4 MB/s eta 0:00:01
   ----------------------- ---------------- 0.8/1.4 MB 6.6 MB/s eta 0:00:01
   ----------------------------- ---------- 1.1/1.4 MB 6.1 MB/s eta 0:00:01
   -------------------------------------- - 1.4/1.4 MB 6.4 MB/s eta 0:00:01
   ---------------------------------------- 1.4/1.4 MB 6.2 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0



[notice] A new release of pip is available: 23.2.1 -> 23.3
[notice] To update, run: C:\Users\tengwei\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [143]:
from mlxtend.frequent_patterns import apriori, association_rules

freq_items = apriori(data, min_support=0.4, use_colnames=True)
freq_items



Unnamed: 0,support,itemsets
0,0.4,(A)
1,0.4,(B)
2,0.8,(C)
3,0.6,(D)
4,0.5,(E)
5,0.5,(F)
6,0.4,(H)
7,0.4,"(C, A)"
8,0.6,"(C, D)"


In [144]:
rules = association_rules(freq_items, metric='confidence', min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(C),(A),0.8,0.4,0.4,0.5,1.25,0.08,1.2,1.0
1,(A),(C),0.4,0.8,0.4,1.0,1.25,0.08,inf,0.333333
2,(C),(D),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
3,(D),(C),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5


<h1> Task 2: Use 3 datasets to run Apriori algorithm with different min-support thresholds </h1>

<h2> 1. Grocery store dataset </h2>

In [82]:
df = pd.read_csv('Market_Basket_Optimisation.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [83]:
# Data CLeaning
df.fillna(0, inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,chutney,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,turkey,avocado,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,mineral water,milk,energy bar,whole wheat rice,green tea,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [84]:
# Get the unique items in the dataset
unique_items = pd.unique(df.values.ravel('K'))
unique_items

array(['shrimp', 'burgers', 'chutney', 'turkey', 'mineral water',
       'low fat yogurt', 'whole wheat pasta', 'soup', 'frozen vegetables',
       'french fries', 'eggs', 'cookies', 'spaghetti', 'meatballs',
       'red wine', 'rice', 'parmesan cheese', 'ground beef',
       'sparkling water', 'herb & pepper', 'pickles', 'energy bar',
       'fresh tuna', 'escalope', 'avocado', 'tomato sauce',
       'clothes accessories', 'energy drink', 'chocolate',
       'grated cheese', 'yogurt cake', 'mint', 'asparagus', 'champagne',
       'ham', 'muffins', 'french wine', 'chicken', 'pasta', 'tomatoes',
       'pancakes', 'frozen smoothie', 'carrots', 'yams', 'shallot',
       'butter', 'light mayo', 'pepper', 'candy bars', 'cooking oil',
       'milk', 'green tea', 'bug spray', 'oil', 'olive oil', 'salmon',
       'cake', 'almonds', 'salt', 'strong cheese', 'hot dogs', 'pet food',
       'whole wheat rice', 'antioxydant juice', 'honey', 'sandwich',
       'salad', 'magazines', 'protein bar', '

In [85]:
# Set the unique items as the column names
transactions_data = pd.DataFrame(columns=unique_items)
transactions_data.drop(columns= 0, inplace=True)
transactions_data

Unnamed: 0,shrimp,burgers,chutney,turkey,mineral water,low fat yogurt,whole wheat pasta,soup,frozen vegetables,french fries,...,ketchup,cream,hand protein bar,body spray,oatmeal,zucchini,water spray,tea,napkins,asparagus


In [86]:
# Iterate through the supermarket dataset
# Each row is a transaction
# If the item is present in the transaction, set the value as 1

for i in range(0, len(df)):
    transaction = df.iloc[i, :].values
    # Remove the 0s from the transaction
    transaction = transaction[transaction != 0]

    # Set the value as 1 if the item is present in the transaction
    for item in transaction:
        transactions_data.at[i, item] = 1

In [87]:
transactions_data.head()

Unnamed: 0,shrimp,burgers,chutney,turkey,mineral water,low fat yogurt,whole wheat pasta,soup,frozen vegetables,french fries,...,ketchup,cream,hand protein bar,body spray,oatmeal,zucchini,water spray,tea,napkins,asparagus
0,1.0,,,,1.0,1.0,,,,,...,,,,,,,,,,
1,,1.0,,,,,,,,,...,,,,,,,,,,
2,,,1.0,,,,,,,,...,,,,,,,,,,
3,,,,1.0,,,,,,,...,,,,,,,,,,
4,,,,,1.0,,,,,,...,,,,,,,,,,


In [89]:
transactions_data.fillna(0, inplace=True)
transactions_data.head()

Unnamed: 0,shrimp,burgers,chutney,turkey,mineral water,low fat yogurt,whole wheat pasta,soup,frozen vegetables,french fries,...,ketchup,cream,hand protein bar,body spray,oatmeal,zucchini,water spray,tea,napkins,asparagus
0,1,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [182]:
combined_freq_itemsets, rules = my_apriori(transactions_data, 100, 0.4)

{('shrimp',): 536, ('burgers',): 654, ('turkey',): 469, ('mineral water',): 1788, ('low fat yogurt',): 574, ('whole wheat pasta',): 221, ('soup',): 379, ('frozen vegetables',): 715, ('french fries',): 1282, ('eggs',): 1348, ('cookies',): 603, ('spaghetti',): 1306, ('meatballs',): 157, ('red wine',): 211, ('rice',): 141, ('parmesan cheese',): 149, ('ground beef',): 737, ('herb & pepper',): 371, ('energy bar',): 203, ('fresh tuna',): 167, ('escalope',): 595, ('avocado',): 250, ('tomato sauce',): 106, ('energy drink',): 200, ('chocolate',): 1229, ('grated cheese',): 393, ('yogurt cake',): 205, ('mint',): 131, ('champagne',): 351, ('ham',): 199, ('muffins',): 181, ('french wine',): 169, ('chicken',): 450, ('pasta',): 118, ('tomatoes',): 513, ('pancakes',): 713, ('frozen smoothie',): 475, ('carrots',): 115, ('butter',): 226, ('light mayo',): 204, ('pepper',): 199, ('cooking oil',): 383, ('milk',): 972, ('green tea',): 991, ('oil',): 173, ('olive oil',): 494, ('salmon',): 319, ('cake',): 608

In [183]:
print('combined frequent itemsets: ', combined_freq_itemsets)

combined frequent itemsets:  {('shrimp',): 536, ('burgers',): 654, ('turkey',): 469, ('mineral water',): 1788, ('low fat yogurt',): 574, ('whole wheat pasta',): 221, ('soup',): 379, ('frozen vegetables',): 715, ('french fries',): 1282, ('eggs',): 1348, ('cookies',): 603, ('spaghetti',): 1306, ('meatballs',): 157, ('red wine',): 211, ('rice',): 141, ('parmesan cheese',): 149, ('ground beef',): 737, ('herb & pepper',): 371, ('energy bar',): 203, ('fresh tuna',): 167, ('escalope',): 595, ('avocado',): 250, ('tomato sauce',): 106, ('energy drink',): 200, ('chocolate',): 1229, ('grated cheese',): 393, ('yogurt cake',): 205, ('mint',): 131, ('champagne',): 351, ('ham',): 199, ('muffins',): 181, ('french wine',): 169, ('chicken',): 450, ('pasta',): 118, ('tomatoes',): 513, ('pancakes',): 713, ('frozen smoothie',): 475, ('carrots',): 115, ('butter',): 226, ('light mayo',): 204, ('pepper',): 199, ('cooking oil',): 383, ('milk',): 972, ('green tea',): 991, ('oil',): 173, ('olive oil',): 494, ('s

In [188]:
index = 1
for key, item in rules.items():
    for i in range(1, len(key)):
        antecedent = key[:i]
        consequent = key[i:]
        print('Rule ', index, ': antecedent -> consequent: ', list(sum(antecedent, ())), '-> ', list(sum(consequent, ())), 'confidence: ', item)
        index += 1

Rule  1 : antecedent -> consequent:  ['soup'] ->  ['mineral water'] confidence:  0.45646437994722955
Rule  2 : antecedent -> consequent:  ['ground beef'] ->  ['mineral water'] confidence:  0.41655359565807326
Rule  3 : antecedent -> consequent:  ['olive oil'] ->  ['mineral water'] confidence:  0.4190283400809717
Rule  4 : antecedent -> consequent:  ['salmon'] ->  ['mineral water'] confidence:  0.4012539184952978
Rule  5 : antecedent -> consequent:  ['eggs', 'chocolate'] ->  ['mineral water'] confidence:  0.40562248995983935
Rule  6 : antecedent -> consequent:  ['mineral water', 'ground beef'] ->  ['spaghetti'] confidence:  0.4169381107491857
Rule  7 : antecedent -> consequent:  ['spaghetti', 'ground beef'] ->  ['mineral water'] confidence:  0.43537414965986393
Rule  8 : antecedent -> consequent:  ['spaghetti', 'chocolate'] ->  ['mineral water'] confidence:  0.40476190476190477
Rule  9 : antecedent -> consequent:  ['spaghetti', 'milk'] ->  ['mineral water'] confidence:  0.44360902255639

<h3> Verify with official Apriori library </h3>

In [96]:
transactions_data.shape

(7501, 120)

In [173]:
freq_items = apriori(transactions_data, min_support=0.0133, use_colnames=True)
freq_items



Unnamed: 0,support,itemsets
0,0.071457,(shrimp)
1,0.087188,(burgers)
2,0.062525,(turkey)
3,0.238368,(mineral water)
4,0.076523,(low fat yogurt)
...,...,...
182,0.013465,"(eggs, mineral water, chocolate)"
183,0.017064,"(ground beef, spaghetti, mineral water)"
184,0.015865,"(spaghetti, mineral water, chocolate)"
185,0.015731,"(spaghetti, mineral water, milk)"


In [174]:
rules = association_rules(freq_items, metric='confidence', min_threshold=0.4)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(soup),(mineral water),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255,0.503221
1,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
2,(olive oil),(mineral water),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962,0.461536
3,(salmon),(mineral water),0.042528,0.238368,0.017064,0.401254,1.683336,0.006927,1.272045,0.423972
4,"(eggs, chocolate)",(mineral water),0.033196,0.238368,0.013465,0.405622,1.701663,0.005552,1.281394,0.426498
5,"(ground beef, spaghetti)",(mineral water),0.039195,0.238368,0.017064,0.435374,1.826477,0.007722,1.348914,0.470957
6,"(ground beef, mineral water)",(spaghetti),0.040928,0.17411,0.017064,0.416938,2.394681,0.009938,1.41647,0.607262
7,"(spaghetti, chocolate)",(mineral water),0.039195,0.238368,0.015865,0.404762,1.698053,0.006522,1.279541,0.42786
8,"(spaghetti, milk)",(mineral water),0.035462,0.238368,0.015731,0.443609,1.861024,0.007278,1.368879,0.479672
9,"(milk, chocolate)",(mineral water),0.032129,0.238368,0.013998,0.435685,1.82778,0.00634,1.349656,0.467922


<h2> 2. Titanic dataset </h2>

In [23]:
survival_df = pd.read_csv('titanic/gender_submission.csv')
survival_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [24]:
train_titanic_df = pd.read_csv('titanic/train.csv')
test_titanic_df = pd.read_csv('titanic/test.csv')

In [25]:
train_titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
# Drop the columns that are not required
train_titanic_df.drop(columns=['PassengerId', 'Name','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin'], inplace=True)
train_titanic_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Embarked
0,0,3,male,22.0,S
1,1,1,female,38.0,C
2,1,3,female,26.0,S
3,1,1,female,35.0,S
4,0,3,male,35.0,S


In [29]:
# Categorise the Age column
# Age 21 and below is a Child
# Age between 21 and 55 is an Adult
# Age above 55 is an Elderly
train_titanic_df['Age'] = pd.cut(train_titanic_df['Age'], bins=[0, 21, 55, 80], labels=['Child', 'Adult', 'Elderly'])
train_titanic_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Embarked
0,0,3,male,Adult,S
1,1,1,female,Adult,C
2,1,3,female,Adult,S
3,1,1,female,Adult,S
4,0,3,male,Adult,S


In [30]:
# Convert into one hot encoding
train_titanic_df = pd.get_dummies(train_titanic_df)
train_titanic_df.head()

Unnamed: 0,Survived,Pclass,Sex_female,Sex_male,Age_Child,Age_Adult,Age_Elderly,Embarked_C,Embarked_Q,Embarked_S
0,0,3,0,1,0,1,0,0,0,1
1,1,1,1,0,0,1,0,1,0,0
2,1,3,1,0,0,1,0,0,0,1
3,1,1,1,0,0,1,0,0,0,1
4,0,3,0,1,0,1,0,0,0,1


In [31]:
# Converet Pclass into one hot encoding
train_titanic_df = pd.get_dummies(train_titanic_df, columns=['Pclass'])
train_titanic_df.head()


Unnamed: 0,Survived,Sex_female,Sex_male,Age_Child,Age_Adult,Age_Elderly,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,0,0,1,0,1,0,0,0,1,0,0,1
1,1,1,0,0,1,0,1,0,0,1,0,0
2,1,1,0,0,1,0,0,0,1,0,0,1
3,1,1,0,0,1,0,0,0,1,1,0,0
4,0,0,1,0,1,0,0,0,1,0,0,1


In [45]:
train_titanic_df.groupby(['Sex_male', 'Sex_female', 'Survived']).size().reset_index(name='count')

Unnamed: 0,Sex_male,Sex_female,Survived,count
0,0,1,0,81
1,0,1,1,233
2,1,0,0,468
3,1,0,1,109


In [49]:
len(train_titanic_df.columns)

12

In [60]:
# Run the apriori algorithm
combined_freq_itemsets, rules = my_apriori(train_titanic_df, 30, 0.4)

{('Survived',): 342, ('Sex_female',): 314, ('Sex_male',): 577, ('Age_Child',): 204, ('Age_Adult',): 470, ('Age_Elderly',): 40, ('Embarked_C',): 168, ('Embarked_Q',): 77, ('Embarked_S',): 644, ('Pclass_1',): 216, ('Pclass_2',): 184, ('Pclass_3',): 491}
{'Survived': 342, 'Sex_female': 314, 'Sex_male': 577, 'Age_Child': 204, 'Age_Adult': 470, 'Age_Elderly': 40, 'Embarked_C': 168, 'Embarked_Q': 77, 'Embarked_S': 644, 'Pclass_1': 216, 'Pclass_2': 184, 'Pclass_3': 491}
{('Survived',): 342, ('Sex_female',): 314, ('Sex_male',): 577, ('Age_Child',): 204, ('Age_Adult',): 470, ('Age_Elderly',): 40, ('Embarked_C',): 168, ('Embarked_Q',): 77, ('Embarked_S',): 644, ('Pclass_1',): 216, ('Pclass_2',): 184, ('Pclass_3',): 491}
   Survived  Sex_female  count
0         0           0    468
1         0           1     81
2         1           0    109
3         1           1    233 3
   Survived  Sex_male  count
0         0         0     81
1         0         1    468
2         1         0    233
3      

In [61]:
print('combined frequent itemsets: ', combined_freq_itemsets)

combined frequent itemsets:  {('Survived',): 342, ('Sex_female',): 314, ('Sex_male',): 577, ('Age_Child',): 204, ('Age_Adult',): 470, ('Age_Elderly',): 40, ('Embarked_C',): 168, ('Embarked_Q',): 77, ('Embarked_S',): 644, ('Pclass_1',): 216, ('Pclass_2',): 184, ('Pclass_3',): 491, ('Survived', 'Sex_female'): 233, ('Survived', 'Sex_male'): 109, ('Survived', 'Age_Child'): 87, ('Survived', 'Age_Adult'): 191, ('Survived', 'Embarked_C'): 93, ('Survived', 'Embarked_Q'): 30, ('Survived', 'Embarked_S'): 217, ('Survived', 'Pclass_1'): 136, ('Survived', 'Pclass_2'): 87, ('Survived', 'Pclass_3'): 119, ('Sex_female', 'Age_Child'): 84, ('Sex_female', 'Age_Adult'): 168, ('Sex_female', 'Embarked_C'): 73, ('Sex_female', 'Embarked_Q'): 36, ('Sex_female', 'Embarked_S'): 203, ('Sex_female', 'Pclass_1'): 94, ('Sex_female', 'Pclass_2'): 76, ('Sex_female', 'Pclass_3'): 144, ('Sex_male', 'Age_Child'): 120, ('Sex_male', 'Age_Adult'): 302, ('Sex_male', 'Age_Elderly'): 31, ('Sex_male', 'Embarked_C'): 95, ('Sex_m

In [62]:
index = 1
for key, item in rules.items():
    for i in range(1, len(key)):
        antecedent = key[:i]
        consequent = key[i:]
        print('Rule ', index, ': antecedent -> consequent: ', list(sum(antecedent, ())), '-> ', list(sum(consequent, ())), 'confidence: ', item)
        index += 1

Rule  1 : antecedent -> consequent:  ['Survived'] ->  ['Sex_female'] confidence:  0.6812865497076024
Rule  2 : antecedent -> consequent:  ['Sex_female'] ->  ['Survived'] confidence:  0.7420382165605095
Rule  3 : antecedent -> consequent:  ['Age_Child'] ->  ['Survived'] confidence:  0.4264705882352941
Rule  4 : antecedent -> consequent:  ['Survived'] ->  ['Age_Adult'] confidence:  0.5584795321637427
Rule  5 : antecedent -> consequent:  ['Age_Adult'] ->  ['Survived'] confidence:  0.40638297872340423
Rule  6 : antecedent -> consequent:  ['Embarked_C'] ->  ['Survived'] confidence:  0.5535714285714286
Rule  7 : antecedent -> consequent:  ['Survived'] ->  ['Embarked_S'] confidence:  0.6345029239766082
Rule  8 : antecedent -> consequent:  ['Pclass_1'] ->  ['Survived'] confidence:  0.6296296296296297
Rule  9 : antecedent -> consequent:  ['Pclass_2'] ->  ['Survived'] confidence:  0.47282608695652173
Rule  10 : antecedent -> consequent:  ['Age_Child'] ->  ['Sex_female'] confidence:  0.4117647058