# Task 1 - Apriori Algorithm for Recommender System (33 points)

**Goal:** The aim of this programming assignment is to implement the Apriori algorithm and apply it to mine frequent itemsets for recommendation. You are required to implement the algorithm from scratch by using only native Python libraries and NumPy. For efficiency you will need to convert the items to ids and sort them.

**Input:** The provided input file (`orders.txt`) contains 461 lists of items, which are orders by customers in an online retail shop. [1] Each line in the file corresponds to an order and represents a list of products a costumer bought. An example:

*Alarm Clock Bakelite Green;Panda And Bunnies Sticker Sheet*

In the example above, a customer ordered the products "Alarm Clock Bakelite Green" and "Panda And Bunnies Sticker Sheet".

**Output:** Implement the Apriori algorithm and use it to mine frequent itemsets. Set the relative minimum support to 0.025 and run the algorithm on the 461 orders of retail shop products. In other words, you need to extract all the itemsets that have an absolute support larger or equal to 12.

[1] The dataset is a modified version of the france dataset from pycaret. (https://github.com/pycaret/pycaret/blob/master/datasets/france.csv)

In [1]:
# TODO: uncomment the packages you used, please do not import additional non-native packages
# You may change the imports to the following format: from [package] import [class, method, etc.]

import collections
import itertools
#import numpy as np


## 1.1 Loading the data and preprocessing (3 points)
**Task:** Solve the tasks explained in the TODOs and comments.

In [2]:
# TODO: read the data from the input file /data/orders.txt (1 points)

orders_file = open("data/orders.txt", "r")
orders = []

line = orders_file.readline()
while line:
    orders.append(line.strip().split(";"))
    line = orders_file.readline()

In [3]:
# TODO: determine the unique items and map the item to ids using enumerate (1 points)
unique_items = []
id_to_item = unique_items

item_to_id = {}

for record in orders:
    for item in record:
        if item not in unique_items:
            unique_items.append(item)
            
for count, item in enumerate(unique_items):
    item_to_id[item] = count

In [4]:
# TODO: map the items of the records to ids and sort each record (1 points)
mapped_records = []
# In the following tasks use the mapped records to compute the frequent itemsets.

for record in orders:
    mapped_record = []
    for item in record:
        mapped_record.append(item_to_id[item])
    mapped_record.sort()
    mapped_records.append(mapped_record)

## 1.2 Apriori algorithm (21 points)
### A) Prune the infrequent items (3 points)
**Task:** Solve the tasks explained in the TODOs and comments.

In [5]:
# TODO: calculate the support of length-1 itemsets using Counter or defaultdict (1 points)
l1_items = collections.Counter()

for record in mapped_records:
    l1_items.update(record)

In [6]:
# TODO: filter out the frequent length-1 itemsets with their support (1 point)
frequent_l1_items = {}

for item in l1_items:
    support = l1_items[item]
    if support >= 12:
        frequent_l1_items[(item,)] = support

# Store all frequent itemsets (keys) with their support (value) in this dictionary.
# Hint: Convert the itemsets to tuples or sets so that you can use them as keys.
# TODO: save the length-1 frequent items and their supports to frequent_itemsets (1 points)
frequent_itemsets = {}

for item in frequent_l1_items:
    frequent_itemsets[item] = frequent_l1_items[item]

### B) Determine the frequent n itemsets (15 points)
**Task:** Solve the tasks explained in the TODOs and comments.


In [7]:
# TODO: implement the apriori_gen algorithm based on the lecture slides
def apriori_gen(itemsets):
    # TODO: generate candidates (4 points)
    C_k = set()
    for p in itemsets:
        for q in itemsets:
            if p[-1] < q[-1]:
                C_k.add( p + (q[-1],) )
        
    # TODO: prune the candidates and return them (4 points)
    def all_subsets_in_itemsets(x):
        for subset in itertools.combinations(x, len(x) - 1):
            if subset not in itemsets:
                return False
        return True
    
    return list(filter(all_subsets_in_itemsets, C_k))

In [8]:
# TODO: implement an algorithm to calculate the support of the given itemset (2 points)
# You do not need to implement a Hash Tree for calculating the supports.
def calculate_support(itemset):
    if len(itemset) == 1:
        try:
            return frequent_l1_items[itemset]
        except KeyError:
            return 0
        
    support = 0
    for record in mapped_records:
        itemset_in_record = True
        for item in itemset:
            if item not in record:
                itemset_in_record = False
                break
        if itemset_in_record:
            support += 1
    return support

In [9]:
# TODO: set the initial frequent itemsets which needs to be used in the first iteration (1 point)
# (It will be updated after each iteration.)
frequent_n_itemsets = frequent_l1_items

# TODO: set the correct loop condition until the Apriori algorithm should run (1 point)
while len(frequent_n_itemsets) != 0:
    candidates = apriori_gen(frequent_n_itemsets)
    supports = map(calculate_support, candidates)

    # TODO: filter out the frequent candidates (2 point)
    frequent_candidates = {}
    for candidate, support in zip(candidates, supports):
        if support >= 12:
            frequent_candidates[candidate] = support

    # TODO: add the frequent candidates to frequent_itemsets (1 point)
    for item in frequent_candidates:
        frequent_itemsets[item] = frequent_candidates[item]
    
    # replace the frequent_n_itemsets for the next iteration
    frequent_n_itemsets = [itemset for itemset in frequent_candidates]

### C) Save your results (3 points)

**Task:** Save all the frequent itemsets along with their absolute supports into a text file named `patterns.txt` and place it in the root of your zip file. Every line corresponds to exactly one frequent itemset and should be in the following format:

*support:product1;product2;product3;...*

For example, suppose an itemset (Mini Paint Set Vintage;Picture Dominoes) has an absolute support 46, then the line corresponding to this frequent itemset in `patterns.txt` should be:

*46:Mini Paint Set Vintage;Picture Dominoes*

In [10]:
with open("patterns.txt","w") as patterns_file:
    for itemset in frequent_itemsets:
        support = frequent_itemsets[itemset]
        products = ';'.join(map(lambda x: id_to_item[x], itemset))
        patterns_file.write(f"{support}:{products}\n")

## 1.3 Recommendation (9 points)

**Task:** Imagine you should recommend 2 other products to a customer who added "Pack Of 6 Skull Paper Cups" and "Pack Of 20 Skull Paper Napkins" to the cart. Based on the results of the Apriori algorithm, implement an algorithm that returns 2 products to display to the customer on the website by maximizing the confidence that the customer will buy the product. (6 points)

**Report:** Explain your method (comments in code or summary) and display your recommendations with the confidence scores. (3 points)


In [11]:
#inputs

#list of product names
products = ["Pack Of 6 Skull Paper Cups", "Pack Of 20 Skull Paper Napkins"]
#amount of recommended products in the output
recommendation_amount = 2

#tuple after changing the product names to their ids and then sorting them
products = tuple(sorted(map(lambda x: item_to_id[x], products)))

#confidence is calculated by the support of the added products and 
# the potential recommendations divided by the support of the added products.
#basically: conf = support(products + recommendation)/support(products)

#dict to save recommendations with their confidence
confidences = {}
#calculating support(products)
max_support = frequent_itemsets[products]

#iteration over all frequent itemsets, that are supersets of and have one more product in the tuple then the added items,
#exactly one more because of the anti-monotone property.
for itemset in filter(
        lambda x:  len(x) == len(products) + 1 and all([y in x for y in products]), frequent_itemsets
    ):
    #filtering the already added items
    for item in filter(lambda x: x not in products, itemset):
        #calculating the confidence for recommending this item and adding it to the confidence dict
        confidences[item] = frequent_itemsets[itemset] / max_support
        
#sorting the confidences dict from biggest confidence to lowest
confidences = sorted(confidences.items(), key=lambda x: x[1], reverse=True)

#printing the recommendations
for i in range(min(recommendation_amount, len(confidences))):
    id, confidence = confidences[i]
    recommendation = id_to_item[id]
    print(f"\"{recommendation}\" is recommended by {confidence * 100:.2f}% confidence.")

"Pack Of 6 Skull Paper Plates" is recommended by 93.75% confidence.
"Set/20 Red Retrospot Paper Napkins" is recommended by 81.25% confidence.
