## Input
The provided input file ("categories.txt") consists of the category lists of 77,185 places in the US. Each line corresponds to the category list of one place, where the list consists of a number of category instances (e.g., hotels, restaurants, etc.) that are separated by semicolons.

An example line is provided below:

Local Services;IT Services & Computer Repair

In the example above, the corresponding place has two category instances: "Local Services" and "IT Services & Computer Repair".

## Output
You need to implement the Apriori algorithm and use it to mine category sets that are frequent in the input data. When implementing the Apriori algorithm, you may use any programming language you like. We only need your result pattern file, not your source code file.

After implementing the Apriori algorithm, please set the relative minimum support to 0.01 and run it on the 77,185 category lists. In other words, you need to extract all the category sets that have an absolute support larger than 771.

### Psuedocode for Apriori
Step 1. find frequent 1-itemsets  
Steps 2-10. L_k-1 is used to generate candidates C_k to find L_k for K >=2  
Step 3. generates the candidates and then uses Apriori properety to eliminate those having a subset that is not frequent  
Step 4. scan the database for counts  
Step 5. For each transaction, a subset function used to find all subsets of the transactions that are candidates  
Steps 6-7. Count for each of these candidates is acculated  
Steps 9-11. All candidates satisfying minimum support form the frequent itemsets, L  

In [1]:
from collections import Counter

### Read in data

In [2]:
# read "categories.txt" into list of lists
# https://stackoverflow.com/questions/18448847/import-txt-file-and-having-each-line-as-a-list
transactions = []
with open('categories.txt', 'rt') as f:
    for line in f:
        transactions.append(line.strip().split(';'))
# print(type(transactions))
# transactions

### Apriori
Credit: https://gist.github.com/Stiivi/4730288

In [3]:
def distinct_items(transactions, support=None):
    """Returns counted set of distinct items in transactions"""
    counter = Counter()
    for trans in transactions:
        counter.update(trans)

    if support is not None:
        return set(item for item in counter if counter[item] >= support)
    else:
        return set(counter)
    
def frequent_single_itemsets(transactions, support):
    """Return one-item itemsets with at least `support` support."""
    distinct = distinct_items(transactions, support)
    return set(frozenset([i]) for i in distinct)

def itemsets_support(transactions, itemsets):
    """Get support for `itemsets`"""

    support_set = Counter()

    for trans in transactions:
        subsets = [itemset for itemset in itemsets if itemset <= set(trans)] # transactions as lists of lists
        support_set.update(subsets)

    return support_set

In [4]:
def apriori_gen(L, k):
    """Generate candidate set from `L` with size `k`"""
    candidates = set()
    for l1 in L:
        for l2 in L:
            unionset = l1 | l2
            if len(unionset) == k and l1 != l2:
                candidates.add(unionset)
    return candidates

In [5]:
def apriori_prune(counter, support):
    """Return sets with minimal `support`"""
    items = [item for item in counter if counter[item] >= support]
    return set(items)

### Run Main Procedure

In [6]:
%%time
min_support = 771
candidates = frequent_single_itemsets(transactions, min_support)
result = list(candidates) # L begins with frequent-1 itemsets!!!

k = 2
while(candidates):
    candidates = apriori_gen(candidates, k) # generate candidates
    supports = itemsets_support(transactions, candidates) # get supports of candidates
    candidates = apriori_prune(supports, min_support) # prune unfruitful candidates
    result += candidates # add qualifying candidates to the frequent itemsets
    k = k + 1

CPU times: user 40.5 s, sys: 117 ms, total: 40.6 s
Wall time: 40.8 s


# Part 1

Please output all the length-1 frequent categories with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category and should be in the following format:

support:category

For example, suppose a category (Fast Food) has an absolute support 3000, then the line corresponding to this frequent category set in "patterns.txt" should be:

3000:Fast Food

In [7]:
%%time
distinct = frequent_single_itemsets(transactions, min_support)
frequent_1_itemsets = itemsets_support(transactions, distinct)

CPU times: user 2.02 s, sys: 6.34 ms, total: 2.03 s
Wall time: 2.03 s


In [8]:
frequent_1_itemsets

Counter({frozenset({'Burgers'}): 1774,
         frozenset({'Sandwiches'}): 2364,
         frozenset({'Pets'}): 1497,
         frozenset({'Auto Repair'}): 1716,
         frozenset({'Mexican'}): 2515,
         frozenset({'Pizza'}): 2657,
         frozenset({'Food'}): 9250,
         frozenset({'Nightlife'}): 5088,
         frozenset({'Beauty & Spas'}): 6583,
         frozenset({'Sushi Bars'}): 798,
         frozenset({"Women's Clothing"}): 1138,
         frozenset({'Bars'}): 4328,
         frozenset({'Health & Medical'}): 5121,
         frozenset({'Automotive'}): 4208,
         frozenset({'Ice Cream & Frozen Yogurt'}): 1018,
         frozenset({'Restaurants'}): 25071,
         frozenset({'Dentists'}): 1195,
         frozenset({'Nail Salons'}): 1667,
         frozenset({'Chinese'}): 1629,
         frozenset({'Home Services'}): 4785,
         frozenset({'Cafes'}): 1002,
         frozenset({'Hotels'}): 1431,
         frozenset({'Coffee & Tea'}): 2199,
         frozenset({'Pet Services'}): 87

In [9]:
# write results to  it to "patterns.txt"
# https://stackoverflow.com/questions/17801665/how-to-get-an-arbitrary-element-from-a-frozenset/17844057#17844057
with open('part1/part1_v1.txt', 'wt') as f:
    for k, v in frequent_1_itemsets.items():
        x, = k
        f.write(str(v) + ':' + str(x) + '\n')

# Part 2

Please write all the frequent category sets along with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category set and should be in the following format:

support:category_1;category_2;category_3;...

For example, suppose a category set (Fast Food; Restaurants) has an absolute support 2851, then the line corresponding to this frequent category set in "patterns.txt" should be:

2851:Fast Food;Restaurants



In [11]:
all_frequent_itemsets = itemsets_support(transactions, result)

In [12]:
all_frequent_itemsets

Counter({frozenset({'Pizza'}): 2657,
         frozenset({'Coffee & Tea', 'Food'}): 2199,
         frozenset({'Sandwiches'}): 2364,
         frozenset({'Bars', 'Sports Bars'}): 818,
         frozenset({'Fashion', 'Shopping', "Women's Clothing"}): 1138,
         frozenset({'Fitness & Instruction'}): 1442,
         frozenset({'Auto Repair'}): 1716,
         frozenset({'Food'}): 9250,
         frozenset({'Active Life', 'Fitness & Instruction'}): 1442,
         frozenset({'Pizza', 'Restaurants'}): 2657,
         frozenset({"Women's Clothing"}): 1138,
         frozenset({'Bars'}): 4328,
         frozenset({'Ice Cream & Frozen Yogurt'}): 1018,
         frozenset({'Restaurants'}): 25071,
         frozenset({'Bakeries'}): 1115,
         frozenset({'Doctors', 'Health & Medical'}): 1694,
         frozenset({'Chinese'}): 1629,
         frozenset({'Mexican', 'Restaurants'}): 2515,
         frozenset({'Breakfast & Brunch', 'Restaurants'}): 1369,
         frozenset({'Dentists', 'General Dentistry'}):

In [13]:
# https://stackoverflow.com/questions/2399112/python-print-delimited-list
category_set = []
for k, v in all_frequent_itemsets.items():
    # category_set = list(k)
    category_set = [x for x in k]
    print(str(v) + ':' + ';'.join(map(str, category_set)) + '\n')

2657:Pizza

2199:Food;Coffee & Tea

2364:Sandwiches

874:Nightlife;Pubs

1424:Food;Grocery

774:Fast Food;Burgers

1442:Fitness & Instruction

1716:Auto Repair

9250:Food

1442:Fitness & Instruction;Active Life

2657:Restaurants;Pizza

1138:Women's Clothing

4328:Bars

1018:Ice Cream & Frozen Yogurt

25071:Restaurants

1115:Bakeries

875:Financial Services

1694:Health & Medical;Doctors

1629:Chinese

2515:Restaurants;Mexican

1369:Restaurants;Breakfast & Brunch

823:General Dentistry;Dentists

1424:Home Services;Real Estate

4785:Home Services

1002:Cafes

1138:Fashion;Women's Clothing

2199:Coffee & Tea

1667:Nail Salons;Beauty & Spas

1694:Doctors

1150:Specialty Food

1586:Home & Garden

1431:Event Planning & Services;Hotels & Travel;Hotels

1497:Pets

2091:Hair Salons

2851:Restaurants;Fast Food

1369:Breakfast & Brunch

1424:Grocery

1138:Fashion;Women's Clothing;Shopping

2416:American (Traditional)

823:General Dentistry;Health & Medical

2423:Bars;Restaurants

11233:Shopping



In [14]:
category_set = []
with open('part2/part2_v1.txt', 'wt') as f:
    for k, v in all_frequent_itemsets.items():
        category_set = list(k)
        f.write(str(v) + ':' + ';'.join(map(str, category_set)) + "\n")