## Input
The provided input file ("categories.txt") consists of the category lists of 77,185 places in the US. Each line corresponds to the category list of one place, where the list consists of a number of category instances (e.g., hotels, restaurants, etc.) that are separated by semicolons.

An example line is provided below:

Local Services;IT Services & Computer Repair

In the example above, the corresponding place has two category instances: "Local Services" and "IT Services & Computer Repair".

## Output
You need to implement the Apriori algorithm and use it to mine category sets that are frequent in the input data. When implementing the Apriori algorithm, you may use any programming language you like. We only need your result pattern file, not your source code file.

After implementing the Apriori algorithm, please set the relative minimum support to 0.01 and run it on the 77,185 category lists. In other words, you need to extract all the category sets that have an absolute support larger than 771.

### Read in data

In [1]:
# read "categories.txt" into list of lists
# https://stackoverflow.com/questions/18448847/import-txt-file-and-having-each-line-as-a-list
transactions = []
with open('categories.txt', 'rt') as f:
    for line in f:
        transactions.append(line.strip().split(';'))
print(type(transactions))
print(len(transactions))

<class 'list'>
77185


**Credit**: http://adataanalyst.com/machine-learning/apriori-algorithm-python-3-0/

In [2]:
def createC1(Database):
    """Find frequent 1-itemsets from database of transactions"""
    C1 = []
    for transaction in Database:
        for item in transaction:
            if [item] not in C1:
                C1.append([item])
    C1.sort()
    return list(map(frozenset, C1)) # use frozenset so we can use it as a key in a dict

In [3]:
def scanD(D, Ck, min_sup):
    """
    Given database, Ck (list of candidate sets), and min_support, 
    generate Lk from Ck and also return a dictionary with support values
    """
    
    # get the support counts for all the transactions in candidate set, Ck
    supCounts = dict() # dictionary with itemset (frozenset) as key, and counts as value
    for tid in D:
        for candidate in Ck:
            if candidate.issubset(tid):
                if candidate not in supCounts:
                    supCounts[candidate] = 1
                else:
                    supCounts[candidate] += 1
                    
    # only add candidates to return list whose counts are greater than the minimum support
    # recall key = itemset and value = support count    
    retList = list()
    supportData = dict()  
    for key, value in supCounts.items():
        # if value >= min_sup:
        if value > min_sup:
            retList.insert(0, key)
            supportData[key] = value
    return retList, supportData

###  Apriori Pseudo-code
While the current frequent itemset is not empty:
* generate a list of candidate itemsets of length k
* scan the database to see if each itemset is frequent
* keep frequent itemsets to create itemsets of length k + 1

In [4]:
def apriori_gen(Lk, k):
    """
    Given list of frequent itemsets, Lk, and the size of the itemsets, k,
    produce Ck, a list of candidate itemsets.
    """
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[:k-2]
            L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort() # for efficient implementation (pg 249 of book)
            if (L1 == L2): # if first k-2 elements are equal
                retList.append(Lk[i] | Lk[j]) # set union
    return retList # BE CAREFUL OF TAB!        

In [5]:
def apriori(database, min_sup = 771):
    """Main function for apriori algorithm"""
    C1 = createC1(database)
    D = list(map(set, database)) # becomes list of sets
    L1, supportData = scanD(D, C1, min_sup)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0): # indexing off by 1
        Ck = apriori_gen(L[k-2], k)
        Lk, supK = scanD(D, Ck, min_sup) # scan DB to get Lk
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData

# Part 1

Please output all the length-1 frequent categories with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category and should be in the following format:

support:category

For example, suppose a category (Fast Food) has an absolute support 3000, then the line corresponding to this frequent category set in "patterns.txt" should be:

3000:Fast Food

In [6]:
%%time
min_sup = 771
C1 = createC1(transactions)
L1, supData_part1 = scanD(transactions, C1, min_sup)

CPU times: user 20.7 s, sys: 55.5 ms, total: 20.8 s
Wall time: 20.9 s


In [7]:
supData_part1

{frozenset({'Nightlife'}): 5088,
 frozenset({'Home Services'}): 4785,
 frozenset({'Home & Garden'}): 1586,
 frozenset({'Pets'}): 1497,
 frozenset({'Pubs'}): 874,
 frozenset({'Shopping'}): 11233,
 frozenset({'Food'}): 9250,
 frozenset({'Bars'}): 4328,
 frozenset({'Nail Salons'}): 1667,
 frozenset({'Real Estate'}): 1424,
 frozenset({'Doctors'}): 1694,
 frozenset({'Hotels'}): 1431,
 frozenset({'Coffee & Tea'}): 2199,
 frozenset({'Burgers'}): 1774,
 frozenset({'Sandwiches'}): 2364,
 frozenset({'Local Services'}): 3468,
 frozenset({'Pet Services'}): 870,
 frozenset({'Restaurants'}): 25071,
 frozenset({'Japanese'}): 848,
 frozenset({'Health & Medical'}): 5121,
 frozenset({'Breakfast & Brunch'}): 1369,
 frozenset({'Professional Services'}): 1025,
 frozenset({'Hair Salons'}): 2091,
 frozenset({'Auto Repair'}): 1716,
 frozenset({'Mexican'}): 2515,
 frozenset({'General Dentistry'}): 823,
 frozenset({'Event Planning & Services'}): 2975,
 frozenset({"Women's Clothing"}): 1138,
 frozenset({'Chinese

In [8]:
# write results to  it to "patterns.txt"
# https://stackoverflow.com/questions/17801665/how-to-get-an-arbitrary-element-from-a-frozenset/17844057#17844057
with open('part1/part1_v2.txt', 'wt') as f:
    for k, v in supData_part1.items():
        x, = k
        f.write(str(v) + ':' + str(x) + '\n')

# Part 2

Please write all the frequent category sets along with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category set and should be in the following format:

support:category_1;category_2;category_3;...

For example, suppose a category set (Fast Food; Restaurants) has an absolute support 2851, then the line corresponding to this frequent category set in "patterns.txt" should be:

2851:Fast Food;Restaurants



In [9]:
%%time
L, supData_part2 = apriori(transactions)

CPU times: user 18.3 s, sys: 78.9 ms, total: 18.4 s
Wall time: 18.5 s


In [10]:
len(supData_part2)

101

In [11]:
# supData_part2

In [12]:
# https://stackoverflow.com/questions/2399112/python-print-delimited-list
category_set = []
for k, v in supData_part2.items():
    # category_set = list(k)
    category_set = [x for x in k]
    print(str(v) + ':' + ';'.join(map(str, category_set)) + '\n')

1195:Dentists;Health & Medical

1716:Automotive;Auto Repair

1586:Home & Garden

1774:Burgers

1138:Shopping;Women's Clothing

1431:Hotels & Travel;Hotels

774:Burgers;Fast Food

2423:Restaurants;Bars

2091:Hair Salons

848:Restaurants;Japanese

9250:Food

1667:Nail Salons

1694:Doctors

774:Restaurants;Burgers;Fast Food

1424:Home Services;Real Estate

2271:Arts & Entertainment

2975:Event Planning & Services

3468:Local Services

2101:Food;Restaurants

870:Pet Services

25071:Restaurants

5121:Health & Medical

6583:Beauty & Spas

818:Sports Bars;Nightlife

823:General Dentistry;Dentists

874:Bars;Pubs;Nightlife

4328:Bars;Nightlife

3078:Shopping;Fashion

2515:Mexican

823:General Dentistry

823:General Dentistry;Dentists;Health & Medical

1150:Specialty Food

1848:Italian

2515:Restaurants;Mexican

823:General Dentistry;Health & Medical

1115:Bakeries

1369:Restaurants;Breakfast & Brunch

1431:Event Planning & Services;Hotels

1593:American (New);Restaurants

1002:Cafes

4208:Autom

In [13]:
category_set = []
with open('part2/v2/patterns_test.txt', 'wt') as f:
    for k, v in supData_part2.items():
        # category_set = list(k)
        category_set = [x for x in k]
        f.write(str(v) + ':' + ';'.join(map(str, category_set)) + '\n')