# Description  
  
In this programming assignment, you are required to implement the Apriori algorithm and apply it to mine frequent itemsets from a real-life data set.  
  
### Input  
  
The provided input file ("categories.txt") consists of the category lists of 77,185 places in the US. Each line corresponds to the category list of one place, where the list consists of a number of category instances (e.g., hotels, restaurants, etc.) that are separated by semicolons.  
  
An example line is provided below:
  
Local Services;IT Services & Computer Repair
  
In the example above, the corresponding place has two category instances: "Local Services" and "IT Services & Computer Repair".  

### Output  
  
You need to implement the Apriori algorithm and use it to mine category sets that are frequent in the input data. When implementing the Apriori algorithm, you may use any programming language you like. We only need your result pattern file, not your source code file.  
  
After implementing the Apriori algorithm, please set the relative minimum support to 0.01 and run it on the 77,185 category lists. In other words, you need to extract all the category sets with absolute support larger than (non-inclusive) 771.  
  
#### Part 1: length-1 frequent categories
  
Please output all the length-1 frequent categories with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category and should be in the following format:

support:category

For example, suppose a category (Fast Food) has an absolute support 3000, then the line corresponding to this frequent category set in "patterns.txt" should be:

3000:Fast Food




In [1]:
# Unlike open() where you have to close the file with the close() method, 
# the with statement closes the file for you without you telling it to.
# https://docs.python.org/3/reference/compound_stmts.html#with

# A Counter is a element:count dict subclass for counting hashable objects.  
# https://docs.python.org/3/library/collections.html#counter-objects

import re
from collections import Counter

input_txt = 'G:\Other computers\My Laptop\Documents\MCS-DS\\2024 Fall CS 412 IntroToDataMining\Pattern Discovery in Data Mining\programming-1-frequent-itemset-mining-using-apriori\categories.txt'
output_txt = 'G:\Other computers\My Laptop\Documents\MCS-DS\\2024 Fall CS 412 IntroToDataMining\Pattern Discovery in Data Mining\programming-1-frequent-itemset-mining-using-apriori\patterns.txt'

def calculate_frequency(input_file, output_file):
    
    # Read the content of the input file
    with open(input_file, 'r') as file:
        data = file.read() # returns a string
        # Split the content into terms using semicolon as the delimiter
        terms = re.split(';|\n', data)
        # Count the frequency of each term by converting terms to a Counter object
        term_frequency = Counter(terms)
        # Write the frequency and term to the output file
    with open(output_file, 'w') as output:
        for term, frequency in term_frequency.items(): # convert to a list of (elem, cnt) pairs
            output.write(f"{frequency}: {term}\n") #f formats

calculate_frequency(input_txt, output_txt)



#### Part 2: ALL the frequent category sets

Please write all the frequent category sets along with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one frequent category set and should be in the following format:

support:category_1;category_2;category_3;...

For example, suppose a category set (Fast Food; Restaurants) has an absolute support 2851, then the line corresponding to this frequent category set in "patterns.txt" should be:

2851:Fast Food;Restaurants

#### Important Tips  
  
Make sure that you format each line correctly in the output file. For instance, use a *semicolon* instead of another character to separate the categories for each frequent category set.  
  
In the result pattern file, the order of the categories does not matter. For example, the following two cases will be considered equivalent by the grader:  
  
Case 1:  
  
2851:Fast Food;Restaurants  
  
Case 2:  
  
2851:Restaurants;Fast Food   

In [2]:
# function to create list of lists from file. file = [[line1items], [line 2 items]
def load_text_file(file_path):
    result = []
    with open(file_path, 'r') as file:
        for line in file:
            items = line.strip().split(';') #strip().split(;) the line into a list, no loop!
            result.append(items)
    return result

data = load_text_file(input_txt)

# Print the loaded data
for transaction in data:
    print(transaction)


['Breakfast & Brunch', 'American (Traditional)', 'Restaurants']
['Sandwiches', 'Restaurants']
['Local Services', 'IT Services & Computer Repair']
['Restaurants', 'Italian']
['Food', 'Coffee & Tea']
['Fast Food', 'Restaurants']
['Mortgage Brokers', 'Home Services', 'Real Estate']
['Brasseries', 'Restaurants']
['Bars', 'Sports Bars', 'Nightlife', 'American (New)', 'Chicken Wings', 'Restaurants']
['Automotive', 'Windshield Installation & Repair', 'Auto Detailing', 'Wheel & Rim Repair']
['Automotive', 'Auto Parts & Supplies']
['Food', 'Grocery', 'CSA', 'Farmers Market']
['Specialty Schools', 'CPR Classes', 'First Aid Classes', 'Education']
['Event Planning & Services', 'Venues & Event Spaces']
['Shopping', 'Home Decor', 'Home & Garden', 'Furniture Stores']
['Books, Mags, Music & Video', 'Shopping', 'Bookstores']
['Auto Repair', 'Automotive']
['Local Services', 'Dry Cleaning & Laundry']
['Burgers', 'American (New)', 'Restaurants']
['Pizza', 'Restaurants']
['Massage', 'Beauty & Spas']
['Food

In [3]:
# !pip install apyori


In [4]:
from apyori import apriori

results = list(apriori(transactions = data, min_support = 0.01)) # output list of 'relationship records' containing 
# itemset, support, ordered_statistics(conf, lift)

In [19]:
# re.sub(pattern, repl, string, count=0, flags=0)
    # Return the string obtained by replacing the leftmost non-overlapping 
    # occurrences of pattern in string by the replacement repl. If the 
    # pattern isn’t found, string is returned unchanged. repl can be a 
    # string or a function; if it is a string, any backslash escapes in it 
    # are processed. 

# re.subn(pattern, repl, string, count=0, flags=0)
#     Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).

from math import floor
import string

with open(output_txt, 'w') as output:
    for rr in results:  # for ea relationRecord
        itemset = rr[0]   # first iterable in rr, itemset = ‘item’
        item = [x for x in itemset]  # iterables in first iterable
        item, n = re.subn(', ', ';', str(item) )
        item, n = re.subn( '[\[,\],\',\"]', '', str(item) )
        print (str(floor(rr[1] * 77185)) + ":" + item)
        # output.write(f"{frequency}:{term}\n") #f formats
        output.write(f"{str(floor(rr[1] * 77185))}:{item}\n")

3103:Active Life
1593:American (New)
2416:American (Traditional)
2271:Arts & Entertainment
1716:Auto Repair
4208:Automotive
1115:Bakeries
4328:Bars
6583:Beauty & Spas
1369:Breakfast & Brunch
1774:Burgers
1002:Cafes
1629:Chinese
2199:Coffee & Tea
1195:Dentists
1694:Doctors
2975:Event Planning & Services
3078:Fashion
2851:Fast Food
875:Financial Services
1442:Fitness & Instruction
9250:Food
823:General Dentistry
1424:Grocery
2091:Hair Salons
5120:Health & Medical
1586:Home & Garden
4785:Home Services
1430:Hotels
2495:Hotels & Travel
1018:Ice Cream & Frozen Yogurt
1848:Italian
848:Japanese
3468:Local Services
2515:Mexican
1667:Nail Salons
5088:Nightlife
870:Pet Services
1497:Pets
2657:Pizza
1025:Professional Services
874:Pubs
1424:Real Estate
25071:Restaurants
2364:Sandwiches
11233:Shopping
1150:Specialty Food
818:Sports Bars
798:Sushi Bars
1138:Womens Clothing
1442:Active Life;Fitness & Instruction
1593:American (New);Restaurants
2416:American (Traditional);Restaurants
1716:Automotive;Au