PROBLEM 2 : Kosarak Association Rules
Your task is to take a dataset of nearly one million clicks on a news site16 and use the Weka Explorer to identify interesting association rules. 
Ordinarily this would be a point-and-click task; 
however, the input data format is a list of transactions (each line in the file includes a list of anonymized news item id’s), 
whereas Weka requires a tabular format. 
Specifically, each distinct news item id should be represented via a column/attribute, 
and each row/instance should be a sequence of binary values, indicating whether or not the user visited the corresponding news item.

A. Write a Python program which takes as its argument5 the path to a text file of data (assumed to be in the itemset format above) and produces as output to the console a sparse ARFF file.
# numbers need to be sorted
B. Use your program to convert the kosarak.dat file to a sparse kosarak.arff. About how long did it take to run?
C. Load the resulting file into Weka (as described above; you should have 41,270 attributes and 990, 002 instances). About how long did it take to load this file?

D. Use Weka’s FP-Growth implementation to find rules that have support count of at least 49, 500 and confidence of at least 99% – record your rules (there should be 2).
E. Run the algorithm at least 5 times. Then look to the log and record how much time each took. How does the average time compare to the time necessary to convert the dataset and then load into Weka?


In [9]:
#A
%pip install liac-arff
import arff
%pip install os
import os
%pip install tqdm
from tqdm import tqdm

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement os (from versions: none)
ERROR: No matching distribution found for os


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [24]:
data_file = "kosarak.dat"
data_file_path = os.path.join(os.getcwd(), data_file)
data = []
unique_items = set()
with open(data_file_path, 'r') as file:
    for line in file:
        items = line.split()
        items = [int(item) for item in items]
        items.sort()
        data.append(items)
        
        #check each item and ensure it is unique, if not add it to the unique_items set
        for item in items:
            if item not in unique_items:
                unique_items.add(item)



# Generate attributes based on unique items
# attributes = [(item, 'NUMERIC') for item in sorted(unique_items)]

# arff.dump('kosarak.arff', data, relation='kosarak', names=attributes)

with open('kosarak.arff', 'w') as f:
    f.write('@RELATION kosarak\n')
    for item in sorted(unique_items):
        f.write(f'@ATTRIBUTE {item} {{0,1}}\n') #maybe comma 
    f.write('\n')
    f.write('@DATA\n')
    # write data
    for items in tqdm(data):
        items = [int(item) for item in items]
        f.write('{')
        f.write(','.join([f'{item-1} 1' for item in sorted(unique_items) if item in items]))
        f.write('}\n')
        # f.write(','.join(items) + '\n')
#output: Sparse ARFF file

100%|██████████| 990002/990002 [53:59<00:00, 305.62it/s]  


# OBSERVATIONS & RESPONSES

## B. Use your program to convert the kosarak.dat file to a sparse kosarak.arff. About how long did it take to run?

Using tqdm, it took 54 minutes to complete. 

## C. Load the resulting file into Weka (as described above; you should have 41,270 attributes and 990, 002 instances). About how long did it take to load this file?

Not that long, maybe about 3 seconds, if that. 


## D. Use Weka’s FP-Growth implementation to find rules that have support count of at least 49, 500 and confidence of at least 99% – record your rules (there should be 2).

=== Run information ===

Scheme:       weka.associations.FPGrowth -P 2 -I -1 -N 2 -T 0 -C 0.99 -D 0.05 -U 1.0 -M 49500.0 -S

Relation:     kosarak

Instances:    990002

Attributes:   41270

[list of attributes omitted]

=== Associator model (full training set) ===


FPGrowth found 2 rules

1. [11=1, 218=1, 148=1]: 50098 ==> [6=1]: 49866   <conf:(1)> lift:(1.64) lev:(0.02) conv:(84.4) 

2. [11=1, 148=1]: 55759 ==> [6=1]: 55230   <conf:(0.99)> lift:(1.63) lev:(0.02) conv:(41.3) 


## Using the wiki's documentation to interpret these results

"the number before the arrow is the number of instances for which the antecedent is true;
that after the arrow is the number of instances for which the consequent is true also; 
and the confidence (in parentheses) is the ratio between the two."

These rules state:
1. There is a relationship between the transactions of items [11, 218, 148] being purchased/made 50098 times, to have item [6] be bought 49866 times. The conf of (1) states that this association has a ratio of 1.0 = 100% (49866 / 50098 = 0.995 -> rounded)

2. Ratio of items [11, 148] : [6] = 55230 / 55759 = 0.99



## E. Run the algorithm at least 5 times. Then look to the log and record how much time each took. How does the average time compare to the time necessary to convert the dataset and then load into Weka?

Out of 6 runs:

3 runs took 1 second between start - finish. 

3 runs were performed within/less than a second (start-finish = 0)

The time to convert the dataset is long at 45-50 minutes. Loading into Weka thereafter took only a few seconds. Then running the FP Growth algorithm took less than a second. 