# Description
In this programming assignment, you are required to implement a contiguous sequential pattern mining algorithm and apply it on text data to mine potential phrase candidates.

## Input
The provided input file ("reviews_sample.txt") consists of 10,000 online reviews from Yelp users. The reviews have been stemmed (to remove the postfix of each word so words with similar semantics can have the same form), and most of the punctuation has been removed. Therefore, each line is basically a list of strings separated by spaces.

An example line is provided below:

`cold cheap beer good bar food good service looking great pittsburgh style fish 
sandwich place breading light fish plentiful good side home cut fry good 
grilled chicken salad steak soup day homemade lot special great place lunch 
bar snack beer`

## Task
You need to implement an algorithm to mine contiguous sequential patterns that are frequent in the input data. A contiguous sequential pattern is a sequence of items that frequently appears as a consecutive subsequence in a database of many sequences. For example, if the database is

`A,B,A,C
A,C,A,B,A,B
B,A,A,C,D`

and the minimum support is 2, then patterns like "A,B,A" or "A,C" are both frequent contiguous sequential patterns, while the pattern "A,A" is not a frequent contiguous sequential pattern because in the first two sequences the two A's are not consecutive to each other. Notice that it is still a frequent sequential pattern though.

Also, notice that multiple appearances of a subsequence in a single sequence record only counts once. For example, the pattern "A,B" appears 1 time in the first sequence and 2 times in the second, but its support should be calculated as 2, as there are only 2 records containing subsequence "A,B".


## GSP Implementation
https://github.com/jacksonpradolima/gsp/blob/master/gsp.py

In [1]:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
"""
===============================================
GSP (Generalized Sequential Pattern) algorithm
===============================================

GSP algorithm made with Python3 to deal with arrays as transactions.

Example:

transactions = [
                ['Bread', 'Milk'],
                ['Bread', 'Diaper', 'Beer', 'Eggs'],
                ['Milk', 'Diaper', 'Beer', 'Coke'],
                ['Bread', 'Milk', 'Diaper', 'Beer'],
                ['Bread', 'Milk', 'Diaper', 'Coke']
            ]
"""

import logging
import multiprocessing as mp
import numpy as np
import time

from collections import Counter
from itertools import chain
from itertools import product

__author__ = "Jackson Antonio do Prado Lima"
__email__ = "jacksonpradolima@gmail.com"
__license__ = "GPL"
__version__ = "1.0"


class GSP:

    def __init__(self, raw_transactions):
        self.freq_patterns = []
        self._pre_processing(raw_transactions)

    def _pre_processing(self, raw_transactions):
        '''
        Prepare the data

        Parameters:
                raw_transactions: the data that it will be analysed
        '''
        self.max_size = max([len(item) for item in raw_transactions])
        self.transactions = [tuple(list(i)) for i in raw_transactions]
        counts = Counter(chain.from_iterable(raw_transactions))
        self.unique_candidates = [tuple([k]) for k, c in counts.items()]

    def _is_slice_in_list(self, s, l):
        len_s = len(s)  # so we don't recompute length of s on every iteration
        return any(s == l[i:len_s+i] for i in range(len(l) - len_s+1))

    def _calc_frequency(self, results, item, minsup):
        # The number of times the item appears in the transactions
        frequency = len(
            [t for t in self.transactions if self._is_slice_in_list(item, t)])
        if frequency >= minsup:
            results[item] = frequency
        return results

    def _support(self, items, minsup=0):
        '''
        The support count (or simply support) for a sequence is defined as
        the fraction of total data-sequences that "contain" this sequence.
        (Although the word "contains" is not strictly accurate once we
        incorporate taxonomies, it captures the spirt of when a data-sequence
        contributes to the support of a sequential pattern.)

        Parameters
                items: set of items that will be evaluated
                minsup: minimum support
        '''
        results = mp.Manager().dict()
        pool = mp.Pool(processes=mp.cpu_count())

        for item in items:
            pool.apply_async(self._calc_frequency,
                             args=(results, item, minsup))
        pool.close()
        pool.join()

        return dict(results)

    def _print_status(self, run, candidates):
        logging.debug("""
        Run {}
        There are {} candidates.
        The candidates have been filtered down to {}.\n"""
                      .format(run,
                              len(candidates),
                              len(self.freq_patterns[run-1])))

    def search(self, minsup=0.2):
        '''
        Run GSP mining algorithm

        Parameters
                minsup: minimum support
        '''
        assert (0.0 < minsup) and (minsup <= 1.0)
        minsup = len(self.transactions) * minsup

        # the set of frequent 1-sequence: all singleton sequences
        # (k-itemsets/k-sequence = 1) - Initially, every item in DB is a
        # candidate
        candidates = self.unique_candidates

        # scan transactions to collect support count for each candidate
        # sequence & filter
        self.freq_patterns.append(self._support(candidates, minsup))

        # (k-itemsets/k-sequence = 1)
        k_items = 1

        self._print_status(k_items, candidates)

        # repeat until no frequent sequence or no candidate can be found
        while len(self.freq_patterns[k_items - 1]) and (k_items + 1 <= self.max_size):
            k_items += 1

            # Generate candidate sets Ck (set of candidate k-sequences) -
            # generate new candidates from the last "best" candidates filtered
            # by minimum support
            items = np.unique(
                list(set(self.freq_patterns[k_items - 2].keys())))

            candidates = list(product(items, repeat=k_items))

            # candidate pruning - eliminates candidates who are not potentially
            # frequent (using support as threshold)
            self.freq_patterns.append(self._support(candidates, minsup))

            self._print_status(k_items, candidates)
        return self.freq_patterns[:-1]

## Experimental database

In [2]:
import argparse
import logging
import random

logging.basicConfig(level=logging.DEBUG)

# transactions = [
#     ['Bread', 'Milk'],# for _ in procs:
#     ['Bread', 'Diaper', 'Beer', 'Eggs'],
#     ['Milk', 'Diaper', 'Beer', 'Coke'],
#     ['Bread', 'Milk', 'Diaper', 'Beer'],
#     ['Bread', 'Milk', 'Diaper', 'Coke']
# ]

# transactions = [[3, 5, 2, 0, 4, 4, 1, 1], [2, 5, 5], [5, 3, 2, 4, 4, 0, 4], [4, 3, 0, 0], [
#     1, 0, 4, 0, 0, 4], [2, 5, 1, 3, 5, 2, 5, 3], [0, 4, 0, 4, 5], [4, 2],
#     [5], [2, 3, 0, 0, 0, 3, 0, 2, 3]]


# use example from the problemset
# input is a list of lists
transactions = [
    ['A','B','A','C'],
    ['A','C','A','B','A','B'],
    ['B','A','A','C','D']]
                

result = GSP(transactions).search(2/3) # example in the problem

print("========= Status =========")
print("Transactions: {}".format(transactions))
print("GSP: {}".format(result)) # output is a list of dictionaries

DEBUG:root:
        Run 1
        There are 4 candidates.
        The candidates have been filtered down to 3.

DEBUG:root:
        Run 2
        There are 9 candidates.
        The candidates have been filtered down to 3.

DEBUG:root:
        Run 3
        There are 27 candidates.
        The candidates have been filtered down to 1.

DEBUG:root:
        Run 4
        There are 16 candidates.
        The candidates have been filtered down to 0.



Transactions: [['A', 'B', 'A', 'C'], ['A', 'C', 'A', 'B', 'A', 'B'], ['B', 'A', 'A', 'C', 'D']]
GSP: [{('A',): 3, ('B',): 3, ('C',): 3}, {('A', 'B'): 2, ('A', 'C'): 3, ('B', 'A'): 3}, {('A', 'B', 'A'): 2}]


In [3]:
# check that transactions is list of lists
print(type(transactions))
print(type(transactions[0]))
print(len(transactions))

<class 'list'>
<class 'list'>
3


In [4]:
# check that output is list of dictionaries
# value is the support, key is the sequential pattern
print(type(result))
print(type(result[0]))

<class 'list'>
<class 'dict'>


## Format Results + Write to File

In [7]:
seq_pattern = []
# print Formatting
for seq_dict in result:
    for k, v in seq_dict.items():
        seq_pattern = [x for x in k]
        print(str(v) + ':' + ';'.join(map(str, seq_pattern)) + '\n')

3:A

3:B

3:C

2:A;B

3:A;C

3:B;A

2:A;B;A



In [8]:
seq_pattern = []
with open('output/patterns_test.txt', 'wt') as f:
    for seq_dict in result:
        for k, v in seq_dict.items():
            seq_pattern = [x for x in k]
            f.write(str(v) + ':' + ';'.join(map(str, seq_pattern)) + '\n')

## Output
Please set the relative minimum support to 0.01 and run it on the given text file. In other words, you need to extract all the frequent contiguous sequential patterns that have an absolute support no smaller than 100.

Please write all the frequent contiguous sequential patterns along with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one pattern you found and should be in the following format:

support:item_1;item_2;item_3

For example, suppose the phrase "parking lot" has an absolute support 133, then the line corresponding to this frequent contiguous sequential pattern in "patterns.txt" should be:

133:parking;lot

Notice that the order does matter in sequential pattern mining. That is to say,

133:lot;parking

may be graded as incorrect.

In [9]:
# read "reviews_samples.txt" into list of lists
# https://stackoverflow.com/questions/18448847/import-txt-file-and-having-each-line-as-a-list
transactions = []
with open('reviews_sample.txt', 'rt') as f:
    for line in f:
        transactions.append(line.strip().split(' '))
print(type(transactions))
print(type(transactions[0]))
print(len(transactions))

<class 'list'>
<class 'list'>
10000


In [10]:
%%time
# run the GSP algorithm from above on our transactions list
logging.basicConfig(level=logging.DEBUG)

# set relative minimum support to 0.01
result = GSP(transactions).search(0.01)

print("========= Status =========")
print("Transactions: {}".format(transactions))
print("GSP: {}".format(result))

DEBUG:root:
        Run 1
        There are 22104 candidates.
        The candidates have been filtered down to 977.

Process ForkPoolWorker-50:
Process ForkPoolWorker-49:
Process ForkPoolWorker-54:
Process ForkPoolWorker-53:
Process ForkPoolWorker-48:
Process ForkPoolWorker-51:
Process ForkPoolWorker-47:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/jonathan/anaconda/envs/cs498/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/jonathan/anaconda/envs/cs498/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/jonathan/anaconda/envs/cs498/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jonathan/anaconda/envs/cs498/lib

KeyboardInterrupt: 

In [11]:
seq_pattern = []
with open('output/patterns.txt', 'wt') as f:
    for seq_dict in result:
        for k, v in seq_dict.items():
            seq_pattern = [x for x in k]
            f.write(str(v) + ':' + ';'.join(map(str, seq_pattern)) + '\n')