# Description
In this programming assignment, you are required to implement a contiguous sequential pattern mining algorithm and apply it on text data to mine potential phrase candidates.

## Input
The provided input file ("reviews_sample.txt") consists of 10,000 online reviews from Yelp users. The reviews have been stemmed (to remove the postfix of each word so words with similar semantics can have the same form), and most of the punctuation has been removed. Therefore, each line is basically a list of strings separated by spaces.

An example line is provided below:

`cold cheap beer good bar food good service looking great pittsburgh style fish 
sandwich place breading light fish plentiful good side home cut fry good 
grilled chicken salad steak soup day homemade lot special great place lunch 
bar snack beer`

## Task
You need to implement an algorithm to mine contiguous sequential patterns that are frequent in the input data. A contiguous sequential pattern is a sequence of items that frequently appears as a consecutive subsequence in a database of many sequences. For example, if the database is

`A,B,A,C
A,C,A,B,A,B
B,A,A,C,D`

and the minimum support is 2, then patterns like "A,B,A" or "A,C" are both frequent contiguous sequential patterns, while the pattern "A,A" is not a frequent contiguous sequential pattern because in the first two sequences the two A's are not consecutive to each other. Notice that it is still a frequent sequential pattern though.

Also, notice that multiple appearances of a subsequence in a single sequence record only counts once. For example, the pattern "A,B" appears 1 time in the first sequence and 2 times in the second, but its support should be calculated as 2, as there are only 2 records containing subsequence "A,B".


## SegPhrase
* Paper - http://hanj.cs.illinois.edu/pdf/sigmod15_jliu.pdf
* Implementation (in Python 2) - https://github.com/shangjingbo1226/SegPhrase/tree/master/src/frequent_phrase_mining


### Adapted from frequent_pattern_mining.py

In [144]:
# from sets import Set

def frequentPatternMining(tokens, patternOutputFilename, threshold):
    dict = {}

    tokensNumber = len(tokens)
    for i in xrange(tokensNumber):
        token = tokens[i]
        if token == '$':
            continue
        if token in dict:
            dict[token].append(i)
        else:
            dict[token] = [i]
    print "# of distinct tokens = ", len(dict)

    patternOutput = open(patternOutputFilename, 'w')

    frequentPatterns = []
    patternLength = 1
    while (len(dict) > 0):
        if patternLength > 6:
            break
        #print "working on length = ", patternLength
        patternLength += 1
        newDict = {}
        for pattern, positions in dict.items():
            occurrence = len(positions)
            #if occurrence >= threshold:
            if occurrence > threshold:
                frequentPatterns.append(pattern)
                
                patternOutput.write(pattern + "," + str(occurrence) + "\n")
                for i in positions:
                    if i + 1 < tokensNumber:
                        if tokens[i + 1] == '$':
                            continue
                        newPattern = pattern + " " + tokens[i + 1]
                        if newPattern in newDict:
                            newDict[newPattern].append(i + 1)
                        else:
                            newDict[newPattern] = [i + 1]
        dict.clear()
        dict = newDict
    patternOutput.close()
    return frequentPatterns

### Adapted from main.py

In [145]:
%%time
# from frequent_pattern_mining import *
import re
import sys

# def main(argv):
ENDINGS = ".!?,;:\"[]\n" # add newline character


threshold = 100 # 1000
# rawTextInput = 'test1.txt'
rawTextInput = 'reviews_sample.txt'
patternOutputFilename = 'patterns.csv'
# argc = len(argv)
# for i in xrange(argc):
#     if argv[i] == "-raw" and i + 1 < argc:
#         rawTextInput = argv[i + 1]
#     elif argv[i] == "-thres" and i + 1 < argc:
#         threshold = int(argv[i + 1])
#     elif argv[i] == "-o" and i + 1 < argc:
#         patternOutputFilename = argv[i + 1]

raw = open(rawTextInput, 'r');
tokens = []
for line in raw:
    inside = 0
    chars = []
    for ch in line:
        if ch == '(':
            inside += 1
        elif ch == ')':
            inside -= 1
        elif inside == 0:
            if ch.isalpha():
                chars.append(ch.lower())
            elif ch == '\'':
                chars.append(ch)
            else:
                if len(chars) > 0:
                    tokens.append(''.join(chars))
                chars = []
        if ch in ENDINGS:
            tokens.append('$')
    if len(chars) > 0:
        tokens.append(''.join(chars))
        chars = []

print "# tokens = ", len(tokens)

frequentPatterns = frequentPatternMining(tokens, patternOutputFilename, threshold)

print "# of frequent pattern = ", len(frequentPatterns)
    
# if __name__ == "__main__":
#     main(sys.argv[1 : ])

# tokens =  621914
# of distinct tokens =  21753
# of frequent pattern =  1167
CPU times: user 5.77 s, sys: 103 ms, total: 5.87 s
Wall time: 5.85 s


In [171]:
print(type(frequentPatterns))
print(type(frequentPatterns[1])) # list of strings
print(frequentPatterns[2])

<type 'list'>
<type 'str'>
deli


### Count Support of Frequent Patterns

In [178]:
# read "categories.txt" into list of strings
# https://stackoverflow.com/questions/3277503/how-do-i-read-a-file-line-by-line-into-a-list
with open('reviews_sample.txt', 'rt') as f:
    transactions = f.readlines()
print(type(transactions)) # list of strings
print(type(transactions[1]))
print(transactions[1])
print(len(transactions))

<type 'list'>
<type 'str'>
excellent food superb customer service miss mario machine used still great place steeped tradition

10000


In [179]:
%%time
# https://docs.python.org/2/library/collections.html
cnt = Counter()
# for each pattern, iterate through transactions to update counter

for pat in frequentPatterns:
    for line in transactions:
        if pat in line:
            #cnt[pat] += 1
            cnt.update({pat: 1})
print(len(cnt))

1167
CPU times: user 7.98 s, sys: 54.7 ms, total: 8.03 s
Wall time: 8.02 s


In [180]:
cnt

Counter({'limited': 143,
         'forget': 167,
         'chinese': 132,
         'really good': 274,
         'lack': 466,
         'dollar': 148,
         'month': 332,
         'four': 245,
         'asian': 87,
         'dish': 629,
         'hate': 364,
         'worked': 164,
         'make sure': 211,
         'apartment': 65,
         'bike': 89,
         'th': 6888,
         'dressing': 176,
         'smile': 112,
         'sorry': 152,
         'worth': 802,
         'sound': 181,
         'woman': 229,
         'appointment': 244,
         'worse': 110,
         'sitting': 285,
         'deli': 1725,
         'one favorite': 138,
         'fan': 934,
         'surprise': 300,
         'awful': 153,
         'ticket': 121,
         'much better': 162,
         'school': 163,
         'list': 547,
         'large': 682,
         'frozen': 144,
         'small': 940,
         'clothes': 70,
         'enjoy': 919,
         'neighborhood': 298,
         'food great': 164,
      

## Get rid of items whose objects are less than the minimum support

In [157]:
# https://stackoverflow.com/questions/15861739/removing-objects-whose-counts-are-less-than-threshold-in-counter
min_sup = 100
for i in list(cnt):
    if cnt[i] <= min_sup:
        del cnt[i]

In [159]:
print(len(cnt))

1088


In [160]:
cnt

Counter({'limited': 143,
         'forget': 167,
         'chinese': 132,
         'really good': 274,
         'lack': 466,
         'dollar': 148,
         'month': 332,
         'four': 245,
         'dish': 629,
         'hate': 364,
         'worked': 164,
         'make sure': 211,
         'th': 6888,
         'dressing': 176,
         'smile': 112,
         'sorry': 152,
         'worth': 802,
         'sound': 181,
         'woman': 229,
         'appointment': 244,
         'worse': 110,
         'sitting': 285,
         'deli': 1725,
         'one favorite': 138,
         'fan': 934,
         'surprise': 300,
         'awful': 153,
         'ticket': 121,
         'much better': 162,
         'school': 163,
         'list': 547,
         'large': 682,
         'frozen': 144,
         'small': 940,
         'enjoy': 919,
         'neighborhood': 298,
         'food great': 164,
         'tea': 1186,
         'bacon': 197,
         'past': 1060,
         'burger': 369,
       

## Hunch: sort dictionary by key
https://stackoverflow.com/questions/9001509/how-can-i-sort-a-dictionary-by-key

In [164]:
import collections
sorted_dict = collections.OrderedDict(sorted(cnt.items()))

In [165]:
sorted_dict

OrderedDict([('able', 2800),
             ('absolutely', 392),
             ('across', 261),
             ('across street', 113),
             ('actual', 656),
             ('actually', 570),
             ('add', 685),
             ('added', 165),
             ('addition', 174),
             ('afternoon', 191),
             ('ago', 458),
             ('ahead', 127),
             ('almost', 436),
             ('alone', 133),
             ('along', 200),
             ('already', 241),
             ('also', 1985),
             ('although', 356),
             ('always', 1475),
             ('am', 4810),
             ('amazing', 864),
             ('ambiance', 148),
             ('american', 153),
             ('amount', 286),
             ('another', 756),
             ('anyone', 277),
             ('anything', 582),
             ('anyway', 220),
             ('anywhere', 207),
             ('apparently', 131),
             ('appetizer', 466),
             ('apple', 194),
             ('ap

### Write counter to output

In [166]:
# write results to  it to "patterns.txt"
with open('output/segphrase_sorted.txt', 'wt') as f:
    for k, v in sorted_dict.items():
        f.write(str(v) + ':' + str(k).replace(" ", ";") + '\n')

### Reformat output .csv as .txt file for submission

In [93]:
import pandas as pd

df = pd.read_csv("patterns.csv", header=None, names=['pattern', 'support'])

In [92]:
df.tail(30)

Unnamed: 0,pattern,support
1148,service good,128
1149,good thing,110
1150,happy hour,216
1151,definitely back,141
1152,french toast,127
1153,first time,331
1154,food great,164
1155,great place,283
1156,looked like,122
1157,place like,108


In [87]:
df.dtypes

pattern    object
support     int64
dtype: object

### Test Iteration

In [106]:
# use test first
df_test = df[1100:]
df_test.head()

Unnamed: 0,pattern,support
1100,book,173
1101,incredibly,145
1102,priced,250
1103,customer service,228
1104,good food,210


In [111]:
# use iterrows
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
# https://stackoverflow.com/questions/1007481/how-do-i-replace-whitespaces-with-underscore-and-vice-versa
df_test = df[1100:]
for index, row in df_test.iterrows():
    print str(row['support']) + ":" + row['pattern'].replace(" ", ";")

173:book
145:incredibly
250:priced
228:customer;service
210:good;food
201:whole;food
103:market;square
134:would;recommend
112:good;place
248:every;time
423:ice;cream
102:next;day
105:dining;room
100:beer;cave
249:come;back
181:look;like
113:really;nice
133:wait;staff
154:great;food
100:one;thing
205:hot;dog
154:going;back
120:dor;stop
115:tasted;like
110:reasonable;price
103:mac;cheese
121:across;street
269:even;though
165:last;time
130:would;definitely
134:place;get
158:last;night
239:feel;like
128:long;time
155:great;service
189:one;best
219:highly;recommend
149:good;service
101:nothing;special
208:beer;selection
125:pasta;trio
167:much;better
124:food;service
143:one;favorite
104:reasonably;priced
107:lobster;roll
144:grocery;store
115:saturday;night
128:service;good
110:good;thing
216:happy;hour
141:definitely;back
127:french;toast
331:first;time
164:food;great
283:great;place
122:looked;like
108:place;like
235:next;time
163:strip;district
181:love;place
124:felt;like
252:food;goo

## Output
Please set the relative minimum support to 0.01 and run it on the given text file. In other words, you need to extract all the frequent contiguous sequential patterns that have an absolute support no smaller than 100.

Please write all the frequent contiguous sequential patterns along with their absolute supports into a text file named "patterns.txt". Every line corresponds to exactly one pattern you found and should be in the following format:

support:item_1;item_2;item_3

For example, suppose the phrase "parking lot" has an absolute support 133, then the line corresponding to this frequent contiguous sequential pattern in "patterns.txt" should be:

133:parking;lot

Notice that the order does matter in sequential pattern mining. That is to say,

133:lot;parking

may be graded as incorrect.

In [132]:
# subset rows where support > 100
df = df[df['support']

Unnamed: 0,pattern,support
0,four,249
1,looking,832
2,deli,156
3,wednesday,100
4,straight,115


In [113]:
print(df.shape)

(1178, 2)


In [116]:
# use iterrows
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
# https://stackoverflow.com/questions/1007481/how-do-i-replace-whitespaces-with-underscore-and-vice-versa
with open('output/patterns.txt', 'wt') as f:
    for index, row in df.iterrows():
        f.write(str(row['support']) + ":" + row['pattern'].replace(" ", ";") + "\n")