# Frequent Itemsets
Let's work with the tweets set that we are using in the assignments.

In [1]:
import zipfile
from pathlib import Path
if not Path('../assignments/10000 tweets-NEW.json').exists():
    print("Unzipping tweets")
    with zipfile.ZipFile('../assignments/cleaned-tweets.zip') as myzip:
        myzip.extractall('../assignments')

For this lecture, we will count frequencies of pair of words in the collection of tweets. The following code implements a Python generator that generates the list of words in each tweet after removing a pre-defined set of words that are known to be very frequent (the so-called *stop words*). We will use NLTK to split the tweets into words and remove the stop words.

In [2]:
import nltk
import json

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

def stream_tweets():
    with open('../assignments/10000 tweets-NEW.json', encoding='iso8859-1') as jfile:
        for line in jfile:
            try:
                next_tweet = json.loads(line)['body']
                tweet_words = [w for w in next_tweet.split()
                               if w.lower() not in stop_words] 
            except:
                continue
            yield tweet_words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
counter = 0
for s in stream_tweets():
    if counter > 10:
        break
    counter += 1
    print(s)

['Register', '#Convergence2016', 'hear@ChelleMelbourne', 'talk', 'manage', 'change', 'digital', 'transformation', 'https://t.co/7pxwwDeaXm']
['CONGRATULATIONS', 'Suzie', 'Walker', 'beautiful', 'little', 'man', 'FANTASTIC', 'commitment', 'hard', 'work.', 'You...', 'https://t.co/m4QLVq0BTr']
['Wantirna,', 'VIC,', 'AU', '11:00', 'Temp', '19.8Â°C,', 'RH', '67pct,', 'Winds', 'NNW', '@', '0.0', 'km/h,', 'Rain', 'Today', '0', 'mm,', '1014.3', 'hpa', '&amp;', 'Steady.', '#vicweather']
['much', 'see', 'Visit', 'Central', 'Australia', '"We', 'kind', 'beaches', '#RedCentreNT!', 'The...', 'https://t.co/eRyP7eTk9X']
["Friday's", '3pm', 'Free', 'BBQ,', 'Bikini', 'Girls', '&amp;', 'Beverages', 'Meat', 'Raffles', 'around', 'Lunch', 'Time', 'Seafood', 'Raffle', '4pm...', 'https://t.co/etPRxJ2l7x']
['Say', 'Hello', 'Gorgeous', 'Gingham', 'Dress!', 'Perfect', 'day', 'Play', 'Party.', 'Originally', '$50', '$40!...', 'https://t.co/ILI5uaCIHw']
['Great', 'read', 'MBS', 'alumna', 'Leanda', 'Lee', '(MBA,', '1

## A-Priori

The following code implements the first pass of the A-Priori algorithm. It uses `Counter` from Python's `collections` library to count the words.

In [4]:
from collections import Counter

counts = Counter()
for tweet_words in stream_tweets():
    counts.update(tweet_words)

In [5]:
counts.most_common(20)

[('RT', 1575),
 ('-', 427),
 ('&amp;', 202),
 ("I'm", 180),
 ('via', 163),
 ('like', 160),
 ('need', 143),
 ('April', 139),
 ('one', 127),
 ('get', 127),
 ('time', 123),
 ('going', 113),
 ('people', 112),
 ('Clinton', 104),
 ('New', 103),
 ('new', 102),
 ('threat', 100),
 ('right', 98),
 ('good', 95),
 ('Hillary', 93)]

The code above listed the 20 most frequent "words". The following code performs pass 2 of A-Priori using a threshold of 30.

In [6]:
import itertools
threshold = 30
counts_pairs = dict()
for tweet_words in stream_tweets():
    for wi, wj in itertools.combinations(tweet_words, 2):
        if counts[wi] < threshold or counts[wj] < threshold:
            continue
        if (wi, wj) not in counts_pairs:
            counts_pairs[(wi, wj)] = 1
        else:
            counts_pairs[(wi, wj)] += 1

In [7]:
len(counts_pairs)

3554

In [8]:
len(counts)

27306

The code above shows that 3554 pairs were counted. Let's see how many pairs have a support over the threshold:

In [9]:
c = 0
for k in counts_pairs:
    if counts_pairs[k] >= threshold:
        c+=1

In [10]:
c

194

We have found 194 pairs of frequent words. Let's print some of them.

In [11]:
c = 0
for k in counts_pairs:
    if counts_pairs[k] >= threshold:
        print(k, counts_pairs[k])
        c+=1
        if c > threshold:
            break

('April', 'Fools') 39
('RT', '-') 81
('RT', '&amp;') 79
('RT', 'people') 37
('RT', 'time') 41
('RT', 'like') 38
('RT', 'April') 31
('RT', 'Hillary') 59
('RT', 'one') 54
('-', '-') 70
('RT', 'activist') 47
('RT', "I'm") 68
('RT', 'love') 35
('liked', '@YouTube') 43
('liked', 'video') 43
('@YouTube', 'video') 43
('RT', 'going') 69
('RT', 'threat') 51
('RT', '#iHeartAwards') 58
('RT', '#5SOSFam') 51
('RT', '#BestFanArmy') 55
('#iHeartAwards', '#5SOSFam') 40
('#iHeartAwards', '#BestFanArmy') 43
('#5SOSFam', '#BestFanArmy') 47
('RT', '@Tha5SOSFamily:') 43
('@Tha5SOSFamily:', '#iHeartAwards') 41
('@Tha5SOSFamily:', '#BestFanArmy') 37
('@Tha5SOSFamily:', '#5SOSFam') 38
('#BestFanArmy', '#5SOSFam') 36
('RT', 'via') 52
('RT', 'need') 85


## PCY
The following code performs pass 1 of the PCY algorithm. We will use a hash with 10000 buckets.

In [12]:
nbuckets = 100000
def my_hash(i,j):
    return hash((i,j)) % nbuckets

In [13]:
import numpy as np
buckets = np.zeros(nbuckets)

counts = Counter()
for tweet_words in stream_tweets():
    counts.update(tweet_words)
    for i, j in itertools.combinations(tweet_words, 2):
        buckets[my_hash(i,j)] += 1

And the following code performs pass 2 of the PCY algorithm.

In [14]:
counts_pairs = dict()
for tweet_words in stream_tweets():
    for i, j in itertools.combinations(tweet_words, 2):
        if counts[i] < threshold or counts[j] < threshold:
            continue
        if buckets[my_hash(i,j)] < threshold:
            continue
#        print("Counting pair", i, j)
        if (i,j) not in counts_pairs:
            counts_pairs[(i,j)] = 1
        else:
            counts_pairs[(i,j)] += 1

Let's see how many pairs were counted:

In [15]:
len(counts_pairs)

212

The following code tries several values of number of buckets and prints the numbers of pairs that are counted.

In [16]:
def my_hash(i, j, nbuckets):
    return hash((i,j)) % nbuckets

plotdata = []
for nbuckets in (5000, 10000, 20000, 50000, 100000, 200000, 500000):
    print("Trying with %i buckets" % nbuckets)
    # Pass 1
    buckets = np.zeros(nbuckets)
    counts = Counter()
    for tweet_words in stream_tweets():
        counts.update(tweet_words)
        for i, j in itertools.combinations(tweet_words, 2):
            buckets[my_hash(i, j, nbuckets)] += 1
            
    # Pass 2
    counts_pairs = dict()
    for tweet_words in stream_tweets():
        for i, j in itertools.combinations(tweet_words, 2):
            if counts[i] < threshold or counts[j] < threshold:
                continue
            if buckets[my_hash(i, j, nbuckets)] < threshold:
                continue
    #        print("Counting pair", i, j)
            if (i,j) not in counts_pairs:
                counts_pairs[(i,j)] = 1
            else:
                counts_pairs[(i,j)] += 1 
    print("Number of pairs counted:", len(counts_pairs))
    plotdata.append(len(counts_pairs))

Trying with 5000 buckets
Number of pairs counted: 3554
Trying with 10000 buckets
Number of pairs counted: 2417
Trying with 20000 buckets
Number of pairs counted: 394
Trying with 50000 buckets
Number of pairs counted: 241
Trying with 100000 buckets
Number of pairs counted: 212
Trying with 200000 buckets
Number of pairs counted: 203
Trying with 500000 buckets
Number of pairs counted: 196
