## Task 2: Finding Similar Items

Using the anthologies abstract dataset. Extract a list of the authors and editors per publication and create baskets and perform a search of similar items:

basket 1: Mostafazadeh Davani Aida,Kiela Douwe,Lambert Mathias,Vidgen, Bertie Prabhakaran Vinodkumar, Waseem, Zeerak<br>
basket 2: Singh Sumer, Li Sheng

1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

### Extracting a list of the authors and editors per publication and creating baskets

In [1]:
# import text
import pandas as pd
from urllib.request import urlopen
from io import BytesIO
import gzip

url = 'https://aclanthology.org/anthology+abstracts.bib.gz'
with gzip.open(BytesIO(urlopen(url).read()), 'rb') as fb:
    with open('test.bib', 'wb') as f:
        f.write(fb.read())
        
file = open('test.bib')
for n in range(20):
    print(file.readline()[:-1])

@proceedings{woah-2021-online,
    title = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)",
    editor = "Mostafazadeh Davani, Aida  and
      Kiela, Douwe  and
      Lambert, Mathias  and
      Vidgen, Bertie  and
      Prabhakaran, Vinodkumar  and
      Waseem, Zeerak",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.woah-1.0",
}
@inproceedings{singh-li-2021-exploiting,
    title = "Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers",
    author = "Singh, Sumer  and
      Li, Sheng",
    booktitle = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)",
    month = aug,


In [2]:
# putting authors into a list
authors = []
copy = False

with open('anthology+abstracts.bib', encoding="utf8") as bibtex_file:
    for line in bibtex_file:
        if line.startswith( '    author = "' ):
            copy = True
        if line.startswith( '    booktitle = "' ):
            copy = False
        if copy:
            authors.append(line)

In [3]:
# showing the last 10 rows of authors
authors[:10]

['    author = "Singh, Sumer  and\n',
 '      Li, Sheng",\n',
 '    author = "Hahn, Vanessa  and\n',
 '      Ruiter, Dana  and\n',
 '      Kleinbauer, Thomas  and\n',
 '      Klakow, Dietrich",\n',
 '    author = "Caselli, Tommaso  and\n',
 '      Basile, Valerio  and\n',
 "      Mitrovi{\\'c}, Jelena  and\n",
 '      Granitzer, Michael",\n']

In [4]:
# Cleaning the list
authors = [x.replace('    author = ', '') for x in authors]
authors = [x.replace('      ', '') for x in authors]
authors = [x.replace('  and\n', '') for x in authors]
authors = [x.replace(', ', ' ') for x in authors]
authors = [x.replace('"', '') for x in authors]
authors = [x.replace(',', '') for x in authors]

In [5]:
import re

authors_new = []
for s in authors:
    authors_new.extend(re.split(r'(?=\n)', s))

In [6]:
authors_new = [x.replace('\n', '') for x in authors_new]
authors_new[:10]

['Singh Sumer',
 'Li Sheng',
 '',
 'Hahn Vanessa',
 'Ruiter Dana',
 'Kleinbauer Thomas',
 'Klakow Dietrich',
 '',
 'Caselli Tommaso',
 'Basile Valerio']

In [7]:
# putting editors into a list
editors = []
copy = False

with open('anthology+abstracts.bib', encoding="utf8") as bibtex_file:
    for line in bibtex_file:
        if line.startswith( '    editor = "' ):
            copy = True
        if line.startswith( '    month = ' ):
            copy = False
        if copy:
            editors.append(line)

In [8]:
# showing the last 10 rows of editors list
editors[:10]

['    editor = "Mostafazadeh Davani, Aida  and\n',
 '      Kiela, Douwe  and\n',
 '      Lambert, Mathias  and\n',
 '      Vidgen, Bertie  and\n',
 '      Prabhakaran, Vinodkumar  and\n',
 '      Waseem, Zeerak",\n',
 '    editor = "Xu, Wei  and\n',
 '      Ritter, Alan  and\n',
 '      Baldwin, Tim  and\n',
 '      Rahimi, Afshin",\n']

In [9]:
# Cleaning the list
editors = [x.replace('    editor = ', '') for x in editors]
editors = [x.replace('      ', '') for x in editors]
editors = [x.replace('  and\n', '') for x in editors]
editors = [x.replace(', ', ' ') for x in editors]
editors = [x.replace('"', '') for x in editors]
editors = [x.replace(',', '') for x in editors]

In [10]:
import re

editors_new = []
for s in editors:
    editors_new.extend(re.split(r'(?=\n)', s))

In [11]:
editors_new = [x.replace('\n', '') for x in editors_new]
editors_new[:10]

['Mostafazadeh Davani Aida',
 'Kiela Douwe',
 'Lambert Mathias',
 'Vidgen Bertie',
 'Prabhakaran Vinodkumar',
 'Waseem Zeerak',
 '',
 'Xu Wei',
 'Ritter Alan',
 'Baldwin Tim']

In [12]:
#uniting authors and editors lists
all_authors = authors_new + editors_new
all_authors[:10]

['Singh Sumer',
 'Li Sheng',
 '',
 'Hahn Vanessa',
 'Ruiter Dana',
 'Kleinbauer Thomas',
 'Klakow Dietrich',
 '',
 'Caselli Tommaso',
 'Basile Valerio']

In [13]:
# the number of authors
len(all_authors)

349876

In [14]:
# saving all the authors and editors as a txt file
output_file = open('all_authors.txt', 'w', encoding = 'utf-8')

for author in all_authors:
    output_file.write(author + '\n')

output_file.close()

In [15]:
def readdata(k, fname="all_authors.txt", report=False):
    C_k = []
    b = 0

    with open(fname, "rt", encoding='latin1') as f:
        for line in f:
            line = line.replace('\n', '')  # remove newline symbol
            if report:
                print(line)
         
            if line != "":
                # gather all items in one basket
                C_k.append(line)
            else:
                # end of basket, report all itemsets
                for itemset in itertools.combinations(C_k, k):
                    yield frozenset(itemset)
                C_k = []
                
                if report:
                    print("")

                # report progress
                # print every 1000th element to reduce clutter
                if report:
                    if b % 1000 == 0:
                        print('processing bin ', b)
                    b += 1

    # last basket
    if len(C_k) > 0:
        for itemset in itertools.combinations(C_k, k):
            yield frozenset(itemset)
    

In [16]:
readdata(k = 1)

<generator object readdata at 0x000001FA9F7E09E0>

In [17]:
import itertools
nitems = 5
for C_k in readdata(k = 1,report=True):
    print(C_k)
    nitems -= 1
    
    if nitems == 0:
        break

Singh Sumer
Li Sheng

frozenset({'Singh Sumer'})
frozenset({'Li Sheng'})

processing bin  0
Hahn Vanessa
Ruiter Dana
Kleinbauer Thomas
Klakow Dietrich

frozenset({'Hahn Vanessa'})
frozenset({'Ruiter Dana'})
frozenset({'Kleinbauer Thomas'})


## 1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

For many frequent-itemset algorithms, main-memory is the critical resource.

Naive Approach read file once, counting in main memory the occurrences of each pair. The problem with the Naive method is that when we have too many items pairs they do not fit into memory.

A-priori limits the need for main memory.

PCY algorithm is a slightly modified version of the A-priori algorithm. It only stores the individual item counts.


## Naive Approach

In [18]:
import time

def get_C(k):

    start = time.time()
    C = {}
    for key in readdata(k):  # False report
        if key not in C:
            C[key] = 1
        else:
            C[key] += 1
    print("Took {}s for k={}".format((time.time() - start), k))
    return C


C1 = get_C(1)
C2 = get_C(2)

Took 0.4076042175292969s for k=1
Took 0.7630152702331543s for k=2


In [19]:
print(len(C1),len(C2))

74315 248623


In [20]:
nitems = 10
for ck,n in C2.items():
    print(ck, n)
    nitems -= 1
    if nitems == 0: break

frozenset({'Singh Sumer', 'Li Sheng'}) 1
frozenset({'Ruiter Dana', 'Hahn Vanessa'}) 1
frozenset({'Hahn Vanessa', 'Kleinbauer Thomas'}) 1
frozenset({'Klakow Dietrich', 'Hahn Vanessa'}) 1
frozenset({'Ruiter Dana', 'Kleinbauer Thomas'}) 1
frozenset({'Ruiter Dana', 'Klakow Dietrich'}) 5
frozenset({'Klakow Dietrich', 'Kleinbauer Thomas'}) 2
frozenset({'Basile Valerio', 'Caselli Tommaso'}) 3
frozenset({"Mitrovi{\\'c} Jelena", 'Caselli Tommaso'}) 3
frozenset({'Granitzer Michael', 'Caselli Tommaso'}) 3


### Naive Approach, s = 10

In [21]:
start = time.time()
s = 10 # support threshold
L2 = {}
for key, n in C2.items():
    if n >= s:
        L2[key] = n
print('Naive Approach: {} items with >{} occurances'.format(len(L2), s))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

Naive Approach: 1713 items with >10 occurances
It took 0.04 sec.


### Naive Approach, s = 50

In [22]:
start = time.time()
s = 50 # support threshold
L2 = {}
for key, n in C2.items():
    if n >= s:
        L2[key] = n
print('Naive approach: {} items with >{} occurances'.format(len(L2), s))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

Naive approach: 12 items with >50 occurances
It took 0.04 sec.


### Naive Approach, s = 100

In [23]:
start = time.time()
s = 100 # support threshold
L2 = {}
for key, n in C2.items():
    if n >= s:
        L2[key] = n
print('Naive approach: {} items with >{} occurances'.format(len(L2), s))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

Naive approach: 0 items with >100 occurances
It took 0.03 sec.


As we increase the support threshold the number of frequent items decreases.

## A-priori

### A-priori, s = 10

In [24]:
start = time.time()

N = 10  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
37555 items
A-priori: 1713 items with >10 occurances
It took 35.75 sec.


### A-priori, s = 50

In [25]:
start = time.time()
N = 50  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
521 items with >50 occurances
3516 items
A-priori: 12 items with >50 occurances
It took 3.82 sec.


### A-priori, s = 100

In [26]:
start = time.time()
N = 100  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
117 items with >100 occurances
426 items
A-priori: 0 items with >100 occurances
It took 0.84 sec.


As we increase the frequency threshold, the number of frequent items decreases and the time taken decreases.

## PCY

### PCY, N = 10

In [27]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
H1 = np.zeros((max_hash1,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])

del H1

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1733 items
PCY: 1713 items with >10 occurances
It took 36.25 sec.


### PCY, N = 50

In [28]:
start = time.time()

# hash table
N = 50

max_hash1 = 10*1000000
H1 = np.zeros((max_hash1,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])

del H1


# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))


# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
521 items with >50 occurances
12 items
PCY: 12 items with >50 occurances
It took 4.39 sec.


### PCY, N = 100

In [29]:
start = time.time()

# hash table
N = 100

max_hash1 = 10*1000000
H1 = np.zeros((max_hash1,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])

del H1


# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))


# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
117 items with >100 occurances
0 items
PCY: 0 items with >100 occurances
It took 1.64 sec.


As the frequency threshold increases, the number of frequent items and the time taken decreases.

## 2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

### 1 table

In [30]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
H1 = np.zeros((max_hash1,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])

del H1

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1733 items
PCY: 1713 items with >10 occurances
It took 34.27 sec.


### 2 tables

In [31]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
max_hash2 = 10*1000000+1000
H1 = np.zeros((max_hash1,), dtype=int)
H2 = np.zeros((max_hash2,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    hash_cell_2 = hash(key) % max_hash2
    H2[hash_cell_2] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])
H_good_2 = set(np.where(H2 >= N)[0])

del H1
del H2
# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue
    hash_cell_2 = hash(key) % max_hash2
    if hash_cell_2 not in H_good_2:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1713 items
PCY: 1713 items with >10 occurances
It took 80.48 sec.


### 3 tables

In [32]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
max_hash2 = 10*1000000+1000
max_hash3 = 10*1000000+2000
H1 = np.zeros((max_hash1,), dtype=int)
H2 = np.zeros((max_hash2,), dtype=int)
H3 = np.zeros((max_hash3,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    hash_cell_2 = hash(key) % max_hash2
    H2[hash_cell_2] += 1
    hash_cell_3 = hash(key) % max_hash3
    H3[hash_cell_3] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])
H_good_2 = set(np.where(H2 >= N)[0])
H_good_3 = set(np.where(H3 >= N)[0])

del H1
del H2
del H3
# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue
    hash_cell_2 = hash(key) % max_hash2
    if hash_cell_2 not in H_good_2:
        continue
        
    hash_cell_3 = hash(key) % max_hash3
    if hash_cell_3 not in H_good_3:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1713 items
PCY: 1713 items with >10 occurances
It took 99.70 sec.


### 4 tables

In [33]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
max_hash2 = 10*1000000+1000
max_hash3 = 10*1000000+2000
max_hash4 = 10*1000000+3000
H1 = np.zeros((max_hash1,), dtype=int)
H2 = np.zeros((max_hash2,), dtype=int)
H3 = np.zeros((max_hash3,), dtype=int)
H4 = np.zeros((max_hash4,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    hash_cell_2 = hash(key) % max_hash2
    H2[hash_cell_2] += 1
    hash_cell_3 = hash(key) % max_hash3
    H3[hash_cell_3] += 1
    hash_cell_4 = hash(key) % max_hash4
    H4[hash_cell_4] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])
H_good_2 = set(np.where(H2 >= N)[0])
H_good_3 = set(np.where(H3 >= N)[0])
H_good_4 = set(np.where(H4 >= N)[0])

del H1
del H2
del H3
del H4
# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue
    hash_cell_2 = hash(key) % max_hash2
    if hash_cell_2 not in H_good_2:
        continue
        
    hash_cell_3 = hash(key) % max_hash3
    if hash_cell_3 not in H_good_3:
        continue
        
    hash_cell_4 = hash(key) % max_hash4
    if hash_cell_4 not in H_good_4:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1713 items
PCY: 1713 items with >10 occurances
It took 102.78 sec.


### 5 tables

In [34]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
max_hash2 = 10*1000000+1000
max_hash3 = 10*1000000+2000
max_hash4 = 10*1000000+3000
max_hash5 = 10*1000000+4000
H1 = np.zeros((max_hash1,), dtype=int)
H2 = np.zeros((max_hash2,), dtype=int)
H3 = np.zeros((max_hash3,), dtype=int)
H4 = np.zeros((max_hash4,), dtype=int)
H5 = np.zeros((max_hash5,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    hash_cell_2 = hash(key) % max_hash2
    H2[hash_cell_2] += 1
    hash_cell_3 = hash(key) % max_hash3
    H3[hash_cell_3] += 1
    hash_cell_4 = hash(key) % max_hash4
    H4[hash_cell_4] += 1
    hash_cell_5 = hash(key) % max_hash5
    H5[hash_cell_5] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])
H_good_2 = set(np.where(H2 >= N)[0])
H_good_3 = set(np.where(H3 >= N)[0])
H_good_4 = set(np.where(H4 >= N)[0])
H_good_5 = set(np.where(H5 >= N)[0])

del H1
del H2
del H3
del H4
del H5
# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue
    hash_cell_2 = hash(key) % max_hash2
    if hash_cell_2 not in H_good_2:
        continue
        
    hash_cell_3 = hash(key) % max_hash3
    if hash_cell_3 not in H_good_3:
        continue
        
    hash_cell_4 = hash(key) % max_hash4
    if hash_cell_4 not in H_good_4:
        continue
        
    hash_cell_5 = hash(key) % max_hash5
    if hash_cell_5 not in H_good_5:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1713 items
PCY: 1713 items with >10 occurances
It took 72.59 sec.


By increasing the number of hash tables, we reduce the amount of memory that we have to use, thus we are saving memory.

The number of frequent items didn't change by increasing the number of hash tables.

As we increase the number of hash tables it takes slightly more time.

## 3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. Warning: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach

In [35]:
# Frequent Items for k = 3

start = time.time()

N = 10  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python


# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count

C3_items = set([a.union(b) for a in L2.keys() for b in L2.keys()]) # List comprehensions in python


# find frequent 2-tuples
C3 = {}
for key in readdata(k=3):
    # filter out non-frequent tuples
    if key not in C3_items:
        continue

    # record frequent tuples
    if key not in C3:
        C3[key] = 1
    else:
        C3[key] += 1
        
print("{} items".format(len(C3)))

# filter stage
L3 = {}
for key, count in C3.items():
    if count >= N:
        L3[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L3), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
37555 items
2377 items
A-priori: 343 items with >10 occurances
It took 107.46 sec.


In [36]:
# Frequent Items for k = 4

start = time.time()

N = 10  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python


# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count

C3_items = set([a.union(b) for a in L2.keys() for b in L2.keys()]) # List comprehensions in python


# find frequent 2-tuples
C3 = {}
for key in readdata(k=3):
    # filter out non-frequent tuples
    if key not in C3_items:
        continue

    # record frequent tuples
    if key not in C3:
        C3[key] = 1
    else:
        C3[key] += 1
        
print("{} items".format(len(C3)))

# filter stage
L3 = {}
for key, count in C3.items():
    if count >= N:
        L3[key] = count

C4_items = set([a.union(b) for a in L3.keys() for b in L3.keys()]) # List comprehensions in python


# find frequent 2-tuples
C4 = {}
for key in readdata(k=4):
    # filter out non-frequent tuples
    if key not in C4_items:
        continue

    # record frequent tuples
    if key not in C4:
        C4[key] = 1
    else:
        C4[key] += 1
        
print("{} items".format(len(C4)))

# filter stage
L4 = {}
for key, count in C4.items():
    if count >= N:
        L4[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L4), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
37555 items
2377 items
291 items
A-priori: 103 items with >10 occurances
It took 113.26 sec.


### Trying other support values

#### N = 20, k=3

In [37]:
start = time.time()

N = 20  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python


# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count

C3_items = set([a.union(b) for a in L2.keys() for b in L2.keys()]) # List comprehensions in python


# find frequent 2-tuples
C3 = {}
for key in readdata(k=3):
    # filter out non-frequent tuples
    if key not in C3_items:
        continue

    # record frequent tuples
    if key not in C3:
        C3[key] = 1
    else:
        C3[key] += 1
        
print("{} items".format(len(C3)))

# filter stage
L3 = {}
for key, count in C3.items():
    if count >= N:
        L3[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L3), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
1886 items with >20 occurances
16809 items
153 items
A-priori: 8 items with >20 occurances
It took 22.15 sec.


#### N = 20, k = 4

In [38]:
# Frequent Items for k = 4

start = time.time()

N = 20  # frequency threshold

# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python


# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count

C3_items = set([a.union(b) for a in L2.keys() for b in L2.keys()]) # List comprehensions in python


# find frequent 2-tuples
C3 = {}
for key in readdata(k=3):
    # filter out non-frequent tuples
    if key not in C3_items:
        continue

    # record frequent tuples
    if key not in C3:
        C3[key] = 1
    else:
        C3[key] += 1
        
print("{} items".format(len(C3)))

# filter stage
L3 = {}
for key, count in C3.items():
    if count >= N:
        L3[key] = count

C4_items = set([a.union(b) for a in L3.keys() for b in L3.keys()]) # List comprehensions in python


# find frequent 2-tuples
C4 = {}
for key in readdata(k=4):
    # filter out non-frequent tuples
    if key not in C4_items:
        continue

    # record frequent tuples
    if key not in C4:
        C4[key] = 1
    else:
        C4[key] += 1
        
print("{} items".format(len(C4)))

# filter stage
L4 = {}
for key, count in C4.items():
    if count >= N:
        L4[key] = count
print('A-priori: {} items with >{} occurances'.format(len(L4), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
1886 items with >20 occurances
16809 items
153 items
0 items
A-priori: 0 items with >20 occurances
It took 19.24 sec.


The best value for the support is 10 since when we decrease it the time taken increases a lot and when we increase it there are only few frequent items.

With the support 10, we have 343 3-tuples frequent items and 103 4-tuples frequent items.

## 4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

#### k = 2, N = 10

In [39]:
start = time.time()
import numpy as np
# hash table
N = 10

max_hash1 = 10*1000000
max_hash2 = 10*1000000+1000
H1 = np.zeros((max_hash1,), dtype=int)
H2 = np.zeros((max_hash2,), dtype=int)
for key in readdata(k=2, report=False):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1
    hash_cell_2 = hash(key) % max_hash2
    H2[hash_cell_2] += 1
    
# compact hash table
H_good_1 = set(np.where(H1 >= N)[0])
H_good_2 = set(np.where(H2 >= N)[0])

del H1
del H2
# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

# find frequent 2-tuples
C2 = {}
for key in readdata(k=2):
    # hash-based filtering stage from PCY
    hash_cell_1 = hash(key) % max_hash1
    if hash_cell_1 not in H_good_1:
        continue
    hash_cell_2 = hash(key) % max_hash2
    if hash_cell_2 not in H_good_2:
        continue

    # filter out non-frequent tuples
    if key not in C2_items:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
        
print("{} items".format(len(C2)))

# filter stage
L2 = {}
for key, count in C2.items():
    if count >= N:
        L2[key] = count
print('PCY: {} items with >{} occurances'.format(len(L2), N))

end = time.time()
print('It took {:.2f} sec.'.format(end-start))

74315 items
4316 items with >10 occurances
1713 items
PCY: 1713 items with >10 occurances
It took 70.76 sec.


In [40]:
L2

{frozenset({'Androutsopoulos Ion', 'Pavlopoulos John'}): 13,
 frozenset({'Sumita Eiichiro', 'Utiyama Masao'}): 90,
 frozenset({'Sumita Eiichiro', 'Watanabe Taro'}): 26,
 frozenset({'Aguilar Gustavo', 'Solorio Thamar'}): 10,
 frozenset({'Lam Wai', 'Zhang Wenxuan'}): 10,
 frozenset({'Chersoni Emmanuele', 'Santus Enrico'}): 19,
 frozenset({'Rashid Ahmad', 'Rezagholizadeh Mehdi'}): 10,
 frozenset({'Sagot Beno{\\^\\i}t', "Seddah Djam{\\'e}"}): 15,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Chatterjee Rajen'}): 10,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Federmann Christian'}): 15,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Graham Yvette'}): 14,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Haddow Barry'}): 21,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Huck Matthias'}): 14,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Kocmi Tom'}): 13,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Koehn Philipp'}): 16,
 frozenset({'Bojar Ond{\\v{r}}ej', 'Monz Christof'}): 13,
 frozenset({'Chatterjee Rajen', 'Federmann Christian'}): 12,
 frozenset({'Chatterjee 

In [41]:
def find_clusters( tuples ):
    # clusterlist contains at each position either a set
    # representing an actual cluster, or an int referring
    # to another cluster that has eaten this one here.
    # the cluster id is its position within this list
    clusterlist=[]
    # clustermap maps an element to the id of the containing
    # cluster within clusterlist
    clustermap = {}

    # here we find the current cluster id for elem, by following the
    # chain within clusterlist, and replace that entire chain
    # with the new cluster id n.   We return the old cluster id.
    def set_cluster_id( elem, n ):
        if elem not in clustermap:
            return None
        k = clustermap[elem]
        # clusters may be replaced by references to other clusters,
        # we follow that chain
        while k < n and isinstance( clusterlist[k], int ):
            k1 = clusterlist[k]
            # this is optional, we make the chain shorter
            # by making this entry point directly to the current cluster
            clusterlist[k] = n
            k = k1
        return k

    for t in tuples:
        # for each tuple we create a new cluster
        thiscluster = set(t)
        n = len( clusterlist ) # the id of thiscluster
        for x in t:
            # we absorb existing clusters into the new one
            # if there is overlap
            k = set_cluster_id(x, n)
            if k is not None and k != n:
                thiscluster.update( clusterlist[k] )
                # we replace the existing cluster
                # with a reference to the new one
                clusterlist[k] = n 
            clustermap[x] = n
        clusterlist.append(thiscluster)

    return [ tuple(x) for x in clusterlist if isinstance( x, set ) ]

In [42]:
clusters = find_clusters(L2)

In [43]:
clusters

[('Rashid Ahmad', 'Rezagholizadeh Mehdi'),
 ('Gwinnup Jeremy', 'Erdmann Grant', 'Anderson Tim'),
 ('Wang Mingxuan', 'Li Lei', 'Zhou Hao'),
 ('Yang Hao',
  'Guo Jiaxin',
  'Wang Minghan',
  'Lei Lizhi',
  'Qin Ying',
  'Shang Hengchao',
  'Wei Daimeng'),
 ('Grundkiewicz Roman',
  'Junczys-Dowmunt Marcin',
  'Heafield Kenneth',
  'Bogoychev Nikolay'),
 ('Senellart Jean', 'Crego Josep'),
 ('Fujita Atsushi', 'Marie Benjamin'),
 ('Nagoudi El Moatez Billah', 'Abdul-Mageed Muhammad'),
 ('Lichouri Mohamed', 'Abbas Mourad'),
 ('Schulte im Walde Sabine', 'Schlechtweg Dominik'),
 ('Caines Andrew', 'Buttery Paula'),
 ('Kahane Sylvain', 'Gerdes Kim'),
 ('Foster Jennifer', 'Wagner Joachim'),
 ('Socher Richard', 'Xiong Caiming'),
 ('Opitz Juri', 'Frank Anette'),
 ('Bjerva Johannes', 'Augenstein Isabelle'),
 ('Koller Alexander', 'Groschwitz Jonas'),
 ('Illina Irina', 'Fohr Dominique'),
 ('Gonzalez-Hernandez Graciela', 'Weissenbacher Davy'),
 ('Chakravarthi Bharathi Raja', 'McCrae John P.', 'Priyadhars

In [44]:
len(clusters)

300

In [45]:
two_member = []
for x in clusters:
    if len(x) == 2:
        two_member.append(x)
        
len(two_member)

167

Interpretation:

Results from the PCY for k = 2, N = 10 were used.

There are in total 300 clusters.

167 of them have only 2 members. Those authors only collaborated with each other above the threshold that was used.

The rest has more than 2 members which shows that there are some group of authors who tend to collabarate with each other.