# Project 1: Mining information from Text Data 
<hr>

## Task 2: Mining information from Text Data 

Using the whole anthologies abstract dataset. Extract a list of the authors and editors per publication and create baskets and perform a search of similar items, for example:

- basket 1: Mostafazadeh Davani Aida,Kiela Douwe,Lambert Mathias,Vidgen, Bertie Prabhakaran Vinodkumar, Waseem, Zeerak
- basket 2: Singh Sumer, Li Sheng

1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

0. Import libraries

In [22]:
from urllib.request import urlopen
from io import BytesIO
from time import time

import pandas as pd
import gzip
import numpy as np
import itertools
import os

1. Download the data files

In [2]:
url  = 'https://aclanthology.org/anthology+abstracts.bib.gz'   # url where the file is stored
filename = 'anthology+abstracts.bib'                           # bib filename
folder   = 'data'                                              # folder name
minimum = 200                                                  # minimum number of words in the abtract to be considered 

# Create the path to store the files
if not os.path.exists(folder):
    os.makedirs(folder)

file = folder + '/' + filename

# Download the file if it doesn't exist locally
if(not os.path.exists(file)):
  print("Downloading " + url + " to /" + folder + "..." )
  with gzip.open(BytesIO(urlopen(url).read()), 'rb') as fb:
    with open(file, 'wb') as f:
        f.write(fb.read())
else:
  print("File " + filename + " already available in folder /" + folder)      

File anthology+abstracts.bib already available in folder /data


2. Load data

In [3]:
# Read authors / editors from the bib file

elements = []
with open(file, 'r',errors='ignore') as f:
    string = ''
    found = False
    # skip all lines until author/editor
    for line in f:
      if found:
        if '=' in line:                                        
          elements.append(string)
          string = ''
          found = False
        else:
          string = string + line
      if 'author = "' in line:                                 
        found = True
        string = string + line       
      if 'editor = "' in line:                                 
        found = True
        string = string + line        

3. Save list of authors/editors in a file

In [4]:
# Preporcess and clean the data and save it to the file ./data/authors.txt
authors_fname = './data/authors.txt'
baskets = []
for e in elements:
    new_string = e.replace("and", "")
    new_string = new_string.replace("\n", "")
    new_string = new_string.replace("    editor = ", "")
    new_string = new_string.replace("    author = ", "")
    new_string = new_string.replace(',', "")
    new_string = new_string.replace('"', "")
    new_string = new_string.replace('        ', ",")
    baskets.append(new_string)

with open(authors_fname, 'w') as file:    
    for s in baskets:        
        print(s, file=file)

nitems = 0
with open(authors_fname, "rt", encoding='latin1') as f:
    for line in f:
        C_k  = line.rstrip().split(',')
        nitems = nitems + len(C_k)

nbaskets = len(baskets)
print('Number of baskets:', nbaskets)
print('Number of items:', nitems)


Number of baskets: 70449
Number of items: 217901


<hr>

#### 1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

# Naïve approach

In [5]:
# Read the date and generate the frozenset
def readdata(k, fname=authors_fname):    
    with open(fname, "rt", encoding='latin1') as f:
        for line in f:
            C_k  = line.rstrip().split(',')
            for itemset in itertools.combinations(C_k, k):
                    yield frozenset(itemset)  

In [6]:
nitems = 10
for C_k in readdata(k=2):
    print(C_k)    
    nitems -= 1
    if nitems == 0: 
        break    

frozenset({'Kiela Douwe', 'Mostafazadeh Davani Aida'})
frozenset({'Lambert Mathias', 'Mostafazadeh Davani Aida'})
frozenset({'Mostafazadeh Davani Aida', 'Vidgen Bertie'})
frozenset({'Prabhakaran Vinodkumar', 'Mostafazadeh Davani Aida'})
frozenset({'Mostafazadeh Davani Aida', 'Waseem Zeerak'})
frozenset({'Kiela Douwe', 'Lambert Mathias'})
frozenset({'Kiela Douwe', 'Vidgen Bertie'})
frozenset({'Prabhakaran Vinodkumar', 'Kiela Douwe'})
frozenset({'Kiela Douwe', 'Waseem Zeerak'})
frozenset({'Lambert Mathias', 'Vidgen Bertie'})


In [7]:
def get_C(k):
    start = time()
    C = {}
    for key in readdata(k):
        if key not in C:
            C[key] = 1
        else:
            C[key] += 1
    print("Took {}s for k={}".format((time() - start), k))
    return C

In [8]:
C1 = get_C(1)
print("C1 contains {} items".format(len(C1)))
C2 = get_C(2)
print("C2 contains {} items".format(len(C2)))

Took 0.2580387592315674s for k=1
C1 contains 61389 items
Took 0.7174451351165771s for k=2
C2 contains 247358 items


In [9]:
for (ck, n), _ in zip(C2.items(), range(5)):
    print(ck,n)

frozenset({'Kiela Douwe', 'Mostafazadeh Davani Aida'}) 2
frozenset({'Lambert Mathias', 'Mostafazadeh Davani Aida'}) 1
frozenset({'Mostafazadeh Davani Aida', 'Vidgen Bertie'}) 2
frozenset({'Prabhakaran Vinodkumar', 'Mostafazadeh Davani Aida'}) 3
frozenset({'Mostafazadeh Davani Aida', 'Waseem Zeerak'}) 2


In [11]:
def association_rules(nitems, C1, C2, L2):
    for i in range(len(L2)):
        A, B = L2[i]
        support_AB = C2[frozenset([A, B])]
        support_A = C1[frozenset([A])]
        conf_A_leads_to_B = support_AB / support_A

        support_B = C1[frozenset([B])]
        prob_B = support_B / nitems

        interest_A_leads_to_B = conf_A_leads_to_B - prob_B

        if interest_A_leads_to_B > 0.7:
            print("{} --> {} with interest {:3f}".format(A, B, interest_A_leads_to_B))

In [26]:
%time
supports = [10, 50, 100]
naive = []
L2s = []
for s in supports:
    t = time()
    L2 = {}
    for key, n in C2.items():
        if n >= s:
            L2[key] = n    

    L2 = [elem for elem in L2 if len(elem) > 1]  # clean our the list a bit.
    L2s.append(L2)
    t2 = round(time() - t,3)
    naive.append(str(len(L2)) + ' items with > ' + str(s) + ' occurrences in ' + str(t2) + 's')
    print("\nSupport: {}".format(s))
    association_rules(sum(1 for line in open(authors_fname)), C1, C2, L2)    

Wall time: 0 ns

Support: 10
Utiyama Masao --> Sumita Eiichiro with interest 0.740508
Rashid Ahmad --> Rezagholizadeh Mehdi with interest 0.999773
Chatterjee Rajen --> Turchi Marco with interest 0.722761
Negri Matteo --> Turchi Marco with interest 0.751200
Escolano Carlos --> Fonollosa Jos{\'e} A. R. with interest 0.832680
Wang Mingxuan --> Li Lei with interest 0.749063
Wang Xing --> Tu Zhaopeng with interest 0.903896
Wei Daimeng --> Shang Hengchao with interest 0.908935
Wei Daimeng --> Wang Minghan with interest 0.908864
Wei Daimeng --> Lei Lizhi with interest 0.908949
Wei Daimeng --> Yang Hao with interest 0.999716
Wei Daimeng --> Qin Ying with interest 0.908835
Shang Hengchao --> Wang Minghan with interest 0.908864
Shang Hengchao --> Lei Lizhi with interest 0.908949
Shang Hengchao --> Yang Hao with interest 0.999716
Guo Jiaxin --> Yang Hao with interest 0.999716
Yang Hao --> Wang Minghan with interest 0.749773
Lei Lizhi --> Yang Hao with interest 0.999716
Dong Li --> Wei Furu with i

In [14]:
print(naive)

['1705 items with > 10 occurrences in 0.037s', '12 items with > 50 occurrences in 0.039s', '0 items with > 100 occurrences in 0.037s']


# A-priori algorithm

In [36]:
supports = [10, 50, 100]
apriori = []
L2s = []
for s in supports:

    print('Threshold = ', s)
    t = time()
    C1 = get_C(1)
    print("C1 contains {} items".format(len(C1)))
    
    # filter stage
    L1 = {}
    for key, count in C1.items():
        if count >= s:
            L1[key] = count
    
    print('{} of those items with > {} occurrences'.format(len(L1),s))

    C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python
    
    # find frequent 2-tuples
    C2 = {}
    for key in readdata(k=2):
        # filter out non-frequent tuples
        if key not in C2_items:
            continue

        # record frequent tuples
        if key not in C2:
            C2[key] = 1
        else:
            C2[key] += 1
    
    print("C2 contains {} items".format(len(C2)))

    # Save output to file for Task 3    
    filename = 'authors_out' + str(s) + '.txt'   
    if os.path.exists(filename):
        os.remove(filename) 
    
    # filter stage
    L2 = {}    
    for key, count in C2.items():
        if count >= s:
            L2[key] = count        
            with open(filename, 'a') as fi:
                print(key, count, file=fi)    
    #print(L2)
    t2 = round(time() - t,3)
    print('A-priori: {} items with >{} occurrences\n'.format(len(L2), s))    
    apriori.append(str(len(L2)) + ' items with > ' + str(s) + ' occurrences in ' + str(t2) + 's')
    L2s.append(L2) 


Threshold =  10
Took 0.2825186252593994s for k=1
C1 contains 61389 items
4154 of those items with > 10 occurrences
C2 contains 37361 items
A-priori: 1705 items with >10 occurrences

Threshold =  50
Took 0.2680206298828125s for k=1
C1 contains 61389 items
450 of those items with > 50 occurrences
C2 contains 3506 items
A-priori: 12 items with >50 occurrences

Threshold =  100
Took 0.3299996852874756s for k=1
C1 contains 61389 items
95 of those items with > 100 occurrences
C2 contains 436 items
A-priori: 0 items with >100 occurrences



In [37]:
print(apriori)

['1705 items with > 10 occurrences in 47.02s', '12 items with > 50 occurrences in 3.675s', '0 items with > 100 occurrences in 0.715s']


### PCY algorithm

In [38]:
# Hash table
max_hash1 = 10 * 1000000
H1 = np.zeros((max_hash1, ), dtype=np.int64)

for key in readdata(k=2):
    hash_cell_1 = hash(key) % max_hash1
    H1[hash_cell_1] += 1

In [39]:
supports = [10, 50, 100]
pcy = []
L2s = []
for s in supports:

    print('Threshold = ', s)
    t = time()
    C1 = get_C(1)
    print("C1 contains {} items".format(len(C1)))
    
    L1 = {}
    for key, count in C1.items():
        if count >= s:
            L1[key] = count
    
    print('{} of those items with > {} occurrences'.format(len(L1),s))

    C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python  

    # find frequent 2-tuples
    C2 = {}
    N = 10
    for key in readdata(k=2):
    
        # hash-based filtering stage from PCY
        hash_cell_1 = hash(key) % max_hash1
        if H1[hash_cell_1] < s:
            continue

        # filter out non-frequent tuples
        if key not in C2_items:
            continue

        # record frequent tuples
        if key not in C2:
            C2[key] = 1
        else:
            C2[key] += 1
        
    print("C2 contains {} items".format(len(C2)))

    # filter stage
    L2 = {}    
    for key, count in C2.items():
        if count >= s:
            L2[key] = count
            
    t2 = round(time() - t,3)
    print('PCY: {} items with >{} occurrences\n'.format(len(L2), n))
    pcy.append(str(len(L2)) + ' items with > ' + str(s) + ' occurrences in ' + str(t2) + 's')    
    L2s.append(L2) 

Threshold =  10
Took 0.2849905490875244s for k=1
C1 contains 61389 items
4154 of those items with > 10 occurrences
C2 contains 1732 items
PCY: 1705 items with >100 occurrences

Threshold =  50
Took 1.3420419692993164s for k=1
C1 contains 61389 items
450 of those items with > 50 occurrences
C2 contains 12 items
PCY: 12 items with >100 occurrences

Threshold =  100
Took 0.3149707317352295s for k=1
C1 contains 61389 items
95 of those items with > 100 occurrences
C2 contains 0 items
PCY: 0 items with >100 occurrences



In [40]:
print('Naïve:', naive)
print('A-priori:', apriori)
print('PCY:', pcy)

Naïve: ['1705 items with > 10 occurrences in 0.044s', '12 items with > 50 occurrences in 0.044s', '0 items with > 100 occurrences in 0.043s']
A-priori: ['1705 items with > 10 occurrences in 47.02s', '12 items with > 50 occurrences in 3.675s', '0 items with > 100 occurrences in 0.715s']
PCY: ['1705 items with > 10 occurrences in 47.092s', '12 items with > 50 occurrences in 4.926s', '0 items with > 100 occurrences in 0.99s']


As we can see the results obtained are the same with the 3 methods but the number of items to count in C2 is less with A-priori algorithm and even more with the PCY algorithm which makes those algorithms better in terms of computational memory requirements.

<hr>

#### 2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

In [59]:
supports = [10, 50, 100]

# Definie hash tables
max_hash1 = 5*1000000-673
max_hash2 = 5*1000000+673
max_hash3 = 5*1000000-1673
max_hash4 = 5*1000000+1673
max_hash5 = 5*1000000+11673

for s in supports:

    t = time()
    print('Threshold = {}\n'.format(s))

    H1 = np.zeros((max_hash1,), dtype=int)
    H2 = np.zeros((max_hash2,), dtype=int)
    H3 = np.zeros((max_hash3,), dtype=int)
    H4 = np.zeros((max_hash4,), dtype=int)
    H5 = np.zeros((max_hash5,), dtype=int)

    for key in readdata(k=2):
        hash_cell_1 = hash(key) % max_hash1
        H1[hash_cell_1] += 1
        hash_cell_2 = hash(key) % max_hash2
        H2[hash_cell_2] += 1
        hash_cell_3 = hash(key) % max_hash3
        H3[hash_cell_3] += 1
        hash_cell_4 = hash(key) % max_hash4
        H4[hash_cell_4] += 1
        hash_cell_5 = hash(key) % max_hash5
        H5[hash_cell_5] += 1

    # compact hash table
    H_good_1 = set(np.where(H1 >= s)[0])
    H_good_2 = set(np.where(H2 >= s)[0])
    H_good_3 = set(np.where(H3 >= s)[0])
    H_good_4 = set(np.where(H4 >= s)[0])
    H_good_5 = set(np.where(H5 >= s)[0])

    del H1
    del H2
    del H3
    del H4
    del H5

    # find frequent 1-tuples (individual items)
    C1 = {}
    for key in readdata(k=1):
        if key not in C1:
            C1[key] = 1
        else:
            C1[key] += 1    

    # filter stage
    L1 = {}
    for key, count in C1.items():
        if count >= s:
            L1[key] = count

    C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python

    # find frequent 2-tuples
    C2 = {}


    for i in range (1,6):
        for key in readdata(k=2):
            # hash-based filtering stage from PCY
            if i >= 1:
                hash_cell_1 = hash(key) % max_hash1
                if hash_cell_1 not in H_good_1:
                    continue
            if i >= 2:
                hash_cell_2 = hash(key) % max_hash2
                if hash_cell_2 not in H_good_2:
                    continue
            if i >= 3:
                hash_cell_3 = hash(key) % max_hash3
                if hash_cell_3 not in H_good_3:
                    continue
            if i >= 4:
                hash_cell_4 = hash(key) % max_hash4
                if hash_cell_4 not in H_good_4:
                    continue
            if i >= 5:
                hash_cell_5 = hash(key) % max_hash5
                if hash_cell_5 not in H_good_5:
                    continue
                    
            # filter out non-frequent tuples
            if key not in C2_items:
                continue

            # record frequent tuples
            if key not in C2:
                C2[key] = 1
            else:
                C2[key] += 1
        
        # filter stage
        L2 = {}
        for key, count in C2.items():
            if count >= s:
                L2[key] = count
        t2 = round(time() - t,3)

        print('{} items with > {} occurrences found in {} using {} hashing functions'.format(len(L2), s, t2, i))

Threshold = 10

1705 items with > 10 occurances found in 274.353 using 1 hashing functions
1706 items with > 10 occurances found in 274.897 using 2 hashing functions
1706 items with > 10 occurances found in 275.379 using 3 hashing functions
1706 items with > 10 occurances found in 275.898 using 4 hashing functions
1706 items with > 10 occurances found in 276.436 using 5 hashing functions
Threshold = 50

12 items with > 50 occurances found in 7.11 using 1 hashing functions
12 items with > 50 occurances found in 7.495 using 2 hashing functions
12 items with > 50 occurances found in 7.873 using 3 hashing functions
12 items with > 50 occurances found in 8.253 using 4 hashing functions
12 items with > 50 occurances found in 8.631 using 5 hashing functions
Threshold = 100

0 items with > 100 occurances found in 2.685 using 1 hashing functions
0 items with > 100 occurances found in 3.062 using 2 hashing functions
0 items with > 100 occurances found in 3.437 using 3 hashing functions
0 items w

We have seen that we do not improve the execution time by addig more hashing functions but we can appreciare that, by splitting memory among too many hash-tables, the hash-tables get smaller, resulting in more collisions and, too many collisions, may result in an unefficient filtering out of infrequent pairs.

<hr>

#### 3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

In [82]:
supports = [10, 20, 30]
k_tuples = [3, 4]

for s in supports:
    for k in k_tuples:

        t = time()

        # find frequent 1-tuples (individual items)
        C1 = {}
        for key in readdata(k=1):
            if key not in C1:
                C1[key] = 1
            else:
                C1[key] += 1    
            
        # filter stage
        L1 = {}
        for key, count in C1.items():
            if count >= s:
                L1[key] = count
            
        # find frequent 2-tuples    
        C2 = {}
        for key in readdata(k=2):
        # record frequent tuples
            if key not in C2:
                C2[key] = 1
            else:
                C2[key] += 1

        # filter stage
        L2 = {}
        for key, count in C2.items():
            if count >= s:
                L2[key] = count

        if k == 3:

            C3_items = set([a.union(b) for a in L2.keys() for b in L2.keys() ]) # List comprehensions in python

            # Hash table
            max_hash1 = 10 * 1000000
            H1 = np.zeros((max_hash1, ), dtype=int)

            for key in readdata(k=3):
                hash_cell_1 = hash(key) % max_hash1
                H1[hash_cell_1] += 1
    
            # find frequent 3-tuples
            C3 = {}
            for key in readdata(k=3):
                # hash-based filtering stage from PCY
                hash_cell_1 = hash(key) % max_hash1
                if H1[hash_cell_1] < s:
                    continue
    
                # filter out non-frequent tuples
                if key not in C3_items:
                    continue

                # record frequent tuples
                if key not in C3:
                    C3[key] = 1
                else:
                    C3[key] += 1

            # filter stage
            L3 = {}
            for key, count in C3.items():
                if count >= s:
                    L3[key] = count
            t2 = round(time() - t,3)

            print('{} {}-tuples items with > {} occurrences found in {}'.format(len(L3), k, s, t2))    

        else:
            # find frequent 3-tuples
            C3 = {}
            for key in readdata(k=3):            
                # record frequent tuples
                if key not in C3:
                    C3[key] = 1
                else:
                    C3[key] += 1

            # filter stage
            L3 = {}
            for key, count in C3.items():
                if count >= s:
                    L3[key] = count
    
            C4_items = set([a.union(b) for a in L3.keys() for b in L3.keys() ]) # List comprehensions in python  

            # hash table
            max_hash1 = 10 * 1000000
            H1 = np.zeros((max_hash1, ), dtype=int)

            for key in readdata(k=4):
                hash_cell_1 = hash(key) % max_hash1
                H1[hash_cell_1] += 1
            
            # find frequent 3-tuples
            C4 = {}

            for key in readdata(k=4):
                # hash-based filtering stage from PCY
                hash_cell_1 = hash(key) % max_hash1
                if H1[hash_cell_1] < s:
                    continue
            
                # filter out non-frequent tuples
                if key not in C4_items:
                    continue

                # record frequent tuples
                if key not in C4:
                    C4[key] = 1
                else:
                    C4[key] += 1

            # filter stage
            L4 = {}
            for key, count in C4.items():
                if count >= s:
                    L4[key] = count
            t2 = round(time() - t,3)    
            print('{} {}-tuples items with > {} occurrences found in {}\n'.format(len(L4), k, s, t2))      
                      
            # saving result for next exercise
            if s == 10:
                result = L4   

342 3-tuples items with > 10 occurrences found in 11.655
103 4-tuples items with > 10 occurrences found in 13.99

8 3-tuples items with > 20 occurrences found in 3.309
0 4-tuples items with > 20 occurrences found in 11.333

1 3-tuples items with > 30 occurrences found in 3.264
0 4-tuples items with > 30 occurrences found in 11.325



<hr>

#### 4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.