## 1. Introduction

This project will explore and analyze the information stored in a particular dataset. In this case, the ACL Anthology dataset (https://aclanthology.org/). We will explore different techniques for obtaining valuable information.

### Task 1: Finding Similar Items
 
Randomly select 1000 abstracts from the whole dataset. Find the similar items using pairwise Jaccard similarities, MinHash and LSH (vectorized versions) .

1. Compare the performance in time and the results for k-shingles = 3, 5 and 10, for the three methods and similarity thresholds s=0.1 and 0.2. Use 50 hashing functions. Comment your results.

2. Compare the results obtained for MinHash and LSH for different similarity thresholds s = 0.1, 0.2 and 0.25 and 50, 100 and 200 hashing functions. Comment your results.

3. For MinHashing using 100 hashing functions and s = 0.1 and 0.2, find the Jaccard distances (1-Jaccard similarity) for all possible pairs. Use the obtained values within a k-NN algorithm, and for k=1,3 and, 5 identify the clusters with similar abstracts for each s. Describe the obtained clusters, are they different?. Select randomly at least 5 abstracts per cluster, upon visual inspection, what are the main topics?

In [1]:
# import libraries


#from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

import pandas as pd
import numpy
import numpy as np
import os
import re
import binascii
from time import time

from urllib import request
import gzip
import shutil
import time

import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
#from sklearn.decomposition import PCA

import matplotlib

import networkx as nx
import matplotlib.pyplot as plt
import itertools

%matplotlib inline

In [2]:
# Download and extract dataset

url1 = "https://aclanthology.org/anthology+abstracts.bib.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"

# Extract it
with gzip.open(file_name1, 'rb') as f_in:
    with open(txt1, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

fname = txt1        

In [3]:
# Various Functions
#=================================================#
#=================================================#
#=================================================#

# abstract extracting function
# fname is the file name of the document containing all abstract
# n is the number of abstracts that we will extract
def read_abstracts(fname,n):
    abs = [] #initialize a list variable
    with open(fname, 'r', encoding="utf-8") as f:
        i = 0
        # skip all lines until abstract
        for line in f:
            if "abstract =" in line:
                pattern = '"'
                abstract = re.split(pattern ,line, flags=re.IGNORECASE)[1].split('"')[0]
                if len(abstract)<5: # takes care of empty abstracts
                    pass
                    
                else:
                    abs.append(abstract) # append each abstract to the list
                    i = i + 1
                if i == n:  # number of abstracts to extract
                    return abs
        
        return abs
#=================================================
    
# Shingle function
# k is the number of shingles

def get_shingles(abstract, k):
    """Get all shingles from requested file (hashes of these shingles)
    """
    L = len(abstract)
    shingles = set()  # we use a set to automatically eliminate duplicates
    for i in range(L-k+1):
        shingle = abstract[i:i+k]
        crc = binascii.crc32(shingle.encode('utf-8')) # hash the shingle to a 32-bit integer
        shingles.add(crc)
    return shingles
#=================================================

# jaccard similarity score Function
def jaccard_similarity_score(x, y, errors='ignore'):
    """
    Jaccard Similarity J (A,B) = | Intersection (A,B) | /
                                    | Union (A,B) |
    """
    intersection_cardinality = len(set(x).intersection(set(y)))
    union_cardinality = len(set(x).union(set(y)))
    if float(union_cardinality) == 0:
        ja = 0
    else:
        ja = intersection_cardinality / float(union_cardinality)
    return ja
#=================================================

# similarity functions
# k is number of shingles and s is the similarity thresholds 
# abstract_list is the list of 1000 abstracts

def similar_items(abstract_list, k, s):
    candidates = []
    #abstract_list = read_abstracts(fname,n)
    for pair in itertools.combinations(abstract_list,2):
        js = jaccard_similarity_score(get_shingles(pair[0], k),get_shingles(pair[1], k))
        
        if js > s:
            #print(pair)
            candidates.append(pair)
            
    return candidates
#=================================================


# fast implementation of Minhashing algorithm
# computes all random hash functions for a shingle at once, using vector operations
# also finds element-wise minimum of two vectors efficiently
def minhash_vectorized(shingles, A, B, nextPrime, maxShingleID, nsig):
    signature = numpy.ones((nsig,)) * (maxShingleID + 1)

    for ShingleID in shingles:
        hashCodes = ((A*ShingleID + B) % nextPrime) % maxShingleID
        numpy.minimum(signature, hashCodes, out=signature)

    return signature
#=================================================

# candidate pair function
def candidate_pair(abstract_list, k, s):
    signatures = []  # signatures for all files
    for abstract in abstract_list:
        shingles = get_shingles(abstract, k)
        signature = minhash_vectorized(shingles, A, B, nextPrime, maxShingleID, nsig)
        signatures.append(signature)
        
    Nfiles = len(signatures)
    #startTime = time.time()
    candidates = []
    for i in range(Nfiles):
        for j in range(i+1, Nfiles):
            Jsim = numpy.mean(signatures[i] == signatures[j])  # average number of similar items in 
            if Jsim >= s:                                      # two vectors, equivalente to Jaccard
                candidates.append((i,j))
                
            
    return len(candidates)
#=================================================

# Moditied Function for jaccard similarity

def jaccard_similarity_score_mod2(a, b, shingles_list, errors='ignore'): 
    
    sha = shingles_list[a]
    shingles_vector_a = sha
 
    shb = shingles_list[b]
    shingles_vector_b = shb

    jsc = jaccard_similarity_score(shingles_vector_a, shingles_vector_b)
    
    return jsc
#=================================================

# LSH candidates function
def LSH(signatures, bands, rows, Ab, Bb, nextPrime, maxShingleID, s, shingles_list):
    """Locality Sensitive Hashing
    """
    numItems = signatures.shape[1]
    signBands = numpy.array_split(signatures, bands, axis=0) 
    candidates = set()
    for nb in range(bands):
        hashTable = {}
        for ni in range(numItems):
            item = signBands[nb][:,ni]
            hash = (numpy.dot(Ab[nb,:], item) + Bb[nb]) % nextPrime % maxShingleID
            if hash not in hashTable:
                hashTable[hash] = [ni]
            else:
                hashTable[hash].append(ni)
        for _,items in hashTable.items():
            if len(items) > 1:
                L = len(items)
                for i in range(L-1):
                    for j in range(i+1, L):
                        cand = [items[i], items[j]]
                        a = items[i]
                        b = items[j]
                        jsim = jaccard_similarity_score_mod2(a,b, shingles_list) #jaccard similarity function call
                        if jsim >= s:
                            numpy.sort(cand)
                            candidates.add(tuple(cand))
    return candidates
#=================================================

# LSH candidates length function
def LSH_candidates(abstract_list, k, s):
    signatures = []  # signatures for all files
    shingles_list =[]
    for abstract in abstract_list:
        shingles = get_shingles(abstract, k)
        signature = minhash_vectorized(shingles, A, B, nextPrime, maxShingleID, nsig)
        signatures.append(signature)
        shingles_list.append(shingles)
        
    
    A2 = numpy.random.randint(0, nextPrime/2, size=(bands, rows),dtype=numpy.int64)  # now we need a vector of A parameters for each band
    B2 = numpy.random.randint(0, nextPrime/2, size=(bands, ),dtype=numpy.int64)
    signatures = numpy.array(signatures).T  # LSH needs a matrix of signatures, not a list of vectors

  
    candidates = LSH(signatures, bands, rows, A2, B2, nextPrime, maxShingleID, s, shingles_list)
   
    
    return len(candidates)
#=================================================

# New minHash function for Task1 number2 a
def Sim_Method_Property(abstract_list,k,s,bands,rows):
    
    nsig = bands*rows  # hashing function: number of elements in signature, or the number of different random hash functions

    #maxShingleID = 2**32-1  # record the maximum shingle ID that we assigned
    #nextPrime = 4294967311  # next prime number after maxShingleID

    #A = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)
    #B = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)
    
    startTime = time.time()
    cand = candidate_pair(abstract_list, k, s)
    execTime = round(((time.time() - startTime)),2)

    minHash_k.append(k)
    minHash_s.append(s)
    Hashing_fn.append(nsig)
    minHash_sim.append(cand)
    minHash_execT.append(execTime)
    
    dict = {'k': minHash_k, 's': minHash_s, 'Hashing_fn': Hashing_fn,'#sim': minHash_sim, 'execTime(sec)': minHash_execT} 
    df = pd.DataFrame(dict)
    
    return df
#=================================================

# New LSH function for Task1 number2 b
def Sim_Method_Property2(abstract_list,k,s,bands,rows):
    
    
    nsig = bands*rows  # hashing function: number of elements in signature, or the number of different random hash functions
    
    startTime = time.time()
    candi = LSH_candidates(abstract_list,k, s)
    execTime = round(((time.time() - startTime)),2)
    
    LSH_k.append(k)
    LSH_s.append(s)
    Hashing_fn2.append(nsig)
    LSH_sim.append(candi)
    LSH_execT.append(execTime)
    
    dict = {'k': LSH_k, 's': LSH_s, 'Hashing_fn': Hashing_fn2,'#sim': LSH_sim, 'execTime(sec)': LSH_execT} 
    df = pd.DataFrame(dict)
    
    return df
#=================================================

# functions for Task1 number 3
# Jaccard distance calculator function

def jacc_dist_calc(abstract_list,k,s,bands,rows):
    
    nsig = bands*rows  # hashing function: number of elements in signature, or the number of different random hash functions
    
    jd_df = candidate_pair_jacc_dist(abstract_list, k, s, nsig)

    return jd_df
#==============

# A modified candidate pair function

def candidate_pair_jacc_dist(abstract_list, k, s, nsig):
    signatures = []  # signatures for all files
    shingles_list = []
    for abstract in abstract_list:
        shingles = get_shingles(abstract, k)
        signature = minhash_vectorized(shingles, A, B, nextPrime, maxShingleID, nsig)
        signatures.append(signature)
        shingles_list.append(shingles)
        
    Nfiles = len(signatures)
    candidates = []
    jaccard_distance = []
    s_list = []
    #k_list = []
    #h_fn_list = []
    #sign1 = []
    #sign2 = []
    for i in range(Nfiles):
        for j in range(i+1, Nfiles):
            Jsim = numpy.mean(signatures[i] == signatures[j])  # average number of similar items in 
            if Jsim >= s:                                      # two vectors, equivalente to Jaccard
                #a = i
                #b = j
                js = jaccard_similarity_score_mod2(i,j, shingles_list)
                jaccard_distance.append(1-js) # jaccard distance calculations
                s_list.append(s)
                #k_list.append(k)
                #h_fn_list.append(nsig)
                candidates.append((i,j))
                #sign1.append(signatures[i])
                #sign2.append(signatures[j])
    
    dict = {'s': s_list, 'candidates': candidates, 'jacc_distance': jaccard_distance} 
    df = pd.DataFrame(dict)
    return df    #len(candidates)
#=================================================


In [4]:
# Let run a test as we extract 1000 abstract and using k=3
abstract_list = read_abstracts(fname,1000) 
shingles_vectors = []

for item in abstract_list[:1000]: 
    sh = list(get_shingles(item, 3))
    shingles_vectors.append(sh)
    

# the first two abstracts
jaccard_similarity_score(shingles_vectors[0], shingles_vectors[1])

0.25058823529411767

In [5]:
# samples size
sample_size = numpy.array(abstract_list).T
sample_size.shape

(1000,)

#### Task1 Number 1

Compare the performance in time and the results for k-shingles = 3, 5 and 10, for the three methods and similarity thresholds s=0.1 and 0.2. Use 50 hashing functions. Comment your results.

##### (a) Finding similar items using Pairwise Jaccard similarities
each cells will take some minutes to spin up because of the volume of data

In [6]:
# initializing collector list
similarity_k = []
similarity_s = []
similarity_sim = []
similarity_execT = []

- k-shingles, k=3
- similarity thresholds, s=0.1

In [7]:
k = 3
s = 0.1

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

- k-shingles, k=3
- similarity thresholds, s=0.2

In [8]:
k = 3
s = 0.2

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

- k-shingles, k=5
- similarity thresholds, s=0.1

In [9]:
k = 5
s = 0.1

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

- k-shingles, k=5
- similarity thresholds, s=0.2

In [10]:
k = 5
s = 0.2

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

- k-shingles, k=10
- similarity thresholds, s=0.1

In [11]:
k = 10
s = 0.1

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

- k-shingles, k=10
- similarity thresholds, s=0.2

In [12]:
k = 10
s = 0.2

startTime = time.time()
sim = len(similar_items(abstract_list,k,s))
execTime = round(((time.time() - startTime)/60),2)

similarity_k.append(k)
similarity_s.append(s)
similarity_sim.append(sim)
similarity_execT.append(execTime)

#print("Number of similar items: {}".format(sim))
#print("Execution time:  {}".format(execTime) + " mins")

In [13]:
dict = {'k': similarity_k, 's': similarity_s, '#sim': similarity_sim, 'execTime(min)': similarity_execT} 
df = pd.DataFrame(dict)

##### (b) Finding similar items using MinHash

In [14]:
minHash_k = []
minHash_s = []
minHash_sim = []
minHash_execT = []

In [15]:
# set global parameters to process the whole dataset
bands = 10
rows = 5
nsig = bands*rows  #50 hashing function: number of elements in signature, or the number of different random hash functions

maxShingleID = 2**32-1  # record the maximum shingle ID that we assigned
nextPrime = 4294967311  # next prime number after maxShingleID

A = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)
B = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)

- k-shingles, k=3
- similarity thresholds, s=0.1

In [16]:
k = 3
s = 0.1

startTime = time.time()
candi = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(candi)
minHash_execT.append(execTime)


- k-shingles, k=3
- similarity thresholds, s=0.2

In [17]:
k = 3
s = 0.2

startTime = time.time()
candi = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(candi)
minHash_execT.append(execTime)


- k-shingles, k=5
- similarity thresholds, s=0.1

In [18]:
k = 5
s = 0.1

startTime = time.time()
cand = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(cand)
minHash_execT.append(execTime)


- k-shingles, k=5
- similarity thresholds, s=0.2

In [19]:
k = 5
s = 0.2

startTime = time.time()
cand = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(cand)
minHash_execT.append(execTime)


- k-shingles, k=10
- similarity thresholds, s=0.1

In [20]:
k = 10
s = 0.1

startTime = time.time()
cand = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(cand)
minHash_execT.append(execTime)

- k-shingles, k=10
- similarity thresholds, s=0.2

In [21]:
k = 10
s = 0.2

startTime = time.time()
cand = candidate_pair(abstract_list, k, s)
execTime = round(((time.time() - startTime)),2)

minHash_k.append(k)
minHash_s.append(s)
minHash_sim.append(cand)
minHash_execT.append(execTime)

In [22]:
dict2 = {'k': minHash_k, 's': minHash_s, '#sim': minHash_sim, 'execTime(sec)': minHash_execT} 
df2 = pd.DataFrame(dict2)

##### (c) Finding similar items using Locality-Sensitive Hashing ( LSH )

In [23]:
# initializing collector list
LSH_k = []
LSH_s = []
LSH_sim = []
LSH_execT = []

- k-shingles, k=3
- similarity thresholds, s=0.1

In [24]:
k = 3
s = 0.1

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

- k-shingles, k=3
- similarity thresholds, s=0.2

In [25]:
k = 3
s = 0.2

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

- k-shingles, k=5
- similarity thresholds, s=0.1

In [26]:
k = 5
s = 0.1

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

- k-shingles, k=5
- similarity thresholds, s=0.2

In [27]:
k = 5
s = 0.2

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

- k-shingles, k=10
- similarity thresholds, s=0.1

In [28]:
k = 10
s = 0.1

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

- k-shingles, k=10
- similarity thresholds, s=0.2

In [29]:
k = 10
s = 0.2

startTime = time.time()
candi = LSH_candidates(abstract_list,k, s)
execTime = round(((time.time() - startTime)),2)

LSH_k.append(k)
LSH_s.append(s)
LSH_sim.append(candi)
LSH_execT.append(execTime)

In [30]:
dict3 = {'k': LSH_k, 's': LSH_s, '#sim': LSH_sim, 'execTime(sec)': LSH_execT} 
df3 = pd.DataFrame(dict3)


##### Number 1 - results

In [31]:

print('pairwise jaccard_similarity results'); print(df); print(''); print('minHash results'); print(df2); print(''); print('LSH results'); print(df3)


pairwise jaccard_similarity results
    k    s    #sim  execTime(min)
0   3  0.1  496590          10.13
1   3  0.2  285927           9.69
2   5  0.1    5308           9.72
3   5  0.2      24          10.34
4  10  0.1      22          10.61
5  10  0.2       7          10.98

minHash results
    k    s    #sim  execTime(sec)
0   3  0.1  470846          14.40
1   3  0.2  224575          13.41
2   5  0.1  149250          14.94
3   5  0.2    2608          13.92
4  10  0.1     422          15.35
5  10  0.2      13          13.90

LSH results
    k    s  #sim  execTime(sec)
0   3  0.1  2761           6.76
1   3  0.2  2268           6.17
2   5  0.1     3           8.32
3   5  0.2     2           7.44
4  10  0.1     1           8.91
5  10  0.2     1           8.88


COMMENT

- MinHash is supposed to be faster than pairwsie jaccard similarity but its clearly not here, but LSH is clearly the fastest
- As the shingle value, k and similarity threshold value, s increases the number of similar items reduces
- But certainly we see a trade-off; as the method enable the process to get faster, we get lesser similar items (i.e. we get fewer results for similar abstracts, its like some are skipped or thrown away or some comparisms are not made)

#### Task1 Number 2

Compare the results obtained for MinHash and LSH for different similarity thresholds s = 0.1, 0.2 and 0.25 and 50, 100 and 200 hashing functions. Comment your results.

We are going to use k_shingles 3 and 5

##### (a) Finding similar items using MinHash 

In [32]:
##### K_shingles = 3 and all the other parameters
#================================================

# initializing the collector list
minHash_k = []
minHash_s = []
Hashing_fn = []
minHash_sim = []
minHash_execT = []

# hashing function 50
Sim_Method_Property(abstract_list,k=3,s=0.1,bands=10,rows=5)
Sim_Method_Property(abstract_list,k=3,s=0.2,bands=10,rows=5)
Sim_Method_Property(abstract_list,k=3,s=0.25,bands=10,rows=5)

# hashing function 100
Sim_Method_Property(abstract_list,k=3,s=0.1,bands=20,rows=5)
Sim_Method_Property(abstract_list,k=3,s=0.2,bands=20,rows=5)
Sim_Method_Property(abstract_list,k=3,s=0.25,bands=20,rows=5)

# hashing function 200
Sim_Method_Property(abstract_list,k=3,s=0.1,bands=40,rows=5)
Sim_Method_Property(abstract_list,k=3,s=0.2,bands=40,rows=5)

df_21_k3 = Sim_Method_Property(abstract_list,k=3,s=0.25,bands=40,rows=5)


#### K_shingles = 5 and all the other parameters
#===============================================

# initializing the collector list
minHash_k = []
minHash_s = []
Hashing_fn = []
minHash_sim = []
minHash_execT = []

# hashing function 50
Sim_Method_Property(abstract_list,k=5,s=0.1,bands=10,rows=5)
Sim_Method_Property(abstract_list,k=5,s=0.2,bands=10,rows=5)
Sim_Method_Property(abstract_list,k=5,s=0.25,bands=10,rows=5)

# hashing function 100
Sim_Method_Property(abstract_list,k=5,s=0.1,bands=20,rows=5)
Sim_Method_Property(abstract_list,k=5,s=0.2,bands=20,rows=5)
Sim_Method_Property(abstract_list,k=5,s=0.25,bands=20,rows=5)

# hashing function 200
Sim_Method_Property(abstract_list,k=5,s=0.1,bands=40,rows=5)
Sim_Method_Property(abstract_list,k=5,s=0.2,bands=40,rows=5)

df_21_k5 = Sim_Method_Property(abstract_list,k=5,s=0.25,bands=40,rows=5)


##### (b) Finding similar items using LSH 

In [33]:
##### K_shingles = 3 and all the other parameters
#================================================

# initializing the collector list

LSH_k = []
LSH_s = []
Hashing_fn2 = []
LSH_sim = []
LSH_execT = []

# hashing function 50
Sim_Method_Property2(abstract_list,k=3,s=0.1,bands=10,rows=5)
Sim_Method_Property2(abstract_list,k=3,s=0.2,bands=10,rows=5)
Sim_Method_Property2(abstract_list,k=3,s=0.25,bands=10,rows=5)

# hashing function 100
Sim_Method_Property2(abstract_list,k=3,s=0.1,bands=20,rows=5)
Sim_Method_Property2(abstract_list,k=3,s=0.2,bands=20,rows=5)
Sim_Method_Property2(abstract_list,k=3,s=0.25,bands=20,rows=5)

# hashing function 200
Sim_Method_Property2(abstract_list,k=3,s=0.1,bands=40,rows=5)
Sim_Method_Property2(abstract_list,k=3,s=0.2,bands=40,rows=5)

df_22_k3 = Sim_Method_Property2(abstract_list,k=3,s=0.25,bands=40,rows=5)



#### K_shingles = 5 and all the other parameters
#===============================================

# initializing the collector list

LSH_k = []
LSH_s = []
Hashing_fn2 = []
LSH_sim = []
LSH_execT = []

# hashing function 50
Sim_Method_Property2(abstract_list,k=5,s=0.1,bands=10,rows=5)
Sim_Method_Property2(abstract_list,k=5,s=0.2,bands=10,rows=5)
Sim_Method_Property2(abstract_list,k=5,s=0.25,bands=10,rows=5)

# hashing function 100
Sim_Method_Property2(abstract_list,k=5,s=0.1,bands=20,rows=5)
Sim_Method_Property2(abstract_list,k=5,s=0.2,bands=20,rows=5)
Sim_Method_Property2(abstract_list,k=5,s=0.25,bands=20,rows=5)

# hashing function 200
Sim_Method_Property2(abstract_list,k=5,s=0.1,bands=40,rows=5)
Sim_Method_Property2(abstract_list,k=5,s=0.2,bands=40,rows=5)

df_22_k5 = Sim_Method_Property2(abstract_list,k=5,s=0.25,bands=40,rows=5)


##### Number 2 - results

In [34]:
# RESULTS

print('Minhash K= 3'); print(df_21_k3); print(''); print('minHash k = 5'); print(df_21_k5)
print("==========================================")
print('LSH K= 3'); print(df_22_k3); print(''); print('LSH k = 5'); print(df_22_k5)

Minhash K= 3
   k     s  Hashing_fn    #sim  execTime(sec)
0  3  0.10          50  470846          13.23
1  3  0.20          50  224575          13.04
2  3  0.25          50   72679          12.43
3  3  0.10         100  470846          14.09
4  3  0.20         100  224575          13.77
5  3  0.25         100   72679          11.87
6  3  0.10         200  470846          14.02
7  3  0.20         200  224575          13.55
8  3  0.25         200   72679          12.22

minHash k = 5
   k     s  Hashing_fn    #sim  execTime(sec)
0  5  0.10          50  149250          13.94
1  5  0.20          50    2608          13.13
2  5  0.25          50     120          13.61
3  5  0.10         100  149250          14.17
4  5  0.20         100    2608          14.07
5  5  0.25         100     120          13.21
6  5  0.10         200  149250          14.77
7  5  0.20         200    2608          14.23
8  5  0.25         200     120          14.29
LSH K= 3
   k     s  Hashing_fn  #sim  execTime(sec)

COMMENT

- First of all it is obvious that LSH is faster than minHashing
- Secondly, there is a trade-off, because of speed as we have fewer simmilarities for LSH for the same set of parameters as in minHashing
- Thirdly, for a given shingle number K we see that the number of similar abstracts are the same across board for equal similarity_thresholds s, irrespective of the hashing function size
- And lastly, we notice that the execution time increases marginally with the number of hashing functions used.

It therefore follows that we can use a lower optimum number of hashing fuction say 50 to get a good result combine with efficiency.

#### Task1 Number 3
For MinHashing using 100 hashing functions and s = 0.1 and 0.2, find the Jaccard distances (1-Jaccard similarity) for all possible pairs. Use the obtained values within a k-NN algorithm, and for k=1,3 and, 5 identify the clusters with similar abstracts for each s. Describe the obtained clusters, are they different?. Select randomly at least 5 abstracts per cluster, upon visual inspection, what are the main topics?

In [35]:
bands = 20
rows = 5
nsig = bands*rows
A = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)
B = numpy.random.randint(0, nextPrime, size=(nsig,),dtype=numpy.int64)

In [36]:
s01 = jacc_dist_calc(abstract_list,k=3,s=0.1,bands=bands,rows=rows)
s02 = jacc_dist_calc(abstract_list,k=3,s=0.2,bands=bands,rows=rows)

In [37]:
s01.head(3)

Unnamed: 0,s,candidates,jacc_distance
0,0.1,"(0, 1)",0.749412
1,0.1,"(0, 2)",0.772559
2,0.1,"(0, 3)",0.772973


In [38]:
input_data = s01[['jacc_distance']]

##### for similarity threshhold, s = 0.1

In [39]:
# For Nearest Neighbor, k = 1
k = 1

# fitting the model
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data)
distances, indices = knn_model.kneighbors(input_data)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

# making a dataframe of it
di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3) #this has two pairs of candidates in a cluster    

Unnamed: 0,indices
0,[441682]
1,[395874]
2,[277707]


In [40]:
###  The equal sign separate abstracts in the same cluster but the hash-tagg sign separates clusters

print(abstract_list[s01['candidates'][dfi['indices'][1589][0]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][1589][0]][1]])
print('##############################################################################')
print(abstract_list[s01['candidates'][dfi['indices'][3579][0]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][3579][0]][1]])
#print('##############################################################################')
#print(abstract_list[s01['candidates'][dfi['indices'][345876][0]][0]])
#print('========================================================')
#print(abstract_list[s01['candidates'][dfi['indices'][345876][0]][1]])

This paper describes the Global Tone Communication Co., Ltd.{'}s submission of the WMT21 shared news translation task. We participate in six directions: English to/from Hausa, Hindi to/from Bengali and Zulu to/from Xhosa. Our submitted systems are unconstrained and focus on multilingual translation odel, backtranslation and forward-translation. We also apply rules and language model to filter monolingual, parallel sentences and synthetic sentences.
The objective of subtask 2 of SemEval-2021 Task 6 is to identify techniques used together with the span(s) of text covered by each technique. This paper describes the system and model we developed for the task. We first propose a pipeline system to identify spans, then to classify the technique in the input sequence. But it severely suffers from handling the overlapping in nested span. Then we propose to formulize the task as a question answering task by MRC framework which achieves a better result compared to the pipeline method. Moreover, 

Comment: 
- k=1 nearest neighbor and for s=0.1, yields clusters of just the candidate pairs as shown above
- The first pair is about language translation
- The second pair is about dialogue


In [41]:
# For Nearest Neighbor, k = 3
k = 3
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data)
distances, indices = knn_model.kneighbors(input_data)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3)    #The indices shows the clusters we have in the dataframe s01
              # Those set of numbers on each rows represents the indices on s01
              # corresponding to the candidates pairs. That means in s01 (the
              # dataframe with s=0.1 the candidate pairs at the indexes 22424, 0, 438)
              # are in a cluster; that means their abstracts a similar. This cluster
              # contains 6 abstracts (since each index has a pair of candidate similarities)

Unnamed: 0,indices
0,"[22478, 441682, 428193]"
1,"[450667, 395874, 195403]"
2,"[216492, 277707, 11131]"


In [42]:
print(abstract_list[s01['candidates'][dfi['indices'][412652][0]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][412652][0]][1]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][412652][1]][0]])
#print('========================================================')
#print(abstract_list[s01['candidates'][dfi['indices'][412652][1]][1]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][412652][2]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][412652][2]][1]])

In this paper, we describe our approaches for task six of Social Media Mining for Health Applications (SMM4H) shared task in 2021. The task is to classify twitter tweets containing COVID-19 symptoms in three classes (self-reports, non-personal reports {\&} literature/news mentions). We implemented BERT and XLNet for this text classification task. Best result was achieved by XLNet approach, which is F1 score 0.94, precision 0.9448 and recall 0.94448. This is slightly better than the average score, i.e. F1 score 0.93, precision 0.93235 and recall 0.93235.
The upsurge of prolific blogging and microblogging platforms enabled the abusers to spread negativity and threats greater than ever. Detecting the toxic portions substantially aids to moderate or exclude the abusive parts for maintaining sound online platforms. This paper describes our participation in the SemEval 2021 toxic span detection task. The task requires detecting spans that convey toxic remarks from the given text. We explore 

Comment:

k=3 nearest neighbors and for s=0.1, yields clusters of 6 abstracts i.e. 3 candidate pairs each

- This cluster is about Text translation models

In [43]:
# For Nearest Neighbor, k = 5
k = 5
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data)
distances, indices = knn_model.kneighbors(input_data)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3) 

Unnamed: 0,indices
0,"[428193, 0, 479697, 441682, 22478]"
1,"[195403, 455819, 433467, 395874, 450667]"
2,"[11131, 405813, 216492, 277707, 404999]"


In [44]:
print(abstract_list[s01['candidates'][dfi['indices'][442152][0]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][442152][0]][1]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][442152][1]][0]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][442152][1]][1]])
print('========================================================')
print(abstract_list[s01['candidates'][dfi['indices'][442152][2]][0]])
#print('========================================================')
#print(abstract_list[s01['candidates'][dfi['indices'][442152][2]][1]])

In this paper, we describe our system submitted to SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense. The task aims at predicting whether the given text is humorous, the average humor rating given by the annotators, and whether the humor rating is controversial. In addition, the task also involves predicting how offensive the text is. Our approach adopts the DeBERTa architecture with disentangled attention mechanism, where the attention scores between words are calculated based on their content vectors and relative position vectors. We also took advantage of the pre-trained language models and fine-tuned the DeBERTa model on all the four subtasks. We experimented with several BERT-like structures and found that the large DeBERTa model generally performs better. During the evaluation phase, our system achieved an F-score of 0.9480 on subtask 1a, an RMSE of 0.5510 on subtask 1b, an F-score of 0.4764 on subtask 1c, and an RMSE of 0.4230 on subtask 2a (rank 3 on the 

Comment:

k=5 nearest neighbors and for s=0.1, yields clusters of 10 abstracts i.e. 5 candidate pairs each

- This cluster is about Language tracking

##### for similarity threshhold, s = 0.2

In [45]:
s02.head(3)

Unnamed: 0,s,candidates,jacc_distance
0,0.2,"(0, 1)",0.749412
1,0.2,"(0, 3)",0.772973
2,0.2,"(0, 4)",0.708238


In [46]:
input_data2 = s02[['jacc_distance']]
input_data2.head(3)

Unnamed: 0,jacc_distance
0,0.749412
1,0.772973
2,0.708238


In [47]:
# For Nearest Neighbor, k = 1
k = 1

# fitting the model
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data2)
distances, indices = knn_model.kneighbors(input_data2)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

# making a dataframe of it
di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3) #this has two pairs of candidates in a cluster    

Unnamed: 0,indices
0,[274592]
1,[68479]
2,[2]


In [48]:
###  The equal sign separate abstracts in the same cluster but the hash-tagg sign separates clusters

print(abstract_list[s02['candidates'][dfi['indices'][1589][0]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][1589][0]][1]])
print('##############################################################################')
print(abstract_list[s02['candidates'][dfi['indices'][3579][0]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][3579][0]][1]])
#print('##############################################################################')
#print(abstract_list[s02['candidates'][dfi['indices'][345][0]][0]])
#print('========================================================')
#print(abstract_list[s02['candidates'][dfi['indices'][345][0]][1]])

In this paper, we develop Sindhi subjective lexicon using a merger of existing English resources: NRC lexicon, list of opinion words, SentiWordNet, Sindhi-English bilingual dictionary, and collection of Sindhi modifiers. The positive or negative sentiment score is assigned to each Sindhi opinion word. Afterwards, we determine the coverage of the proposed lexicon with subjectivity analysis. Moreover, we crawl multi-domain tweet corpus of news, sports, and finance. The crawled corpus is annotated by experienced annotators using the Doccano text annotation tool. The sentiment annotated corpus is evaluated by employing support vector machine (SVM), recurrent neural network (RNN) variants, and convolutional neural network (CNN).
Sentiment analysis has come a long way for high-resource languages due to the availability of large annotated corpora. However, it still suffers from lack of training data for low-resource languages. To tackle this problem, we propose Conditional Language Adversaria

Comment:

k=1 nearest neighbors and for  s=0.2, yields majorly no cluster, just the candidate pairs as shown above
- The first pair is about human machine collaboration 
- The second pair is about sentence similarity, parsing or processing

In [49]:
# For Nearest Neighbor, k = 3
k = 3
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data2)
distances, indices = knn_model.kneighbors(input_data2)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3) 

Unnamed: 0,indices
0,"[247295, 12897, 274592]"
1,"[108945, 68479, 127469]"
2,"[2, 207107, 166264]"


In [50]:
print(abstract_list[s02['candidates'][dfi['indices'][41264][0]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][0]][1]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][1]][0]])
#print('========================================================')
#print(abstract_list[s02['candidates'][dfi['indices'][412652][1]][1]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][2]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][2]][1]])

We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.
With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These need

Comment:

k=3 nearest neighbor and for s=0.2, yields clusters of 6 abstracts i.e. 3 candidate pairs each

- This cluster is about various classification systems

In [51]:
# For Nearest Neighbor, k = 5
k = 5
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(input_data2)
distances, indices = knn_model.kneighbors(input_data2)

# looping over the indices to make a list of the clusters
ind = []
for item in indices:
    ind.append(item)

di = {'indices': ind}
dfi = pd.DataFrame(di)    
dfi.head(3) 

Unnamed: 0,indices
0,"[274592, 0, 247295, 12897, 20722]"
1,"[127469, 6483, 153000, 68479, 108945]"
2,"[2, 207107, 166264, 193469, 30877]"


In [52]:
print(abstract_list[s02['candidates'][dfi['indices'][41264][0]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][0]][1]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][1]][0]])
#print('========================================================')
#print(abstract_list[s02['candidates'][dfi['indices'][412652][1]][1]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][2]][0]])
print('========================================================')
print(abstract_list[s02['candidates'][dfi['indices'][41264][2]][1]])

Modern Natural Language Processing (NLP) makes intensive use of deep learning methods because of the accuracy they offer for a variety of applications. Due to the significant environmental impact of deep learning, cost-benefit analysis including carbon footprint as well as accuracy measures has been suggested to better document the use of NLP methods for research or deployment. In this paper, we review the tools that are available to measure energy use and CO2 emissions of NLP methods. We describe the scope of the measures provided and compare the use of six tools (carbon tracker, experiment impact tracker, green algorithms, ML CO2 impact, energy usage and cumulator) on named entity recognition experiments performed on different computational set-ups (local server vs. computing facility). Based on these findings, we propose actionable recommendations to accurately measure the environmental impact of NLP experiments.
This paper describes our submissions for the Social Media Mining for H

Comment:

k=5 nearest neighbors and for s=0.2, yields clusters of 10 abstracts each (i.e. 5 candidate pairs each)

- This cluster is about human dialogue catalog processing