## Creating a test set of tags relations

In this notebook, we will create subsets of tags for testing of different methods for word sense disambiguation (WSD), word sense induction (WSI), ad learning relations between tagts. 

We will test how well different methods can spot relatinships between tags. We will select tags that  have different kinds of relashionshops between them: 
- Synonymy  

- Specificity (hypernym/hyponym)

- Acronyms 

- Words with same roots 

- Semmantic extention 

- Different spennings (misspelling) of same words 

- Same named entity 

We will use coverage / presicion / recall for evaluation. 


In [2]:
import pandas as pd 
import numpy as np 

In [3]:
df_tags = pd.read_csv("data/tags_selected.csv")
df_tags.head()

Unnamed: 0,tag_id,book_count,count,tag_name
0,0,7,24,-
1,1,2,6,--1-
2,15,2,6,--6-
3,21,2,19,-calif--
4,22,3,27,-d-c--


We randomly choose 100 consecutive tags to explore the existing relationships between them.

In [4]:
# df_syn = df_tags[df_tags.tag_id >= 28500] 
# df_syn = df_syn[df_syn.tag_id <= 28600]
df_sample = df_tags[13450:13550]
df_sample.head(100)

Unnamed: 0,tag_id,book_count,count,tag_name
13450,28446,2,20,sports-romances
13451,28448,2,19,sports-star
13452,28450,3,30,sports-theme
13453,28454,4,41,sporty
13454,28456,3,24,spouses
...,...,...,...,...
13545,28621,2,19,stegner
13546,28623,10,1322,steinbeck
13547,28624,3,25,steinbeck-john
13548,28626,30,203,stem


In [4]:
pd.set_option('display.max_rows', 1000)
display(df_sample)

Unnamed: 0,tag_id,book_count,count,tag_name
13450,28446,2,20,sports-romances
13451,28448,2,19,sports-star
13452,28450,3,30,sports-theme
13453,28454,4,41,sporty
13454,28456,3,24,spouses
13455,28457,2,30,sprawl
13456,28460,7,50,spring
13457,28464,3,8,spritual
13458,28466,222,11294,spy
13459,28468,3,10,spy-action


We will extract the following pairs of similar words: 

Extension of meaning: 
- "spy": edges (28466, i) for i in (28468, 28488)
- "spy-thriller": (28482, 28483) 
- "stand-alone": (28525, 28526), (28525, 28528)
- "stage": (28510, 28511), (28510, 28512)
- "star-wars": edges (28541, i ) for i in range(28542, 28567)
- "star-wars-legents": (28552, 28553)
- "starcrossed": (28570, 28571)
- "stark": (28573, 28574), (28573, 28575) 	

Specificy: 
- "book-novel": (28471, 28473), (28542, 28556), 
- "book-fiction": (28471, 28477), (28471, 28479)

Synonymy: 
- "started, did not finnish": (28583, 28585), (28583, 28586), (28585, 28586) 

Words with same root: 
- "stalk":  (28520, 28522) 

Different spellings:
- "star-wars": (28541, 28567), (28541,28597) (28567,28597)
- "star-wars-canon": (28544, 28545)
- "start-up" and "startups": (28580, 28594), (28581, 28596)
- "steampunk": (28609, 28610)
- "steel-danielle": (28615, 28616), (28615, 28618), (28616, 28618)
- "steinbeck": (28623, 28624) 

Different forms of word: 
- "spy": (28466, 28488)
- "novel": (28477, 28479)
- "own" (28559, 2860) 
- "start-up" and "startup": (28580, 28581), (28594, 28596)


Acronymy: 
-  "star-wars-eu", "star-wars-expanded-universe": (28548, 28549) 


In [5]:
edges = set()

In [31]:
# Extention of meaning
extension_set = set()
extension_set = extension_set.union([(28466, i) for i in range(28468, 28489) if i in df_sample.tag_id.values]) #"spy"
extension_set = extension_set.union([(28482, 28483)]) #"spy-thriller"
extension_set = extension_set.union([(28525, 28526), (28525, 28528)]) # "stand-alone"
extension_set = extension_set.union([(28510, 28511), (28510, 28512)]) # "stage"
extension_set = extension_set.union([(28541, i ) for i in range(28542, 28566) if i in df_sample.tag_id.values]) #star-wars
extension_set = extension_set.union([(28570, 28571)]) #"starcrossed"
extension_set = extension_set.union([(28573, 28574), (28573, 28575)]) #"stark"

#Length of the set 
print("Number of pairs: ", len(extension_set))

#Test if values are true
for pair in list(extension_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  40


In [33]:
# Specificy
spec_set = set() 
spec_set = spec_set.union([(28471, 28473), (28542, 28556)]) #"book-novel"
spec_set = spec_set.union([(28471, 28477), (28471, 28479)]) #"book-fiction"

#Length of the set 
print("Number of pairs: ", len(spec_set))

#Test if values are true
for pair in list(spec_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  4


In [43]:
#Synonymy: 
syn_set = set()
syn_set = syn_set.union([(28583, 28585), (28583, 28586), (28585, 28586) ]) #"started, did not finnish"

#Length of the set 
print("Number of pairs: ", len(syn_set))

#Test if values are true
for pair in list(syn_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  3


In [44]:
# Words with same root:
root_set = set()
root_set = root_set.union([(28520, 28522)]) #"stalk"

#Length of the set 
print("Number of pairs: ", len(root_set))

#Test if values are true
for pair in list(root_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  1


In [1]:
##### Different forms of word: 
forms_set = set() 
forms_set = forms_set.union([(28466, 28488)]) #"spy"
forms_set = forms_set.union([(28477, 28479)]) #"novel"
forms_set = forms_set.union([(28559, 28560)]) #"own"
forms_set = forms_set.union([(28580, 28581), (28594, 28596)]) #"start-up" and "startup"

#Length of the set 
print("Number of pairs: ", len(forms_set))

#Test if values are true
for pair in list(forms_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  5


NameError: name 'df_sample' is not defined

In [53]:
#Different spellings:
spell_set = set() 
spell_set = spell_set.union([(28541, 28567), (28541, 28597), (28567, 28597)]) #"star-wars"
spell_set = spell_set.union([(28544, 28545)]) #"star-wars-canon"
spell_set = spell_set.union([(28580, 28594), (28581, 28596)]) #"start-up" and "startups"
spell_set = spell_set.union([(28609, 28610)]) #"steampunk"
spell_set = spell_set.union([(28615, 28616), (28615, 28618), (28616, 28618)]) #"steel-danielle"
spell_set = spell_set.union([(28623, 28624)]) #"steinbeck"

#Length of the set 
print("Number of pairs: ", len(spell_set))

#Test if values are true
for pair in list(spell_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  11


In [54]:
#Acronymy: 
acro_set = set() 
acro_set = acro_set.union([(28548, 28549)]) #"star-wars-eu", "star-wars-expanded-universe"

print("Number of pairs: ", len(acro_set))

#Test if values are true
for pair in list(acro_set): 
    assert(pair[0] in df_sample.tag_id.values)
    assert(pair[1] in df_sample.tag_id.values)

Number of pairs:  1


In [55]:
#Join all 
similarity_set = extension_set | spec_set | syn_set | root_set | forms_set | spell_set | acro_set
print("Number of pairs: ", len(similarity_set))

Number of pairs:  64


The similarity set we constructed so far does not take in consideration that the similarity relation is symmetric. We will construct a set that has all the symmetric pairs. 

In [59]:
import copy 
similarity_set_sym = copy.deepcopy(similarity_set)
for pair in list(similarity_set): 
    similarity_set_sym.add((pair[1], pair[0]))  
print("Number of pairs: ", len(similarity_set_sym))

Number of pairs:  128


Now we will consturct a set of all possible pairs, using combinations of the tag ids in the sample, excluding the ones that are formed with repetition of the same id. 

In [97]:
len(list(df_sample.tag_id.values))

100

In [99]:
#Set of all tag pairs, without the reflexive ones
import itertools
all_pairs_set= set() 

for element in itertools.combinations(list(df_sample.tag_id.values), 2):
    all_pairs_set.add(element)

In [100]:
print("Number of pairs: ", len(all_pairs_set))

Number of pairs:  4950


In [101]:
# What part of the tags are related with similarity 
print("Ratio of related tags: ", len(similarity_set)/ len(all_pairs_set))

Ratio of related tags:  0.01292929292929293


In [102]:
#Save test datasets 
import pickle 
file_sim_set = "test_data/simipar_tags_test_set.pickle"
file_all_set = "test_data/all_tags_test_set.pickle"

with open(file_sim_set, 'wb') as f_sim:
    pickle.dump(similarity_set, f_sim)
    
with open(file_all_set, 'wb') as f_all:
    pickle.dump(all_pairs_set, f_all)

In [103]:
df_sample.to_csv("test_data/tags_sample.csv")

In [None]:
Number of pairs:  128
Number of pairs:  4950
Ratio of related tags:  0.01292929292929293   