### Introduction

This notebook records the experiments I have done in the article of "Computing Semantic Similarity of Concepts in Knowledge Graphs". If someone is interested in reproducing the experiments, one can install Sematch and use this notebook for reference.

In [None]:
from sematch.semantic.similarity import WordNetSimilarity
from IPython.display import display
import pandas as pd

### A simple example of word similarity

In [None]:
wns = WordNetSimilarity()
words = ['artist', 'musician', 'scientist', 'physicist', 'actor', 'movie']
sim_matrix = [[wns.word_similarity(w1, w2, 'wpath') for w1 in words] for w2 in words]
df = pd.DataFrame(sim_matrix, index=words,columns=words)
display(df)

### Evaluations on Word Similarity Datasets

We have collected some well known word similarity datasets for evaluating semantic similarity metrics. Several python classes can be used to separate the dataset for specicial purpose and evaluate the metric function automatically. 

We put them together and provide a uniformed framework to evaluate different semantic measures. The word similarity datasets include:

- [Rubenstein and Goodenough (RG)](http://www.cs.cmu.edu/~mfaruqui/word-sim/EN-RG-65.txt) 

Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (October 1965), 627-633. DOI=10.1145/365628.365657 

- [Miller and Charles (MC)](http://www.cs.cmu.edu/~mfaruqui/word-sim/EN-MC-30.txt) 

Miller, George A., and Walter G. Charles. "Contextual correlates of semantic similarity." Language and cognitive processes 6.1 (1991): 1-28.

- [Wordsim353 (WS353)](http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/) 

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, "Placing Search in Context: The Concept Revisited", ACM Transactions on Information Systems, 20(1):116-131, January 2002 

- [wordsim353 similarity and relatedness (WS353Sim)](http://alfonseca.org/eng/research/wordsim353.html) 

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.

- [SimLex-999 (SIMLEX)](http://www.cl.cam.ac.uk/~fh295/simlex.html) 

SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. 2014. Felix Hill, Roi Reichart and Anna Korhonen. Preprint pubslished on arXiv. arXiv:1408.3456


In [None]:
from sematch.evaluation import WordSimEvaluation
from sematch.semantic.similarity import WordNetSimilarity, YagoTypeSimilarity
from nltk.corpus import wordnet as wn

data_word_noun = ['noun_rg','noun_mc','noun_ws353','noun_ws353-sim','noun_simlex']
data_word_graph = ['graph_rg','graph_mc','graph_ws353','graph_ws353-sim','graph_simlex']
data_word_type = ['type_rg','type_mc','type_ws353','type_ws353-sim','type_simlex']

sim_methods_noun = ['path','lch','wup','li','res','lin','jcn','wpath']
sim_methods_graph = ['path','lch','wup','li','res','res_graph','lin','jcn','wpath','wpath_graph']
sim_methods_type = ['path','lch','wup','li','res','res_graph','lin','lin_graph','jcn','jcn_graph','wpath','wpath_graph']

ws_eval = WordSimEvaluation()
wns = WordNetSimilarity()
yagosim = YagoTypeSimilarity()

To produce the TABLE 2 in the article "The illustration of Semantic Similarity Methods on Some Concept Pair Examples". We manually create the word to synset mapping and compute their semantic similarity scores using different semantic similarity metrics.

In [None]:
aspects = {'beef':wn.synset('beef.n.02'), 'lamb':wn.synset('lamb.n.05'), 'octopus':wn.synset('octopus.n.01'),
          'shellfish':wn.synset('shellfish.n.01'), 'meat':wn.synset('meat.n.01'), 'seafood':wn.synset('seafood.n.01'),
          'food':wn.synset('food.n.02'), 'service':wn.synset('service.n.02'),'atmosphere':wn.synset('atmosphere.n.01'),
          'coffee':wn.synset('coffee.n.01')}
aspect_pairs = [('beef', 'octopus'), ('beef', 'lamb'), ('meat','seafood'), ('octopus', 'shellfish'),
               ('beef','service'),('beef','atmosphere'),('beef', 'coffee'), ('food','coffee')]
aspects_sim_matrix = [[wns.similarity(aspects[w1], aspects[w2], m) for m in sim_methods_noun] 
                      for w1, w2 in aspect_pairs]
aspect_index = [x+'-'+y for x, y in aspect_pairs]
aspect_df = pd.DataFrame(aspects_sim_matrix, index=aspect_index, columns=sim_methods_noun)
display(aspect_df)

#### WPATH method with different K in Word Noun Datasets

The data_word_noun contains word pairs that can be mapped to WordNet noun taxonomy. The k settings are varied with interval 0.1 started from 0.1.

In [None]:
wpath_cors = [ws_eval.evaluate_wpath_k(dataset) for _, dataset in enumerate(data_word_noun)]
cors_matrix = [[cors[i] for _, cors in enumerate(wpath_cors)] for i in range(1,11)]
wpath_index = map(lambda x: str(x/10.0), range(1, 11))
df_wpath = pd.DataFrame(cors_matrix, index=wpath_index, columns=data_word_noun)
display(df_wpath)

#### WPATH method with different K in Word Graph Datasets

In word graph dataset, we performed the evaluation of wpath with different k using corpus-based IC and graph-based IC respectively.

In [None]:
#evaluate with corpus-based IC
wpath_cors = [ws_eval.evaluate_wpath_k(dataset) for _, dataset in enumerate(data_word_graph)]
cors_matrix = [[cors[i] for _, cors in enumerate(wpath_cors)] for i in range(1,11)]
df_wpath_graph = pd.DataFrame(cors_matrix, index=wpath_index, columns=data_word_graph)
display(df_wpath_graph)

In [None]:
#evaluate with graph-based IC
wpath_cors = [ws_eval.evaluate_wpath_k(dataset, 'graph') for _, dataset in enumerate(data_word_graph)]
cors_matrix = [[cors[i] for _, cors in enumerate(wpath_cors)] for i in range(1,11)]
df_wpath_graph = pd.DataFrame(cors_matrix, index=wpath_index, columns=data_word_graph)
display(df_wpath_graph)

#### WPATH method with different K in Word Type Datasets

In [None]:
#evaluate with corpus-based IC
wpath_cors = [ws_eval.evaluate_wpath_k(dataset) for _, dataset in enumerate(data_word_type)]
cors_matrix = [[cors[i] for _, cors in enumerate(wpath_cors)] for i in range(1,11)]
df_wpath_type = pd.DataFrame(cors_matrix, index=wpath_index, columns=data_word_type)
display(df_wpath_type)

In [None]:
#evaluate with graph-based IC
wpath_cors = [ws_eval.evaluate_wpath_k(dataset, 'graph') for _, dataset in enumerate(data_word_type)]
cors_matrix = [[cors[i] for _, cors in enumerate(wpath_cors)] for i in range(1,11)]
df_wpath_type = pd.DataFrame(cors_matrix, index=wpath_index, columns=data_word_type)
display(df_wpath_type)

#### Baseline semantic similarity metrics on Word Noun Datasets

In [None]:
path = lambda x, y: wns.word_similarity(x, y, 'path')
lch = lambda x, y: wns.word_similarity(x, y, 'lch')
wup = lambda x, y: wns.word_similarity(x, y, 'wup')
li = lambda x, y: wns.word_similarity(x, y, 'li')
res = lambda x, y: wns.word_similarity(x, y, 'res')
lin = lambda x, y: wns.word_similarity(x, y, 'lin')
jcn = lambda x, y: wns.word_similarity(x, y, 'jcn')

methods = {'path':path, 'lch':lch, 'wup':wup, 'li':li, 'res':res, 'lin':lin, 'jcn':jcn}
cor_dicts = [ws_eval.evaluate_multiple_metrics(methods, dataset) for dataset in data_word_noun]
baseline_cors_matrix = [[cors[m] for _, cors in enumerate(cor_dicts)] for m in sim_methods_noun[0:7]]
df_baselines_noun = pd.DataFrame(baseline_cors_matrix, index=sim_methods_noun[0:7], columns=data_word_noun)
display(df_baselines_noun)

#### Baseline semantic similarity metrics on Word Graph Datasets

In [None]:
res_graph = lambda x, y: yagosim.word_similarity(x, y, 'res_graph')
methods['res_graph'] = res_graph
cor_dicts = [ws_eval.evaluate_multiple_metrics(methods, dataset) for dataset in data_word_graph]
baseline_cors_matrix = [[cors[m] for _, cors in enumerate(cor_dicts)] for m in sim_methods_graph[0:8]]
df_baselines_graph = pd.DataFrame(baseline_cors_matrix, index=sim_methods_graph[0:8], columns=data_word_graph)
display(df_baselines_graph)

#### Baseline semantic similarity metrics on Word Type Datasets

In [None]:
lin_graph = lambda x, y: yagosim.word_similarity(x, y, 'lin_graph')
jcn_graph = lambda x, y: yagosim.word_similarity(x, y, 'jcn_graph')
methods['lin_graph'] = lin_graph
methods['jcn_graph'] = jcn_graph
cor_dicts = [ws_eval.evaluate_multiple_metrics(methods, dataset) for dataset in data_word_type]
baseline_cors_matrix = [[cors[m] for _, cors in enumerate(cor_dicts)] for m in sim_methods_type[0:10]]
df_baselines_type = pd.DataFrame(baseline_cors_matrix, index=sim_methods_type[0:10], columns=data_word_type)
display(df_baselines_type)

#### Steiger's Z Significance Test on Word Noun Dataset

In [None]:
wpath_rg = lambda x, y: wns.word_similarity_wpath(x, y, 0.9)
wpath_mc = lambda x, y: wns.word_similarity_wpath(x, y, 0.4)
wpath_ws353 = lambda x, y: wns.word_similarity_wpath(x, y, 0.5)
wpath_ws353sim = lambda x, y: wns.word_similarity_wpath(x, y, 0.8)
wpath_simlex = lambda x, y: wns.word_similarity_wpath(x, y, 0.8)

methods = {'wpath_rg':wpath_rg, 'wpath_mc':wpath_mc, 'wpath_ws353':wpath_ws353, 
           'wpath_ws353sim':wpath_ws353sim,'wpath_simlex':wpath_simlex}

cor_dicts = [ws_eval.evaluate_multiple_metrics(methods, dataset) for dataset in data_word_noun]

In [None]:
wpath_dic = {'noun_rg':'wpath_rg', 'noun_mc':'wpath_mc', 'noun_ws353':'wpath_ws353',
            'noun_ws353-sim':'wpath_ws353sim', 'noun_simlex':'wpath_simlex'}

cors_matrix = [[cor_dicts[i][wpath_dic[dataset]] for i, dataset in enumerate(data_word_noun)]]
df_cors = pd.DataFrame(cors_matrix, index=['metrics'], columns=data_word_noun)
display(df_cors)

To perform the Steiger's Z Significance Test, one can use the implementation integrated in Sematch framework, or use the R, cocor package. The example scripts using cocor package to perform statistical test in Simlex dataset is shown as:
```
require(cocor) # load package
#j means dependent sample, k and h means comparison sample
#we have wpath with human (jk), jcn with human (jh), and wpath with jcn (kh)
#simlex
#wpath with path Pass
cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.584, r.kh=+0.955, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
#wpath with lch Pass
cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.584, r.kh=+0.955, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
#wpath with wup Pass
cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.542, r.kh=+0.946, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
#wpath with li Pass
 cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.586, r.kh=+0.965, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
#wpath with res Pass
cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.535, r.kh=+0.913, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
#wpath with lin Pass
cocor.dep.groups.overlap(r.jk=+0.603, r.jh=+0.582, r.kh=+0.944, n=666, alternative="greater", alpha=0.05, conf.level=0.95, null.value=0)
```
The example of using the integrated Statistical Test is illustrate in the following codes.

In [None]:
stats_tests = []
for _, dataset in enumerate(data_word_noun):
    stats = {}
    for _, m in enumerate(sim_methods_noun[0:7]):
        cor, p_value = ws_eval.statistical_test(wpath_dic[dataset], m, dataset)
        stats[m] = '('+str(round(cor,3))+','+str(p_value)+')'
    stats_tests.append(stats)
stats_matrix = [[cors[m] for _, cors in enumerate(stats_tests)] for _, m in enumerate(sim_methods_noun[0:7])]
df_stats = pd.DataFrame(stats_matrix, index=sim_methods_noun[0:7], columns=data_word_noun)
display(df_stats)

In [16]:
import os
os.system("gsutil -m cp gs://nis-dataproc/data/follow-unfollow/namedEntities.p ./")

0

In [17]:
import pickle
namedEntities = pickle.load(open("namedEntities.p", "rb"))

In [22]:
names, dbpedia_urls = zip(*[(t[1]['name'], t[1]['dbpedia_url']) for t in namedEntities.iteritems()])

In [27]:
all_names_pairs = [(n1, n2) for n1 in names for n2 in names]
all_urls_pairs = [(n1, n2) for n1 in dbpedia_urls for n2 in dbpedia_urls]

In [28]:
all_names_pairs[:4]

[(u'Dineshwar_Sharma', u'Dineshwar_Sharma'),
 (u'Dineshwar_Sharma', u'Cairn_India'),
 (u'Dineshwar_Sharma', u'McKinsey_%26_Company'),
 (u'Dineshwar_Sharma', u'Intercontinental_ballistic_missile')]

In [29]:
all_urls_pairs[:4]

[(u'http://dbpedia.org/resource/Dineshwar_Sharma',
  u'http://dbpedia.org/resource/Dineshwar_Sharma'),
 (u'http://dbpedia.org/resource/Dineshwar_Sharma',
  u'http://dbpedia.org/resource/Cairn_India'),
 (u'http://dbpedia.org/resource/Dineshwar_Sharma',
  u'http://dbpedia.org/resource/McKinsey_%26_Company'),
 (u'http://dbpedia.org/resource/Dineshwar_Sharma',
  u'http://dbpedia.org/resource/Intercontinental_ballistic_missile')]

In [1]:
from sematch.semantic.similarity import EntitySimilarity
sim = EntitySimilarity()
import datetime
st = datetime.datetime.now()
# print sim.similarity('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona') #0.409923677282
# print sim.similarity('http://dbpedia.org/resource/Narendra_Modi','http://dbpedia.org/resource/Steve_Jobs')#0.0904545454545
# print sim.relatedness('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona')#0.457984139871
# print sim.relatedness('http://dbpedia.org/resource/Arun_Jaitley', 'http://dbpedia.org/resource/Narendra_Modi')#0.465991132787
en = datetime.datetime.now()
print en - st

0:00:00.000034


In [2]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Cristiano_Ronaldo','http://dbpedia.org/resource/Madrid')#0.457984139871
print sim.di_relatedness('http://dbpedia.org/resource/Arun_Jaitley', 'http://dbpedia.org/resource/Narendra_Modi')#0.465991132787
print sim.di_relatedness('http://dbpedia.org/resource/Sachin_Tendulkar', 'http://dbpedia.org/resource/Cricket')
en = datetime.datetime.now()
print en - st

0.902403335865
0.986416248169
0.930793437854
0:00:09.498396


In [3]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Cristiano_Ronaldo')#0.457984139871
print sim.di_relatedness('http://dbpedia.org/resource/Narendra_Modi', 'http://dbpedia.org/resource/Arun_Jaitley')#0.465991132787
print sim.di_relatedness('http://dbpedia.org/resource/Cricket', 'http://dbpedia.org/resource/Sachin_Tendulkar')
en = datetime.datetime.now()
print en - st

0.384560921554
0.788008763981
0.21519414076
0:00:02.233770


In [5]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Sachin_Tendulkar', 'http://dbpedia.org/resource/Narendra_Modi')
en = datetime.datetime.now()
print en - st

0.640476618239
0:00:01.540255


In [6]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Cricket', 'http://dbpedia.org/resource/Narendra_Modi')
en = datetime.datetime.now()
print en - st

-0.149850346799
0:00:01.638156


In [8]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Cricket', 'http://dbpedia.org/resource/Madrid')
en = datetime.datetime.now()
print en - st

-0.149850346799
0:00:02.235042


In [26]:
st = datetime.datetime.now()
print sim.di_relatedness('http://dbpedia.org/resource/Bharatiya_Janata_Party', 'http://dbpedia.org/resource/Bharatiya_Janata_Party')
print sim.di_relatedness('http://dbpedia.org/resource/Rahul_Gandhi', 'http://dbpedia.org/resource/Indian_National_Congress')
print sim.di_relatedness('http://dbpedia.org/resource/Indian_National_Congress', 'http://dbpedia.org/resource/Rahul_Gandhi')
en = datetime.datetime.now()
print en - st

1.07704736658
1.01002662437
0.442058902002
0:00:02.734767
