# Applications of Semantic Similarity
* Grouping similar words into semantic concepts.
* As a building block in natural language understanding tasks.
    * Textual entailment
    * Paraphrasing

### **WordNet**
* Semantic dictionary of (mostly) English words, interlinked by semantic relations.
* Includes rich linguistic information
    * part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, derivationally related forms, ...
* Machine-readable, freely available

#### **Semantic similarity using WordNet**
* WordNet organizes information in a hierarchy
* Many similarities measures use the hierarchy in some way

<img src="./assets/semantic_sim.png" alt="" />

* How to get the similarity between:
    * deer and giraffe
    * deer and elk
    * deer and horse

##### **Path Similarity**
* Find the shortest path between the two concepts
* Similarity measure inversely related to path distance
* deer and giraffe - distance = 2
* deer and elk - distance = 1
* deer and horse - distance = 6

##### **Lowest common subsumer (LCS)**
* Find the closest ancestor to both concepts
* LCS(deer and giraffe) = ruminant
* LCS(deer and elk) = deer
* LCS(deer and horse) = ungulate

##### L**in Similarity**
* Similarity measure based on the information contained in the LCS of the two concepts
    * $LinSim(u, v) = {{2 \times \log P (LCS(u, v))}\over{\log P(u) + \log P(v)}}$
* $P(u)$ is given by the information content learnt over a large corpus

In [8]:
from nltk.corpus import wordnet as wn 

# the word (deer) given by the noun meaning (n) and the first meaning of that (01)
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
giraffe = wn.synset('giraffe.n.01')
horse = wn.synset('horse.n.01')

print('Path Similarity:')
print('* deer and giraffe - ' + str(deer.path_similarity(giraffe)))
print('* deer and elk - ' + str(deer.path_similarity(elk)))
print('* deer and horse - ' + str(deer.path_similarity(horse)))

Path Similarity:
* deer and giraffe - 0.3333333333333333
* deer and elk - 0.5
* deer and horse - 0.14285714285714285


In [13]:
import nltk
nltk.download('wordnet_ic')

from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

print('Lin Similarity:')
print('* deer and giraffe - ' + str(deer.lin_similarity(giraffe, brown_ic)))
print('* deer and elk - ' + str(deer.lin_similarity(elk, brown_ic)))
print('* deer and horse - ' + str(deer.lin_similarity(horse, brown_ic)))

[nltk_data] Downloading package wordnet_ic to /home/user/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


Lin Similarity:
* deer and giraffe - 1.5837629004603727e-299
* deer and elk - 0.8623778273893673
* deer and horse - 0.7726998936065773


##### **Collocations and Distributional similarity**
* Two words that frequently appears in similar contexts are more likely to be semantically related.
* **Context**:
    * Words before, after, within a small window.
    * Parts of speech of words before, after, in a small window.
    * Specific syntactic relation to the target word.
    * Words in the same sentence, same document, ...
* Strength of association between words
    * How frequent are these? - Not similar if two words don't occur together often
    * How frequent are individual words? - 'the' is very frequent, so high chances it co-occurs often with every word
    * Pointwise Mutual Information - $PMI(w, c) = {\log({{P(w, c)}\over{P(w) \times P(c)}})}$

In [18]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

finder = BigramCollocationFinder.from_words('The quick brown fox jumped over the yellow fence')
finder.nbest(bigram_measures.pmi, 10)

[('m', 'p'),
 ('T', 'h'),
 ('b', 'r'),
 ('c', 'k'),
 ('i', 'c'),
 ('j', 'u'),
 ('q', 'u'),
 ('t', 'h'),
 ('u', 'i'),
 ('u', 'm')]

#### **Important Concepts**
* Finding similarity between words and text is non-trivial
* WordNet is a useful resource for semantic relationship between words
* Many similarity functions exist
* NLTK is a useful package for many such tasks