# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import numpy as np, nltk, pandas as pd
from nltk.corpus import wordnet as wn
_ = nltk.download(['punkt', 'averaged_perceptron_tagger', 'wordnet', 'stopwords'], quiet=True)
SsStopwords = set(nltk.corpus.stopwords.words())            # set of strings of stop words

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to measure a similarity between a pair of sentences.

To compare two sentences using a WordNet ontological database, you first need to tokenize each sentence and assign POS tags, which are used in identifying the most relevant synset. Not all tokens can be translated to synsets, so some preprocessing can be helpful to change plural nouns to singular or to change other verb forms to the base form. Next, for each sense of a word in one document you could try locating the most relevant sense of a word in the second document. Then compute the path similarity between these and average all path similarities at the end to derive the final similarity between two texts. This process is overly simplified, but it opens up opportunities for improvements.

Recall that WordNet can help with some preprocessing as it has built in functions to lemmatize words as shown below. Below, use the functions `morphy` and `lemmatize` to look up other forms of the given strings. 

In [None]:
wn.morphy('cats', pos='n')
nltk.WordNetLemmatizer().lemmatize('cats', pos='n')
nltk.WordNetLemmatizer().lemmatize('walking', pos='v')

The UDF `SplitNTag()` tokenizes an argument `sDoc` and retrieves POS tags for the resulting tokens.

In [None]:
DssTags = dict(N='n', J='a', R='r', V='v') # convert NLTK POS to WordNet POS: noun, adjective, adverb, verb

def SplitNTag(sDoc='I ate a red bean.', sDefaultTag='n', SsStop=SsStopwords, nMinLen=2):
    ''' Tokenize & clean a document. Return list of tuples (word, WordNet tag).
    sDoc:        a string sentence or document
    sDefaultTag: default tag to use if NLTK tag is not a key in DssTags
    SsStop:      set of stopword strings to discard
    nMinLen:     min length of words to keep  '''
    LTssTokTag = nltk.pos_tag(nltk.word_tokenize(sDoc)) # word and NLTK tag
    return [(w, DssTags.get(t[:1], sDefaultTag)) for w, t in LTssTokTag if (w not in SsStopwords) and len(w)>nMinLen]
SplitNTag()

The UDF `Doc2Syn()` retrieves the first synset (if available) for each token+POS in a document. It uses a strong assumption that the first synset is the most likely, which is sometimes not the case. If a text’s context is used (for example via a pre-trained language model such as SBERT) then it may be possible to draw a more relevant synset, possibly at a higher computational cost.

In [None]:
Doc2Syn = lambda s='I rode a bike': [wn.synsets(i, z)[0] for i, z in SplitNTag(s) if wn.synsets(i, z)] # list of synsets in document

print([s.name() for s in Doc2Syn()])
print([s for s in Doc2Syn('brand new laptop')])

The UDF `SynSim()` is a wrapper for `wn.path_similarity()`, which returns a zero if no path is found between two synsets.

In [None]:
def SynSim(ss1=wn.synset('cat.n.01'), ss2=wn.synset('cat.v.01')):
    ''' Return similarity score between two synsets '''
    nSim = wn.path_similarity(ss1, ss2) # path similarity between 2 synset objects
    return nSim if nSim else 0          # replace None with 0, which will be ignored by max()

SynSim(ss1=wn.synset('cat.n.01'), ss2=wn.synset('cat.n.02'))

The UDF `SynsSim()` takes two lists of synsets and computes average similarity for best-matched pairs from the two lists. If `debug` is turned on, it prints the intermediate similarities for the pairs.

Note that the current UDF computes an asymmetric similarity, that is the similarity between LS1 and LS2 may differ from the similarity between LS2 and LS1. This is because the function finds the best-matched tokens in LS2 for each token in LS1.

While this function can be improved, it serves our purpose for demonstrating the computation of the rough similarity between two texts.

In [None]:
max0 = lambda X: max(X, default=0)   # wrapper for max() with default for None's set to 0

def SynsSim(LS1=wn.synsets('cat'), LS2=wn.synsets('dog'), debug=False):
    '''Return average similarity for best-matched pairs of synsets from two lists of synsets.
     debug: if true, prints best-matched synset names'''
    if debug: print('-'*20, f'\n>LS1: {[ss.name() for ss in LS1]}', f'\n>LS2: {[ss.name() for ss in LS2]}')
    if not LS1 or not LS2:
        print('WARNING: At least one synset is empty')
        return 0

    # Ensure similarity is the first in a tuple
    LnSims = [max0([(SynSim(ss1, ss2), ss1.name(), ss2.name()) for ss2 in LS2]) for ss1 in LS1] # double loop
    nAvgSim = sum(list(zip(*LnSims))[0])/len(LnSims) if len(LnSims) > 0 else 0  # average similarity
    if debug: print('Best-matched synset pairs:',', '.join([f'{s}|{s2}|{n:.2f}' for n,s,s2 in LnSims]), f"\nAvg sim: {nAvgSim:.2f}")
    else: return nAvgSim

LSyn1, LSyn2 = wn.synsets('lion'), wn.synsets('tiger')
SynsSim(LSyn1, LSyn2, debug=True)
SynsSim(LSyn2, LSyn1, debug=True)
SynsSim(Doc2Syn('it rains outside'), Doc2Syn('rain is outside'), debug=True)
SynsSim(Doc2Syn('it rains outside'), Doc2Syn('it pours outdoors'), debug=True)

Finally, if the `debug` is switched off, you can compute the similarities between sentences. In the first pair `rain` is a *verb* in the first sentence and a *noun* in the second sentence. This results in poorer extraction of a synset, yielding a relatively low similarity. The last two sentences are in line with the expectation that *'raining cats and dogs'* is still related to rain, while the *'brand new laptop'* is not.

In [None]:
def DocSim(sDoc1='it rains outside', sDoc2='rain is outside'):
    LTs1, LTs2 = Doc2Syn(sDoc1), Doc2Syn(sDoc2)
    return (SynsSim(LTs1, LTs2) + SynsSim(LTs2, LTs1)) / 2

print(DocSim('it rains outside', 'rain is outside'))               # rain is a verb and a noun
print(DocSim('it rains outside', 'it is a pouring rain outdoors')) # rain is a verb and a noun
print(DocSim('it rains outside', 'raining cats and dogs'))         # rain is a verb in both sentences
print(DocSim('it rains outside', 'brand new laptop'))              # there is no `rain` word in the 2nd sentence

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice computing similarities between documents. Consider this list of quotes about language.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

In [None]:
LsQuote=["A different language is a different vision of life.", # Federico Fellini
  "The limits of my language mean the limits of my world.",     # Ludwig Wittgenstein
  "One language sets you in a corridor for life. Two languages open every door along the way.",  # Frank Smith
  "He who knows no foreign languages knows nothing of his own.",  # Johann Wolfgang von Goethe
  "You can never understand one language until you understand at least two.",  # Geoffrey Willans
  "To have another language is to possess a second soul.",      # Charlemagne
  "Change your language and you change your thoughts.",         # Karl Albrecht
  "Knowledge of languages is the doorway to wisdom.",           # Roger Bacon
  "Language is the blood of the soul into which thoughts run and out of which they grow.",  # Oliver Wendell Holmes
  "Learn a new language and get a new soul.",                   # Czech Proverb
  "A special kind of beauty exists which is born in language, of language, and for language.",  # Gaston Bachelard
  "Learning is a treasure that will follow its owner everywhere.",  # Chinese Proverb
  "One should not aim at being possible to understand but at being impossible to misunderstand.",  # Marcus Fabius Quintilian
  "A mistake is to commit a misunderstanding.",                  # Bob Dylan
  "Language is to the mind more than light is to the eye."]     # William Gibson

## Task 1

Compute a similarity score for each quote in `LsQuote` in relation to the search quote `'languages around the World'`.

<b>Hint:</b> Use <code>DocSim()</code> to compute similarity for each quote. You can do this in a loop or list comprehension.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
sQuery = 'languages around the World'
sorted([(round(DocSim(sQuery, q), 3), q) for q in LsQuote], reverse=True)
</pre>
</details> 
</font>

<hr>