
The similarity between two sentences is calculated using the **WordNet WUP Similarity** method. Below is a detailed explanation of the process:

1. **Tokenization**:
    - Each sentence is tokenized into words using `word_tokenize()`.

2. **POS Tagging**:
    - Each token is assigned a Part-of-Speech (POS) tag using `pos_tag()`.

3. **POS Mapping**:
    - The POS tags are mapped to WordNet POS tags using the `get_wordnet_pos()` function.

4. **Synset Extraction**:
    - For each word in the sentences, WordNet synsets are retrieved based on the mapped POS tags.

5. **Similarity Calculation**:
    - For each pair of synsets (one from each sentence), the **Wu-Palmer Similarity (WUP)** is calculated using `wn.wup_similarity()`.
    - The highest similarity score for each word in the first sentence is retained.

6. **Final Score**:
    - The average of the retained similarity scores is computed to give the final similarity score between the two sentences.


In [None]:
from nltk.corpus import wordnet as wn
from nltk import word_tokenize, pos_tag
import nltk

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(treebank_tag):
    """Map POS tag to first character for lemmatization."""
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('N'):
        return wn.NOUN
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN

def wup_similarity(sent1, sent2):
    tokens1 = word_tokenize(sent1)
    tokens2 = word_tokenize(sent2)
    pos1 = pos_tag(tokens1)
    pos2 = pos_tag(tokens2)

    sims = []
    for word1, tag1 in pos1:
        synsets1 = wn.synsets(word1, pos=get_wordnet_pos(tag1))
        if not synsets1:
            continue
        best_sim = 0
        for word2, tag2 in pos2:
            synsets2 = wn.synsets(word2, pos=get_wordnet_pos(tag2))
            for syn1 in synsets1:
                for syn2 in synsets2:
                    sim = wn.wup_similarity(syn1, syn2)
                    if sim and sim > best_sim:
                        best_sim = sim
        if best_sim:
            sims.append(best_sim)

    return sum(sims) / len(sims) if sims else 0.0



[nltk_data] Downloading package wordnet to /Users/nikisha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/nikisha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nikisha/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
sent1 = "Dogs are wonderful pets."
sent2 = "Cats are amazing companions."
score = wup_similarity(sent1, sent2)
print(f"WordNet WUP Similarity: {score:.4f}")

WordNet WUP Similarity: 0.8036
