In this notebook, we will explore WordNet synsets, presenting a simple method for finding all mentions of all hyponyms of a given node in the WordNet hierarchy (e.g., finding all buildings in a text).

Source code adapted from: https://github.com/dbamman/anlp21/blob/main/10.wordnet/ExploreWordNet.ipynb

# WordNet

In [21]:
import nltk, re, spacy
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Get the synsets for a given word. The synsets here are roughly ordered by frequency of use (in a small tagged dataset), so that more frequent senses occur first.

In [22]:
synsets=wn.synsets('blue')
for synset in synsets:
    print (synset, synset.definition())

Synset('blue.n.01') blue color or pigment; resembling the color of the clear sky in the daytime
Synset('blue.n.02') blue clothing
Synset('blue.n.03') any organization or party whose uniforms or badges are blue
Synset('blue_sky.n.01') the sky as viewed during daylight
Synset('bluing.n.01') used to whiten laundry or hair or give it a bluish tinge
Synset('amobarbital_sodium.n.01') the sodium salt of amobarbital that is used as a barbiturate; used as a sedative and a hypnotic
Synset('blue.n.07') any of numerous small butterflies of the family Lycaenidae
Synset('blue.v.01') turn blue
Synset('blue.s.01') of the color intermediate between green and violet; having a color similar to that of a clear unclouded sky; - Helen Hunt Jackson
Synset('blue.s.02') used to signify the Union forces in the American Civil War (who wore blue uniforms)
Synset('gloomy.s.02') filled with melancholy and despondency
Synset('blasphemous.s.02') characterized by profanity or cursing
Synset('blue.s.05') suggestive of 

In [23]:
for lemma in wn.synset("blue.n.01").lemmas():
    print (lemma.name())

# functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms
hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

blue
blueness


In [24]:
# find all the synsets that are hyponyms of the target synset (descendents in the WordNet hierarchy)
list(wn.synset("blue.n.01").closure(hypo))

[Synset('azure.n.01'),
 Synset('dark_blue.n.01'),
 Synset('greenish_blue.n.01'),
 Synset('powder_blue.n.01'),
 Synset('prussian_blue.n.02'),
 Synset('purplish_blue.n.01'),
 Synset('steel_blue.n.01'),
 Synset('ultramarine.n.02')]

In [25]:
# find all the synsets that are hyperyms (ancestors up the tree) of the target synset
list(wn.synset("blue.n.01").closure(hyper))

[Synset('chromatic_color.n.01'),
 Synset('color.n.01'),
 Synset('visual_property.n.01'),
 Synset('property.n.02'),
 Synset('attribute.n.02'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

In [26]:
# return a list of words/phrases that comprise the hyponyms of a synset
def get_words_in_hypo(synset):
    words=set()
    hyponym_synsets=list(synset.closure(hypo))
    hyponym_synsets.append(synset)
    for synset in hyponym_synsets:
        for l in synset.lemmas():
            word=l.name()
            word=re.sub("_", " ", word)
            words.add(word)
    
    return words

get_words_in_hypo(wn.synset("color.n.01"))

{"Davy's gray",
 "Davy's grey",
 'Indian red',
 'Paris green',
 'Prussian blue',
 'Turkey red',
 'Tyrian purple',
 'Vandyke brown',
 'Venetian red',
 'achromasia',
 'achromatic color',
 'achromatic colour',
 'alabaster',
 'alizarine red',
 'amber',
 'apatetic coloration',
 'aposematic coloration',
 'apricot',
 'aqua',
 'aquamarine',
 'ash gray',
 'ash grey',
 'azure',
 'beige',
 'black',
 'blackness',
 'bleach',
 'blond',
 'blonde',
 'blondness',
 'blue',
 'blue green',
 'blueness',
 'bluish green',
 'bone',
 'bottle green',
 'brick red',
 'brown',
 'brownish yellow',
 'brownness',
 'buff',
 'burgundy',
 'burnt sienna',
 'burnt umber',
 'canary',
 'canary yellow',
 'caramel',
 'caramel brown',
 'cardinal',
 'carmine',
 'carnation',
 'cerise',
 'cerulean',
 'chalk',
 'charcoal',
 'charcoal gray',
 'charcoal grey',
 'chartreuse',
 'cherry',
 'cherry red',
 'chestnut',
 'chocolate',
 'chromatic color',
 'chromatic colour',
 'chromatism',
 'chrome green',
 'chrome red',
 'claret',
 'coal b

In [27]:
# for a given set of words, find each instance among a list of tokens already processed by spacy.  
# return a list of token indexes that match.
# note this only identifies single words, not multi-word phrases.
def find_all_words_in_text(words, spacy_tokens):
    all_matches=[]
    for idx, token in enumerate(spacy_tokens):
        if token.lemma_ in words:
            all_matches.append(idx)
    return all_matches

# for a given set of token indexes, print out a window of words around each match, in the style of a concordance.
def print_concordance(matches, spacy_tokens, window=3):
    RED="\x1b[31m"
    BLACK="\x1b[0m"
    
    spacing=window*10
    for match in matches:
        start=match-window
        end=match+window+1
        if start < 0:
            start=0
        if end > len(spacy_tokens):
            end=len(spacy_tokens)
        pre=' '.join([token.text for token in spacy_tokens[start:match]])
        post=' '.join([token.text for token in spacy_tokens[match+1:end]])
#         print("xtcyvubjn")
        print("%s %s%s%s %s" % (pre.rjust(spacing), RED, spacy_tokens[match].text, BLACK, post))

# read a text, replacing all whitespace sequences with a single space
def read_text(filename):
    with open(filename, encoding="utf-8") as file:
        return re.sub("\s+", " ", file.read())

In [34]:
# use Pride and Prejudice as an example
book=read_text("/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/pride_and_prejudice.txt")
spacy_tokens=nlp(book)

# search through all the tokens in the spacy_tokens argument to find any mention of words in the synset or any of its hyponyms
def wordnet_search(synset, spacy_tokens):
    targets=get_words_in_hypo(synset)
    matches=find_all_words_in_text(targets, spacy_tokens)
    print(len(matches),"jkhbjkn")
    print_concordance(matches, spacy_tokens)

Let's do a very coarse tagging of a document to find all of the mentions of a specific WordNet synset and all of its hyponyms. Using the functions above, find all the color terms in Pride and Prejudice.

In [35]:
wordnet_search(wn.synset("color.n.01"), spacy_tokens)

79 jkhbjkn
                     he wore a [31mblue[0m coat , and
                    and rode a [31mblack[0m horse . An
                   a bottle of [31mwine[0m a day .
                     I liked a [31mred[0m coat myself very
                  given to her [31mcomplexion[0m , and doubt
              till summoned to [31mcoffee[0m . She was
                 walking , the [31mtone[0m of her voice
                   with a fine [31mcomplexion[0m and good -
                   , but their [31mcolour[0m and shape ,
             Nicholls has made [31mwhite[0m soup enough ,
                        is _ a [31mshade[0m in a character
            reject the offered [31molive[0m - branch .
                   idea of the [31molive[0m - branch perhaps
                     come in a [31mscarlet[0m coat , and
                  in any other [31mcolour[0m . As for
                 In a softened [31mtone[0m she declared herself
                . Both changed [31mcolou