## This is a tool to generate relation pairs from the Wordnets

It supports any wordnet under the `wn` package
It exports the following pairs:
 * Hypernym-hyponym (hypernym, instance_hypernym)
 * Hypernym-*-hyponym (grand[grand]child of hypernym)
 * Holonym-meronym (meronym, mero_location, mero_member, mero_part, mero_portion, mero_substance)
 * Cohyponym (i.e two hyponyms of the same hypernym)
 * Synonym (i.e., two senses of the same synset)
 * Antonym (sense-to-sense relation)
 
The generation is optimized by caching calls to the underlying SQLite DB. DB is (optionally) moved to ramdisk.

In [21]:
import wn
from typing import List, Dict, Optional, Generator
from tqdm.notebook import tqdm

In [None]:
# Trying to use ramdisk below to speedup everything a bit
# !diskutil erasevolume HFS+ 'RAMDisk' `hdiutil attach -nobrowse -nomount ram://2097152`
# !cp -r ~/.wn_data /Volumes/RAMDisk/

from pathlib import Path
wn.config._dbpath = Path("/Volumes/RAMDisk/.wn_data/wn.db")

In [3]:
LEXICON_ID: str = "omw-en31"

In [4]:
REL_HYPERNYM: str = "hypernym"
REL_INSTANCE_HYPERNYM: str = "instance_hypernym"
REL_HOLONYM: str = "holonym"
REL_ANTONYM: str = "antonym"
REL_HYPERNYM_LEAP: str = "hypernym_leap_%s"

REL_SYNONYM: str = "synonym"
REL_COHYPONYM: str = "co_hyponym"


SYNSET_RELATIONS: List[str] = [
    # Covered by synset.hypernyms
    # REL_HYPERNYM,
    # REL_INSTANCE_HYPERNYM,
    
    # Covered computationally
    # Also hypernym_leap_1, hypernym_leap_2...

    # Covered by synset.meronyms
    # REL_HOLONYM,
    
    # Covered computationally
    # Also synonym
    # Also co_hyponym
]
    
# Synset-synset stats:
# [('derivation', 50397),
#  ('pertainym', 7920),
#  ('antonym', 7772),
#  ('is_exemplified_by', 390),
#  ('also', 324),
#  ('domain_region', 98),
#  ('participle', 73),
#  ('domain_topic', 12),
#  ('has_domain_topic', 11),
#  ('exemplifies', 8),
#  ('has_domain_region', 4),
#  ('similar', 2)]
    
SENSE_RELATIONS: List[str] = [
    REL_ANTONYM,
]

In [5]:
from collections import namedtuple

root: wn.Synset = wn.synset(id="omw-en31-00001740-n", lexicon=LEXICON_ID)
sample: wn.Synset = wn.synset(id="omw-en31-05990115-n", lexicon=LEXICON_ID)
sample2: wn.Synset = wn.synset(id="omw-en31-07961030-n", lexicon=LEXICON_ID)

Relation = namedtuple(
    "Relation",
    [
        "synset_id_left",
        "synset_id_right",
        "sense_id_left",
        "sense_id_right",
        "pos_left",
        "pos_right",
        "rel",
        "lemma_left",
        "lemma_right",
        "path_len",  # Min length of the path between two synsets on hypernym/hyponym three
        "level_left", # Min depth on the hypernymy tree
        "level_right",
    ],
)

In [26]:
from itertools import combinations, product
from functools import cache

# Wrapping expensive calls into @cache decorator
@cache
def get_sense_lemma(sense: wn.Sense) -> str:
    return sense.word().lemma()

@cache
def get_shortest_path_len(synset_left: wn.Synset, synset_right: wn.Synset) -> int:
    try:
        return len(synset_left.shortest_path(synset_right))
    except wn.Error:
        return 1000
        

@cache
def get_level(synset: wn.Synset) -> int:
    return synset.min_depth()

@cache
def get_synset_from_sense(sense: wn.Sense) -> wn.Synset:
    return sense.synset()


def get_relation_record(
    sense_left: wn.Sense, sense_right: wn.Sense, rel_type: str
) -> Relation:
    synset_left: wn.Synset = get_synset_from_sense(sense_left)
    synset_right: wn.Synset = get_synset_from_sense(sense_right)

    return Relation(
        synset_id_left=synset_left.id,
        synset_id_right=synset_right.id,
        sense_id_left=sense_left.id,
        sense_id_right=sense_right.id,
        pos_left=synset_left.pos,
        pos_right=synset_right.pos,
        rel=rel_type,
        lemma_left=get_sense_lemma(sense_left),
        lemma_right=get_sense_lemma(sense_right),
        path_len=get_shortest_path_len(synset_left, synset_right),
        level_left=get_level(synset_left),
        level_right=get_level(synset_right),
    )


def export_hypernyms(
    hypernym: wn.Synset, hyponym: wn.Synset, curr_depth: int, max_depth: int
) -> List[Relation]:
    if curr_depth == 0:
        rel: str = REL_HYPERNYM
    else:
        rel = REL_HYPERNYM_LEAP % curr_depth

    res: List[Relation] = []
    for a, b in product(hypernym.senses(), hyponym.senses()):
        res.append(get_relation_record(sense_left=a, sense_right=b, rel_type=rel))

    if curr_depth < max_depth - 1:
        for child_hyponym in hyponym.hyponyms():
            res += export_hypernyms(
                hypernym=hypernym,
                hyponym=child_hyponym,
                curr_depth=curr_depth + 1,
                max_depth=max_depth,
            )

    return res


def extract_relations(synset: wn.Synset, hypernym_depth: int = 2) -> List[Relation]:
    lemmas: List[str] = synset.lemmas()
    pos = synset.pos
    res: List[Relation] = []

    # Synonyms
    for a, b in combinations(synset.senses(), 2):
        res.append(
            # TODO: add reverse relation?
            get_relation_record(sense_left=a, sense_right=b, rel_type=REL_SYNONYM)
        )

    # hypernyms:
    for hyponym in synset.hyponyms():
        res += export_hypernyms(
            hypernym=synset, hyponym=hyponym, curr_depth=0, max_depth=hypernym_depth
        )

    # holonyms:
    for meronym in synset.meronyms():
        for a, b in product(synset.senses(), meronym.senses()):
            res.append(
                get_relation_record(sense_left=a, sense_right=b, rel_type=REL_HOLONYM)
            )

    # cohyponyms:
    for hyp1, hyp2 in combinations(synset.hyponyms(), 2):
        # TODO: check for reverse relations?
        for a, b in product(hyp1.senses(), hyp2.senses()):
            res.append(
                get_relation_record(sense_left=a, sense_right=b, rel_type=REL_COHYPONYM)
            )

    # Sense 2 Sense relations
    for sense in synset.senses():
        for rel, related_senses in sense.relations(*SENSE_RELATIONS).items():
            for related_sense in related_senses:
                res.append(
                    get_relation_record(
                        sense_left=sense, sense_right=related_sense, rel_type=rel
                    )
                )

    return res

In [27]:
# extract_relations(root)

In [28]:
def get_all_relations(
    hypernym_depth: int = 2, first_n: Optional[int] = None
) -> Generator[Relation, None, None]:
    with tqdm(desc="Relations out") as pbar:
        for i, synset in enumerate(
            tqdm(wn.synsets(lexicon=LEXICON_ID), desc="Synsets in")
        ):
            for rel in extract_relations(synset, hypernym_depth=hypernym_depth):
                pbar.update(1)
                yield rel

            if first_n is not None and i + 1 >= first_n:
                break

In [29]:
import smart_open
import csv

with open("all_relations_raw.depth3.csv.bz2", "wt") as fp_out:
    w = csv.DictWriter(fp_out, fieldnames=Relation._fields)
    w.writeheader()
    for rel in get_all_relations(hypernym_depth=3):
        w.writerow(rel._asdict())

Relations out: 0it [00:00, ?it/s]

Synsets in:   0%|          | 0/117791 [00:00<?, ?it/s]

## Some remarks on further dataset cleansing
 * Dataset is huge. Think twice if you want to use ALL of the pairs
 * You may (or you may not) remove multi-word lemmas
 * You definitely want to remove all **word** pairs that belong to more than one relation (especially when it's about polysemy)
 * For your convenience, each word pair has a distance between the words on the hypernym/hyponym tree. You might want to filter some relations that are too close, for example, antonyms with the same hypernym.
 * Each pair also has a distance from the synset to the top. You might want to remove noun pairs that are too close to the top (abstract part) of the wordnet
 * You might merge hypernym-leap-* relations with direct hypernym-hyponym relations (to see if the classifier can learn a general sense of hypernymy-hyponymy) or keep them (to see if the classifier can distinguish between direct hypernym-hyponym pair and an indirect one) or remove them at all
 * You definitely want to build your train/val/test dataset carefully so you don't have data leakage when, for example, a pair of hypernym-hyponym is available in train and a pair of hypernym-*-hyponym is available in the test dataset
 * You must pay attention to the negative pairs (random relation). They shouldn't be present in the dataset of relations above (for an apparent reason), and also, they shouldn't be too close to each other (i.e., the shortest path length should be more than, say, 7).
 * You have to pay attention to class disbalance (for example, there are much smaller amounts of antonyms than other relations)