# Get Synsets

Code to link Wikipedia pages for mammal species to WordNet synsets using the [BabelNet](https://babelnet.org/) API. This code helps with the analyses in Appendix D of "Quantifying Bias in Hierarchical Category Systems". BabelNet  v5.3 was accessed from November 26th - November 29th, 2023 for use in this paper.

Mammal Wikipedia IDs ("mammal_wiki_ids.txt") were generated by scraping the Wikipedia page IDs for mammal species from various mammal lists on Wikipedia. The pages are listed below:

- https://en.wikipedia.org/wiki/List_of_placental_mammals
- https://en.wikipedia.org/wiki/List_of_rodents
- https://en.wikipedia.org/wiki/List_of_primates 
- https://en.wikipedia.org/wiki/List_of_lagomorphs 
- https://en.wikipedia.org/wiki/List_of_bats 
- https://en.wikipedia.org/wiki/List_of_carnivorans 
- https://en.wikipedia.org/wiki/List_of_artiodactyls 
- https://en.wikipedia.org/wiki/List_of_monotremes_and_marsupials

Domestic Wikipedia IDs ("domestic_wiki_ids.txt") were generated by manually selecting the Wikipedia page IDs of mammals from the Domesticated Animals table on the following page: https://en.wikipedia.org/wiki/List_of_domesticated_animals. These Wikipedia pages are subject to change. All analyses used in the paper are based on when they were originally scraped on November 26th, 2023.



Instructions for obtaining a BabelNet API key are found here: https://babelnet.org/guide.
Due to API request limits, this code may have to be run in chunks over multiple days. The current place in the list of Wikipedia IDs is saved to "current_idx.txt" and parsed wordnet synsets are saved to either "wild_nodes.txt" or "domestic_nodes.txt".


## Imports

In [8]:
from nltk.corpus import wordnet as wn
import requests

## Functions

In [10]:
BAD_IDS = []

'''
Find wikipedia pages in BabelNet and extract their corresponding
WordNet synset if one exists.
'''
def get_synsets(wiki_ids, key):
    not_in_wordnet = []
    synsets = []
    count = 0 # place in list
    records_process = 0 # how many records were actually processed
    flag = True
    if key == '':
        print('No BabelNet API key')
        return None, count, False
    for wiki_id in wiki_ids:
        url1 = f'https://babelnet.io/v8/getSynsetIdsFromResourceID?id={wiki_id}&searchLang=EN&pos=NOUN&source=WIKI&key={key}'
        response = requests.get(url1)
        # If the API request limit is reached
        if response.status_code != 200:
            print(response.status_code)
            print(response.json())
            flag = False
            break
        # If the API request limit is reached
        babel = response.json()
        if len(babel) == 0:
            url1 = f'https://babelnet.io/v8/getSynsetIdsFromResourceID?id={wiki_id.lower()}&searchLang=EN&pos=NOUN&source=WIKI&key={key}'
            response = requests.get(url1)
            if response.status_code != 200:
                print(response.status_code)
                print(response.json())
                flag = False
                break
            # If the wiki ID does not link to an entry in BabelNet
            babel = response.json()
            if len(babel) == 0:
                print(f'ERROR: problem with wiki ID {wiki_id}')
                BAD_IDS.append(wiki_id)
                count += 1
                continue
        babel_id = babel[0]['id']
        url2 = f'https://babelnet.io/v8/getSynset?id={babel_id}&key={key}'
        response = requests.get(url2)
        # If the API request limit is reached
        if response.status_code != 200:
            print(response.status_code)
            print(response.json())
            flag = False
            break
        senses = response.json()['senses']
        synset_data = [data for data in senses if data['type'] == 'WordNetSense']
        # If the BabelNet entry is not linked to a WordNet synset
        if len(synset_data) == 0:
            not_in_wordnet.append(wiki_id)
        else: 
            # Extract WordNet synset
            offset = synset_data[0]['properties']['wordNetOffset']
            synsets.append(wn.of2ss(offset))
        records_process += 1
        count += 1
        if records_process%10 == 2:
            print(f'{records_process} records processed')
    print(f'{records_process} records processed')
    return synsets, count, flag

## Load IDs

In [11]:
with open('mammal_wiki_ids.txt', 'r') as f:
    mammalWiki = f.read().splitlines()

with open('domestic_wiki_ids.txt', 'r') as f:
    domesticWiki = f.read().splitlines()

print(f'There are {len(mammalWiki)} mammal wikipedia IDs and {len(domesticWiki)} domestic.')

There are 5559 mammal wikipedia IDs and 28 domestic.


## Link Data
Link wikipedia pages to wordnet sysnets using the BabelNet API. Variable "key" must be set to your BabelNet API key.

In [26]:
key = '' #BabelNet API Key
try:
    with open('current_idx.txt', 'r') as f:
        idx = int(f.read().splitlines()[0])
except FileNotFoundError:
    idx = 0

In [None]:
flag = True
while idx < len(mammalWiki) and flag:
    print(f'Start Idx {idx}')
    wild_nodes, count, flag = get_synsets(mammalWiki[idx:], key)
    if wild_nodes is not None: 
        # Save mammal synsets found so far
        with open('wild_nodes.txt', 'a') as f:
            for node in wild_nodes:
                if node is not None:
                    f.write(f'{node.name()}\n')
    # If haven't finished crawling the entire list
    if not flag: 
        idx += count
        with open('current_idx.txt', 'w') as f:
             f.write(f'{idx}')
    print(f'End Idx {idx}')

In [None]:
domestic_nodes, count, flag = get_synsets(domesticWiki, key)
if domestic_nodes is not None: 
    with open('domestic_nodes.txt', 'a') as f:
        for node in wild_nodes:
            if node is not None:
                f.write(f'{node.name()}\n')