# Domestic Bias in WordNet 
Code for Appendix D analyses in "Quantifying Bias in Hierarchical Category Systems". The following code takes in a list of wild and domestic starting nodes (starting synsets) and compute category count, level, and descendant bias towards domestic mammal synsets in WordNet.

## Imports

In [1]:
from nltk.corpus import wordnet as wn
from statistics import mode
from statistics import median
from statistics import mean

## Load Data

In [2]:
with open('domestic_nodes.txt', 'r') as f:
    domestic_nodes = f.read().splitlines()

with open('wild_nodes.txt', 'r') as f:
    wild_nodes = f.read().splitlines()


## Functions

In [3]:
'''
Count the number of direct and in-direct hyponyms of a given sysnset.
'''
def get_num_descendants(synset):
    kids = synset.hyponyms()
    num_descendants = 0 
    for kid in kids:
        num_descendants += get_num_descendants(kid)
    return len(kids) + num_descendants    

'''
Compute the depth of a synset. WordNet is not a strict hierachy, thus if a
synset belongs to multiple parent categories (multiple chains of hypernyms 
leading to the root), compute the depth as the mean depth over all paths to 
the root. 
'''
def get_depth(synset):
    parents = synset.hypernyms()
    if len(parents) == 0:
        return 0 
    elif len(parents) == 1:
        return 1 + get_depth(parents[0])
    else:
        sub_depth = 0
        for syn in parents:
            sub_depth += get_depth(syn)
        avg_depth = sub_depth/len(parents)
    
    return 1 + avg_depth

'''
Collect a node and all its children.
'''
def collect_nodes(synset, catList):
    catList.append(synset.name())
    for kid in synset.hyponyms():
        collect_nodes(kid, catList)

'''
Count the total number of unique nodes given 
a list of starting nodes. 
'''
def count_nodes(nodes):
    catList = []
    for node in nodes:
        synset = wn.synset(node)
        collect_nodes(synset, catList)
    return len(list(set(catList))) # duplicates are removed. 

In [4]:
def run_analysis(nodes, node_label):
    num_desc = 0
    avg_depth = 0
    descs = []
    node_count = count_nodes(nodes) # node count
    for node in nodes:
        synset = wn.synset(node)
        avg_depth += get_depth(synset) # node depth 
        desc = get_num_descendants(synset) # number of descendants
        num_desc += desc
        descs.append(desc)
    avg_depth = avg_depth / len(nodes)
    avg_num_desc = num_desc / len(nodes)
    # collect descedant counts for non-leaf nodes (nodes with 1+ hyponyms)
    non_leaf = [node for node in descs if node != 0] 
    print(f'{node_label}\n\tStarting Node Count: {len(nodes)}\n\tNode Count: {node_count}\n\tAvg. Depth: {round(avg_depth, 2)}\n\tMean # of Descendants: {round(avg_num_desc, 2)}')
    print(f'\tMode # of Descendants: {round(mode(descs))}\n\tMedian # of Descendants: {round(median(descs))}')
    print(f'\tNumber of non-leaf starting nodes: {len(non_leaf)} ({round((len(non_leaf) / len(nodes))*100, 2)}%)\n\tMean # of desc per non-leaf start node: {round(mean(non_leaf), 2)}')
    print(f'\tMedian number of descendants per non-leaf starting nodes: {round(median(non_leaf), 2)}')


## Analysis
There is evedince of bias towards domestic mammals in WordNet as there are more synsets for domestic mammals than for wild mammals, and domestic mammal synsets tend to have more hyponyms (descendants) than wild ones.

In [5]:
wild_nodes = [node for node in list(set(wild_nodes)) if node not in domestic_nodes]
run_analysis(domestic_nodes, 'DOMESTIC NODES')
print("")
run_analysis(wild_nodes, 'WILD NODES')

DOMESTIC NODES
	Starting Node Count: 19
	Node Count: 398
	Avg. Depth: 14.34
	Mean # of Descendants: 20.0
	Mode # of Descendants: 0
	Median # of Descendants: 2
	Number of non-leaf starting nodes: 12 (63.16%)
	Mean # of desc per non-leaf start node: 31.67
	Median number of descendants per non-leaf starting nodes: 6.5

WILD NODES
	Starting Node Count: 309
	Node Count: 344
	Avg. Depth: 14.27
	Mean # of Descendants: 0.13
	Mode # of Descendants: 0
	Median # of Descendants: 0
	Number of non-leaf starting nodes: 24 (7.77%)
	Mean # of desc per non-leaf start node: 1.67
	Median number of descendants per non-leaf starting nodes: 1.0
