# ConceptNet Category Hyponyms
## Extract English Terms from ConceptNet

ConceptNet contains many different associations between words of a wide range of languages. For the wordlist generation, I am currently only interested in English nouns, that are connected via a "is-a" relation, so that I can extract exemplars/hyponyms for a category.

Later on, other relations might be interesting for a thorough analysis of game-play, human and machine.

To re-run this notebook, download the original data (assertions file, version 5.7.0 of ConceptNet from here: https://github.com/commonsense/conceptnet5/wiki/Downloads

and put it into a data-folder:
- `data/`
    - `conceptnet-assertions-5.7.0.csv`

In [1]:
with open("data/conceptnet-assertions-5.7.0.csv") as conceptnet:
    for line in conceptnet:
        print(line)
        break

/a/[/r/Antonym/,/c/ab/агыруа/n/,/c/ab/аҧсуа/]	/r/Antonym	/c/ab/агыруа/n	/c/ab/аҧсуа	{"dataset": "/d/wiktionary/en", "license": "cc:by-sa/4.0", "sources": [{"contributor": "/s/resource/wiktionary/en", "process": "/s/process/wikiparsec/2"}], "weight": 1.0}



What the format looks like in ConceptNet (see also https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy):

first characters: type
- /a/ - assertion or edge
- /c/ - concept

second characters: language code
- en - English

third: the word

fourth: part of speech
- n - noun

In [2]:
def extract_noun_from_conceptnet(cn_word, require_POS = True):
    word_parts = cn_word.split('/')
    
    # check if word denotes a concept and is an english word
    if word_parts[1] != 'c' or word_parts[2] != 'en':
        return None
    
    if require_POS:
        # check if word has a part of speech tag that is a noun
        if len(word_parts) <= 4:
            return None
        if len(word_parts) > 4 and word_parts[4] != 'n':
            return None
    else:
        # POS-tag is not required, but if exists, should still be a noun
        if len(word_parts) > 4 and word_parts[4] != 'n':
            return None
    
    return word_parts[3]

with open("data/conceptnet-assertions-5.7.0.csv") as conceptnet:
    with open("data/conceptnet-nouns-en.csv", 'w+') as en_conceptnet:
        count = 0
        for line in conceptnet:
            uri, relationship, first_word_id, second_word_id, _ = line.split('\t')
            relationship = relationship.removeprefix('/r/')
            
            first_word = extract_noun_from_conceptnet(first_word_id)
            second_word = extract_noun_from_conceptnet(second_word_id, require_POS=False)
            if first_word and second_word:
                count += 1
                en_conceptnet.write(f"{relationship}; {first_word}; {second_word}\n")
            
print(count)

1850333


When developing this function above, I recorded the following amounts of words:
- all english words:`6356320`
- all english nouns and concepts without part of speech: `2735738`
- all english nouns:`1850333`

The extracted English nouns are related with the following relations:
- Antonym
- AtLocation
- DistinctFrom
- FormOf
- HasContext
- InstanceOf
- IsA (probably the only one relevant for Board Generation)
- PartOf
- RelatedTo
- SimilarTo
- Synonym
- UsedFor
- dbpedia/captial
- dbpedia/field
    - /genre
    - /genus
    - /influencedBy
    - /knownFor
    - /language
    - /leade
    - /occupation
    - /product

In [3]:
with open("data/conceptnet-nouns-en.csv") as conceptnet_nouns:
    with open("data/conceptnet-isa-relations.csv", 'w+') as conceptnet_isa_relations:
        count = 0
        for line in conceptnet_nouns:
            line = line.strip()
            relationship, first_word, second_word = line.split(';')
            
            if relationship == "IsA":
                count += 1
                conceptnet_isa_relations.write(f"{relationship}; {first_word}; {second_word}\n")
            
print(count)

173668


All English nouns related with an "is-a" relationship: `173668`

## Getting Hyponyms for Categories

In [11]:
def get_hyponyms_for_categories(categories):
    hyponyms = {}
    with open("data/conceptnet-isa-relations.csv") as conceptnet_isa_relations:
        count = 0
        for line in conceptnet_isa_relations:
            line = line.strip()
            relationship, first_word, second_word = line.split(';  ')

            if second_word in categories:
                if second_word not in hyponyms:
                    hyponyms[second_word] = set()
                hyponyms[second_word].add(first_word)
                count += 1

    print(count)
    return hyponyms

In [5]:
import json
with open('../category lists/schroeder et al.json') as file:
    categories_schroeder = json.load(file)["categories"]
    print(categories_schroeder)

['animal', 'bird', 'fruit', 'vegetable', 'clothing', 'furniture', 'vehicle', 'tool', 'musical_instrument', 'profession', 'sport']


In [12]:
get_hyponyms_for_categories(categories_schroeder) # 1019

1019


{'animal': {'1_sex',
  'aardvark',
  'aardwolf',
  'acrodont',
  'adult',
  'albatross',
  'allosaurus',
  'american_goldfinch',
  'animal',
  'animal_human_food_source',
  'apatosaurus',
  'ape',
  'arapaima',
  'archimedes',
  'arctotherium',
  'armadillidiidae',
  'arthropod',
  'asexual_organism',
  'asian_giant_hornet',
  'athletic_physique',
  'auk',
  'australian_magpie',
  'average_physical_build',
  'awake_thing',
  'axis',
  'aye_aye',
  'baboon',
  'badger',
  'barreleye',
  'bat',
  'bear',
  'bee_eater',
  'bigfin_reef_squid',
  'biped',
  'bird',
  'black_mamba',
  'blobfish',
  'bonobo',
  'brazilian_wandering_spider',
  'bull_shark',
  'bullock',
  'calf',
  'camel',
  'cancer',
  'canine',
  'cannibal',
  'captive',
  'captive_animal',
  'caribou',
  'castoroides',
  'cephalopod',
  'channichthyidae',
  'child',
  'chimpanzee',
  'choerodon_fasciatus',
  'chordate',
  'cicada',
  'coelomate',
  'cold_blooded_animal',
  'coon',
  'coypu',
  'crawling_posture',
  'creepy

In [13]:
with open('../category lists/jaramillo et al cleaned.json') as file:
    categories_jaramillo = json.load(file)["categories"]
    print(categories_jaramillo)

['animal', 'body', 'clothes', 'color', 'day', 'dessert', 'food', 'relative', 'room', 'shape', 'sound', 'toy', 'beverage', 'bird', 'building', 'coin', 'collectable', 'condiment', 'container', 'dinosaur', 'direction', 'emotion', 'flower', 'fruit', 'holiday', 'ingredient', 'insect', 'instrument', 'job', 'jungle animal', 'liquid', 'measure', 'month', 'movie', 'story', 'pattern', 'planet', 'plant', 'reptile', 'rhyming', 'season', 'sense', 'silverware', 'size', 'solid', 'sport', 'transportation', 'tool', 'vegetable', 'writing', 'ability', 'businesses', 'city', 'country', 'communication', 'continent', 'currency', 'exercise', 'habitat', 'hazard', 'mammal', 'material', 'metal', 'ocean', 'president', 'school subject', 'seasoning', 'state', 'symbol', 'texture', 'tree', 'weather', 'colony', 'ancient civilization', 'constellation', 'cuisine', 'element', 'landmark', 'government type', 'gas', 'gem', 'organ', 'language', 'mineral', 'mountain', 'music type', 'precipitation', 'book', 'religion', 'tradit

In [14]:
get_hyponyms_for_categories(categories_jaramillo) # 5635

5635


{'movie': {'1_movie_genre',
  'action_movie',
  'adult_movie',
  'adventure_movie',
  'ai_gone_wrong_movie',
  'animated_movie',
  'batman_movie',
  'biblical_movie',
  'celebrity_sex_tape',
  "children's_movie",
  'cinema_verite',
  'classic_movie',
  'collage_film',
  'comedy_movie',
  'coming_attraction',
  'crime_movie',
  'cult_movie',
  'documentary',
  'documentary_film',
  'drama_movie',
  'educational_movie',
  'experimental_movie',
  'family_movie',
  'fantasy_movie',
  'feature',
  'film_noir',
  'final_cut',
  'foreign_film',
  'free_movie',
  'g',
  'history_movie',
  'home_movie',
  'horror_movie',
  'imax',
  'independent_film',
  'musical',
  'mystery_movie',
  'nc_17',
  'not_yet_rated_rating',
  'pg',
  'pg_13',
  'r',
  'romance_movie',
  'rough_cut',
  'satirical_movie',
  'science_fiction_movie',
  'shoot_em_up',
  'short_subject',
  'silent_movie',
  'skin_flick',
  'slow_motion',
  'sports_movie',
  'star_wars_movie',
  'superman_movie',
  'suspense_movie',
  'ta

The category exemplars are either very sparse (especially for the amount of categories taken from Jaramillo et al.) or again on different levels of hierarchy, with some more typical items on the direct level. I would again need to recursively get hyponyms as I tried for wordnet, with the caveat that the levels of hierarchy diverge more and more. I will look for another dataset that already contains categories and useful/typical exemplars.