# Exploring Metadata Keyword Relationships with word2vec

This notebook demonstrates using a pretrained word2vec model to explore how keywords are related to other words and other keywords. 

Examples of how to use pretrained models and code where used from https://github.com/3Top/word2vec-api


This notebook runs on the docker jupyter/datascience-notebook container with gensim additionally installed as below 

In [None]:
%%bash
pip install gensim

In [197]:
from gensim.models.word2vec import Word2Vec as w
import gensim.models
from gensim import utils, matutils

## Loading a pretrained word2vec model

This model has been sourced http://nlp.stanford.edu/data/glove.6B.zip to use this you'll need to download it as it's too big to package alongside this notebook.

To load it sucessfully it has been modified to add a '400000 300' (presumably specifying the number of words and vector size) to a new line above the rest of the data

In [204]:
%%bash 
head -2 ./glove.6B.300d.txt

400000 300 
the 0.04656 0.21318 -0.0074364 -0.45854 -0.035639 0.23643 -0.28836 0.21521 -0.13486 -1.6413 -0.26091 0.032434 0.056621 -0.043296 -0.021672 0.22476 -0.075129 -0.067018 -0.14247 0.038825 -0.18951 0.29977 0.39305 0.17887 -0.17343 -0.21178 0.23617 -0.063681 -0.42318 -0.11661 0.093754 0.17296 -0.33073 0.49112 -0.68995 -0.092462 0.24742 -0.17991 0.097908 0.083118 0.15299 -0.27276 -0.038934 0.54453 0.53737 0.29105 -0.0073514 0.04788 -0.4076 -0.026759 0.17919 0.010977 -0.10963 -0.26395 0.07399 0.26236 -0.1508 0.34623 0.25758 0.11971 -0.037135 -0.071593 0.43898 -0.040764 0.016425 -0.4464 0.17197 0.046246 0.058639 0.041499 0.53948 0.52495 0.11361 -0.048315 -0.36385 0.18704 0.092761 -0.11129 -0.42085 0.13992 -0.39338 -0.067945 0.12188 0.16707 0.075169 -0.015529 -0.19499 0.19638 0.053194 0.2517 -0.34845 -0.10638 -0.34692 -0.19024 -0.2004 0.12154 -0.29208 0.023353 -0.11618 -0.35768 0.062304 0.35884 0.02906 0.0073005 0.0049482 -0.15048 -0.12313 0.19337 0.12173 0.44503 0.25147 0.10781 -0.

In [199]:
model = gensim.models.KeyedVectors.load_word2vec_format('./glove.6B.300d.txt', binary=False)

## Keywords describing datasets

A keywords.json file was produced from backend data harvested as part of populating the knowledge network http://kn.csiro.au

In [200]:
import json

In [201]:
with open('./keywords.json') as keywords_json_file:
    raw_keywords = json.load(keywords_json_file)

In [128]:
keywords_dict = [keyword for keyword in raw_keywords if 'keywords' in keyword.keys() and keyword['keywords'] != None]

In [131]:
for anItem in keywords_dict[0:10]:
    print(anItem)

{'_id': '59159e9b3dd97b0a4c165378', 'keywords': ['entomology', 'insects', 'spiders', 'fauna']}
{'_id': '59159e9b3dd97b0a4c165379', 'keywords': ['plants']}
{'_id': '59159e9b3dd97b0a4c16537b', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c16537e', 'keywords': ['plants']}
{'_id': '59159e9b3dd97b0a4c16537f', 'keywords': ['herbarium', 'herbaria', 'angiosperms', 'dicots', 'monocots', 'pteridophytes', 'mosses', 'liverworts', 'lichens', 'fungi', 'algae', 'plants']}
{'_id': '59159e9b3dd97b0a4c165380', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c165381', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c165382', 'keywords': ['microbes']}
{'_id': '59159e9b3dd97b0a4c165383', 'keywords': ['arachnids', 'spiders', 'myriapods', 'centipedes', 'millipedes', 'onychophorans', '"velvet', 'worms"', 'tardigrades', '"water', 'bears"', 'fauna']}
{'_id': '59159e9b3dd97b0a4c165384', 'keywords': ['entomology', 'insects', 'hexapoda']}


## Keyword cleaning and considerations

Keywords are flattened to a unique list. However there are a lot of keywords that aren't single words. These aren't handled currently. 

In [132]:
flat_keywords = [keyword for keyword_item in keywords_dict for keyword in keyword_item['keywords']]

In [133]:
unique_keywords = list(set(flat_keywords))

In [105]:
len(unique_keywords)

1912

## Finding similar words

Here we find the top 10 words similar to a keyword. Note that this isn't finding the most similar other keywords rather it is finding words from the original word to vec corpus (in this case wikipedia + gigawords), only a small number of which are exisiting keywords. Also note that not all keywords are in the corpus if a keyword isn't present an error is thrown.

In [155]:
for keyword in unique_keywords[1:10]:
    print(keyword)
    try:
        print(str(model.most_similar_cosmul(positive=keyword,topn=10)) + "\n")
    except:
        print("Word not in corpus" + "\n")
        continue

Workspace
Word not in corpus

Ferrous iron index
Word not in corpus

risk assessment
Word not in corpus

pulsars, neutron stars, BPSR, HIPSR
Word not in corpus

Oenanthe pimpinelloides
Word not in corpus

sea
[('ocean', 0.8422356247901917), ('waters', 0.8335513472557068), ('seas', 0.8311473727226257), ('mediterranean', 0.7880582213401794), ('coast', 0.7716078162193298), ('coastal', 0.7678077816963196), ('aegean', 0.7616022825241089), ('arctic', 0.7600924968719482), ('ships', 0.7546911239624023), ('ship', 0.7542240619659424)]

Western Australia Coast South
Word not in corpus

ICT Centre
Word not in corpus

clams
[('mussels', 0.8888778686523438), ('oysters', 0.8605103492736816), ('scallops', 0.835378110408783), ('clam', 0.8343192338943481), ('crabs', 0.8148018717765808), ('lobsters', 0.8023141622543335), ('shrimp', 0.7868390083312988), ('squid', 0.7804455757141113), ('prawns', 0.7790178060531616), ('shellfish', 0.7775545120239258)]



## Finding only similar existing keywords

In some use cases we might want to only want to find existing keywords similar to other keywords. This might help a user select a new keyword based on other similar keywords to an already selected keyword. The best way to do this might be to reset the word2vec model to use the keywords as a corpus and then "copy" the weights from an existing model (because we can't easily train specifically on the small keyword corpus). An alternative is to "brute force" this by build ordered dictionaries so that each keyword is associated with an list of all other keywords ordered by similarity. Not implemented yet but might begin like the following

In [207]:
print(unique_keywords[6] + '\n')
for keyword in unique_keywords[1:30]:
    try:
        print(keyword + ': ' + str(model.similarity(keyword, unique_keywords[6])))
    except:
        pass

sea

sea: 1.0
clams: 0.224150595646
entomology: -0.138087730563
fiber: 0.110444760662
ethanol: -0.0278519565875
rumen: -0.0532214256698
atlas: 0.0858158608241


## What else might we be able to do with word2vec vectors and keywords

One thing that might be possible is to look at how closely related a particular set of keywords attributed to a dataset  actually are. We can compare word vector distance ourselves and so can calculate average distance of keywords with each other. 

In [None]:
model.word_vec(unique_keywords[6])

In [211]:
import numpy, itertools, collections

average_keyword_dist_record = {}

def get_word_vec(word, word_vec):
    a_vec = None
    if word in word_vec.keys():
        a_vec = word_vec[word]
    else:
        try:
            a_vec = model.word_vec(word)
        except:
            a_vec = None
        word_vec[word] = a_vec
    return a_vec

for record in keywords_dict:    
    metadata_record_keywords = record['keywords']
    total_distance = 0
    count = 1
    word_vec = {}
    sets = [frozenset(pair) for pair in itertools.product(metadata_record_keywords, metadata_record_keywords)]
    unique_pairs = set(sets)
    for pair in unique_pairs:
        pair = list(pair)
        if len(pair) < 2:
            continue
        else:
            a_vec = get_word_vec(pair[0], word_vec)
            b_vec = get_word_vec(pair[1], word_vec)
            if a_vec != None and b_vec != None:
                distance = numpy.linalg.norm(a_vec-b_vec)
            else:
                continue
            count += 1
            total_distance += distance
    average_distance = total_distance / count
    if average_distance in average_keyword_dist_record.keys():
        average_keyword_dist_record[average_distance].append(record)
    else:
        average_keyword_dist_record[average_distance] = [record]

# first 10 results to keep notebook output smallish
for result in list(collections.OrderedDict(sorted(average_keyword_dist_record.items(), reverse=True)).items())[0:10]:
    print(str(result) + '\n')



(9.4812989864709234, [{'_id': '5915a6be3dd97b0a4c19fcfa', 'keywords': ['soil', 'agriculture', 'crops', 'pastures', 'carbon', 'sequestration', 'baseline', 'stocks', 'fractions', 'nitrogen', 'carbonate', 'particulate', 'humus', 'resistant', 'charcoal']}])

(8.8938780725002289, [{'_id': '5915a6be3dd97b0a4c19fd45', 'keywords': ['invertebrates', 'insects', 'biosecurity', 'taxonomy', 'arthropods', 'systematic']}])

(8.8470295563988071, [{'_id': '5915a6be3dd97b0a4c19fd04', 'keywords': ['temperature', 'salinity', 'depth', 'fluorescence', 'oxygen', 'PAR', 'ocean', 'phosphate', 'nitrate', 'ammonium', 'silicate', 'Australia, Western Australia coast', 'East Indian Ocean', 'Indian Ocean']}])

(8.8263469372155523, [{'_id': '59159e9b3dd97b0a4c165387', 'keywords': ['malacology', 'molluscs', 'mollusc', 'chitons', 'clams', 'mussels', 'snails', 'nudibranchs', 'sea', 'slugs', 'tusk', 'shells', 'octopus', 'squid', 'fauna']}])

(8.7827231287956238, [{'_id': '5915a6be3dd97b0a4c19fcc8', 'keywords': ['permanen

## Can we assess good keyword tagging this way? Could we score keyword descriptiveness as users enter them? 

(2.9496593475341797, [{'_id': '59159e9b3dd97b0a4c1653bb', 'keywords': ['microbial', 'microbes']}])

vs 

(8.1640734042017922, [{'_id': '59159e9b3dd97b0a4c165439', 'keywords': ['herbarium', 'herbaria', 'plants', 'angiosperms', 'dicots', 'monocots', 'gymnosperms', 'pteridophytes', 'mosses', 'liverworts', 'lichens', 'fungi', 'algae', 'fossils', 'wood', 'microbes']}])


