# Exploring Metadata Keyword Relationships with word2vec

This notebook demonstrates using a pretrained word2vec model to explore how keywords are related to other words and other keywords. 

Examples of how to use pretrained models and code where used from https://github.com/3Top/word2vec-api


This notebook runs on the docker jupyter/datascience-notebook container with gensim additionally installed as below 

In [119]:
%%bash
pip install gensim



You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
bash: line 2: syntax error near unexpected token `;'
bash: line 2: `;'


In [26]:
from gensim.models.word2vec import Word2Vec as w
import gensim.models
from gensim import utils, matutils

## Loading a pretrained word2vec model

This model has been sourced http://nlp.stanford.edu/data/glove.6B.zip

To load it sucessfully it has been modified to add a '400000 300' (presumably specifying the number of words and vector size) to a new line above the rest of the data

In [120]:
model = gensim.models.KeyedVectors.load_word2vec_format('./glove.6B.300d.txt', binary=False)

## Keywords describing datasets

A keywords.json file was produced from backend data harvested as part of populating the knowledge network http://kn.csiro.au

In [126]:
import json

In [127]:
with open('./keywords.json') as keywords_json_file:
    raw_keywords = json.load(keywords_json_file)

In [128]:
keywords_dict = [keyword for keyword in raw_keywords if 'keywords' in keyword.keys() and keyword['keywords'] != None]

In [131]:
for anItem in keywords_dict[0:10]:
    print(anItem)

{'_id': '59159e9b3dd97b0a4c165378', 'keywords': ['entomology', 'insects', 'spiders', 'fauna']}
{'_id': '59159e9b3dd97b0a4c165379', 'keywords': ['plants']}
{'_id': '59159e9b3dd97b0a4c16537b', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c16537e', 'keywords': ['plants']}
{'_id': '59159e9b3dd97b0a4c16537f', 'keywords': ['herbarium', 'herbaria', 'angiosperms', 'dicots', 'monocots', 'pteridophytes', 'mosses', 'liverworts', 'lichens', 'fungi', 'algae', 'plants']}
{'_id': '59159e9b3dd97b0a4c165380', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c165381', 'keywords': ['microbial']}
{'_id': '59159e9b3dd97b0a4c165382', 'keywords': ['microbes']}
{'_id': '59159e9b3dd97b0a4c165383', 'keywords': ['arachnids', 'spiders', 'myriapods', 'centipedes', 'millipedes', 'onychophorans', '"velvet', 'worms"', 'tardigrades', '"water', 'bears"', 'fauna']}
{'_id': '59159e9b3dd97b0a4c165384', 'keywords': ['entomology', 'insects', 'hexapoda']}


## Keyword cleaning and considerations

Keywords are flattened to a unique list. However there are a lot of keywords that aren't single words. These aren't handled currently. 

In [132]:
flat_keywords = [keyword for keyword_item in keywords_dict for keyword in keyword_item['keywords']]

In [133]:
unique_keywords = list(set(flat_keywords))

In [105]:
len(unique_keywords)

1912

## Finding similar words

Here we find the top 10 words similar to a keyword. Note that this isn't finding the most similar other keywords rather it is finding words from the original word to vec corpus (in this case wikipedia + gigawords), only a small number of which are exisiting keywords. Also note that not all keywords are in the corpus if a keyword isn't present an error is thrown.

In [155]:
for keyword in unique_keywords[1:10]:
    print(keyword)
    try:
        print(str(model.most_similar_cosmul(positive=keyword,topn=10)) + "\n")
    except:
        print("Word not in corpus" + "\n")
        continue

Workspace
Word not in corpus

Ferrous iron index
Word not in corpus

risk assessment
Word not in corpus

pulsars, neutron stars, BPSR, HIPSR
Word not in corpus

Oenanthe pimpinelloides
Word not in corpus

sea
[('ocean', 0.8422356247901917), ('waters', 0.8335513472557068), ('seas', 0.8311473727226257), ('mediterranean', 0.7880582213401794), ('coast', 0.7716078162193298), ('coastal', 0.7678077816963196), ('aegean', 0.7616022825241089), ('arctic', 0.7600924968719482), ('ships', 0.7546911239624023), ('ship', 0.7542240619659424)]

Western Australia Coast South
Word not in corpus

ICT Centre
Word not in corpus

clams
[('mussels', 0.8888778686523438), ('oysters', 0.8605103492736816), ('scallops', 0.835378110408783), ('clam', 0.8343192338943481), ('crabs', 0.8148018717765808), ('lobsters', 0.8023141622543335), ('shrimp', 0.7868390083312988), ('squid', 0.7804455757141113), ('prawns', 0.7790178060531616), ('shellfish', 0.7775545120239258)]



## Finding only similar existing keywords

In some use cases we might want to only want to find existing keywords similar to other keywords. This might help a user select a new keyword based on other similar keywords to an already selected keyword. The best way to do this might be to reset the word2vec model to use the keywords as a corpus and then "copy" the weights from an existing model (because we can't easily train specifically on the small keyword corpus). An alternative is to "brute force" this by build ordered dictionaries so that each keyword is associated with an list of all other keywords ordered by similarity. Not implemented yet but might begin like the following

In [147]:
print(unique_keywords[6])
for keyword in unique_keywords:
    try:
        print(keyword + ': ' + str(model.similarity(keyword, unique_keywords[6])))
    except:
        pass

sea
sea: 1.0
clams: 0.224150595646
entomology: -0.138087730563
fiber: 0.110444760662
ethanol: -0.0278519565875
rumen: -0.0532214256698
atlas: 0.0858158608241
amphibian: 0.0912710886103
geophysics: 0.0635767892918
pulsar: -0.0616135738105
nutrients: 0.135089925386
biosecurity: -0.013429143093
cropping: -0.0462017966651
birds: 0.40292392226
waves: 0.404814288266
gorgonians: 0.0264709870194
nematoda: -0.174774502264
guild: 0.00533698418086
ocean: 0.684472884777
rarefaction: -0.0488151855678
molecule: 0.00805777661849
runoff: 0.117485779389
spiders: 0.157447788603
flower: 0.16018151524
carbonate: 0.127139506334
invertebrates: 0.21980052183
charcoal: 0.060828164567
conductivity: -0.00876307080493
herbarium: -0.0916143439752
software: 0.0584844322381
sucrose: -0.0344297328679
crustacea: -0.0299677499344
cattle: 0.153258218679
arachnids: -0.0823761398388
standardisation: -0.0272053032805
projections: 0.0487768129987
mycorrhiza: -0.136208198412
weather: 0.313883307572
bushfire: -0.100511608219

## What else might we be able to do with word2vec vectors and keywords

One thing that might be possible is to look at how closely related a particular set of keywords attributed to a dataset  actually are. We can compare word vector distance ourselves and so can calculate average distance of keywords with each other. 

In [149]:
model.word_vec(unique_keywords[6])

array([ 0.29919001, -0.11731   , -0.0089925 , -0.37059   , -0.06722   ,
        0.15163   , -0.061105  ,  0.29587001,  0.36515999, -1.50870001,
        0.46160001, -0.15762   ,  0.015131  ,  0.31378999,  0.49033999,
        0.23762   ,  0.27667001,  0.44819999, -0.64633   ,  0.66012001,
       -0.65131003,  0.36984   , -0.41850001, -0.053622  , -0.0097837 ,
       -0.12772   ,  0.47055   ,  0.65263999, -0.37119001,  0.48045999,
        0.39286   , -0.06189   , -0.88924998, -0.55094999,  0.35034001,
       -0.32709   ,  0.29975   , -0.056088  , -0.035726  ,  0.46678001,
       -0.27544001, -0.01794   ,  0.41068   ,  0.16943   , -0.38510999,
        0.29284   ,  0.51858997,  0.56309003,  0.24123   ,  0.099607  ,
       -0.20422   ,  0.11269   , -0.49399999, -0.87515002, -0.31327   ,
        0.43037   , -0.11785   ,  0.46608001,  0.13485   , -0.29508001,
        0.071264  ,  0.31647   ,  1.10640001, -0.35238999,  0.11418   ,
       -0.30375999, -0.56996   ,  0.70823997, -0.15449999,  0.16

In [194]:
import numpy, itertools, collections

average_keyword_dist_record = {}

def get_word_vec(word, word_vec):
    a_vec = None
    if word in word_vec.keys():
        a_vec = word_vec[word]
    else:
        try:
            a_vec = model.word_vec(word)
        except:
            a_vec = None
        word_vec[word] = a_vec
    return a_vec

for record in keywords_dict:    
    metadata_record_keywords = record['keywords']
    total_distance = 0
    count = 1
    word_vec = {}
    sets = [frozenset(pair) for pair in itertools.product(metadata_record_keywords, metadata_record_keywords)]
    unique_pairs = set(sets)
    for pair in unique_pairs:
        pair = list(pair)
        if len(pair) < 2:
            continue
        else:
            a_vec = get_word_vec(pair[0], word_vec)
            b_vec = get_word_vec(pair[1], word_vec)
            if a_vec != None and b_vec != None:
                distance = numpy.linalg.norm(a_vec-b_vec)
            else:
                continue
            count += 1
            total_distance += distance
    average_distance = total_distance / count
    if average_distance in average_keyword_dist_record.keys():
        average_keyword_dist_record[average_distance].append(record)
    else:
        average_keyword_dist_record[average_distance] = [record]
        
for result in collections.OrderedDict(sorted(average_keyword_dist_record.items(), reverse=True)).items():
    print(str(result) + '\n')



(9.4812989864709234, [{'_id': '5915a6be3dd97b0a4c19fcfa', 'keywords': ['soil', 'agriculture', 'crops', 'pastures', 'carbon', 'sequestration', 'baseline', 'stocks', 'fractions', 'nitrogen', 'carbonate', 'particulate', 'humus', 'resistant', 'charcoal']}])

(8.8938780725002289, [{'_id': '5915a6be3dd97b0a4c19fd45', 'keywords': ['invertebrates', 'insects', 'biosecurity', 'taxonomy', 'arthropods', 'systematic']}])

(8.8470295563988071, [{'_id': '5915a6be3dd97b0a4c19fd04', 'keywords': ['temperature', 'salinity', 'depth', 'fluorescence', 'oxygen', 'PAR', 'ocean', 'phosphate', 'nitrate', 'ammonium', 'silicate', 'Australia, Western Australia coast', 'East Indian Ocean', 'Indian Ocean']}])

(8.8263469372155523, [{'_id': '59159e9b3dd97b0a4c165387', 'keywords': ['malacology', 'molluscs', 'mollusc', 'chitons', 'clams', 'mussels', 'snails', 'nudibranchs', 'sea', 'slugs', 'tusk', 'shells', 'octopus', 'squid', 'fauna']}])

(8.7827231287956238, [{'_id': '5915a6be3dd97b0a4c19fcc8', 'keywords': ['permanen

## Can we assess good keyword tagging this way? Could we score keyword descriptiveness as users enter them? 

(2.9496593475341797, [{'_id': '59159e9b3dd97b0a4c1653bb', 'keywords': ['microbial', 'microbes']}])

vs 

(8.1640734042017922, [{'_id': '59159e9b3dd97b0a4c165439', 'keywords': ['herbarium', 'herbaria', 'plants', 'angiosperms', 'dicots', 'monocots', 'gymnosperms', 'pteridophytes', 'mosses', 'liverworts', 'lichens', 'fungi', 'algae', 'fossils', 'wood', 'microbes']}])


