This notebook relies on sampling from the `sample.ipynb` notebook. We will extract metadata descriptions, then fit models on them.

In [135]:
# Import collection metadata
import pickle
long_names, metadata = pickle.load(open('metadata.p', 'rb'))

We now need to make a small corpus. First naive strategy: concatenate all the leaves of the structure in random order.

NOTE: this discards important information along the paths to the leaves.

In [3]:
# source: https://stackoverflow.com/questions/12507206/python-recommended-way-to-walk-complex-dictionary-structures-imported-from-json

# This code turns a dictionary into a list of paths to leaves
def dict_generator(indict, pre=None):
    pre = pre[:] if pre else []
    if isinstance(indict, dict):
        for key, value in indict.items():
            if isinstance(value, dict):
                for d in dict_generator(value, [key] + pre):
                    yield d
            elif isinstance(value, list) or isinstance(value, tuple):
                for v in value:
                    for d in dict_generator(v, [key] + pre):
                        yield d
            else:
                yield pre + [key, value]
    else:
        yield indict

In [4]:
# source: https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [212]:
# WARNING: Naive implementation
def struct2Doc(cln):
    # Convert to javascript-style dictionary-array object
    hierarchy = json.loads(json.dumps(cln))

    path_gen = dict_generator(hierarchy)
    leaves = []

    for path in path_gen:
        leaf = path[-1]

        # Do some filtering on the leaves (see below)
        if leaf == '' or leaf == None:
            continue
        if is_number(leaf):
            continue
        if validators.url(leaf):
            continue

        # TODO: Extend
        leaf = re.sub(r'[^\s\w]+', ' ', leaf)
        #print leaf
        leaves.append(leaf.lower())

    # Shuffle so that incidental proximity of leaves to eachother is not taken into account
    shuffle(leaves)
    document = ' '.join(leaves)
    
    return document

# Just from looking at the result it seems we should filter out:
#  - Empty leaves
#  - Numbers
#  - URL's and emails
#  - Dates?
    

In [213]:
import re

# WARNING: This may be too reliant on structured data

# Structured output
def struct2Sentence(cln):
    # Convert to javascript-style dictionary-array object
    hierarchy = json.loads(json.dumps(cln))

    path_gen = dict_generator(hierarchy)
    sentences = []

    for path in path_gen:
        leaf = path[-1]

        # Do some filtering on the leaves (see below)
        if leaf == '' or leaf == None:
            continue
        if is_number(leaf):
            continue
        if validators.url(leaf):
            continue

        sentence = ' '.join(path)
        
        # Make sentence lowercase and remove periods
        sentence = re.sub(r'[^\s\w]+', ' ', sentence)
        sentences.append(sentence.lower())
    
    return sentences

# Just from looking at the result it seems we should filter out:
#  - Empty leaves
#  - Numbers
#  - URL's and emails
#  - Dates?
    

In [216]:
cln = metadata[0]
document = struct2Sentence(cln)
print document

[u'concept id c1000000000 cddis', u'instrument instruments platform platforms collection shortname doris receiver', u'platform platforms collection shortname cryosat 2', u'platform platforms collection type  ', u'platform platforms collection longname cryosat 2', u'instrument instruments platform platforms collection shortname doris beacon', u'platform platforms collection shortname ground stations', u'platform platforms collection type  ', u'platform platforms collection longname ground stations', u'instrument instruments platform platforms collection shortname doris receiver', u'platform platforms collection shortname hy 2a', u'platform platforms collection type  ', u'platform platforms collection longname haiyang 2a', u'instrument instruments platform platforms collection shortname doris receiver', u'platform platforms collection shortname jason 1', u'platform platforms collection type  ', u'platform platforms collection longname jason 1', u'instrument instruments platform platforms

The next step is to convert collections in-bulk to documents

In [154]:
import multiprocessing

# Parallelize for speed
pool = multiprocessing.Pool()
#documents = pool.map(struct2Doc, metadata)
sentences = pool.map(struct2Sentence, metadata)

sentences[0]

[u'conceptid c1000000000cddis',
 u'collection lastupdate 20120531t000000',
 u'collection description the doppler orbitography by radiopositioning integrated on satellite doris was developed by the centre national detudes spatiales cnes with cooperation from other french government agencies the system was developed to provide precise orbit determination and high accuracy location of ground beacons for point positioning doris is a dualfrequency doppler system that has been included as an experiment on various space missions such as topexposeidon spot2 3 4 and 5 envisat and jason satellites unlike many other navigation systems doris is based on an uplink device the receivers are on board the satellite with the transmitters are on the ground this creates a centralized system in which the complete set of observations is downloaded by the satellite to the ground center from where they are distributed after editing and processing an accurate measurment is made of the doppler shift on radiofre

In [123]:
def flattenGenerator(listOfLists):
    for list2 in listOfLists:
        for item in list2:
            yield item

sentenceGen = flattenGenerator(sentences)
sentenceGen.next()

u'conceptid c1000000000cddis'

For word2vec we need just a list of sentences

In [124]:
sentences_flat = [filter(None, item.split(' ')) for sublist in sentences for item in sublist]

Now we can examine document similarity between collection metadata.

In [125]:
import logging, gensim
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Try word2vec on sentences

In [126]:
model = word2vec.Word2Vec(sentences_flat, size=100, window=5, min_count=5, workers=7)

2018-02-13 20:44:09,022 : INFO : collecting all words and their counts
2018-02-13 20:44:09,026 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-13 20:44:09,072 : INFO : PROGRESS: at sentence #10000, processed 78389 words, keeping 4601 word types
2018-02-13 20:44:09,120 : INFO : PROGRESS: at sentence #20000, processed 162222 words, keeping 7006 word types
2018-02-13 20:44:09,166 : INFO : PROGRESS: at sentence #30000, processed 239825 words, keeping 9093 word types
2018-02-13 20:44:09,215 : INFO : PROGRESS: at sentence #40000, processed 325109 words, keeping 9797 word types
2018-02-13 20:44:09,265 : INFO : PROGRESS: at sentence #50000, processed 404400 words, keeping 12067 word types
2018-02-13 20:44:09,321 : INFO : PROGRESS: at sentence #60000, processed 490160 words, keeping 14474 word types
2018-02-13 20:44:09,369 : INFO : PROGRESS: at sentence #70000, processed 573714 words, keeping 16293 word types
2018-02-13 20:44:09,422 : INFO : PROGRESS: at sente

In [127]:
model.save('word2vec_structured.m')
#model = word2vec.load('simple_model.m')

2018-02-13 20:44:21,460 : INFO : saving Word2Vec object under simple_model.m, separately None
2018-02-13 20:44:21,465 : INFO : not storing attribute syn0norm
2018-02-13 20:44:21,466 : INFO : not storing attribute cum_table
2018-02-13 20:44:21,580 : INFO : saved simple_model.m


In [133]:
model.wv['rain']
model.wv.most_similar('rainfall')

[(u'measurement', 0.7558643817901611),
 (u'ii', 0.7248314619064331),
 (u'iii', 0.7035893797874451),
 (u'islscp', 0.685666024684906),
 (u'fire', 0.6785022616386414),
 (u'convection', 0.6691360473632812),
 (u'transcom', 0.6663325428962708),
 (u'measuring', 0.6540623903274536),
 (u'tropical', 0.6412253379821777),
 (u'icesat', 0.6358033418655396)]

Time for doc2vec

In [186]:
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument

# We need to feed it labeled sentences
idx = 0
doc_sentences = []
for sentence_list in sentences:
    
    ln = metadata[idx]['Collection']['LongName']
    
    for sentence in sentence_list:
        ls = TaggedDocument(words=filter(None, sentence.split(' ')), tags=[unicode(idx), ln])
        doc_sentences.append(ls)
        
    idx += 1



In [187]:
doc_sentences[0]

TaggedDocument(words=[u'conceptid', u'c1000000000cddis'], tags=[u'0', 'Doppler Orbitography by Radiopositioning Integrated on Satellite Range-Rate Observation Data (cycle format) from NASA CDDIS'])

In [188]:
model = doc2vec.Doc2Vec(doc_sentences, size=100, window=8, min_count=5, workers=7)

2018-02-13 23:40:50,596 : INFO : collecting all words and their counts
2018-02-13 23:40:50,598 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-02-13 23:40:50,767 : INFO : PROGRESS: at example #10000, processed 78389 words (467134/s), 4601 word types, 146 tags
2018-02-13 23:40:50,931 : INFO : PROGRESS: at example #20000, processed 162222 words (522203/s), 7006 word types, 245 tags
2018-02-13 23:40:51,095 : INFO : PROGRESS: at example #30000, processed 239825 words (478516/s), 9093 word types, 305 tags
2018-02-13 23:40:51,259 : INFO : PROGRESS: at example #40000, processed 325109 words (524534/s), 9797 word types, 425 tags
2018-02-13 23:40:51,426 : INFO : PROGRESS: at example #50000, processed 404400 words (479667/s), 12067 word types, 583 tags
2018-02-13 23:40:51,588 : INFO : PROGRESS: at example #60000, processed 490160 words (536476/s), 14474 word types, 722 tags
2018-02-13 23:40:51,748 : INFO : PROGRESS: at example #70000, processed 573714 words (

In [198]:
model.save('doc2vec_structured.m')

2018-02-13 23:55:30,659 : INFO : saving Doc2Vec object under doc2vec_simple_model.m, separately None
2018-02-13 23:55:30,663 : INFO : not storing attribute syn0norm
2018-02-13 23:55:30,664 : INFO : not storing attribute cum_table
2018-02-13 23:55:30,803 : INFO : saved doc2vec_simple_model.m


In [197]:
# Now lets see which is the most similiar to a chosen document

model.docvecs.most_similar(200) 

[(u'151', 0.9998267889022827),
 ('CALIPSO Lidar Level 2 aerosol profile data using the CALIPSO Lidar Ratio selection algorithm (CAL_LID_L2_05kmAPro-Prov-V3-30)',
  0.9759185314178467),
 (u'154', 0.9758481979370117),
 (u'156', 0.9694593548774719),
 ('CALIPSO Lidar Level 2 Cloud Profile data (CAL_LID_L2_05kmCPro-Prov-V3-30)',
  0.9688910245895386),
 (u'169', 0.9673362374305725),
 ('CERES Single Scanner Satellite Footprint, TOA, Surface Fluxes and Clouds (SSF) data in HDF (CER_SSF_NPP-FM5-VIIRS_Edition1A)',
  0.9670361280441284),
 ('CALIPSO Lidar Level 2 5 km aerosol layer data (CAL_LID_L2_05kmALay-Prov-V3-30)',
  0.966292142868042),
 (u'153', 0.9662625789642334),
 ('CALIPSO Lidar Level 2 5 km cloud layer data (CAL_LID_L2_05kmCLay-Prov-V3-30)',
  0.9644289016723633)]

In [261]:
print metadata[200]['Collection']['ShortName']

FGDL


This works, but what would have happened if we had used a different representation of documents that captured less of the structure?

In [203]:
documents[0]

u'20030123t170000 applicationecho10xml usa   20120531t000000 cryosat2 doppler orbitography by radiopositioning integrated on satellite rangerate observation data cycle format from nasa cddis jason1 archiver telephone doris receiver doris receiver haiyang2a spot5 doris beacon doris receiver   c1000000000cddis 3016146542 topexposeidon cryosat2 ocean topography experiment the doppler orbitography by radiopositioning integrated on satellite doris was developed by the centre national detudes spatiales cnes with cooperation from other french government agencies the system was developed to provide precise orbit determination and high accuracy location of ground beacons for point positioning doris is a dualfrequency doppler system that has been included as an experiment on various space missions such as topexposeidon spot2 3 4 and 5 envisat and jason satellites unlike many other navigation systems doris is based on an uplink device the receivers are on board the satellite with the transmitters

In [217]:
import multiprocessing

# Parallelize for speed
pool = multiprocessing.Pool()
documents = pool.map(struct2Doc, metadata)

In [220]:
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument

# We need to feed it labeled sentences
idx = 0
doc_sentences2 = []
for document in documents:
    ln = metadata[idx]['Collection']['LongName']
    sentence = filter(None, document.split(' '))
    td = TaggedDocument(words=sentence, tags=[unicode(idx), ln])
    doc_sentences2.append(td)
    idx += 1

In [221]:
model_simplified = doc2vec.Doc2Vec(doc_sentences2, size=100, window=8, min_count=5, workers=7)

2018-02-14 00:12:24,561 : INFO : collecting all words and their counts
2018-02-14 00:12:24,566 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-02-14 00:12:25,026 : INFO : collected 32366 word types and 3430 unique tags from a corpus of 2516 examples and 1081725 words
2018-02-14 00:12:25,027 : INFO : Loading a fresh vocabulary
2018-02-14 00:12:25,099 : INFO : min_count=5 retains 8858 unique words (27% of original 32366, drops 23508)
2018-02-14 00:12:25,100 : INFO : min_count=5 leaves 1045917 word corpus (96% of original 1081725, drops 35808)
2018-02-14 00:12:25,141 : INFO : deleting the raw counts dictionary of 32366 items
2018-02-14 00:12:25,144 : INFO : sample=0.001 downsamples 49 most-common words
2018-02-14 00:12:25,145 : INFO : downsampling leaves estimated 881349 word corpus (84.3% of prior 1045917)
2018-02-14 00:12:25,147 : INFO : estimated required memory for 8858 words and 100 dimensions: 13573400 bytes
2018-02-14 00:12:25,187 : INFO : reset

In [222]:
model_simplified.save('doc2vec_simple.m')

2018-02-14 00:13:49,948 : INFO : saving Doc2Vec object under doc2vec_simple_model.m, separately None
2018-02-14 00:13:49,952 : INFO : not storing attribute syn0norm
2018-02-14 00:13:49,954 : INFO : not storing attribute cum_table
2018-02-14 00:13:50,092 : INFO : saved doc2vec_simple_model.m


Let's compare how this version of documents compares

In [224]:
model_simplified.docvecs.most_similar(200) 

2018-02-14 00:14:26,897 : INFO : precomputing L2-norms of doc weight vectors


[(u'151', 0.9997702836990356),
 ('CALIPSO Lidar Level 2 5 km aerosol layer data (CAL_LID_L2_05kmALay-Prov-V3-30)',
  0.9831189513206482),
 (u'153', 0.9828567504882812),
 (u'157', 0.9820994138717651),
 ('CALIPSO Lidar Level 2 1/3 km cloud layer data (CAL_LID_L2_333mCLay-ValStage1-V3-30)',
  0.9817156791687012),
 ('CALIPSO Lidar Level 2 Cloud Profile data (CAL_LID_L2_05kmCPro-Prov-V3-30)',
  0.9773738980293274),
 (u'156', 0.9772941470146179),
 (u'154', 0.9766090512275696),
 ('CALIPSO Lidar Level 2 aerosol profile data using the CALIPSO Lidar Ratio selection algorithm (CAL_LID_L2_05kmAPro-Prov-V3-30)',
  0.9758360981941223),
 (u'159', 0.9728275537490845)]

In [225]:
model.docvecs.most_similar(200)

[(u'151', 0.9998267889022827),
 ('CALIPSO Lidar Level 2 aerosol profile data using the CALIPSO Lidar Ratio selection algorithm (CAL_LID_L2_05kmAPro-Prov-V3-30)',
  0.9759185314178467),
 (u'154', 0.9758481979370117),
 (u'156', 0.9694593548774719),
 ('CALIPSO Lidar Level 2 Cloud Profile data (CAL_LID_L2_05kmCPro-Prov-V3-30)',
  0.9688910245895386),
 (u'169', 0.9673362374305725),
 ('CERES Single Scanner Satellite Footprint, TOA, Surface Fluxes and Clouds (SSF) data in HDF (CER_SSF_NPP-FM5-VIIRS_Edition1A)',
  0.9670361280441284),
 ('CALIPSO Lidar Level 2 5 km aerosol layer data (CAL_LID_L2_05kmALay-Prov-V3-30)',
  0.966292142868042),
 (u'153', 0.9662625789642334),
 ('CALIPSO Lidar Level 2 5 km cloud layer data (CAL_LID_L2_05kmCLay-Prov-V3-30)',
  0.9644289016723633)]

Looks about the same.

Now let's use LDA on the data. First we have to make a corpus.

In [230]:
from gensim import corpora

plain_sentences = [filter(None, document.split(' ')) for document in documents]
dictionary = corpora.Dictionary(plain_sentences)
corpus = [dictionary.doc2bow(sentence) for sentence in plain_sentences]

2018-02-14 00:37:42,361 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-02-14 00:37:43,851 : INFO : built Dictionary(32366 unique tokens: [u'tilton', u'fermanv1', u'woods', u'netcdf', u'spiders']...) from 2516 documents (total 1081725 corpus positions)


u'jason'

In [248]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=100, id2word=dictionary)

2018-02-14 00:56:22,606 : INFO : using symmetric alpha at 0.01
2018-02-14 00:56:22,609 : INFO : using symmetric eta at 3.08966199098e-05
2018-02-14 00:56:22,621 : INFO : using serial LDA version on this node
2018-02-14 00:56:40,970 : INFO : running online (single-pass) LDA training, 100 topics, 1 passes over the supplied corpus of 2516 documents, updating model once every 2000 documents, evaluating perplexity every 2516 documents, iterating 50x with a convergence threshold of 0.001000
2018-02-14 00:56:40,978 : INFO : PROGRESS: pass 0, at document #2000/2516
2018-02-14 00:56:53,571 : INFO : merging changes from 2000 documents into a model of 2516 documents
2018-02-14 00:56:54,800 : INFO : topic #14 (0.010): 0.020*"and" + 0.019*"the" + 0.013*"data" + 0.012*"00" + 0.011*"earth" + 0.011*"ceres" + 0.011*"nasa" + 0.008*"provided" + 0.007*"radiometer" + 0.006*"in"
2018-02-14 00:56:54,804 : INFO : topic #90 (0.010): 0.029*"the" + 0.018*"and" + 0.017*"not" + 0.014*"of" + 0.013*"provided" + 0.01

In [258]:
lda.print_topics(10)

2018-02-14 00:58:57,109 : INFO : topic #93 (0.010): 0.014*"earth" + 0.011*"
" + 0.010*"land" + 0.009*"data" + 0.009*"provided" + 0.008*"not" + 0.008*"science" + 0.007*"the" + 0.006*"use" + 0.006*"cover"
2018-02-14 00:58:57,111 : INFO : topic #38 (0.010): 0.016*"the" + 0.014*"data" + 0.014*"of" + 0.014*"and" + 0.011*"earth" + 0.008*"science" + 0.007*"00" + 0.006*"surface" + 0.006*"in" + 0.005*"es"
2018-02-14 00:58:57,113 : INFO : topic #36 (0.010): 0.028*"iris" + 0.018*"data" + 0.017*"earth" + 0.016*"the" + 0.015*"science" + 0.012*"of" + 0.012*"and" + 0.010*"a" + 0.010*"00" + 0.010*"land"
2018-02-14 00:58:57,115 : INFO : topic #37 (0.010): 0.030*"data" + 0.028*"documentation" + 0.025*"set" + 0.023*"ornl" + 0.022*"00z" + 0.021*"daac" + 0.018*"10" + 0.018*"ornl_daac" + 0.017*"oak" + 0.017*"ridge"
2018-02-14 00:58:57,117 : INFO : topic #20 (0.010): 0.027*"the" + 0.026*"and" + 0.021*"of" + 0.020*"water" + 0.018*"data" + 0.017*"provided" + 0.015*"risks" + 0.014*"earth" + 0.014*"science" + 0.

[(93,
  u'0.014*"earth" + 0.011*"\n" + 0.010*"land" + 0.009*"data" + 0.009*"provided" + 0.008*"not" + 0.008*"science" + 0.007*"the" + 0.006*"use" + 0.006*"cover"'),
 (38,
  u'0.016*"the" + 0.014*"data" + 0.014*"of" + 0.014*"and" + 0.011*"earth" + 0.008*"science" + 0.007*"00" + 0.006*"surface" + 0.006*"in" + 0.005*"es"'),
 (36,
  u'0.028*"iris" + 0.018*"data" + 0.017*"earth" + 0.016*"the" + 0.015*"science" + 0.012*"of" + 0.012*"and" + 0.010*"a" + 0.010*"00" + 0.010*"land"'),
 (37,
  u'0.030*"data" + 0.028*"documentation" + 0.025*"set" + 0.023*"ornl" + 0.022*"00z" + 0.021*"daac" + 0.018*"10" + 0.018*"ornl_daac" + 0.017*"oak" + 0.017*"ridge"'),
 (20,
  u'0.027*"the" + 0.026*"and" + 0.021*"of" + 0.020*"water" + 0.018*"data" + 0.017*"provided" + 0.015*"risks" + 0.014*"earth" + 0.014*"science" + 0.013*"00"'),
 (91,
  u'0.052*"the" + 0.018*"of" + 0.017*"earth" + 0.016*"data" + 0.014*"for" + 0.013*"to" + 0.011*"and" + 0.009*"science" + 0.009*"string" + 0.008*"a"'),
 (87,
  u'0.028*"the" + 0.01

topic #25 (0.010): 0.045*"fertilizer" + 0.018*"and" + 0.016*"image" + 0.015*"the" + 0.014*"data" + 0.010*"00z" + 0.009*"p" + 0.009*"daac" + 0.009*"for" + 0.008*"soil"

Clearly this needs a lot more fine-tuning and removal of common words.

In [259]:
lda.save('lda_simple.m')

2018-02-14 01:11:54,553 : INFO : saving LdaState object under lda_simple.m.state, separately None
2018-02-14 01:11:54,637 : INFO : saved lda_simple.m.state
2018-02-14 01:11:54,746 : INFO : saving LdaModel object under lda_simple.m, separately ['expElogbeta', 'sstats']
2018-02-14 01:11:54,748 : INFO : not storing attribute id2word
2018-02-14 01:11:54,752 : INFO : storing np array 'expElogbeta' to lda_simple.m.expElogbeta.npy
2018-02-14 01:11:54,825 : INFO : not storing attribute state
2018-02-14 01:11:54,828 : INFO : not storing attribute dispatcher
2018-02-14 01:11:54,831 : INFO : saved lda_simple.m
