An implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/)
Python
Latest commit 0d2351d Aug 12, 2016 @mouradmourafiq fix model plot
 * add get_words_embeddings
 * update documentation with user guide
Permalink
Failed to load latest commit information.
data remove non ascii characters in the processing Aug 11, 2016
.gitignore add inital model version Aug 10, 2016
LICENSE Initial commit Aug 1, 2016
README.md fix model plot Aug 12, 2016
data.py add docstrings and documentation Aug 11, 2016
models.py fix model plot Aug 12, 2016
preprocessors.py
requirements.txt add inital model version Aug 10, 2016
utils.py add docstrings and documentation Aug 11, 2016

README.md

philo2vec

A Tensorflow implementation of word2vec applied to stanford philosophy encyclopedia, the implementation supports both cbow and skip gram

for more reference, please have a look at this papers:

After training the model returns some interesting results, see interesting results part

Evaluating hume - empiricist + rationalist:

descartes
malebranche
spinoza
hobbes
herder

screen shot 2016-08-12 at 19 19 22

Some interesting results

Similarities

Similar words to death:

untimely
ravages
grief
torment

Similar words to god:

divine
De Providentia
christ
Hesiod

Similar words to love:

friendship
affection
christ
reverence

Similar words to life:

career
live
lifetime
community
society

Similar words to brain:

neurological
senile
nerve
nervous

operations

Evaluating hume - empiricist + rationalist:

descartes
malebranche
spinoza
hobbes
herder

Evaluating ethics - rational:

hiroshima

Evaluating ethic - reason:

inegalitarian
anti-naturalist
austere

Evaluating moral - rational:

commonsense

Evaluating life - death + love:

self-positing
friendship
care
harmony

Evaluating death + choice:

regret
agony
misfortune
impending

Evaluating god + human:

divine
inviolable
yahweh
god-like
man

Evaluating god + religion:

amida
torah
scripture
buddha
sokushinbutsu

Evaluating politic + moral:

rights-oriented
normative
ethics
integrity

The repo contains:

  • an object to crawl data from the philosophy encyclopedia; PlatoData
  • a object to build the vocabulary based on the crawled data; VocabBuilder
  • the model that computes the continuous distributed representations of words; Philo2Vec

Installation

The dependencies used for this module can be easily installed with pip:

> pip install -r requirements.txt

The params for the VocabBuilder:

  • min_frequency: the minimum frequency of the words to be used in the model.
  • size: the size of the data, the model then use the top size most frequenct words.

The hyperparams of the model:

  • optimizer: an instance of tensorflow Optimizer, such as GradientDescentOptimizer, AdagradOptimizer, or MomentumOptimizer.
  • model: the model to use to create the vectorized representation, possible values: CBOW, SKIP_GRAM.
  • loss_fct: the loss function used to calculate the error, possible values: SOFTMAX, NCE.
  • embedding_size: dimensionality of word embeddings.
  • neg_sample_size: number of negative samples for each positive sample
  • num_skips: numer of skips for a SKIP_GRAM model.
  • context_window: window size, this window is used to create the context for calculating the vector representations [ window target window ].

Quick usage:

params = {
    'model': Philo2Vec.CBOW,
    'loss_fct': Philo2Vec.NCE,
    'context_window': 5,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)
params = {
    'model': Philo2Vec.SKIP_GRAM,
    'loss_fct': Philo2Vec.SOFTMAX,
    'context_window': 2,
    'num_skips': 4,
    'neg_sample_size': 2,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)

about stemming

Since the words are stemmed as part of the preprocessing, some operation are sometimes necessary

StemmingLookup.stem('religious')  # returns "religi"

StemmingLookup.original_form('religi')  # returns "religion"

Getting similarities

pv.get_similar_words(['rationalist', 'empirist'])

Evaluating operations

pv.evaluate_operation('moral - rational')

plotting vectorized words

pv.plot(['hume', 'empiricist', 'descart', 'rationalist'])

Training details

skip_gram:

skip_gram_loss

skip_gram_embeddings

skip_gram_w

skip_gram_b

cbow:

cbow_loss

cbow_embedding

cbow_w

cbow_b