word embedding #798

Closed
stevenbird opened this Issue Dec 1, 2014 · 16 comments

Comments

Projects
None yet
5 participants
@stevenbird
Member

stevenbird commented Dec 1, 2014

Implement some word embedding algorithms

@dimazest

This comment has been minimized.

Show comment
Hide comment
@dimazest

dimazest Dec 2, 2014

Contributor

Hi,

In my experience, a typical word vector workflow (it doesn't really matter whether it's distributional (based on co-occurence counts) or distributed (learned) vectors look like this.

  1. Decide the source corpus to build vectors. I use BNC and BNCReader fot this (it's common to use Wikipedia or ukwac as far as i know, there are no readers for them).
  2. Decide the target and context (in the distributional approach) words. Target words are the words for which vectors are built, context words label vector dimensions.
    • It's common to take the N most frequent words as context words, this information is taken from the corpus. Again, I don't know if there is a common approach to get such information.
    • Words might be POS tagged.
    • Sometimes depenencies are taken into account.
  3. Extract co-occurrence statistics, or learn the model.
    • Here is how I do it. I've found Pandas very efficient and easy to use in this kind of data processing.
    • It's not clear what is the best format to store spaces, as they become pretty large. I store spaces in hdf files using Pandas, the load pretty fast. Efficient data format is essential because it takes quite a lot of time to load word2vec published vectors using gensim.
  4. Weight the model. scikit-learn provides most of the methods used in the related literature. Here is how I weight the space.
  5. Perform experiments. There are many options to compute similarity (e.g. use cosine similarity or correlation) again, scipy and scikit-learn provides most of the functionality. Also, there are several common datasets word vectors are evaluated on, so it would be nice to have an interface to them as well.

To sum up. Word embeddings require fast IO operations. I don't think pickle would be sufficient (mainly because the various versions and now it's common for researchers to 'release' their vectors). It would be cool, if NLTK provided basic functionality and an API to

  • build vector space
  • read and write it to some file that is ready to be shared and published
  • perform basic weighting operations

On the other side, NLTK should not reimplement what is already implemented in scikit-learn, scipy, gensim and dissect. Probably, providing a unified API to them is a good idea.

P.S. I would be happy to work on this, and move relevant code from fowler.corpora to NLTK.
P.P.S. https://github.com/maciejkula/glove-python reuires gcc 4.9!

Contributor

dimazest commented Dec 2, 2014

Hi,

In my experience, a typical word vector workflow (it doesn't really matter whether it's distributional (based on co-occurence counts) or distributed (learned) vectors look like this.

  1. Decide the source corpus to build vectors. I use BNC and BNCReader fot this (it's common to use Wikipedia or ukwac as far as i know, there are no readers for them).
  2. Decide the target and context (in the distributional approach) words. Target words are the words for which vectors are built, context words label vector dimensions.
    • It's common to take the N most frequent words as context words, this information is taken from the corpus. Again, I don't know if there is a common approach to get such information.
    • Words might be POS tagged.
    • Sometimes depenencies are taken into account.
  3. Extract co-occurrence statistics, or learn the model.
    • Here is how I do it. I've found Pandas very efficient and easy to use in this kind of data processing.
    • It's not clear what is the best format to store spaces, as they become pretty large. I store spaces in hdf files using Pandas, the load pretty fast. Efficient data format is essential because it takes quite a lot of time to load word2vec published vectors using gensim.
  4. Weight the model. scikit-learn provides most of the methods used in the related literature. Here is how I weight the space.
  5. Perform experiments. There are many options to compute similarity (e.g. use cosine similarity or correlation) again, scipy and scikit-learn provides most of the functionality. Also, there are several common datasets word vectors are evaluated on, so it would be nice to have an interface to them as well.

To sum up. Word embeddings require fast IO operations. I don't think pickle would be sufficient (mainly because the various versions and now it's common for researchers to 'release' their vectors). It would be cool, if NLTK provided basic functionality and an API to

  • build vector space
  • read and write it to some file that is ready to be shared and published
  • perform basic weighting operations

On the other side, NLTK should not reimplement what is already implemented in scikit-learn, scipy, gensim and dissect. Probably, providing a unified API to them is a good idea.

P.S. I would be happy to work on this, and move relevant code from fowler.corpora to NLTK.
P.P.S. https://github.com/maciejkula/glove-python reuires gcc 4.9!

@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations Feb 5, 2015

Contributor

gensim has a range of very good vector space tools. And i'm also working on python based neural net with as little dependencies as possible. Currently, i think i am depending on scipy and numpy and I want to cut scipy out because of dependencies.

What if we "port" code from gensim? Would that be okay?

Contributor

alvations commented Feb 5, 2015

gensim has a range of very good vector space tools. And i'm also working on python based neural net with as little dependencies as possible. Currently, i think i am depending on scipy and numpy and I want to cut scipy out because of dependencies.

What if we "port" code from gensim? Would that be okay?

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Feb 5, 2015

Member

Gensim is good! -1 for "porting" gensim code because it means more code to maintain for NLTK for no clear benefit.

Member

kmike commented Feb 5, 2015

Gensim is good! -1 for "porting" gensim code because it means more code to maintain for NLTK for no clear benefit.

@dimazest

This comment has been minimized.

Show comment
Hide comment
@dimazest

dimazest Feb 5, 2015

Contributor

Agree, that it doesn't worth of reimplementing gensim inside of NLTK.

However, would it make sense to implement functionality to perform word similarity tasks (for example, wordsim353) in NLTK. Should nltk then depend on gensim?

Gensim's Dictionary is useful to decide what context words should be to build co-occurrence based vectors. But I didn't find code that builds a co-occurrence matrix.

it would be nice to do something like this:

>>> corpus = BNCReader(...)
>>> space = CooccurrenceSpace(
...    corpus,
...    targets=['car', 'bus'],
...    contexts=corpus.most_frequent_words(2000),
...    weighting='ppmi',
... )

>>> space.similairty('car', 'bus', measure=np.cos)
0.82
Contributor

dimazest commented Feb 5, 2015

Agree, that it doesn't worth of reimplementing gensim inside of NLTK.

However, would it make sense to implement functionality to perform word similarity tasks (for example, wordsim353) in NLTK. Should nltk then depend on gensim?

Gensim's Dictionary is useful to decide what context words should be to build co-occurrence based vectors. But I didn't find code that builds a co-occurrence matrix.

it would be nice to do something like this:

>>> corpus = BNCReader(...)
>>> space = CooccurrenceSpace(
...    corpus,
...    targets=['car', 'bus'],
...    contexts=corpus.most_frequent_words(2000),
...    weighting='ppmi',
... )

>>> space.similairty('car', 'bus', measure=np.cos)
0.82
@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations Feb 5, 2015

Contributor

CooccurenceSpace sounds a little weird but i get the idea.

Is the assumption for the space plain Bag of Words? Or should the module allow different space of various information and density? E.g. Skip-grams, or contextual BoW (i'm not sure whether c in cbow means contextual but it does use the context window)?

A good thing is that we don't need compression modules since the recent study has shown that no compression works best but it's a matter of 300k dimensions vs 500 dimensions (users need bigger than laptop machines to train the space). see http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf .

From a user perspective, my code to do vector space related projects uses a mish-mash of nltk + gensim + scikit. I'm not sure whether it's the best practice but it's the fastest way to churn models out.

Contributor

alvations commented Feb 5, 2015

CooccurenceSpace sounds a little weird but i get the idea.

Is the assumption for the space plain Bag of Words? Or should the module allow different space of various information and density? E.g. Skip-grams, or contextual BoW (i'm not sure whether c in cbow means contextual but it does use the context window)?

A good thing is that we don't need compression modules since the recent study has shown that no compression works best but it's a matter of 300k dimensions vs 500 dimensions (users need bigger than laptop machines to train the space). see http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf .

From a user perspective, my code to do vector space related projects uses a mish-mash of nltk + gensim + scikit. I'm not sure whether it's the best practice but it's the fastest way to churn models out.

@dimazest

This comment has been minimized.

Show comment
Hide comment
@dimazest

dimazest Feb 5, 2015

Contributor

Class name doesn't matter that much. The main idea is to make it easy to build spaces used in the literature. It would be cool to have a common API that allows to build common distributional (distributed) models.

Contributor

dimazest commented Feb 5, 2015

Class name doesn't matter that much. The main idea is to make it easy to build spaces used in the literature. It would be cool to have a common API that allows to build common distributional (distributed) models.

@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations Feb 6, 2015

Contributor

At some point, we should make the module scale to this: http://www.cs.bgu.ac.il/~yoavg/publications/acl2014syntemb.pdf

Contributor

alvations commented Feb 6, 2015

At some point, we should make the module scale to this: http://www.cs.bgu.ac.il/~yoavg/publications/acl2014syntemb.pdf

@longdt219

This comment has been minimized.

Show comment
Hide comment
@longdt219

longdt219 Mar 13, 2015

Contributor

Hi all,
Here is some though

(1) Using word2vec through gensim is a good start. However, having common API for word representation either distributional or distributed is crucial since this is the main contribution. There is a big comparison of word embedding methods at wordvectors.org/index.php

(2) We can incorporate some pre-trained english model to our corpora.
(3) Allow to load external word embedding (binary and text format)
(4) Allow to train the model using various configuration.
(5) Allow to save in binary / text format (currently gensim use pickle)
(6) Cross-lingual word embedding is a possible extension (http://arxiv.org/pdf/1404.4641.pdf)
(7) Go beyond word: phrase embedding, sentence embedding, document embedding. Which is already implemented in gensim (http://radimrehurek.com/2014/12/doc2vec-tutorial/)

Contributor

longdt219 commented Mar 13, 2015

Hi all,
Here is some though

(1) Using word2vec through gensim is a good start. However, having common API for word representation either distributional or distributed is crucial since this is the main contribution. There is a big comparison of word embedding methods at wordvectors.org/index.php

(2) We can incorporate some pre-trained english model to our corpora.
(3) Allow to load external word embedding (binary and text format)
(4) Allow to train the model using various configuration.
(5) Allow to save in binary / text format (currently gensim use pickle)
(6) Cross-lingual word embedding is a possible extension (http://arxiv.org/pdf/1404.4641.pdf)
(7) Go beyond word: phrase embedding, sentence embedding, document embedding. Which is already implemented in gensim (http://radimrehurek.com/2014/12/doc2vec-tutorial/)

@edloper edloper referenced this issue in nltk/nltk_book Mar 19, 2015

Open

Add NLTK support for embedding #131

@stevenbird

This comment has been minimized.

Show comment
Hide comment
@stevenbird

stevenbird Apr 13, 2015

Member

@longdt219 is there a good default pre-trained model that we could include with NLTK-data? I can add it, and then you can provide sample code for using it: load the model, then given some input, find the closest word, or compare two words, etc.

Member

stevenbird commented Apr 13, 2015

@longdt219 is there a good default pre-trained model that we could include with NLTK-data? I can add it, and then you can provide sample code for using it: load the model, then given some input, find the closest word, or compare two words, etc.

@stevenbird

This comment has been minimized.

Show comment
Hide comment
@stevenbird

stevenbird May 1, 2015

Member

@longdt219 can we please have an example based on the word2vec tutorial. Create a good model using one of the word2vec datasets, and prune the less-frequent words so that the model size is less than 100MB. Then we need the code for training this, and for using it to do one of the standard word similarity tasks.

Member

stevenbird commented May 1, 2015

@longdt219 can we please have an example based on the word2vec tutorial. Create a good model using one of the word2vec datasets, and prune the less-frequent words so that the model size is less than 100MB. Then we need the code for training this, and for using it to do one of the standard word similarity tasks.

@longdt219

This comment has been minimized.

Show comment
Hide comment
@longdt219

longdt219 May 5, 2015

Contributor

Created PR #971

Contributor

longdt219 commented May 5, 2015

Created PR #971

@stevenbird

This comment has been minimized.

Show comment
Hide comment
@stevenbird

stevenbird May 7, 2015

Member

Sorry not to be clearer. The best way to provide an example is to create a file nltk/test/gensim.doctest. Then you can explain the steps in prose, and provide doctest examples etc. The result will then appear at: http://www.nltk.org/howto/

@longdt219 would you please rework your example to be a doctest file instead of a package?

Member

stevenbird commented May 7, 2015

Sorry not to be clearer. The best way to provide an example is to create a file nltk/test/gensim.doctest. Then you can explain the steps in prose, and provide doctest examples etc. The result will then appear at: http://www.nltk.org/howto/

@longdt219 would you please rework your example to be a doctest file instead of a package?

@longdt219

This comment has been minimized.

Show comment
Hide comment
@longdt219

longdt219 May 7, 2015

Contributor

Sure @stevenbird,
Do you think we should provide the visualization of the embedding. It would be a cool things to do. t-SNE is a package that most people use and it's free for non-commercial purposes.

Contributor

longdt219 commented May 7, 2015

Sure @stevenbird,
Do you think we should provide the visualization of the embedding. It would be a cool things to do. t-SNE is a package that most people use and it's free for non-commercial purposes.

@stevenbird

This comment has been minimized.

Show comment
Hide comment
@stevenbird

stevenbird May 7, 2015

Member

@longdt219 that sounds like a good thing to include, thanks.
Note that the word embedding data is now in the repository, but with a more specific name; please see nltk/nltk_data@a6db693

Member

stevenbird commented May 7, 2015

@longdt219 that sounds like a good thing to include, thanks.
Note that the word embedding data is now in the repository, but with a more specific name; please see nltk/nltk_data@a6db693

@longdt219

This comment has been minimized.

Show comment
Hide comment
@longdt219

longdt219 Aug 6, 2015

Contributor

Hi @stevenbird ,
Could you close this issue ? Resolve in PR #971

Contributor

longdt219 commented Aug 6, 2015

Hi @stevenbird ,
Could you close this issue ? Resolve in PR #971

@stevenbird

This comment has been minimized.

Show comment
Hide comment
Member

stevenbird commented Aug 6, 2015

Thanks @longdt219

@stevenbird stevenbird closed this Aug 6, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment