SGNS-WNE : The word-like n-gram embedding version of skip-gram model with negative sampling

SGNS-WNE is an open source implementation of our framework to learn distributed representation of words by embedding word-like character n-grams, described in the following papers:

Word-like character n-gram embedding http://aclweb.org/anthology/W18-6120

Requirements & Environment

Linux(Tested with CentOS Linux release 7.4.1708)
gcc(>=5)
hdf5
Python 3
NumPy
Pandas
h5py
scikit-learn
tqdm
cmdline : Download cmdline.h and place it in 2_count_ngram_frequency/, 4_count_expected_word_frequenct/ and 5_SGNS_WNE/

1_preprocess/ : Pre-processing corpus. Sentences are concatenated and white spaces are replaces with another character for visualization.
2_count_ngram_frequency/ : Count n-grams frequency. In this implementation, we use lossy counting algorithm.
3_logistic_regression/ : Probabilistic predictor for word boundary.
4_count_expected_word_frequenct/ : Count expected word frequency (ewf) of word-like n-grams.
5_SGNS_WNE/ : Compute distributed representations of word-like n-grams via skip-gram model with negative sampling.

.
├── 1_preprocess
│   └── main.py
├── 2_count_ngram_frequency
│   ├── cmdline.h
│   ├── lossycounting.cpp
│   ├── lossycounting.h
│   ├── main.cpp
│   ├── makefile
│   └── run.sh
├── 3_logistic_regression
│   └── main.py
├── 4_count_expected_word_frequency
│   ├── cmdline.h
│   ├── counting_word.cpp
│   ├── counting_word.h
│   ├── main.cpp
│   ├── makefile
│   └── run.sh
├── 5_SGNS_WNE
│   ├── cheaprand.h
│   ├── cmdline.h
│   ├── main.cpp
│   ├── makefile
│   ├── run.sh
│   ├── skipgram.cpp
│   └── skipgram.h
└── README.md

Submodules & Dependencies

The majority of C++ code which is used for computing representations for n-grams with SGNS is taken from word2vec - Google Codes[1] and w2v-sembei[2].

References

Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proceedings of ICLR2013. [pdf, code]
Oshikiri, T. (2017). Segmentation-Free Word Embedding for Unsegmented Languages. In Proceedings of EMNLP2017. [pdf]
Kudo, T., Yamamoto, K., & Matsumoto, Y. (2004). Applying Conditional Random Fields to Japanese Morphological Analysis. In Proceedings of EMNLP2004. [pdf]
MeCab: Yet Another Part-of-Speech and Morphological Analyzer. [code]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGNS-WNE : The word-like n-gram embedding version of skip-gram model with negative sampling

Requirements & Environment

Contents

Submodules & Dependencies

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
1_preprocess		1_preprocess
2_count_ngram_frequency		2_count_ngram_frequency
3_logistic_regression		3_logistic_regression
4_count_expected_word_frequency		4_count_expected_word_frequency
5_SGNS_WNE		5_SGNS_WNE
LICENSE		LICENSE
README.md		README.md

License

kdrl/WNE

Folders and files

Latest commit

History

Repository files navigation

SGNS-WNE : The word-like n-gram embedding version of skip-gram model with negative sampling

Requirements & Environment

Contents

Submodules & Dependencies

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages