Create word-embedding model using skip-gram and negative sampling. This model is trained with wikipedia and opensubtitle corpus with total word = 270M and output feature dimension = 256.
- Python 2 or 3
- Gensim
- sklearn
- w2v-model
$ python tsne_plot.py list_word
This will generate a ~500 word with list_word as a generating keyword, and create a 2D plot of word vector.
$ python plot-tsne.py siang komputer sendu kaki mati apel relativitas emansipasi jokowi
Wikipedia (~5.6GB) : https://dumps.wikimedia.org/idwiki/20200101/
opensubtitle (~702MB) : http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.id.gz