Skip to content

Build word embedding in Bahasa using skip-gram and negative sampling with ~270M word. use T-SNE to visualize word vector

License

Notifications You must be signed in to change notification settings

mfarrelm/word-embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word-embedding

Create word-embedding model using skip-gram and negative sampling. This model is trained with wikipedia and opensubtitle corpus with total word = 270M and output feature dimension = 256.

Requirements to use this program


To visualizing word vector using T-SNE, run :


$ python tsne_plot.py list_word

This will generate a ~500 word with list_word as a generating keyword, and create a 2D plot of word vector.

Example :


$ python plot-tsne.py siang komputer sendu kaki mati apel relativitas emansipasi jokowi

Plot example :


alt text

Corpus source :

Wikipedia (~5.6GB) : https://dumps.wikimedia.org/idwiki/20200101/
opensubtitle (~702MB) : http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.id.gz

About

Build word embedding in Bahasa using skip-gram and negative sampling with ~270M word. use T-SNE to visualize word vector

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages