Skip to content

Word embedding using fastText algorithm with EuroSense dataset

Notifications You must be signed in to change notification settings

ibiscp/fastText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sense Embeddings

The goal of this project is to construct and train the fastText algorithm to create a word embedding based on the paper 'Enriching Word Vectors with Subword Information', which was developed by Facebook's AI Research (FAIR) lab.

The dataset used for the training was the EuroSense dataset, which is a multilingual sense-annotated resource in 21 languages, however only the English language was used for this task.

For the correlation evaluation, the dataset WordSimilarity-353 is used.

The training was done using a Google Compute Engine instance running a Tesla K80 GPU.


Dimensionality reduction of 40 words with the highest number of samples

Instructions

  • Generate dictionary

python preprocess.py [resource_folder] [file_name]

  • Train

python train.py [file_name]

  • Score

python train.py [resource_folder] [gold_file] [model_name]

  • Plot PCA

python pca.py [resource_folder] [filtered_vec_name]

About

Word embedding using fastText algorithm with EuroSense dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published