Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
 
 
 
 
 
 
 
 
 
 

fastText models trainer

About

This repository contains few scripts to automate fastText word embedding models training using Wikipedia dumps as corpus and the instructions from the official documentation.

One thing that differs from this instructions is that we don't use the wikifil.pl script to extract and pre-process the text from Wikipedia dumps since it's English-targeted and it strips non-ASCII characters.

Instead we use gensim's segment_wiki tool with a custom pre-processing script (see bin/preprocess).

Usage

Vefify the configuration and training parameters:

$ editor conf/train.env

Setup/Verify the corpora used to train the models:

$ edit conf/models.csv

Fetch dependencies and run the training process:

$ make train

Use models:

$ find resources/ -type f
resources/dumps/wikipedia-en.txt
resources/datasets/wikipedia-en.txt
resources/models/wikipedia-en.bin
resources/models/wikipedia-en.vec

About

fastText word embedding models training automation using Wikipedia dumps

Topics

Resources

Releases

No releases published

Packages

No packages published