fastText word embedding models training automation using Wikipedia dumps
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
conf
scripts
Dockerfile
Makefile
README.md
docker-compose.yml
requirements.txt

README.md

fastText models trainer

About

This repository contains few scripts to automate fastText word embedding models training using Wikipedia dumps as corpus and the instructions from the official documentation.

One thing that differs from this instructions is that we don't use the wikifil.pl script to extract and pre-process the text from Wikipedia dumps since it's English-targeted and it strips non-ASCII characters.

Instead we use gensim's segment_wiki tool with a custom pre-processing script (see bin/preprocess).

Usage

Vefify the configuration and training parameters:

$ editor conf/train.env

Setup/Verify the corpora used to train the models:

$ edit conf/models.csv

Fetch dependencies and run the training process:

$ make train

Use models:

$ find resources/ -type f
resources/dumps/wikipedia-en.txt
resources/datasets/wikipedia-en.txt
resources/models/wikipedia-en.bin
resources/models/wikipedia-en.vec