Skip to content

Removes foreign and non-language words from Facebook’s fastText vec-file

License

Notifications You must be signed in to change notification settings

mikkorautiainen/fasttext-decrapifier

Repository files navigation

fastText Decrapifier

This tool removes foreign and non-language words from Facebook's fastText (https://fasttext.cc/) vec-file.
The configuration file can be easily customized to work with any language. If this project helps you with your language, please submit a pull request or share your changes with us.

 

Prerequisite

  • Python 3.6 or later
  • fastText executable
  • MySQL or MariaDB
  • Voikko spell checker
  • libvoikko library 4.3 or later

 

Installation

git clone https://github.com/mikkorautiainen/fasttext-decrapifier
cd fasttext-decrapifier
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

 

Language selection

Copy one of the predefined config files.
If your language is missing, create your own language specific config file.

For Finnish:

cp config-fi.json config.json`

Or Japanese:

cp config-ja.json config.json`

 

Data and fastText

The code expects the following files to be in the project root:

  1. fastText executable
  2. word vectors in bin-format
  3. word vectors in vec-format

  You can symbolically link the files to the project root:

ln -s /usr/src/fastText/fasttext .
ln -s /data/cc.fi.300.bin .
ln -s /data/cc.fi.300.vec .

 

Running CLI

The decrapifier tool uses sub-commands (specified as a command option) to run the non-language word removal steps.

 

Database initialization

The database connection parameters are specified in config.json:

  "DATABASE": {
    "dbname":  "decrapper",
    "table": "garbwords",
    "user":  "root",
    "password": "",
    "host": "localhost",
    "port": "3306"
  }

Once you are done changing the user and the password, please run the "init" action to create the database and table.

python decrapper.py --action init

Regex

Finds non-language word using regex

python decrapper.py --action regex

Nearest neighbor iteration

Generates non-language garbage word and find their nearest neighbors in the vec-file

python decrapper.py --action nn_query

Spell checker

The nearest neighbor iteration finds words that are rarely used but correct in the target language vocabulary.
The spell checker removes these words from the garbage word table (garbwords) in the database.

python decrapper.py --action spell_checker

Create vec-file

Checks every word in the vec-file against the database.
This sub-command creates a new vec-file with the non-language words excluded.

python decrapper.py --action remove

Create vocabulary

(Optional step) Replaces the word-vectors with the word’s lexical category and plurality.
This sub-command creates a new tab-delimited text file with the uncased vocabulary and lexical information.

python decrapper.py --action vocabulary

 

License

This project is licensed under the MIT License - see the LICENSE file for details.