fastText Decrapifier

This tool removes foreign and non-language words from Facebook's fastText (https://fasttext.cc/) vec-file.
The configuration file can be easily customized to work with any language. If this project helps you with your language, please submit a pull request or share your changes with us.

Prerequisite
Installation
Language selection
Data and fastText
Running CLI
License

Prerequisite

Python 3.6 or later
fastText executable
MySQL or MariaDB
Voikko spell checker
libvoikko library 4.3 or later

Installation

git clone https://github.com/mikkorautiainen/fasttext-decrapifier
cd fasttext-decrapifier
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Language selection

Copy one of the predefined config files.
If your language is missing, create your own language specific config file.

For Finnish:

cp config-fi.json config.json`

Or Japanese:

cp config-ja.json config.json`

Data and fastText

The code expects the following files to be in the project root:

fastText executable
word vectors in bin-format
word vectors in vec-format

You can symbolically link the files to the project root:

ln -s /usr/src/fastText/fasttext .
ln -s /data/cc.fi.300.bin .
ln -s /data/cc.fi.300.vec .

Running CLI

The decrapifier tool uses sub-commands (specified as a command option) to run the non-language word removal steps.

Database initialization

The database connection parameters are specified in config.json:

  "DATABASE": {
    "dbname":  "decrapper",
    "table": "garbwords",
    "user":  "root",
    "password": "",
    "host": "localhost",
    "port": "3306"
  }

Once you are done changing the user and the password, please run the "init" action to create the database and table.

python decrapper.py --action init

Regex

Finds non-language word using regex

python decrapper.py --action regex

Nearest neighbor iteration

Generates non-language garbage word and find their nearest neighbors in the vec-file

python decrapper.py --action nn_query

Spell checker

The nearest neighbor iteration finds words that are rarely used but correct in the target language vocabulary.
The spell checker removes these words from the garbage word table (garbwords) in the database.

python decrapper.py --action spell_checker

Create vec-file

Checks every word in the vec-file against the database.
This sub-command creates a new vec-file with the non-language words excluded.

python decrapper.py --action remove

Create vocabulary

(Optional step) Replaces the word-vectors with the word’s lexical category and plurality.
This sub-command creates a new tab-delimited text file with the uncased vocabulary and lexical information.

python decrapper.py --action vocabulary

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config-fi.json		config-fi.json
config-ja.json		config-ja.json
decrapper.py		decrapper.py
ft_config.py		ft_config.py
ft_dbconnect.py		ft_dbconnect.py
ft_forward_model.py		ft_forward_model.py
ft_nn_query.py		ft_nn_query.py
ft_regex.py		ft_regex.py
ft_remove_garbage.py		ft_remove_garbage.py
ft_spell_checker.py		ft_spell_checker.py
ft_vocab.py		ft_vocab.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastText Decrapifier

Prerequisite

Installation

Language selection

Data and fastText

Running CLI

Database initialization

Regex

Nearest neighbor iteration

Spell checker

Create vec-file

Create vocabulary

License

About

Releases

Packages

Contributors 2

Languages

License

mikkorautiainen/fasttext-decrapifier

Folders and files

Latest commit

History

Repository files navigation

fastText Decrapifier

Prerequisite

Installation

Language selection

Data and fastText

Running CLI

Database initialization

Regex

Nearest neighbor iteration

Spell checker

Create vec-file

Create vocabulary

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages