NLP tools for Keras

A set of tools for natural language processing using Keras.

MIMICK embeddings

Implementation of paper "Mimicking Word Embeddings using Subword RNNs". MIMICK allows to avoid OOV (out of vocabulary) problem by imitating the original pre-trained word embeddings using small character-based model, thus making <UNK> word embeddings unnecessary. Benefits (in comparison with FastText, for example): very low resource requirements and significant boost in accuracy due to the absence of <UNK> in your training data.

A short example which can mimic word embeddings for OOV words and show which of the known words are closest to the produced vector:

from kenlp.mimic import MimicEmbedding, pre_trained_mimic_model_path
from pathlib import Path

mem = MimicEmbedding.load(
    str(Path.home() / 'polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2'),
    pre_trained_mimic_model_path('en'))
while True:
    w = input('Enter a word: ')
    if w not in mem.embedding:
        print(f'{w!r} is an unknown word which has to be mimicked')
    word_vec = mem.get_vectors([w])[-1]
    sim_words = mem.nearest_to_vector(word_vec, 10)
    print('similar words:', sim_words, '\n')

The example above being run with some fictional words:

Enter a word: trintiful
'trintiful' is an unknown word which has to be mimicked
similar words: ['banal', 'cliched', 'preposterous', 'horrifying', 'horrid', 'baffling', 'disagreeable', 'pedantic', 'farcical', 'ludicrous']

Enter a word: unfrogable
'unfrogable' is an unknown word which has to be mimicked
similar words: ['unmanageable', 'unwelcome', 'unbreakable', 'irreplaceable', 'unsatisfying', 'anachronistic', 'unrestrained', 'outmoded', 'unfashionable', 'unattainable']

Enter a word: Kaa
'Kaa' is an unknown word which has to be mimicked
similar words: ['Ung', 'Tham', 'Kuang', 'Chon', 'Teh', 'Kor', 'Kum', 'Gam', 'Loh', 'Mah']

Enter a word: Karll
'Karll' is an unknown word which has to be mimicked
similar words: ['Helmer', 'Gerhardt', 'Bastian', 'Henrich', 'Tilman', 'Bosse', 'Laurin', 'Hartwig', 'Burkhard', 'Cori']

Enter a word: abroktose
'abroktose' is an unknown word which has to be mimicked
similar words: ['phosphine', 'adenine', 'albumin', 'oxalate', 'osmium', 'IgG', 'capsaicin', 'casein', 'amide', 'azide']

The word "trintiful" is fictional but MIMICK "understands" that this is probably some kind of adjective. It assumed that "unfrogable" is a negating adjective. It was also able to guess that Kaa looks like an asian name, and "Karll" is something european. "abroktose" turns out to be something vaguely scientific, probably about chemistry or biology, which also makes sense.

The code already contains pre-trained models for few languages (see the example on how to use them). To train a MIMICK model for a new language you will need first to download polyglot's embedding for that language and then run a command like this:

python -m kenlp.mimic --save new_mimic_model.h5 --embeddings <path to downloaded polyglot embedding>

Installation

Assuming you use virtualenv or venv, installation on Ubuntu >= 16.04 looks like

sudo apt-get install libicu-dev
git clone https://github.com/kpot/kenlp.git
cd kenlp
pip install .

On MacOS:

brew install icu4c
brew link --force icu4c
git clone https://github.com/kpot/kenlp.git
cd kenlp
pip install .
brew unlink icu4c

You will also need to install any of Keras's backends, like TensorFlow:

pip install tensorflow

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
kenlp		kenlp
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP tools for Keras

MIMICK embeddings

Installation

About

Releases

Packages

Languages

License

kpot/kenlp

Folders and files

Latest commit

History

Repository files navigation

NLP tools for Keras

MIMICK embeddings

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages