Tiny Tokenizer is simple sentence/word Tokenizer which is convenient to pre-process Japanese text.
tiny_tokenizer requires following libraries.
- Python
- MeCab (and natto-py)
- KyTea (and Mykytea-python)
You can install tiny_tokenizer via pip.
pip install tiny_tokenizer
You can use tiny_tokenizer using the Docker container. If you want to use tiny_tokenizer with Docker, run following commands.
docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer
python3 example/tokenize_document.py
# python3 example/tokenize_document.py
ζθΌ©γ―η«γ§γγγ
words: MeCab
ζθΌ©
γ―
η«
γ§
γγ
γ
words: KyTea
ζθΌ©
γ―
η«
γ§
γ
γ
γ
εεγ―γΎγ γͺγ
words: MeCab
εε
γ―
γΎγ
γͺγ
words: KyTea
εε
γ―
γΎγ
γͺ
γ
python -m unittest discover tests
