GitHub - jun-harashima/tiny_tokenizer: 📝A tiny sentence/word tokenizer for Japanese text written in Python

Overview

Tiny Tokenizer is simple sentence/word Tokenizer which is convenient to pre-process Japanese text.

Quick start: Install tiny_tokenizer using PIP

tiny_tokenizer requires following libraries.

Python
MeCab (and natto-py)
KyTea (and Mykytea-python)

You can install tiny_tokenizer via pip. pip install tiny_tokenizer

Quick start: Docker

You can use tiny_tokenizer using the Docker container. If you want to use tiny_tokenizer with Docker, run following commands.

docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer

Example

python3 example/tokenize_document.py

# python3 example/tokenize_document.py
我輩は猫である。
words: MeCab
  我輩
  は
  猫
  で
  ある
  。
words: KyTea
  我輩
  は
  猫
  で
  あ
  る
  。
名前はまだない
words: MeCab
  名前
  は
  まだ
  ない
words: KyTea
  名前
  は
  まだ
  な
  い

Test

python -m unittest discover tests

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
example		example
static/image		static/image
tests		tests
tiny_tokenizer		tiny_tokenizer
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Quick start: Install tiny_tokenizer using PIP

Quick start: Docker

Example

Test

About

Uh oh!

Releases

Packages

Languages

License

jun-harashima/tiny_tokenizer

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick start: Install tiny_tokenizer using PIP

Quick start: Docker

Example

Test

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages