Skip to content

πŸ“A tiny sentence/word tokenizer for Japanese text written in Python

License

Notifications You must be signed in to change notification settings

jun-harashima/tiny_tokenizer

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Build Status GitHub stars GitHub issues GitHub release MIT License

Overview

Tiny Tokenizer is simple sentence/word Tokenizer which is convenient to pre-process Japanese text.

Quick start: Install tiny_tokenizer using PIP

tiny_tokenizer requires following libraries.

You can install tiny_tokenizer via pip. pip install tiny_tokenizer

Quick start: Docker

You can use tiny_tokenizer using the Docker container. If you want to use tiny_tokenizer with Docker, run following commands.

docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer

Example

python3 example/tokenize_document.py

# python3 example/tokenize_document.py
ζˆ‘θΌ©γ―ηŒ«γ§γ‚γ‚‹γ€‚
words: MeCab
  ζˆ‘θΌ©
  は
  猫
  で
  ある
  。
words: KyTea
  ζˆ‘θΌ©
  は
  猫
  で
  あ
  γ‚‹
  。
名前はまだγͺい
words: MeCab
  名前
  は
  まだ
  γͺい
words: KyTea
  名前
  は
  まだ
  γͺ
  い

Test

python -m unittest discover tests

About

πŸ“A tiny sentence/word tokenizer for Japanese text written in Python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 91.6%
  • Dockerfile 8.4%