Himotoki (紐解き, "unraveling" or "untying strings") is a Python remake of ichiran, the comprehensive Japanese morphological analyzer. It provides sophisticated text segmentation, dictionary lookup, and conjugation analysis, all powered by a portable SQLite backend.
- 🚀 Fast & Portable: Uses SQLite for rapid dictionary lookups without the need for a complex PostgreSQL setup.
- 🧠 Smart Segmentation: Employs dynamic programming (Viterbi-style) to find the most linguistically plausible segmentation.
- 📚 Deep Dictionary Integration: Built on JMDict, providing rich metadata, glosses, and part-of-speech information.
- 🔄 Advanced Deconjugation: Recursively traces conjugated verbs and adjectives back to their dictionary forms.
- 📊 Scoring Engine: Implements the "synergy" and penalty rules from ichiran to ensure high-quality results.
- 🛠️ Developer Friendly: Clean Python API and a robust CLI for quick analysis.
pip install himotokiOn first use, Himotoki will prompt you to download and initialize the dictionary database:
himotoki "日本語テキスト"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🧶 Welcome to Himotoki!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
First-time setup required. This will:
• Download JMdict dictionary data (~15MB compressed)
• Generate optimized SQLite database (~3GB)
• Store data in ~/.himotoki/
Proceed with setup? [Y/n]:
⚠️ Disk Space: The database requires approximately 3GB of free disk space.
The setup process takes approximately 10-20 minutes to complete.
You can also run setup manually:
himotoki setup # Interactive setup
himotoki setup --yes # Non-interactive (for scripts/CI)Analyze Japanese text directly from your terminal:
# Default: Dictionary info only
himotoki "学校で勉強しています"
# Simple romanization
himotoki -r "学校で勉強しています"
# Full output (romanization + dictionary info)
himotoki -f "学校で勉強しています"
# Kana reading with spaces
himotoki -k "学校で勉強しています"
# JSON output for integration
himotoki -j "学校で勉強しています"Integrate Himotoki into your own projects with ease:
import himotoki
# Optional: pre-warm caches for faster first request
himotoki.warm_up()
# Analyze Japanese text
results = himotoki.analyze("日本語を勉強しています")
for words, score in results:
for w in words:
print(f"{w.text} 【{w.kana}】 - {w.gloss[:50]}...")Himotoki is designed with modularity in mind, keeping the database, logic, and output layers distinct.
himotoki/
├── himotoki/ # Main package
│ ├── 🧠 segment.py # Pathfinding and segmentation logic
│ ├── 📖 lookup.py # Dictionary retrieval and scoring
│ ├── 🔄 constants.py # Shared constants and SEQ definitions
│ ├── 🗄️ db/ # SQLAlchemy models and connection
│ ├── 📚 loading/ # JMdict and conjugation loaders
│ └── 🖥️ cli.py # Command line interface
├── scripts/ # Developer tools
│ ├── compare.py # Ichiran comparison suite
│ ├── init_db.py # Database initialization
│ └── report.py # HTML report generator
├── tests/ # Test suite
├── data/ # Dictionary data files
├── output/ # Generated results and reports
└── docs/ # Documentation
We welcome contributions! To get started:
git clone https://github.com/msr2903/himotoki.git
cd himotoki
pip install -e ".[dev]"- Tests:
pytest - Coverage:
pytest --cov=himotoki - Linting:
ruff check . - Formatting:
black .
- Run LLM evaluation:
python -m scripts.llm_eval --quick - Run with mock mode:
python -m scripts.llm_eval --quick --mock - Run one sentence:
python -m scripts.llm_eval --onesentence "猫が食べる" - Start labeler UI:
python -m scripts.llm_labeler --host 127.0.0.1 --port 8008
Set LLM_PROVIDER=openai with OPENAI_BASE_URL (for example, http://127.0.0.1:3030/v1)
and OPENAI_API_KEY (use not-needed for local servers that ignore keys) to use a local
OpenAI-compatible server. Use --mock for offline runs.
Set LLM_PROVIDER=gemini with GEMINI_API_KEY and GEMINI_MODEL (default: gemini-3-flash-preview)
to use Gemini.
Use --concurrency 5 (or LLM_CONCURRENCY=5) to send multiple LLM requests in parallel.
Use --rpm (or LLM_RPM) to cap request rate per minute (defaults: 2 for openai, 1 for gemini).
Install the optional dependencies for the labeler UI:
pip install -e ".[eval]"
Distributed under the MIT License. See LICENSE for more information.
"Unraveling the complexities of the Japanese language, one string at a time."