Hasami

Hasami is a tool to perform sentence segmentation on japanese text.

Sentences are split on common sentence ending markers like ！？。
Enclosed sentence endings will not be split, i.e. those inside quotes or parentheses.
Runs of sentence ending markers are treated as a single sentence ending.
You can configure custom sentence ending markers and enclosures if the defaults don't cover your needs.
You can define exceptions for when not to split sentences.

Installation

$ pip install hasami

Usage

A simple command line interface is provided. Input is read from stdin or from a file.

$ echo "これが最初の文。これは二番目の文。これが最後の文。" | tee input.txt | hasami
これが最初の文。
これは二番目の文。
これが最後の文。

$ hasami input.txt
これが最初の文。
これは二番目の文。
これが最後の文。

Usage in code:

import hasami

hasami.segment_sentences('これが最初の文。これは二番目の文。これが最後の文。')
# => ['これが最初の文。', 'これは二番目の文。', 'これが最後の文。']

More examples:

import hasami

# Instead of splitting you can also just insert newlines.
hasami.insert_newlines('これが最初の文。これは二番目の文。これが最後の文。')
# => 'これが最初の文。\nこれは二番目の文。\nこれが最後の文。\n'

# Runs of sentence ending markers are treated as a single sentence ending.
hasami.segment_sentences('え、本当…！？嘘だろ…')
# => ['え、本当…！？', '嘘だろ…']

# Enclosed sentence endings are ignored.
hasami.segment_sentences('「うまく行くかな？」と思った。')
# => ['「うまく行くかな？」と思った。']

Customization

The defaults should work for most of the punctuation found in natural text but it is possible to define custom enclosures and sentence ending markers if necessary. You can also define exceptions for when sentence segmentation should not happen, for example in cases of untypical use of punctuation.

from hasami import Hasami, DEFAULT_ENCLOSURES, DEFAULT_SENTENCE_ENDING_MARKERS

# Pass a string of pairs of opening/closing characters to define custom enclosures.
with_custom_enclosures = Hasami(enclosures=DEFAULT_ENCLOSURES + '<>')
with_custom_enclosures.segment_sentences('<うまく行くかな？>と思った。')
# => ['<うまく行くかな？>と思った。']

# Pass an empty string if you want all enclosures to be ignored.
without_enclosures = Hasami(enclosures='')
without_enclosures.segment_sentences('「うまく行くかな？」と思った。')
# => ['「うまく行くかな？', '」と思った。']

# Pass a string of characters that should be considered as sentence ending markers.
with_custom_endings = Hasami(sentence_ending_markers=DEFAULT_SENTENCE_ENDING_MARKERS + '．，')
with_custom_endings.segment_sentences('これが最初の文．これは二番目の文，これが最後の文．')
# => ['これが最初の文．', 'これは二番目の文，', 'これが最後の文．']

# Pass a list of patterns to define exceptions where segmentation should not happen.
# Make sure to include the newline which should be removed in the pattern.
with_exceptions = Hasami(exceptions=['君の名は。\n'])
with_exceptions.segment_sentences('君の名は。見たことあるの？')
# => ['君の名は。見たことあるの？']

License

Released under the BSD-3-Clause License

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
hasami		hasami
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hasami

Installation

Usage

Customization

License

About

Releases 1

Languages

License

mkartawijaya/hasami

Folders and files

Latest commit

History

Repository files navigation

Hasami

Installation

Usage

Customization

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages