Skip to content

Commit

Permalink
Handle punctuation+CJK combinations.
Browse files Browse the repository at this point in the history
  • Loading branch information
linas committed Jan 8, 2017
1 parent 6fb2c97 commit 2e48f83
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions scripts/ems/support/split-sentences.perl
Expand Up @@ -128,13 +128,19 @@ sub preprocess {

# A normal full-stop or other Western sentence enders followed
# by an ideograph is an and-of-sentence, always.
$text =~ s/([\.?!]) *(\p{InCJK})/$1\n$2/g;
$text =~ s/([\.?!]) *(\p{CJK})/$1\n$2/g;

# Split close-paren-then-comma into two.
$text =~ s/(\p{Punctuation}) *(\p{Punctuation})/ $1 $2 /g;

# Chinese does not use any sort of white-space between ideographs.
# Nominally, each single ideograph corresponds to one word. Add
# spaces here, so that later processing stages can tokenize readily.
# Note that this handles mixed latinate+CJK.
$text =~ s/(\p{InCJK})/ $1 /g;
# TODO: perhaps also CJKExtA CJKExtB etc ??? CJK_Radicals_Sup ?
$text =~ s/(\p{Punctuation}) *(\p{CJK})/ $1 $2/g
$text =~ s/(\p{CJK}) *(\p{Punctuation})/$1 $2 /g;
$text =~ s/([\p{CJK}\p{CJKSymbols}])/ $1 /g;
$text =~ s/ +/ /g;

# Special punctuation cases are covered. Check all remaining periods.
Expand Down

0 comments on commit 2e48f83

Please sign in to comment.