Add MPUNC #575

ampli · 2017-08-06T15:38:30Z

Add the ability to split words at MPUNC punctuation list.
Add/adjust the random splitters definitions to split hyphenated words and words with underbars, and also words with em-dash (-- and —).
Add MPUNC to en/4.affix. Currently it contains only ASCII em-dash (--). Add also an example sentence (from the old defect list).
Most probably adding UNICODE dashes to the English dict is a good idea (at least em-dash) since LPUNC/RPUNC already include them they are currently unknown-words.
Implement "negative-regex" for REGPRE/REGMID/REGSUF (not actually used for now).
Fix some bugs/problems as described in the commit messages, especially ones that interfere with these changes.

Repeat by: In ady/4.0.affix: "[": REGPRE link-grammar: Error: Failed to compile regex: "[" (REGPRE): Invalid regular expression ... SEGV on unknown address 0x000000000000 (pc 0x7f540ade4904 bp 0x7ffe853b4270 sp 0x7ffe853b4240 T0)

Else alternative_id remains NULL and a crash results when it is used. Also, we need it to be the correct alternative_id. It happened in the "ady" language after adding MPUNC (yet uncommited code), on the input sentence "1234--56".

A word that can split is considered as known and is not subject to regex (and also spell) guesses.

Comment: The word could also be enclosed in '', but maybe it is better to enclose all words in a rare quotation marks like ❮ ❯, or not enclose all of them in a quotation marks at all. So for now this aspect is not changed here.

This can happen if an alternative couldn't be created in issue_word_alternative() (which returns NULL in such a case).

Since the word address serves as the alternative_id, if the word already participates in an alternative (consisting of several words), this alternative will be wrongly used. Fix it by using only one word if it is marked by issued_unsplit. XXX: This problem means we may have 2 different alternatives with the same alternative_id. Can it cause a problem when determining the hier_position?

(To avoid unrecognized symbol errors.)

This prevents recursive splitting of xPUNC tokens, e.g. "--" to "_" "_".

Issued randomly-split tokens should not be subject to a yet another random split. Currently this is prevented by marking them with TS_ANYSPLIT. This means separate_word() is entered again for each of these tokens, trying the xPUNC splitting again on each, and then anysplit() just neglect them. Instead, mark them (as done in a previous version) by TS_DONE, so they will be popped out of the token queue and not re-processed. However, regex-marking is still needed. The fix is to use tokenization_done(), which now also does regex-marking (and also adds TS_DONE).

Else, after adding MPUNC "--", we get: Linkage 2, cost vector = (UNUSED=0 DIS= 0.09 LEN=6) +------------Wa------------+ | +-------Dsu-------+ | | +----A---+ | | | | LEFT-WALL this.p is--a[!HYPHENATED-WORDS].a test.n Press RETURN for the next linkage. linkparser> Linkage 3, cost vector = (UNUSED=0 DIS= 4.00 LEN=9) +---------Xx--------+ +----->WV----->+ +---Js---+ +---Wd---+-Ss*b+ | +Ds**c+ | | | | | | LEFT-WALL this.p is.v --.r a test.n

The code is said to check for NULL by now.

After the fix to validate that the checked alternatives is of the given unsplit word, it doesn't detect duplicates using the existing batches (bug in the check?). (Previously it might find unrelated duplicates.) If this check will still be needed, a better check can be done by ensuring unique token positions for each alternative (hashing can be used).

Only with this update it is supposed to be fully functional.

(The general idea of using several values of Tokenizing_step is not abandoned yet.)

linas · 2017-08-09T19:07:48Z

will look at this "really soon now"

linas · 2017-08-11T14:49:00Z

thanks!

linas · 2017-08-11T14:50:20Z

No comments, I looked at it and it seemed reasonable.

ampli added 30 commits August 4, 2017 21:07

Add a test sentence for "--"

c7988e3

Add AFDICT_MPUNC

0d3cf6b

Prevent a crash on invalid regex

77c2831

Repeat by: In ady/4.0.affix: "[": REGPRE link-grammar: Error: Failed to compile regex: "[" (REGPRE): Invalid regular expression ... SEGV on unknown address 0x000000000000 (pc 0x7f540ade4904 bp 0x7ffe853b4270 sp 0x7ffe853b4240 T0)

Bug fix: Set alternative_id in case of WS_HASALT

5b2f925

Else alternative_id remains NULL and a crash results when it is used. Also, we need it to be the correct alternative_id. It happened in the "ady" language after adding MPUNC (yet uncommited code), on the input sentence "1234--56".

Regex guess: Fix a comment bug (if cannot split)

82a954c

A word that can split is considered as known and is not subject to regex (and also spell) guesses.

Refactor regex guess code into a new function regex_guess()

b869d9b

Use the just added regex_guess() instead of a direct code

1f8def6

Clarify comments mentioning empty-word when they mean empty-string

de8af81

tokenization_done(): Also for regex-matching words

6d35908

Add MPUNC splitting

e249069

tokenization_done() bug fix: Account for one-token alternatives

64e40c1

Allow affix-file regexes with _ by escaping them with \

c585be6

tokenization_done(): Add logging

6411c5f

tokenization_done(): Prevent a crash on a null alternative

6dc751d

This can happen if an alternative couldn't be created in issue_word_alternative() (which returns NULL in such a case).

tokenization_done(): Make global (for anysplit.c)

a8e9a37

tokenize.h: Add link-includes.h file for IDE

e93194b

(To avoid unrecognized symbol errors.)

Mark punctuation with MT_PUNC and make them terminal tokens

9df7ace

This prevents recursive splitting of xPUNC tokens, e.g. "--" to "_" "_".

Add for_word_alt() to set gword fields for a whole alternative

5110a8e

Implement "negative-regex" for REGPRE/REGMID/REGSUF

1050a57

Handle potential issue_word_alternative() NULL returned value

473b097

issue_word_alternative() can now safely return NULL

f9c3bf6

The code is said to check for NULL by now.

word_start_another_alternative(): Validate unsplit_word

2bf9d33

Check existing same alternatives: Validate unsplit_word

ed58fc2

Common LPUNC/MPUNC/RPUNC file for the random splitter

351c658

Random splitter: Update the affix and regex files

fad1914

Only with this update it is supposed to be fully functional.

ampli added 2 commits August 6, 2017 19:09

Adding MPUNC: Update the ChangeLog

8e023c2

Remove TS_ANYSPLIT (not needed any more)

21e8220

(The general idea of using several values of Tokenizing_step is not abandoned yet.)

linas merged commit 4cb9bb3 into opencog:master Aug 11, 2017

ampli deleted the mpunc branch August 11, 2017 21:29

ampli mentioned this pull request Aug 12, 2017

splitting hyphenated, underlined words #560

Closed

ampli mentioned this pull request Nov 7, 2019

Bugs/issues listed at googlecode #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MPUNC #575

Add MPUNC #575

ampli commented Aug 6, 2017 •

edited

Loading

linas commented Aug 9, 2017

linas commented Aug 11, 2017

linas commented Aug 11, 2017

Add MPUNC #575

Add MPUNC #575

Conversation

ampli commented Aug 6, 2017 • edited Loading

linas commented Aug 9, 2017

linas commented Aug 11, 2017

linas commented Aug 11, 2017

ampli commented Aug 6, 2017 •

edited

Loading