Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPUNC #575

Merged
merged 32 commits into from
Aug 11, 2017
Merged

Add MPUNC #575

merged 32 commits into from
Aug 11, 2017

Conversation

ampli
Copy link
Member

@ampli ampli commented Aug 6, 2017

  1. Add the ability to split words at MPUNC punctuation list.
  2. Add/adjust the random splitters definitions to split hyphenated words and words with underbars, and also words with em-dash (-- and ).
  3. Add MPUNC to en/4.affix. Currently it contains only ASCII em-dash (--). Add also an example sentence (from the old defect list).
    Most probably adding UNICODE dashes to the English dict is a good idea (at least em-dash) since LPUNC/RPUNC already include them they are currently unknown-words.
  4. Implement "negative-regex" for REGPRE/REGMID/REGSUF (not actually used for now).
  5. Fix some bugs/problems as described in the commit messages, especially ones that interfere with these changes.

ampli added 30 commits August 4, 2017 21:07
Repeat by:
In ady/4.0.affix:
"[": REGPRE

link-grammar: Error: Failed to compile regex: "[" (REGPRE): Invalid regular expression
...
SEGV on unknown address 0x000000000000 (pc 0x7f540ade4904 bp 0x7ffe853b4270 sp 0x7ffe853b4240 T0)
Else alternative_id remains NULL and a crash results when it is used.
Also, we need it to be the correct alternative_id.

It happened in the "ady" language after adding MPUNC (yet uncommited code),
on the input sentence "1234--56".
A word that can split is considered as known and is not subject to regex
(and also spell) guesses.
Comment:
The word could also be enclosed in '', but maybe it is better to
enclose all words in a rare quotation marks like ❮ ❯, or not enclose
all of them in a quotation marks at all. So for now this aspect is not
changed here.
This can happen if an alternative couldn't be created in
issue_word_alternative() (which returns NULL in such a case).
Since the word address serves as the alternative_id, if the word
already participates in an alternative (consisting of several words),
this alternative will be wrongly used.

Fix it by using only one word if it is marked by issued_unsplit.

XXX: This problem means we may have 2 different alternatives with the same
alternative_id. Can it cause a problem when determining the
hier_position?
(To avoid unrecognized symbol errors.)
This prevents recursive splitting of xPUNC tokens, e.g. "--" to "_" "_".
Issued randomly-split tokens should not be subject to a yet another
random split.

Currently this is prevented by marking them with TS_ANYSPLIT. This means
separate_word() is entered again for each of these tokens, trying the
xPUNC splitting again on each, and then anysplit() just neglect them.

Instead, mark them (as done in a previous version) by TS_DONE, so they
will be popped out of the token queue and not re-processed. However,
regex-marking is still needed.

The fix is to use tokenization_done(), which now also does
regex-marking (and also adds TS_DONE).
Else, after adding MPUNC "--", we get:
	Linkage 2, cost vector = (UNUSED=0 DIS= 0.09 LEN=6)

    +------------Wa------------+
    |        +-------Dsu-------+
    |        |        +----A---+
    |        |        |        |
LEFT-WALL this.p is--a[!HYPHENATED-WORDS].a test.n

Press RETURN for the next linkage.
linkparser>
	Linkage 3, cost vector = (UNUSED=0 DIS= 4.00 LEN=9)

    +---------Xx--------+
    +----->WV----->+    +---Js---+
    +---Wd---+-Ss*b+    |  +Ds**c+
    |        |     |    |  |     |
LEFT-WALL this.p is.v --.r a  test.n
The code is said to check for NULL by now.
After the fix to validate that the checked alternatives is of the given
unsplit word, it doesn't detect duplicates using the existing batches
(bug in the check?). (Previously it might find unrelated duplicates.)

If this check will still be needed, a better check can be done by
ensuring unique token positions for each alternative (hashing can be used).
Only with this update it is supposed to be fully functional.
(The general idea of using several values of Tokenizing_step is not
abandoned yet.)
@linas
Copy link
Member

linas commented Aug 9, 2017

will look at this "really soon now"

@linas linas merged commit 4cb9bb3 into opencog:master Aug 11, 2017
@linas
Copy link
Member

linas commented Aug 11, 2017

thanks!

@linas
Copy link
Member

linas commented Aug 11, 2017

No comments, I looked at it and it seemed reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants