-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MPUNC #575
Merged
Merged
Add MPUNC #575
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Repeat by: In ady/4.0.affix: "[": REGPRE link-grammar: Error: Failed to compile regex: "[" (REGPRE): Invalid regular expression ... SEGV on unknown address 0x000000000000 (pc 0x7f540ade4904 bp 0x7ffe853b4270 sp 0x7ffe853b4240 T0)
Else alternative_id remains NULL and a crash results when it is used. Also, we need it to be the correct alternative_id. It happened in the "ady" language after adding MPUNC (yet uncommited code), on the input sentence "1234--56".
A word that can split is considered as known and is not subject to regex (and also spell) guesses.
Comment: The word could also be enclosed in '', but maybe it is better to enclose all words in a rare quotation marks like ❮ ❯, or not enclose all of them in a quotation marks at all. So for now this aspect is not changed here.
This can happen if an alternative couldn't be created in issue_word_alternative() (which returns NULL in such a case).
Since the word address serves as the alternative_id, if the word already participates in an alternative (consisting of several words), this alternative will be wrongly used. Fix it by using only one word if it is marked by issued_unsplit. XXX: This problem means we may have 2 different alternatives with the same alternative_id. Can it cause a problem when determining the hier_position?
(To avoid unrecognized symbol errors.)
This prevents recursive splitting of xPUNC tokens, e.g. "--" to "_" "_".
Issued randomly-split tokens should not be subject to a yet another random split. Currently this is prevented by marking them with TS_ANYSPLIT. This means separate_word() is entered again for each of these tokens, trying the xPUNC splitting again on each, and then anysplit() just neglect them. Instead, mark them (as done in a previous version) by TS_DONE, so they will be popped out of the token queue and not re-processed. However, regex-marking is still needed. The fix is to use tokenization_done(), which now also does regex-marking (and also adds TS_DONE).
Else, after adding MPUNC "--", we get: Linkage 2, cost vector = (UNUSED=0 DIS= 0.09 LEN=6) +------------Wa------------+ | +-------Dsu-------+ | | +----A---+ | | | | LEFT-WALL this.p is--a[!HYPHENATED-WORDS].a test.n Press RETURN for the next linkage. linkparser> Linkage 3, cost vector = (UNUSED=0 DIS= 4.00 LEN=9) +---------Xx--------+ +----->WV----->+ +---Js---+ +---Wd---+-Ss*b+ | +Ds**c+ | | | | | | LEFT-WALL this.p is.v --.r a test.n
The code is said to check for NULL by now.
After the fix to validate that the checked alternatives is of the given unsplit word, it doesn't detect duplicates using the existing batches (bug in the check?). (Previously it might find unrelated duplicates.) If this check will still be needed, a better check can be done by ensuring unique token positions for each alternative (hashing can be used).
Only with this update it is supposed to be fully functional.
(The general idea of using several values of Tokenizing_step is not abandoned yet.)
will look at this "really soon now" |
thanks! |
No comments, I looked at it and it seemed reasonable. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
--
and—
).en/4.affix
. Currently it contains only ASCII em-dash (--
). Add also an example sentence (from the old defect list).Most probably adding UNICODE dashes to the English dict is a good idea (at least em-dash) since LPUNC/RPUNC already include them they are currently unknown-words.