Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

echo "storeDiagramString:true, text: But as usual,we couldn't make it stick." | nc localhost 9000 #743

Closed
cdmalcl opened this issue Apr 20, 2018 · 12 comments

Comments

@cdmalcl
Copy link

cdmalcl commented Apr 20, 2018

    +----------------------------Xp----------------------------+
    +------------------------->WV------------------------>+    |
    |        +--------------->WV--------------->+---I*j---+    |
    +-->Wc---+------Wdc-----+-----Ss----+---I---+-Osm+    |    +--RW--+
    |        |              |           |       |    |    |    |      |
LEFT-WALL but.ij [as] usual,we[?].n couldn't make.v it stick.v . RIGHT-WALL

usual,we

@ampli
Copy link
Member

ampli commented Apr 20, 2018

With the current setup, there is no tokenization of most of the punctuation that don't have white space before/after them. such in usual,we.

However, you can experiment with tokenizing them too.
Just add them to the MPUNC definition in en/4.0.affix.

Current definition:
-- ‒ – — ― "(" "[": MPUNC+;
You can modify it to (added ... and ,):
-- ‒ – — ― "(" "[" ... ,: MPUNC+;
Note that some tokens need quoting if you add them there, e.g. ":".

@linas
Copy link
Member

linas commented Apr 23, 2018

I just added elipses and commas and semicolons to the MPUNC list; this will be in version 5.5.0 later today.

Closing.

@linas linas closed this as completed Apr 23, 2018
@ampli
Copy link
Member

ampli commented Apr 24, 2018

What is the reason of including [ in MPUNC (a previous change) but note also ]' (similarly for (). See for example following from en/corpus-fixes.batch`:

We looked for 3-Amino-3-azabicyclo[3.3.0]octane hydrochloride

                                          +--------------------MX--------------------+
    +----->WV---->+                       |             +-------------Xd-------------+
    +->Wd--+--Sp--+--MVp--+-------J-------+             |         +---------A--------+------Xc------+
    |      |      |       |               |             |         |                  |              |
LEFT-WALL we looked.v-d for.p 3-Amino-3-azabicyclo[!].n [ 3.3.0]octane[?].a hydrochloride[!].n RIGHT-WALL

(In this particular case, however, the first pasre, which uses the alternative of not splitting the word, may be better.)

@ampli
Copy link
Member

ampli commented Apr 25, 2018

Middle-splitting on , has implications on numbers with commas, which may be undesired.
For example (from corpus-biolg.batch):
The enzyme has a weight of 125,000 to 130,000
In addition to the previous parses like (:

    +-------->WV-------->+
    +----->Wd-----+      +----Os---+     +------MVp------+
    |      +-Ds**v+-Ss*s-+   +Ds**c+--Mf-+      +--NIfn--+---NItn--+
    |      |      |      |   |     |     |      |        |         |
LEFT-WALL the enzyme.s has.v a weight.s of 125,000[!] to.j-ru 130,000[!]

we now get also:

    +-------------------------------Xx-------------------------------+
    +-------->WV-------->+                                           |
    +----->Wd-----+      +----Os---+     +------MVp------+           |
    |      +-Ds**v+-Ss*s-+   +Ds**c+--Mf-+      +--NIfn--+--NItn-+   +>Wa-+
    |      |      |      |   |     |     |      |        |       |   |    |
LEFT-WALL the enzyme.s has.v a weight.s of 125,000[!] to.j-ru 130[!] , 000[!]

which is incorrect.

If Middle-splitting on , is still desired in general, I think this can be mostly solved by treating numbers with commas (and also spaces!) as variable-length idioms (no syntax for that yet but they can be crafted by hand definitions).

The problem of exponentially increase in the number of parses (2**(number of splits)) will still remain even then, unless we find a way for compact representation of parses (you wrote on that need in another context).

Also note that forbidding split for tokens that match regexes was abandoned long ago in the favor of creating alternatives. Returning to the old way will rule out parsing of many useful constructs.
What we need is a mechanism for alternative cost, that is independent of a sentence cost, and has a relative cut-off. I encountered this problem in many yet-unsolved LG problems, including spell corrections of words with capital letters, and recently when I tried to investigate corrections like it's.#its: its; that over-correct.

@linas
Copy link
Member

linas commented Apr 25, 2018

Ouch. I added only "[" and not "]" because I was looking at project gutenberg texts, which had foot-note and reference constructions, e.g. "Studies show that this happens[32]." where the [32] is some reference. I will add closing square brackets now.

The addition of the comma is the law of unintended consequences. Perhaps I should remove the comma, for now.

I don't see any easy, obvious solution. The complex solutions all seem to be problematic. We have distinct needs: bad punctuation, spelling errors, European morphology, Hebrew morphology, other (e.g. turkish) morphology. One unified system for all this is possibly not enough.

For spelling errors, single-word substitution is a short-term patch; I did it because its an easy, cheap stunt. The correct long term approach is some kind of transformative grammar; where we can mutate sentences from unusual forms into "standard" forms. Some of that mutation is bad-spelling, bad-grammar related.

linas added a commit that referenced this issue Apr 25, 2018
@ampli
Copy link
Member

ampli commented Apr 25, 2018

The addition of the comma is the law of unintended consequences. Perhaps I should remove the comma, for now.

Most probably.

I don't see any easy, obvious solution. The complex solutions all seem to be problematic. We have distinct needs: bad punctuation, spelling errors, European morphology, Hebrew morphology, other (e.g. turkish) morphology. One unified system for all this is possibly not enough.

Many, if not all, of these things can be done with the current LG, if definition and lookup limitations are removed (e.g. extending idiom definitions to allow <a>_<b>) , constructions and definitions be made orthogonal (so they will not interfere with each other), ambiguity get removed (e.g. regex label on input), explicit rules be used everywhere instead of implicit hard-coded ones (examples of implicit hard-coded rules are not applying regex to tokens in the dict, stopping on the first matching regex, etc., all of that always has bad implications), token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Additional important improvement is a general way to detect "virtual morphemes" like (but not limited to) phonology and capitalization and use an appropriate disjuncts for them so everything will be able to be done by the dict, without a need to separate words to different files etc. (like a/an - can be very complex for complex phonology) or adding all capital words with their own rules. Such "virtual morphemes" are exactly like doing such separations -they are just practical shortcuts that are easy to define and modify (I'm working on that).

One unified system for all this is possibly not enough.

I think it is definitely possible.

@ampli
Copy link
Member

ampli commented Apr 25, 2018

I fixed a bad formatting of my previous message, so please read it on GitHub...

@linas
Copy link
Member

linas commented Apr 25, 2018

if definition and lookup limitations are removed
(e.g. extending idiom definitions to allow _),

OK, that's done, now. Its not obviously terribly useful, unless you have some good example that you haven't told me (or I've forgotten)

constructions and definitions be made orthogonal (so they will not interfere with each other),

Can you provide examples? For this, and each of teh points below, maybe could you open an issue, and place a description of the problem, an example, and a proposed fix?

ambiguity get removed (e.g. regex label on input),

Again, not sure what that means.

explicit rules be used everywhere instead of implicit hard-coded ones
(examples of implicit hard-coded rules are not applying regex to tokens in the dict,

What would that do, and how would that help?

stopping on the first matching regex, etc., all of that always has bad implications),

OK. Yes, that seemed useful at the time, but perhaps has outgrown it's utility.

token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Yes. Well, the REGPARTS/REGMID/etc was an attempt to do that; I'm not convinced it ever worked very well, but also, we've just barely ever used it. Or are you refering to something else, some replacement, some different way of doing things?

I like the general sentiment; please do this. I'm flat out of bright ideas at this particular moment.

@ampli
Copy link
Member

ampli commented Apr 26, 2018

if definition and lookup limitations are removed
(e.g. extending idiom definitions to allow _),

OK, that's done, now. Its not obviously terribly useful, unless you have some good example that you haven't told me (or I've forgotten)

I had an markdown problem in my post, and thought I have corrected it. But the post still contains this markdown problem. I tried again to fix it, and now the format is OK.

Anyway, here is the corrected line (I forgot to use backquotes for <a>_<b>):

(e.g. extending idiom definitions to allow <a>_<b>)

I indeed have never updated you about a possible usage of such constructs. I tried to solve the case of collocations with holes. I will open an issue to describe the test I did.

ambiguity get removed (e.g. regex label on input),

Again, not sure what that means.

This doesn't parse:

UNITS is a regex label used for units.
LEFT-WALL [UNITS] is.v a regex.n label.n used.v-d for.p units.n .

This parses fine:

UNITSX (X removed)  is a regex label used for units.
LEFT-WALL UNITSX[!] ( x.n removed.v-d ) is.v a regex.n label.n used.v-d for.p units.n .

I can open an issue with my proposal for solving it.

stopping on the first matching regex, etc., all of that always has bad implications),

OK. Yes, that seemed useful at the time, but perhaps has outgrown it's utility.

This is one of the many things that are not "orthogonal": If you would like to add a regex, like what I did in my test, you interfere with the rest of the regexes.

token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Yes. Well, the REGPARTS/REGMID/etc was an attempt to do that; I'm not convinced it ever worked very well, but also, we've just barely ever used it.

Not exactly. Its only intended use is for the 'amy` and the like languages, to denote tokens that should not be produced by random splitting.

Or are you refering to something else, some replacement, some different way of doing things?

Yes. Say we add an ability to denote regex tokens in the xPUNC definitions.
For example:
-- ‒ – — ― "(" "[" ... "," ";" /(,)[^0-9]/ : MPUNC+;
(POSIX regex can return the matching group, which can server for the actual split.)

I like the general sentiment; please do this. I'm flat out of bright ideas at this particular moment.

I have a list of things to implement. I can try to open issues on each of them.
But I will need your input on each...

@linas
Copy link
Member

linas commented Apr 26, 2018

UNITS is a regex label used for units.

Yes, please.

/(,)[^0-9]/ : MPUNC+;

Oh, that's interesting!, Yeah, I guess I like that.

But I will need your input on each...

I can try to give it. I only have a finite amount of ability to read, focus and respond, and am already operating at the limit :-)

@linas
Copy link
Member

linas commented Apr 26, 2018

p.s. when should I publish version 5.5.0?

@ampli
Copy link
Member

ampli commented Apr 26, 2018

Just now, if you are willing to have the rand fix for the next release only.
This is because I will not be able to look at it until tomorrow.
Or maybe just wait 24 hours and include this fix too (I suppose I will be able to fix the problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants