Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

islands_ok flag should not be true #140

Closed
ampli opened this issue Jul 9, 2015 · 7 comments
Closed

islands_ok flag should not be true #140

ampli opened this issue Jul 9, 2015 · 7 comments

Comments

@ampli
Copy link
Member

ampli commented Jul 9, 2015

There was a problem to make a direct detailed comparison of run batches with these 2 different versions of the program, because the dicts are slightly different, 5.3.0 prints long labels and has defaults to island_ok=1

So I ran both on the dict of 5.2.x, both with with -island_ok=0, and commented out producing long labels in 5.3.0. I also added -constituents=1 to both. I used en/4.0.batch. I did this check to validate that the latest fixes didn't break it. Previously (before the above mentioned changes between the versions) I checked it in a similar manner on the other batch files.

There was only two changes in the output (differences in blank lines ignored and their reason has not been investigated):

Expected:

< /tmp/link-grammar/link-grammar-5.2.5-maintenance

---
> /tmp/link-grammar/linkage
...
< LEFT-WALL [But] why did.v-d you send.v him that.j-d nasty.a note.n 

---
> LEFT-WALL [but] why did.v-d you send.v him that.j-d nasty.a note.n 
...
< (S {But} why did.v-d

---
> (S {but} why did.v-d

Minor tokenizing difference - Corp. got broken also to Corp . and the dict accepted both:

< Found 12 linkages (12 had no P.P. violations)

---
> Found 24 linkages (24 had no P.P. violations)
...
< Alfred.m Baird[!] , formerly vice.n-u president.t of Beevil[!] Corp..y , has.v been.v appointed.v-d as.e president.t 

> Alfred.m Baird[!] , formerly vice.n-u president.t of Beevil[!] Corp.y . , has.v been.v appointed.v-d as.e president.t 
@linas
Copy link
Member

linas commented Jul 9, 2015

Hmmm The default islands_ok should not have changed. That seems like a bug.

The downcasing of But is acceptable. The slitting of Corp. is .. odd. It would be better if it was not split.

@ampli
Copy link
Member Author

ampli commented Jul 9, 2015

From the recent change log:

  • Change default setting of 'islands_ok' to true.

I didn't like this change...

Regarding the splitting of Corp., Both Corp and . are in the dict, a thing that supports their splitting.

Should the rule be the following one?
If 2 tokens together are a string that is found in the dict, then they should not get split.
Edit: if 2 or more...

I fear it may not be general enough (for arbitrary languages), but I can try to add it (conditionally compiled by default, so we can experiment with it).

Another edit: it is surely not correct in general, so maybe the rule should be this:
If potential tokens, together, are a word in the dict, and at least one of them is LPUNC or RPUNC, then they should not get split.

I'm not sure that even this is general enough...

@linas
Copy link
Member

linas commented Jul 9, 2015

Gahhh. Yes, I now vaguely remember changing islands_ok to true, after reading the source code for it, and deciding it was a neat thing that we should always have enabled. I do not quite remember why I thought it would be a good idea to change it, but it did seem like a good idea. I guess I should have written down a justification ... Do yousee bad things happening, as a result?

Re corp vs corp.: OK, do not do anything. The right fix would be to assign the LG rules that link to "corp" to use a higher cost than those that link to "corp." That way, the form with the period would be preferred.

@ampli
Copy link
Member Author

ampli commented Jul 9, 2015

Consider the sentence:
*This is a test what

linkparser> !islands-ok=1
islands-ok set to 1
linkparser> this is a test what
No complete linkages found.
Found 8 linkages (8 had no P.P. violations) at null count 1
    Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=10)
    +--------------Ww-------------+
    |              +---Osm--+     |
    |        +-Ss*b+  +Ds**c+     |
    |        |     |  |     |     |
LEFT-WALL this.p is.v a  test.n what 
linkparser> !islands-ok=0
islands-ok set to 0
linkparser> this is a test what
No complete linkages found.
Found 6 linkages (6 had no P.P. violations) at null count 1
    Linkage 1, cost vector = (UNUSED=1 DIS= 0.00 LEN=7)

    +----->WV----->+---Osm--+       
    +---Wd---+-Ss*b+  +Ds**c+       
    |        |     |  |     |       
LEFT-WALL this.p is.v a  test.n [what] 

In the parse with islands-ok=1, it is harder to see that this is a broken sentence. In case of a more complex sentence, it may be harder to even find the discontinuity problem(s) at a glance. It is also less clear that the sentence is correct without the word what, and it is not shown how the sentence could be correctly parsed without it.

@linas
Copy link
Member

linas commented Jul 10, 2015

OK. I'll try to deal with this next week

linas added a commit that referenced this issue Nov 29, 2015
@linas
Copy link
Member

linas commented Nov 29, 2015

Fixed the Corp. thing in commit 40940eb

@linas linas changed the title Comparing the results of RC 5.3.0 to 5.2.5-maintenance 9ec8027 islands_ok flag should not be true Nov 29, 2015
linas added a commit that referenced this issue Dec 14, 2015
Just as it was in the 5.2.x series. This seemed like a good idea,
but ..really isn't. It just confusing, and makes it harder to see
bad parses.  Per bug report #140
@linas
Copy link
Member

linas commented Dec 14, 2015

reverted islands_ok in 2f97e3c

@ampli ampli closed this as completed Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants