-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
islands_ok flag should not be true #140
Comments
Hmmm The default islands_ok should not have changed. That seems like a bug. The downcasing of But is acceptable. The slitting of Corp. is .. odd. It would be better if it was not split. |
From the recent change log:
I didn't like this change... Regarding the splitting of Should the rule be the following one? I fear it may not be general enough (for arbitrary languages), but I can try to add it (conditionally compiled by default, so we can experiment with it). Another edit: it is surely not correct in general, so maybe the rule should be this: I'm not sure that even this is general enough... |
Gahhh. Yes, I now vaguely remember changing islands_ok to true, after reading the source code for it, and deciding it was a neat thing that we should always have enabled. I do not quite remember why I thought it would be a good idea to change it, but it did seem like a good idea. I guess I should have written down a justification ... Do yousee bad things happening, as a result? Re corp vs corp.: OK, do not do anything. The right fix would be to assign the LG rules that link to "corp" to use a higher cost than those that link to "corp." That way, the form with the period would be preferred. |
Consider the sentence:
In the parse with islands-ok=1, it is harder to see that this is a broken sentence. In case of a more complex sentence, it may be harder to even find the discontinuity problem(s) at a glance. It is also less clear that the sentence is correct without the word |
OK. I'll try to deal with this next week |
Fixed the Corp. thing in commit 40940eb |
Just as it was in the 5.2.x series. This seemed like a good idea, but ..really isn't. It just confusing, and makes it harder to see bad parses. Per bug report #140
reverted islands_ok in 2f97e3c |
There was a problem to make a direct detailed comparison of run batches with these 2 different versions of the program, because the dicts are slightly different, 5.3.0 prints long labels and has defaults to island_ok=1
So I ran both on the dict of 5.2.x, both with with -island_ok=0, and commented out producing long labels in 5.3.0. I also added -constituents=1 to both. I used en/4.0.batch. I did this check to validate that the latest fixes didn't break it. Previously (before the above mentioned changes between the versions) I checked it in a similar manner on the other batch files.
There was only two changes in the output (differences in blank lines ignored and their reason has not been investigated):
Expected:
Minor tokenizing difference -
Corp.
got broken also toCorp .
and the dict accepted both:The text was updated successfully, but these errors were encountered: