Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opaque failure - What to do when 'Learning fails'? #14

Closed
TimLovellSmith opened this issue Jul 24, 2017 · 4 comments
Closed

Opaque failure - What to do when 'Learning fails'? #14

TimLovellSmith opened this issue Jul 24, 2017 · 4 comments

Comments

@TimLovellSmith
Copy link
Member

TimLovellSmith commented Jul 24, 2017

I see in the samples this code snippet. When I try and create my own dataset and try to learn from it from some not very nice real world examples, my program tends to output "Error: Learning fails!"

What does it really mean that learning failed? Does it mean that the grammar is too incomplete to build a suitable generalization so the grammar needs to be extended!? Could it also mean that generalization was too hard for the learning system and it gave up, in which case maybe it will work with more examples? How can I determine the correct path forward?

This Learn() api really needs to throw a specific exception instead of just returning null!

@TimLovellSmith TimLovellSmith changed the title What to do when 'Learning fails'? Opaque failure - What to do when 'Learning fails'? Jul 24, 2017
@danpere
Copy link
Contributor

danpere commented Jul 24, 2017

Since a program could not be learned from the examples given, usually, more examples will not help. Since normally all programs expressible in the DSL which satisfy the examples are learned, no programs learned means that there are no programs in the DSL that satisfy all of the examples and adding more examples would only further constraint the learning problem. (I say "normally" because using the escape hatches of the learning procedure you could write your own non-monotonic learning sub-procedure... but that's generally a bad idea because of the confusion you bring up.) As you say, this means that the grammar would likely have to be extended to express the desired operation.

We know the error reporting is poor and it's an issue we intend to address.

If you are comfortable sharing your data, it would be helpful to see your inputs, both to determine if it is in fact not expressible and, if so, help us know how we might want to extend the language to cover your scenario. You can e-mail me at danpere@microsoft.com if you don't want to share it publicly.

@TimLovellSmith
Copy link
Member Author

@danpere Thanks for the feedback. I am not sure, but I think one of the problems I might be running into is that there are (at least) two date formats mingled in the documents yyyymmdd and mm/dd/yyyy, either of which could be the accepted 'output' and the grammar may be failing to generalize across them.

@danpere
Copy link
Contributor

danpere commented Jul 24, 2017

For clarification, you are using the Extraction.Text language? (That's the sample that has that exact text as the error message.)

The differing formats might be the issue. Extraction.Text usually ends up being able to use context when the formats are different, but that might not apply in this case. Also, there is a regular expression internally for matching "dates" which is fairly flexible, but it can't cover everything. Extraction.Text does not currently support conditionals, but one way to work around that is to make multiple fields for the different date formats/contexts. @vuminhle may be able to give more tips on getting it to work on difficult scenarios.

@vuminhle
Copy link
Member

@danpere has covered all the main points.
If there is no program, it means that your task cannot be expressed in the current grammar. We could have given you back the problematic examples (or a maximal subset of working examples), but I'm not sure if that information is useful. Furthermore, there may be more than one variations of such sets.
We do give more indicative messages if your examples are conflicting or duplicating.

As you rightly observed, we can solve this by extending the grammar to support the task. @danpere mentioned learning conditional, which basically partitions your inputs into different clusters (each of which shares the same format) and learns a program for each of them. This is on-going work.

Which API did you use? Did you extract a substring out of a string, or a sequence of substrings out of a string?
It would be great if you can share one or two lines of your (anonymized) data, together with the fields you are extracting, so that we can analyze what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants