Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the case of timeouts, frequencies should be estimated and marked as such #6

Open
kupietz opened this issue Oct 22, 2021 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@kupietz
Copy link
Member

kupietz commented Oct 22, 2021

At the moment, the warnings are formulated rather unspecifically:

682: Response time exceeded

and do not point out the consequences.

The frequency values in the case of timeouts are currently not to be understood as estimates, but as lower bounds - without this being explicit.

Estimating frequencies probably also requires changes in Krill.

@kupietz kupietz added the bug Something isn't working label Oct 22, 2021
@Akron
Copy link
Member

Akron commented Oct 22, 2021

Would estimating timeouts be really helpful rather than lower bounds? I made a proposal for a change in the response from Krill to make the lower bound more explicit, setting the results to -1 and adding a different result key, like total_till_time_exceeded or similar. Would that help?
Estimation could be added, but wouldn't be very sophisticated for VC. It would estimate only on the whole corpus.

@kupietz
Copy link
Member Author

kupietz commented Oct 22, 2021

Estimating frequencies or some other workaround is required if the frequency query is somewhere deeply hidden like in all collocation analysis functions, but also in simple frequency queries over vectors of queries and vcs.

Just lower bounds would render the whole API client idea useless - maybe unless this happens rarely or can be resolved by a retry or something.

@Akron Akron added enhancement New feature or request and removed bug Something isn't working labels Oct 22, 2021
@Akron
Copy link
Member

Akron commented Oct 22, 2021

I label this as an enhancement, as estimation would be a completely different feature and I guess would need to be implemented on Krill's side. At least to return the necessary numbers.

As I said: It may not work well with the current numbers we get. I am not an expert in this field of statistics, but I would assume to get a reasonable estimation, we would need a rough percentage of how much of the data in question we already have searched until the timeout - and how much is left. We can give this information for the whole index (i.e. how many documents have we passed in relation to the whole corpus), but as far as I can see, we can't give this information for a VC for now, because a VC is not balanced over the whole corpus/index.

To be able to do that, we would have to calculate the number of documents in the VC and the number of documents in the VC we already have passed (at least roughly). We could do that in a single run.

I see three options for this:

  • Doing it in the first run everytime. This would slow down all searches.
  • Doing it only if an "estimation" flag is set.
  • Doing it after a timeout. This, however, would render the purpose of the timeout meaningless, as the calculation in a redundant run could be quite costly.

However - this would be a Krill enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants