Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cold start handling in ranked batch sampling #28

Closed
zhangyu94 opened this issue Jan 10, 2019 · 10 comments
Closed

cold start handling in ranked batch sampling #28

zhangyu94 opened this issue Jan 10, 2019 · 10 comments

Comments

@zhangyu94
Copy link
Contributor

Hi!

The behavior of cold start handling in ranked batch sampling seems different from the Cardoso et al.'s "Ranked batch-mode active learning".

modAL/modAL/batch.py

Lines 133 to 139 in 452898f

if classifier.X_training is None:
labeled = select_cold_start_instance(X=unlabeled, metric=metric, n_jobs=n_jobs)
elif classifier.X_training.shape[0] > 0:
labeled = classifier.X_training[:]
# Define our record container and the maximum number of records to sample.
instance_index_ranking = []

In modAL's implementation, in the case of cold start, the instance selected by select_cold_start_instance is not added to the instance list instance_index_ranking.
While in "Ranked batch-mode active learning", the instance selected by select_cold_start_instance seems to be the first item in instance_index_ranking.

return X[best_coldstart_instance_index].reshape(1, -1)

If my understanding on the algorithm proposed in the paper and modAL's implementation is correct, we can change the return of select_cold_start_instance to
return best_coldstart_instance_index, X[best_coldstart_instance_index].reshape(1, -1),
store best_coldstart_instance_index in instance_index_ranking, and revise ranked_batch correspondingly.

@cosmic-cortex
Copy link
Member

Great, thanks! It is certainly not added. I'll take a look at the paper as soon as possible!

@cosmic-cortex
Copy link
Member

Thanks for the PR! Fixed by #29.

@zhangyu94
Copy link
Contributor Author

Hmm... This issue and PR #29 actually are addressing different problems.
This issue is on the problem that the best cold start instance is not added to the first batch (in ranked_batch), while PR #29 address the problem that the computation of instance index is incorrect (in select_instance).

I haven't started a pull request for this issue since solving it very likely requires changing the API of select_cold_start_instance.
If needed, I can start a PR for this issue later today.

@cosmic-cortex
Copy link
Member

Sorry, remembered the issue wrong and didn't read the post again before commenting. Issue reopened! :)

@cosmic-cortex cosmic-cortex reopened this Jan 14, 2019
@cosmic-cortex
Copy link
Member

In any case, no need to rush with the PR! Probably I won't have time to work on this week, so any help is appreciated!

@zhangyu94
Copy link
Contributor Author

Hi, I have opened PR #30 for this issue.

By the way, I think it will be great if we can compose the cold start handling mechanism that currently works for ranked batch sampling (and possibly other cold start handling strategies in the future) with other active sampling strategies supported by modAL.

@cosmic-cortex
Copy link
Member

Alright, thanks for the PR! I finally had the time to review and merge it. Currently, some cold start is implemented for the utility measure functions, but it only checks whether the estimator has been fitted yet, and if not, it returns a zero array. Implementing the same density based cold start criteria for a general query function is a good idea.

@zhangyu94
Copy link
Contributor Author

I see. I will take a look at how to integrate cold start handling mechanisms.

One thing that I have been thinking about is whether it is better to pass the cold start function to the query strategy functions or to the Learner when initialized. It is logically sounder to pass cold start criteria to a query strategy since "cold start criteria" are part of the "query strategy", while in implementation, it seems much easier to do it the other way. If we pass the cold start criteria to the Learner, it seems that we only need to change the Learner.query method to support cold start handling for all the query strategies. By comparison, if the cold start criteria is to be passed to the query strategy functions, all the query strategy functions may need to be revised.

Thanks.

@cosmic-cortex
Copy link
Member

I agree completely. I think it is better if the cold start strategy is passed to the query strategy, even so if all query strategy functions need to be modified.

In connection with this, I also plan to do a refactor of the query strategy functions. If you check the code, for instance here, the implementation of the uncertainty_sampling, margin_sampling, entropy_sampling functions is almost identical, aside from the function they call for calculating the utility. This can be solved with a function factory or some other construct. The only reason I implemented it this way because I wanted to avoid adding docstrings one by one later. Do you have any idea which might be good for this? We might hit two birds with one stone, because this would solve the problem outlined by you.

@zhangyu94
Copy link
Contributor Author

Hmm... I don't have better ideas than using a function factory.

A possible alternative is to lift the query strategies from functions to instances of a QueryStrategy class. Different instances of this QueryStrategy class can have different scorers (e.g., classifier_entropy) and cold start handlers. This solution doesn't seem to have a clear advantage over the function factory solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants