Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misclassified cases #5

Open
tommens opened this issue Aug 31, 2020 · 8 comments
Open

misclassified cases #5

tommens opened this issue Aug 31, 2020 · 8 comments

Comments

@tommens
Copy link
Contributor

tommens commented Aug 31, 2020

In those rare cases where bodega misclassifies a human as bot, or a bot as human, it would be nice to have a way to actually make bodega aware of this, to avoid having the tool report this issue over and over again. I do not know what would be the best way to achieve this, but I can see different possibilities: whenever a misclassification is found, it could be marked as such (in some file with a specific format and filename), and when the tool is run, it checks in the file for the misclassification. It will be then up to the user of bodega to decide whether to include the misclassified accounts when re-running bodega.
Where should such a file be stored? Different solutions can be envisioned:
(1) On the bodega GitHub repository itself, we could have a file containing all known misclassifications (i.e. all cases that have been reported to us, and verified by us, of accounts that were misclassified when running bodega). When running bodega, this file can then be consulted to report the correct classification of the account.
(2) On the GitHub repository that is being analysed by bodega. Again, when running bodega, this file can then be consulted to report the correct classification of the account.
(3) In the directory of the user that is actually using bodega to run the analysis (e.g. if that user does not have write access to be able to use solution (2) and if that user does not want to share the misclassification for whatever reason).
If we want to combine these multiple solutions, we should probably set a precedence order.

@mehdigolzadeh
Copy link
Owner

I didn't understand the second option and in the third option, it can only improve the result for that specific user. But the first one seems to be feasible. Could you please elaborate on this idea a bit more?

@AlexandreDecan
Copy link
Collaborator

I think it's better to ask users (or to have a semi-automztic way to do it) to report those misclassified cases, so we can add them to the training set and release a new version of the tool with an improved model.

From a reusability point of view, it's better to improve the model rather than having a list of "edge cases" whose target class override the one of the model.

@tommens
Copy link
Contributor Author

tommens commented Aug 31, 2020

We already mention in the README that users can report misclassified cases to us if they find any. To have a semi-automatic way, would it be feasible to add support in the tool itself to report misclassified cases to us? I do not immediately see how to.

@AlexandreDecan
Copy link
Collaborator

Since an API key has to be provided, we can add a subcommand to report about invalid cases (e.g. enter usernames that are misclassified in a given repository) and that automatically open an issue in this repository with them?

I'm not convinced we need something like this, since we can simply ask/expect/hope users to report misclassified cases "manually".

@tommens
Copy link
Contributor Author

tommens commented Sep 1, 2020

It is probably too positive to think that people will report misclassified cases manually, just because it is mentioned in our readme
I think that any support that can help to automate the process will reduce the workload, both for the user that wants to report the misclassification, and for us to keep track of reported misclassifications. Therefore, if it is possible and not too difficult to implement such a reporting scheme as part of the tool, that will automatically open an issue on the bodega github repository that could be a nice solution.

@AlexandreDecan
Copy link
Collaborator

Any built-in possibility to report misclassified cases as Github issues will require a second execution of the tool (since it is not interactive, and it won't be given we want to keep it as a reusable CLI). Why a second execution is needed? Because we should be able to reproduce the example, so we need the exact set of comments that were considered by the model (or, at least, the exact set of features that were considered for that specific case).

One "easy" possibility to do so would be to add an extra "--report" flag, accepting a list of accounts that are misclassified, e.g., if the tool was run with bodega request/request --key <my token> --start-date 01-01-2017 --verbose (example taken from the readme), one could use bodega request/request --key <my token> --start-date 01-01-2017 --verbose --report greenkeeperio-bot hktalent for example to report automatically these two accounts as misclassified. This should create an issue in the bodega repository, with enough information for each account so that we can check and confirm the misclassification. I believe we only need the version of bodega that was used and, for each account (accounts do NOT have to be provided) a list of considered comments (that way, we can download them, compute the features, predict its class, and add the "opposite" class in the training set for this example, rebuild the model, and release a new version of bodega).

Btw, doing all of this manually could be very time-consuming for us, but if it's the case, we can still try to implement all these steps as part of a CI (e.g. let's dream of a bot we would develop, that downloads the comments, compute the features and prediction, and posts all of this in the corresponding issue, so that one of us can "confirm" the misclassified case by putting a "confirmed" label on the issue, and then the CI rebuilds the model and pushes it on the repository, with an incremented version of bodega and a tag for the new release). But honestly, given the work all of this represents, I think it's too much for a "research tool" ;)

@AlexandreDecan
Copy link
Collaborator

Notice we can ask a student to do this (e.g., as a M1 project).

@tommens
Copy link
Contributor Author

tommens commented Sep 1, 2020

Yes, looks like an interesting master student project to pursue. Let' try that. If you want, you can close this issue for now (or leave it open until we have a worling implementation, but this can take quite a while).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants