Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several bugs: Parameters not used, misclassifications, and other errors... #15

Closed
bockthom opened this issue Jan 15, 2021 · 6 comments
Closed

Comments

@bockthom
Copy link
Contributor

Dear BoDeGHa team, thanks for this interesting tool to detect bots on GitHub, your approach sounds really promising and helpful. I tried to run the tool just on a single project and, however, identified a bunch of issues:

(1) Misclassifications: First of all, I tried the tool with its default parameters on the GitHub project owncloud/core. As I am interested in the complete project history, I used --start-date 01-01-2015. This resulted in 20 bots and almost 600 humans. However, many of those 20 bots aren't bots -- just checking a few issues or pull requests shows that they are not bots. I won't paste their names here (for data privacy reasons), but if you run your tool with the same parameters for your own, you will be able to check that. I assume that they either have opened many pull requests following the pull-request template of owncloud - or they have only few comments (e.g., having 22 comments, some of them pull request templates, may not be enough and may lead to the misclassification.) But also for some of the other misclassified ones, it might be a problem to just look at 100 comments and ignoring all previous ones, which leads to the next issue...

(2) In a second step, I tried to increase the number of comments analyzed per account to circumvent the problems listed above. To be precise, I wanted to set the number of maxComments to 2000 (as I think 2000 might be an appropriate number to get rid of the accounts classified as bots which actually are humans.) However, there are several bugs in your implementation:

  • From your README file it is not clear whether the parameter is called --maxComments or --maxComment (singular or plurarl), as the Usage section of your README contains both versions.
  • Independent from whether the parameter is called maxComments or maxComment, this parameter is never used in your code: The parameters minComments and maxComments are passed to your process_comments function, but within this function, they are not used, and thus, useless. Please fix this to pass them to the right places. Instead, I identified 5 lines in which the number 100 is hard-coded, so I guess all those 5 lines should make use of the maxComments parameter. (Same holds for minComments.)

(3) To circumvent the bugs in your tool described above, I just changed the hard-coded number 100 in your code manually to another number - at all the 5 lines where 100 occurs. Unfortunately, whenever one of those 5 numbers is replaced by a number greater than 100 , the tool runs into an error right at the beginning of downloading the comments (i.e., the error occurs already if just one of those 5 numbers is set to 101). I don't know what the problem is, maybe there are too many requests for one single API-Token then, but I think this is not the problem as reducing all the other numbers to very small numbers (e.g., 5) still produces the error if one exceeds 100 -- so I am actually not sure what's going on when there is one number greater than 100.

Please try to fix these issues to make your tool applicable to the wild 😉 The second issue should be easy to fix, but the third one, which is the most important to me, looks kind of strange. And in the end, I am still not sure how to treat the first one, as I could not try using 2000 comments, so that I don't know whether this solves part 1 or not.

@mehdigolzadeh
Copy link
Owner

Dear Thomas,
Thanks for considering our bot identification tool. I will have a look at the reported bugs as soon as possible and provide you with updates.

@tommens
Copy link
Contributor

tommens commented Jan 15, 2021

@mehdigolzadeh, to start our analysis of the issues raised in this interesting and detailed issue report, could you share internally with the team (not through GitHub obviously for the usual privacy considerations that were also raised by Thomas) the output of running the latest version of BoDeGHa on the owncloud/core project? This will allow us to start analysing the issues raised. I suggest we arrange a group meeting somewhere next week so that we can look at all of this in detail together. Let's take the rest of the discussion offline in order not to clutter this GitHub discussion, until we come up with clarifications.

@mehdigolzadeh
Copy link
Owner

Dear @bockthom,

We would like to thank you again for considering our tool to use in your project and reporting these issues. We executed BoDeGHa with GitHub project you mentioned (i.e., owncloud/core) and investigated the issues you raised. We provide these answers based on the reported issues:

(1) Regarding the misclassifications: Before starting to answer, it is expected that the tool cannot classify everything correctly, as the tool is based on a classification model that is not 100% accurate, for more details about its accuracy on the basis of a ground-truth dataset we refer to the accompanying research publication mentioned in the README file. On some projects, there may be more classifications than on others, it really depends on the types of accounts and comments (details about reasons for misclassifications can again be found in the accompanying research article).
After running the tool on owncloud/core we identified 18 bots, of which 9 cases indeed corresponded to misclassified humans. We noticed that the main reasons for these misclassifications are a combination of a low number of comments and the use of comment templates. We don't see any short-term solution for this issue; at the longer term, it would require developing a classification model that is template-aware but this is less trivial than it seems. We are currently thinking about future solutions to reduce such types of misclassifications.

(2) Regarding the maxComments parameter: BoDeGHa uses a fixed cutoff of 100 comments since the underlying classification model has been evaluated based on a ground-truth dataset that was manually rated for not more than 100 comments per account. Since we did not study the model's performance on more than 100 comments we preferred not to allow this in the tool. In a similar way we have set a minimum threshold of 10 comments per account, since the classification model started to show good performance from 10 comments onwards. (For details, we refer again to the accompanying paper.)
We understand that, from the user perspective, this may sound a bit confusing, and we realise that the README file of the BoDeGHa repository was not very specific on this. To address this issue, we will clarify the limitations imposed by the model in the README file.
At the same time, we will upload a new version of the tool that allows to consider more than 100 comments in accounts, but in that case, the user of the tool needs to be aware that there is no guarantee on the performance or accurracy, given that the underlying classification model has never been evaluated on more than 100 comments. (Intuitively, we believe that the tool should continue to work fine for more than 100 comments, but the execution time may take significantly longer as the number of comments to be considered will be increases. It will be up to the user to decide whether this is acceptable.)

(3) The third issue raised related to restrictions imposed by GitHub's GraphQL API, which does not allow to do more than 100 requests at a time. This is not a limitation of our tool itself, but a limitation of that API. There is nothing we can do about that, so this is the reason why that value 100 was hard-coded. (Note that this value of 100 is different from the value 100 in point (2) above, which is probably one of the sources of your confusion.)

We hope that we have clarified the issues. Stay tuned for an update of BoDeGHa coming up soon, especially to address point (2) above. We hope that our tool will be able to play a beneficial role in your projects, and we always welcome any feedback or specific scenarios of use about why an how BoDeGHa is being used in practice. This may allow us to further adapt the tool (or its underlying classification model) to user needs.

@bockthom
Copy link
Contributor Author

Dear @mehdigolzadeh, thanks for your fast replies and investigations. Let me just add a few comments:

(1) I am aware that the tool is not 100% accurate. However, as in the investigated project half of the detected bots are not bots, the tool's result (without any manual investigations afterwards) doesn't look reliable, which was the reason for trying other configuration parameters, just to see if the output is closer to my expectations then. No offense, I just wanted to tweak the results self-paced. 😉 I have already read your paper before (to be precise, I found the tool through reading the paper which I stumbled over during literature research about bot detection), so I am aware of potential misclassifications and the potential reasons for that. Nevertheless, I think there is room for improvement and considering comment templates (or just ignoring comments which make use of such a template) could be very beneficial for your classification model. Just take my comments on that as a motivation for future work 😉
As there are that much misclassifications based on comment templates at the moment, as the one example showed, it seems like that one needs to use a semi-manual approach when using your results, i.e., manually checking for each classified bot whether it really is a bot and, if not, manually removing it from the list of bots.

(2) I know that you evaluated the performance just on 100 comments (at least, from what you have stated in the paper), but I would just give it a try to use more comments 😄 If this reduces the amount of misclassified humans, I would be happy with it. Even if you don't evaluate your tool for more than 100 comments and state your limitations in the README, I would be glad if you could provide the possibility to use more than 100 comments (on one's own risk)––just to be able to try this, without any guarantees. So, I am looking forward to that. And yes, the execution time will increase, but I am aware of that and that is no bother for me.

(3) I already assumed that this could be an API restriction, but, as you already mentioned, I am/was confused that the hard-coded value for 100 is different from the 100 in point (2). So, I hope that it is, though, possible to use more than 100 comments in some way.

Thanks a lot for clarifying the issues. I am looking forward to your updates regarding point (2) soon. (And as I think that your tool could be very beneficial, I also hope that you can improve your approach regarding comment templates on the long run.)

@tommens
Copy link
Contributor

tommens commented Jan 20, 2021

Thanks @bockthom for your response. Concerning the fact of needing to taking into account comment templates, one of the problems we have encountered, and as a consequence one of the reasons why we have not integrated this, is the fact that it is very difficult to know if a project is using templates, and if the project is, what are the templates being used. Moreover, if templates are being used, there is no historical record of this. As a result, it becomes difficult to take this into account during comment analysis. There can be many different project-specific ways in which templates are being used: using the template-mechanism from GitHub; or using some external service or tool; the use of templates is not standardised. I am not sure what is the best way to deal with this problem. A semi-manual approach could be an option, but it would be very effort-intensive. Allowing to provide comment templates as input to the tool could be another option, but it is difficult for us to evaluate whether this might lead to better results for our classification model. In fact, we would need a kind of ground-truth dataset of templates being used by GitHub projects for their issue and PR comments. Based on such a dataset we could try to see what can be done. How to come to such a dataset (that should be big enough) is another thing.

@bockthom
Copy link
Contributor Author

Thanks for your insights, I see that taking into account comment templates is way more complex than I had expected. I just thought about the templates stored in a ".github/" directory in the project––I was not aware of other templates and non-standardized templates, and I also disregarded template changes during a project's evolution. I completely agree with you that all of this has to be taken care of properly, which is not easy to achieve and needs lots of further investigations. Maybe it would be easier to completely ignore the first message of a pull-request or issue, as templates usually affect the initial comment (however, I am not sure if this is actually true). Nevertheless, if I remember correctly, this would be in contrast to your performance evaluation regarding the consideration of empty comments, which would be neglected when ignoring the initial comment. Anyway, I just provided you with my thoughts on that, without having deeper knowledge on those templates. Thanks for the detailed responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants