-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several bugs: Parameters not used, misclassifications, and other errors... #15
Comments
Dear Thomas, |
@mehdigolzadeh, to start our analysis of the issues raised in this interesting and detailed issue report, could you share internally with the team (not through GitHub obviously for the usual privacy considerations that were also raised by Thomas) the output of running the latest version of BoDeGHa on the owncloud/core project? This will allow us to start analysing the issues raised. I suggest we arrange a group meeting somewhere next week so that we can look at all of this in detail together. Let's take the rest of the discussion offline in order not to clutter this GitHub discussion, until we come up with clarifications. |
Dear @bockthom, We would like to thank you again for considering our tool to use in your project and reporting these issues. We executed BoDeGHa with GitHub project you mentioned (i.e., owncloud/core) and investigated the issues you raised. We provide these answers based on the reported issues: (1) Regarding the misclassifications: Before starting to answer, it is expected that the tool cannot classify everything correctly, as the tool is based on a classification model that is not 100% accurate, for more details about its accuracy on the basis of a ground-truth dataset we refer to the accompanying research publication mentioned in the README file. On some projects, there may be more classifications than on others, it really depends on the types of accounts and comments (details about reasons for misclassifications can again be found in the accompanying research article). (2) Regarding the maxComments parameter: BoDeGHa uses a fixed cutoff of 100 comments since the underlying classification model has been evaluated based on a ground-truth dataset that was manually rated for not more than 100 comments per account. Since we did not study the model's performance on more than 100 comments we preferred not to allow this in the tool. In a similar way we have set a minimum threshold of 10 comments per account, since the classification model started to show good performance from 10 comments onwards. (For details, we refer again to the accompanying paper.) (3) The third issue raised related to restrictions imposed by GitHub's GraphQL API, which does not allow to do more than 100 requests at a time. This is not a limitation of our tool itself, but a limitation of that API. There is nothing we can do about that, so this is the reason why that value 100 was hard-coded. (Note that this value of 100 is different from the value 100 in point (2) above, which is probably one of the sources of your confusion.) We hope that we have clarified the issues. Stay tuned for an update of BoDeGHa coming up soon, especially to address point (2) above. We hope that our tool will be able to play a beneficial role in your projects, and we always welcome any feedback or specific scenarios of use about why an how BoDeGHa is being used in practice. This may allow us to further adapt the tool (or its underlying classification model) to user needs. |
Dear @mehdigolzadeh, thanks for your fast replies and investigations. Let me just add a few comments: (1) I am aware that the tool is not 100% accurate. However, as in the investigated project half of the detected bots are not bots, the tool's result (without any manual investigations afterwards) doesn't look reliable, which was the reason for trying other configuration parameters, just to see if the output is closer to my expectations then. No offense, I just wanted to tweak the results self-paced. 😉 I have already read your paper before (to be precise, I found the tool through reading the paper which I stumbled over during literature research about bot detection), so I am aware of potential misclassifications and the potential reasons for that. Nevertheless, I think there is room for improvement and considering comment templates (or just ignoring comments which make use of such a template) could be very beneficial for your classification model. Just take my comments on that as a motivation for future work 😉 (2) I know that you evaluated the performance just on 100 comments (at least, from what you have stated in the paper), but I would just give it a try to use more comments 😄 If this reduces the amount of misclassified humans, I would be happy with it. Even if you don't evaluate your tool for more than 100 comments and state your limitations in the README, I would be glad if you could provide the possibility to use more than 100 comments (on one's own risk)––just to be able to try this, without any guarantees. So, I am looking forward to that. And yes, the execution time will increase, but I am aware of that and that is no bother for me. (3) I already assumed that this could be an API restriction, but, as you already mentioned, I am/was confused that the hard-coded value for 100 is different from the 100 in point (2). So, I hope that it is, though, possible to use more than 100 comments in some way. Thanks a lot for clarifying the issues. I am looking forward to your updates regarding point (2) soon. (And as I think that your tool could be very beneficial, I also hope that you can improve your approach regarding comment templates on the long run.) |
Thanks @bockthom for your response. Concerning the fact of needing to taking into account comment templates, one of the problems we have encountered, and as a consequence one of the reasons why we have not integrated this, is the fact that it is very difficult to know if a project is using templates, and if the project is, what are the templates being used. Moreover, if templates are being used, there is no historical record of this. As a result, it becomes difficult to take this into account during comment analysis. There can be many different project-specific ways in which templates are being used: using the template-mechanism from GitHub; or using some external service or tool; the use of templates is not standardised. I am not sure what is the best way to deal with this problem. A semi-manual approach could be an option, but it would be very effort-intensive. Allowing to provide comment templates as input to the tool could be another option, but it is difficult for us to evaluate whether this might lead to better results for our classification model. In fact, we would need a kind of ground-truth dataset of templates being used by GitHub projects for their issue and PR comments. Based on such a dataset we could try to see what can be done. How to come to such a dataset (that should be big enough) is another thing. |
Thanks for your insights, I see that taking into account comment templates is way more complex than I had expected. I just thought about the templates stored in a ".github/" directory in the project––I was not aware of other templates and non-standardized templates, and I also disregarded template changes during a project's evolution. I completely agree with you that all of this has to be taken care of properly, which is not easy to achieve and needs lots of further investigations. Maybe it would be easier to completely ignore the first message of a pull-request or issue, as templates usually affect the initial comment (however, I am not sure if this is actually true). Nevertheless, if I remember correctly, this would be in contrast to your performance evaluation regarding the consideration of empty comments, which would be neglected when ignoring the initial comment. Anyway, I just provided you with my thoughts on that, without having deeper knowledge on those templates. Thanks for the detailed responses. |
Dear BoDeGHa team, thanks for this interesting tool to detect bots on GitHub, your approach sounds really promising and helpful. I tried to run the tool just on a single project and, however, identified a bunch of issues:
(1) Misclassifications: First of all, I tried the tool with its default parameters on the GitHub project owncloud/core. As I am interested in the complete project history, I used
--start-date 01-01-2015
. This resulted in 20 bots and almost 600 humans. However, many of those 20 bots aren't bots -- just checking a few issues or pull requests shows that they are not bots. I won't paste their names here (for data privacy reasons), but if you run your tool with the same parameters for your own, you will be able to check that. I assume that they either have opened many pull requests following the pull-request template of owncloud - or they have only few comments (e.g., having 22 comments, some of them pull request templates, may not be enough and may lead to the misclassification.) But also for some of the other misclassified ones, it might be a problem to just look at 100 comments and ignoring all previous ones, which leads to the next issue...(2) In a second step, I tried to increase the number of comments analyzed per account to circumvent the problems listed above. To be precise, I wanted to set the number of
maxComments
to 2000 (as I think 2000 might be an appropriate number to get rid of the accounts classified as bots which actually are humans.) However, there are several bugs in your implementation:--maxComments
or--maxComment
(singular or plurarl), as the Usage section of your README contains both versions.maxComments
ormaxComment
, this parameter is never used in your code: The parameters minComments and maxComments are passed to yourprocess_comments
function, but within this function, they are not used, and thus, useless. Please fix this to pass them to the right places. Instead, I identified 5 lines in which the number 100 is hard-coded, so I guess all those 5 lines should make use of the maxComments parameter. (Same holds for minComments.)(3) To circumvent the bugs in your tool described above, I just changed the hard-coded number 100 in your code manually to another number - at all the 5 lines where 100 occurs. Unfortunately, whenever one of those 5 numbers is replaced by a number greater than 100 , the tool runs into an error right at the beginning of downloading the comments (i.e., the error occurs already if just one of those 5 numbers is set to 101). I don't know what the problem is, maybe there are too many requests for one single API-Token then, but I think this is not the problem as reducing all the other numbers to very small numbers (e.g., 5) still produces the error if one exceeds 100 -- so I am actually not sure what's going on when there is one number greater than 100.
Please try to fix these issues to make your tool applicable to the wild 😉 The second issue should be easy to fix, but the third one, which is the most important to me, looks kind of strange. And in the end, I am still not sure how to treat the first one, as I could not try using 2000 comments, so that I don't know whether this solves part 1 or not.
The text was updated successfully, but these errors were encountered: