-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving accuracy of citation counts #114
Comments
Dear Jonathan, thank you very much for this thoughtful suggestion, as well as your offer to help - I really appreciate it! I think your suggestion is certainly worth considering, let me start with some explanation of the status quo, and then turn to your suggestion. You rightfully point out that the current approach on atomistic.software of searching for both the name of a key author (or authors) and the name of the code excludes certain mentions of the code. This can happen, for example, when the name of the code author was not mentioned (no traditional "citation"), or when the reference section of the article was not indexed (e.g. code mentioned in title or abstract but full text index missing). This approach was inherited from the first static incarnation of the list and, I presume, was a natural evolution from wanting to count traditional citations. Turning to your suggestion.
Could you perhaps elaborate a bit on what makes you confident of the premise of a constant false positive rate? While your two data points for the case of MOPAC seem to match it, I can't come up with a fundamental reason why the growth rate of code "X" and the term "X" in the rest of Google scholar can be assumed to correlate more generally. I don't know what the origin of the false positives for MOPAC in your search are, but for codes that are named after words from day-to-day language (WEST, exciting, ...) or perhaps even more so for those named after somewhat rare words (fleur, Amber, ORCA) I can certainly imagine a sudden uptick of false positives because the word appears in some other context that is becoming a hot topic of research. Or, conversely, a strong relative reduction of false positives because the code is young and its user base is growing much faster than the use of the search term in Google Scholar (citations of MOPAC have remained relatively stable over the last decade, while for, say, ORCA, annual citations have grown by one order of magnitude). Secondly, one of the big advantages of using Google scholar is that it allows atomistic.software to provide direct links to the search results (no paywall), so that users can browse the current research done with a code. I would consider it a significant downside to have those results littered with false positives resulting from a more general query. Given this analysis, I am initially skeptical of implementing this idea - but I certainly remain open to discussion. At the same time, I am very interested in reducing false negatives in the current approach - a factor of 2 seems high to me, and not very satisfactory. So far, this has been done only for a couple of codes in the list on an "as needed" basis (e.g. for MOLCAS), and I would very much appreciate help in rolling it out more broadly. [1] #29 |
I appreciate the thoughtful reply, and I will reply in-kind. You acknowledge an obvious problem - that with the tools at hand, we cannot track the citations of all atomistic simulation codes with equal accuracy (for various reasons). While perhaps an oversimplification, I've attempted to quantify accuracy in my own citation-tracking efforts by taking a statistical perspective and assigning a false-positive rate (which I could estimate reasonably accurately) to the data I had available. I should also clarify that while I used a time-invariant Google Scholar search string and assigned a time-invariant false-positive rate, there were time-specific factors that led to those choices. For example, MOPAC, like Gaussian and a few other codes, is very often cited with a year in the name (e.g. MOPAC2016), which is a complicating factor that doesn't affect much older citations to MOPAC. Certainly, the citation-gathering/quantifying process will need to change with time to maintain accuracy and relevance - if new sources of either false positives or false negatives emerge, then search criteria certainly need to be adjusted accordingly and past citation counts may even change somewhat. I understand your desire to keep the false-positive rate relatively low, as people might want to browse through lists of papers citing these codes (and you provide a link to do so). I tolerated a somewhat high false-positive rate in my own efforts because I was rescaling the data accordingly, but it is similarly reasonable to report raw numbers and cap the tolerable false-positive rate to a relatively small amount (although this removes some flexibility from citation gathering). Whether or not you decide to track or utilize a false-positive rate (to rescale the raw numbers) on this website, I think developers should be taking more responsibility in how their codes are perceived and more actively protect their work and scientific interests. Thus, while I do have some interest in better citation tracking of other codes, it is a much higher priority for me that MOPAC's citations are not severely undercounted. To that end, I've adjusted my own citation tracking of MOPAC to reduce the false-positive rate to a more acceptable number. I've adjusted my Google Scholar search string to:
This was the best I could do because there is a limit to the length of a Google Scholar search string. Also, you should add the Google Scholar URL command My own estimate of a false-positive rate was targeting specifically papers with DOIs, and this will still be nontrivial rate, even with this optimized search string, since Google Scholar indexes books, theses, conference proceedings, preprints, technical reports, and other research products that aren't scientific papers. However, with the more generous target of Google Scholar entries legitimately referring to the computer program MOPAC, my estimate of the false-positive rate is now 4% (estimated from 5 random samples per year between 1990-2020). I think a simple yet fair course of action is to just keep things as they are and accept revisions to your Google Scholar search strings that increase citation counts (thus apparently reducing false-negative rates) without causing a substantial false-positive rate. Developers should know how their codes are cited better than other people, and their expertise should be a welcome contribution to this website. |
Thank you Jonathan for the detailed reply!
I agree. The main downside I see is the increasing complexity of the search strings (I hope this will not create more maintenance work down the line), but I fully understand that absolute citation counts are important to the code developers, and I guess that is the price to be paid. In the process, we may temporarily be introducing some imbalance in the citation counts between those codes whose search strings have already been updated and those whose haven't been updated yet but I guess we can live with that as well. On the topic of requiring the author names to be part of the search string, I agree that the search string should mirror how the code is usually cited, and so there can be valid exceptions to this rule of thumb as long as the false positive rate remains small - say, below 10%. There may be some edge cases where one needs to be careful, like the codes with >1000 citations (there certainly is some "order" to search results, i.e. a false positive rate is not necessarily constant across results), but overall I am confident that the search strings can be improved in this way across the board. Eventually, we may want to actively solicit input on the search strings from developers, e.g. by contacting them directly (not all at once, but one after the other). So far I was hesitant to do this since this can be a sensitive topic and I know how busy developers are, but as atomistic.software becomes more widely used, the cost/benefit analysis may also change here.
Is that what this field does? In a few spot checks I did, there was either no difference in the citation count or a very minor difference, so I guess the decision of whether to leave those in or not is not very consequential (unless you have made observations to the contrary). Since we have come to agree on how to move forward, I will close this issue. |
The |
Thanks for the clarification!
I still didn't quite get what you mean by "they didn't correspond to other papers", could you just elaborate on this last bit? I would like to understand this better, e.g. in case we ever feel the need to extend the citation period further back into the past. From the Google Scholar FAQ I would have concluded that Google Scholar came across these citations while processing the full-text archives of some journals, for which no corresponding online resources exist. |
Well, an example with a large number of these would be citations to MOPAC in 1993, to its final early open-source release (MOPAC 7) and its first commercial release (MOPAC93): There are clearly numerous, very similar citations to MOPAC and/or its manual that are sprawled out as distinct "citation" entries in Google Scholar. My rough guess is that when Google Scholar ingests the bibliographies of papers, it assumes that each entry will be a citation to a research product (i.e. a paper, book, or conference proceeding) or a footnote. If it sees a year in an entry, it just assumes that it is a research product, and either connects it to a known research product or citation entry from that year or adds an empty citation entry for that year if there are no sufficiently close matches. The "cited by" links for each citation entry can let you see the papers that caused these erroneous entries to appear. For example, the top entry from that year with the most citations, "MOPAC 93.00 Manual", comes from bibliography entries such as:
There is an ongoing trend of moving to direct bibliographic citations of software rather than the more common citations to software release papers, and this might cause these sorts of problems to reemerge. |
Thanks, I think I finally get it now. |
Yes, although if we manage to get the community to adopt the FORCE11 software citation principles, in particular the use of unique, persistent identifiers, I think it can also be an improvement upon the status quo. E.g. code developers may be getting statistics about which versions of their software are being used. I like e.g. the implementation by Zenodo, where a software has both a "concept DOI" and a separate DOI for each version (Figure 9 in the accompanying article to atomistic.software). |
I am maintaining MOPAC, and as part of this activity, I've been exploring how to track its citations more accurately. My best attempt at this using Google Scholar data is: https://openmopac.github.io/_images/plot.pdf . There are two things to note: my estimate of citation counts to MOPAC are about twice what is in your database, and I have error bars.
My basic methodology has been to make the search criteria as broad as possible and then to estimate a false positive rate by sampling the data by hand. The premise is that false positives of an overly generous search are easier to quantify than false negatives of an overly narrow search. In MOPAC's case, I estimated false positive rates for two different, separated periods of time, and found them to be very similar. Thus, I suspect that it is safe to use a constant false positive rate for each code, but each code probably has a different false positive rate because some codes are more successful at inducing standardized citations than others.
Would you be interested in adapting this sort of methodology for atomistic.software? It would require an entry in codes.json for the false positive citation rate for each code, and you may need to adjust query_string somehow to accommodate more complicated nested logical expressions for the Google Scholar search strings. If you are willing to make the appropriate backend changes, then I'd be willing to go through the database entries to expand the search strings and estimate the false positive rates.
The text was updated successfully, but these errors were encountered: