Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Scholar crawling #130

Closed
genomematt opened this issue Jun 3, 2016 · 50 comments
Closed

Google Scholar crawling #130

genomematt opened this issue Jun 3, 2016 · 50 comments

Comments

@genomematt
Copy link
Member

It does not at the moment look like JOSS is being indexed by google scholar.
I think this is one of the sites we want to ensure visibility for the journal on.

https://scholar.google.com.au/intl/en/scholar/inclusion.html#troubleshooting

@arfon
Copy link
Member

arfon commented Jun 3, 2016

JOSS papers have included the Google Scholar metadata tags since this update about 10 days ago.

Their docs suggest it can take 4-6 weeks to be indexed so I think we're still in that window. Let's leave this open for now to keep tracking this issue. I agree we must be indexed by Google Scholar, I'm just not sure there's anything we need to do other than wait at this point 🕐

@pjotrp
Copy link
Contributor

pjotrp commented Jul 15, 2016

Excellent!

@sherrillmix
Copy link

At least a couple papers show up now:
https://scholar.google.com/scholar?hl=en&q=GeneNetwork%3A+framework+for+web-based+genetics
https://scholar.google.com/scholar?hl=en&q=R3D2%3A+Relativistic+Reactive+Riemann+problem+solver+for+Deflagrations+and+Detonations

However other articles on the same page currently do not:
https://scholar.google.com/scholar?q=Xenomapper%3A+Mapping+reads+in+a+mixed+species+context
https://scholar.google.com/scholar?q=pyuca%3A+a+Python+implementation+of+the+Unicode+Collation+Algorithm

Pretty cool that some are showing anyway and maybe google's getting around to processing the others.

@arfon
Copy link
Member

arfon commented Sep 23, 2016

@sherrillmix, thanks for finding these. Let's keep monitoring this.

@FaustinCarter
Copy link

It looks like Google Scholar is not actually indexing anything from JOSS. The two examples listed by @sherrillmix only show up in Google Scholar because the first is also listed at researchgate.net and the second is listed at eprints.soton.ac.uk. Google Scholar is indexing them based off those domains, not because of anything going on at JOSS.

@cMadan
Copy link
Member

cMadan commented Dec 19, 2016

Maybe Google Scholar isn't crawling through the Github links? My JOSS paper is indexed, but that's from being uploaded to my own website (https://scholar.google.com/citations?view_op=view_citation&citation_for_view=2QJwoAwAAAAJ:3BvdIg-l-ZAC), but I feel like that's still sufficiently different than being on researchgate or a university's own repository.

@pjotrp
Copy link
Contributor

pjotrp commented Dec 25, 2016

Maybe it is an idea to move the static pages off github anyway. I know people who do not want to publish with JOSS because of the tight github connection. I think it will be fine to use the issue tracker etc., but at least it won't look like JOSS being a github subsidiary.

@cMadan
Copy link
Member

cMadan commented Dec 25, 2016

Github pages is more integrated now (no longer requires the gh-pages branch). It might be easier to switch to that now, rather than having the paper PDFs being loaded through the Github file preview.

@kyleniemeyer
Copy link
Collaborator

OK, so I actually just submitted JOSS to Google Scholar page for requesting indexing of a journal. This is the (automated) response I got:

Homepage: http://joss.theoj.org
Contact name: Kyle Niemeyer
Contact email: kyle.niemeyer@gmail.com
Inclusion type: Other journal website
Inclusion size: 51-100
Volume URLs: http://joss.theoj.org/papers/popular
Issue URLs: http://joss.theoj.org/papers/popular
TOC URLs: http://joss.theoj.org/papers/popular
Abstract URLs: http://joss.theoj.org/papers/10.21105/joss.00194
http://joss.theoj.org/papers/10.21105/joss.00011
http://joss.theoj.org/papers/10.21105/joss.00189
Article URLs: https://github.com/openjournals/joss-papers/blob/master/joss.00194/10.21105.joss.00194.pdf
https://github.com/openjournals/joss-papers/blob/master/joss.00011/10.21105.joss.00011.pdf
https://github.com/openjournals/joss-papers/blob/master/joss.00189/10.21105.joss.00189.pdf
https://github.com/openjournals/joss-papers/blob/master/joss.00012/10.21105.joss.00012.pdf
https://github.com/openjournals/joss-papers/blob/master/joss.00016/10.21105.joss.00016.pdf
If your content meets our guidelines, you can generally expect to find it included within the Google Scholar results within 4-6 weeks.

Please keep in mind that bibliographic data is extracted from your pages by automatic software. If you aren’t satisfied with the accuracy of your listings, please refer to our technical guidelines at http://scholar.google.com/intl/en/scholar/inclusion.html for ways to provide more accurate bibliographic data.

Regards,

The Google Scholar team

I agree that having the paper PDFs linked to directly rather than the GitHub file preview may be smart—not sure if that will matter for Google Scholar.

@arfon
Copy link
Member

arfon commented Mar 10, 2017

I agree that having the paper PDFs linked to directly rather than the GitHub file preview may be smart—not sure if that will matter for Google Scholar.

Just incase that's an issue I've updated the URLs on the site to link to the 'raw' GitHub URLs which means they don't display in the GitHub UI. An example of this is:

@wojdyr
Copy link

wojdyr commented Jul 5, 2017

I used to have the same problem a couple years ago when I put reprints of my papers into a github repository. I waited more than a year and it was still not in Google Scholar. Then I moved PDFs to a repository served through github pages -- and this helped.

I cannot be sure what the issue was, but perhaps it's because PDFs from github.com/.../raw/master/... are served as Content-Type "application/octet-stream" instead of "application/pdf".

Additional benefit of serving PDFs through github pages would be that the URL would look better, e.g.
https://openjournals.github.io/joss-articles/10.21105.joss.00194.pdf

@arfon
Copy link
Member

arfon commented Jul 6, 2017

I used to have the same problem a couple years ago when I put reprints of my papers into a github repository. I waited more than a year and it was still not in Google Scholar. Then I moved PDFs to a repository served through github pages -- and this helped.

👍 thanks @wojdyr, that's very helpful. Good point about the application/octet-stream content type possibly upsetting the Google bot.

@arfon
Copy link
Member

arfon commented Sep 22, 2017

👍 thanks @wojdyr, that's very helpful. Good point about the application/octet-stream content type possibly upsetting the Google bot.

OK in openjournals/whedon#11 I've modified the URLs we're serving to e.g. http://www.theoj.org/joss-papers/joss.00411/10.21105.joss.00411.pdf . Fingers-crossed that helps!

@FaustinCarter
Copy link

My paper is now indexed on Google Scholar! Others should check as well. That last modification might have done the trick.

@arfon
Copy link
Member

arfon commented Oct 7, 2017

Which paper is that @FaustinCarter?

@FaustinCarter
Copy link

FaustinCarter commented Oct 7, 2017

Actually, my excitement may have been premature. It looks like it may only be indexed because it was listed as a citation here (thanks to the kind soul who cited it in a more traditional publication): http://adsabs.harvard.edu//abs/2016JOSS.2016...46B.

The JOSS link is: http://joss.theoj.org/papers/10.21105/joss.00046

If you search for "pygtc" on http://scholar.google.com it shows up as the first link, but with a [CITATION] tag preceding the title. I think this means that it is grabbing it from the adsabs rather than indexing it directly. This is further motivated by the fact that on both Google Scholar and the Harvard Adsabs service the abstract is listed as "Not available".

Bummer.

@arfon
Copy link
Member

arfon commented Dec 3, 2017

I believe this is now fixed.

@arfon arfon closed this as completed Dec 3, 2017
@arfon
Copy link
Member

arfon commented Dec 3, 2017

See https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&q=10.21105&btnG= for example.

@Benjamin-Lee
Copy link
Contributor

Still a little confused as to why my paper isn't being listed. @arfon is there any reason you can think of?

@arfon
Copy link
Member

arfon commented Nov 10, 2018

@Benjamin-Lee - not sure. I'm following up with some folks about this.

@FaustinCarter
Copy link

Another paper that hasn't made it is: https://joss.theoj.org/papers/cf6f8ac309d6a18b6d6cf08b64aa3f62

@arfon
Copy link
Member

arfon commented Nov 10, 2018

@FaustinCarter - yes, it looks like something stopped working in early August this year.

@eendebakpt
Copy link

@arfon This https://joss.theoj.org/papers/0c6638f84a1a574913ed7c6dd1051847 paper was indexed, but the date was not (yet) extracted. The format of the JOSS papers does not meet the specifications that google scholar used to have. The specifications have changed a little but, currently, the date in joss papers is not as suggested at https://scholar.google.com/intl/en/scholar/inclusion.html#indexing section 2.a.C.
This issue was closed some time ago, should we reopen a new issue?

@arfon
Copy link
Member

arfon commented Jan 7, 2019

This issue was closed some time ago, should we reopen a new issue?

AFAIK, Google Scholar doesn't index us directly, rather, our papers are picked up via ADS. I'll follow up with the folks at ADS to see if there's something different we should be doing.

@leios
Copy link

leios commented Jan 18, 2019

I was going to create a new issue for this, but seeing as there seems to be recent discussion on this thread, I would like to also mention that our article does not seem to be picked up by Google Scholar.

I don't know if this information is helpful, but the last article that seems to be picked up by theoj.org is this one (50 days ago): https://www.theoj.org/joss-papers/joss.01102/10.21105.joss.01102.pdf.

The last one picked up by ads is this one (147 days ago): http://adsabs.harvard.edu/abs/2018JOSS....3..854M

It seems indexing via google scholar seems to have stopped roughly 50 days ago, but there doesn't seem to be any PR that should affect this around that time. I suppose a solution for now would be to upload my article to an institutional repository of my university?

@kgjerde
Copy link

kgjerde commented Jul 2, 2019

@arfon

I wonder if this might be the problem:

  1. Published JOSS paper pages now seem to leave the Google Scholar tag "citation_author" empty:
    <meta name="citation_author" content="">.
    Example: https://joss.theoj.org/papers/10.21105/joss.01342

  2. Google Scholar (https://scholar.google.com/intl/en/scholar/inclusion.html#indexing) states that:

At least one author tag is required for inclusion in Google Scholar.

The time frame of the problem, as stated by you, supports this hypothesis:

@FaustinCarter - yes, it looks like something stopped working in early August this year.

Paper from 1 September 2018 with no author tag: https://joss.theoj.org/papers/efb9242db91adee8c8265f000f26ef5a

Paper from 29 June 2018 with author tag:
https://joss.theoj.org/papers/049f6d3dab9391e8353484028148dd0d

@mschubert
Copy link

mschubert commented Jul 27, 2019

The

<meta name="citation_author" content="">

tag (denoted by empty) is not enough to explain the missing indexing (#f03c15). I checked some of the papers below, and it doesn't explain why some are listed from JOSS

However, there's definitely still issue with Google Scholar indexing. Maybe this issue should be reopened? @arfon

From 1 month ago:

Also 2 months:

  • #f03c15 QMRTools - not indexed empty
  • stingray - indexed, but also in arXiv empty
  • MPIFiles.jl - indexed, but also cited from the International Journal on Magnetic Particle Imaging empty
  • #c5f015 SMACT - actually indexed from theoj.org empty
  • #f03c15 pymcmcstat - only indexed as citation empty

Also 3 months ago:

  • HRDS - indexed, but from eprints.whiterose.ac.uk empty
  • anesthetic - indexed, but from arXiv empty
  • AMReX - indexed, but from cloudfront.escholarship.org empty
  • #c5f015 ggparliament - actually indexed from theoj.org empty
  • #f03c15 rGUIDANCE - only indexed as citation empty
  • #f03c15 uJVM - only indexed as citation empty

Also 6 months ago:

  • Multiblock PLS - indexed, but also on orbit.dtu.dk empty
  • fibergen - indexed, but also on researchgate empty
  • BladeX - indexed, but also on researchgate and arXiv empty

@arfon
Copy link
Member

arfon commented Aug 4, 2019

Hi all, thanks for digging into this further. A couple of things:

  • I've just pushed an update to the site which should mean the <meta name="citation_author" content=""> tag is now populated for all paper.
  • I'm working with the folks at ADS to have JOSS indexed more regularly (daily) which should mean that papers start showing up on Google Scholar more rapidly.

I'm not sure what more to do at this point but am open to suggestions/improvements.

@pgrete
Copy link

pgrete commented Nov 22, 2019

Is there any update on this, e.g., is there a way to manually update the indexing? (Specifically asking for https://joss.theoj.org/papers/10.21105/joss.01636) As a side note I noticed that the "Copy bibtex button" doesn't seem to get updated automatically.

@labarba
Copy link
Member

labarba commented Nov 22, 2019

I looked up that paper on Google Scholar, using the title, and in this case it is being indexed through adsabs.harvard.edu.

The citation info in Harvard style is given as:

Brummel-Smith, C., Bryan, G., Butsky, I., Corlies, L., Emerick, A., Forbes, J., Fujimoto, Y., Goldbaum, N., Grete, P., Hummels, C. and Kim, J.H., 2019. ENZO: An Adaptive Mesh Refinement Code for Astrophysics (Version 2.6). The Journal of Open Source Software, 4.

and the BibTeX is given as:

@article{brummel2019enzo,
  title={ENZO: An Adaptive Mesh Refinement Code for Astrophysics (Version 2.6)},
  author={Brummel-Smith, Corey and Bryan, Greg and Butsky, Iryna and Corlies, Lauren and Emerick, Andrew and Forbes, John and Fujimoto, Yusuke and Goldbaum, Nathan and Grete, Philipp and Hummels, Cameron and others},
  journal={The Journal of Open Source Software},
  volume={4},
  year={2019}
}

... which all seems OK. What is your concern about the indexing of this article?

@pgrete
Copy link

pgrete commented Nov 22, 2019

I see. Looks like I got confused/missed it because neither authors nor references are parsed from ADS.
Regarding bibtex, when I click the button the content that ends up in my clipboard is "BibTex entry not available. Please check back later."

@labarba
Copy link
Member

labarba commented Nov 22, 2019

Oh, I was grabbing the BibTeX info from the Google Scholar "Cite" dialog, not the JOSS website.

@arfon
Copy link
Member

arfon commented Nov 22, 2019

Regarding bibtex, when I click the button the content that ends up in my clipboard is "BibTex entry not available. Please check back later."

Yeah, this is super-buggy and I've just removed it from the UI for now until we can find a long-term fix.

@sgbaird
Copy link

sgbaird commented Aug 5, 2022

Any recommendations/updates in 2022?

@arfon
Copy link
Member

arfon commented Aug 5, 2022

Any recommendations/updates in 2022?

On what in particular sorry?

@sgbaird
Copy link

sgbaird commented Aug 5, 2022

@arfon, sorry - for ensuring that a particular JOSS article gets indexed on Google Scholar for all co-authors.

@arfon
Copy link
Member

arfon commented Aug 6, 2022

@arfon, sorry - for ensuring that a particular JOSS article gets indexed on Google Scholar for all co-authors.

Are you having issues with one of your papers being indexed? If so, could you describe the issue in more detail please?

@sgbaird
Copy link

sgbaird commented Aug 6, 2022

@arfon looks like it went through OK. Just required some more patience on my part. Thanks!

@arfon
Copy link
Member

arfon commented Aug 6, 2022

Great stuff!

@mschubert
Copy link

Now that this was brought up again, I still see issues with GS correctly identifying citations.

For instance, this article cites my work here, but the citation is not listed in GS (but is listed in e.g. Dimensions).

Have others experienced similar things, or could this be caused by the line break in the journal name? (I can open up a separate issue if that's helpful)

@arfon
Copy link
Member

arfon commented Aug 15, 2022

Not sure sorry. Google Scholar is a bit of a black box to us all.

@ManavalanG
Copy link

Hi @arfon ! My article was published in JOSS on Oct 23 but google scholar is still not seeing it. Is there anything I can do? It appears there is not much we can do based on the above comments but just wanted to double check given the time passed since the last comment :)

@arfon
Copy link
Member

arfon commented Nov 15, 2023

I'm afraid not @ManavalanG . We don't have any visibility into the Google Scholar operations.

@hauschke
Copy link

@ManavalanG
Copy link

@hauschke My article was published on Oct 23 and is yet to be seen by google scholar. However when I checked multiple articles that were published in JOSS since then, all of them showed up in google scholar including an article published on Nov 11. This made me curious :)

@ManavalanG
Copy link

ManavalanG commented Feb 7, 2024

Hi! I just wanted to note that our article published on Oct 2023 is yet to appear in google scholar 😞 I checked if there is a mechanism on my end to make it available via google scholar, but I didn't find any. I gather from the earlier conversation that JOSS admins can't do much about it, but I wanted to register my frustration here 😢

@Spaak
Copy link

Spaak commented May 6, 2024

Hi all, just wanted to chime in that our paper: https://joss.theoj.org/papers/10.21105/joss.05566 is appearing in the Google Scholar feed of @matsvanes , but is not appearing in my own feed or that of the other co-authors (@robertoostenveld @schoffelen).

Is there anything that can be done? Perhaps metadata are OK for the first author (in whose feed it is appearing) but not for the co-authors?

@ManavalanG
Copy link

My paper has finally made it to google scholar a month or so ago, but it still doesn't appear in google scholar feed of mine or other authors. Looks like this is similar to the issue mentioned by @Spaak.

@matsvanes
Copy link

@Spaak I actually added it to my Google Scholar manually. I saw they're trying to get JOSS articles indexed in PubMed automatically, but I don't know if any work is being done for Google Scholar?

@sneakers-the-rat
Copy link
Contributor

Confirming again that all criteria listed in google scholar's content, crawling, and indexing requirements are met: https://scholar.google.com/intl/en/scholar/inclusion.html

  • with noscript enabled, can easily reach each paper in <10 clicks from homepage
  • robots.txt does not exclude crawlers
  • metadata on a dozen checked papers is correct according to the highwire press format (meta tags also present without JS)

The only thing I could think of is to make a special page with all the papers from the last two weeks (the current "most recent papers" feed only lists from ~6 days ago, depending on volume), and they say:

For websites with more than a hundred thousand papers, we recommend that you create an additional browse interface that lists only the articles added in the last two weeks. This smaller set of webpages can be recrawled more frequently than your entire browse interface, which will facilitate timely coverage of your recent papers by the search robots.

not sure if that would help but it's the only thing they recommend that we don't do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests