No content is retrieved, potential error at the readHTMLTable stage. #23

bastienboutonnet · 2015-11-13T14:57:49Z

As of today (3AM CET, as the earliest measured occurrence):

get_profile() returns the an error with table:
Error in tables[[1]] : subscript out of bounds
get_citation_history() returns an empty df
[1] year cites
<0 rows> (or 0-length row.names)

I suspect something has changed in the google API?

Just tried figuring out (but I'm not that skilled) however a pull of the XML content like so using RCurl

getURL('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')

returns a bunch of source code but with an error message at the end:

We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>"`

I'll let you decide if this is worth closing this case. I imagine it is, but there may be more to it that someone more expert can check out.

The text was updated successfully, but these errors were encountered:

bastienboutonnet · 2015-11-17T11:11:33Z

Looks like the problem is still there. I'm not sure if I can be more helpful but it may be a good idea to look into this. I've been able to reproduce the problem using a different scholar profile, and on a different computer.

Not sure where exactly the problem resides but it would seem that when trying to read the table from the page using readHTMLTable(url) no content is retrieved:

> t=readHTMLTable("http://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ")
> t
named list()

jefferis · 2015-11-17T11:27:07Z

I can confirm that I see this. See also http://stackoverflow.com/questions/33741372/google-server-gives-a-server-error-with-the-first-request-in-private-browsing-mo

That SO post notes that repeating the request bypasses the issue. Doing:

library(httr)
res=GET('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')
content(res)
res2=GET('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')
content(res2)

worked for me the second time, while getURL never worked.

bastienboutonnet · 2015-11-17T14:35:43Z

I can confirm that this works indeed.

if readHTMLTable() is ran on the content of res2 (readHTMLTable(content(res2))) then we obtain the tables needed for the rest of the functions to work.

What does @jkeirstead recommend in terms of mending this issue? Should the functions be written so that a test for content retrieval is performed and if this fails a pull of the content using the method outlined above (twice) is performed and the rest of the function runs on the content of the object?

Not sure how long this issue with google will remain, seems to have been quite some days already. But seems like a fix, that if judged needed I'd be happy to help fixing :)

Kyeongan · 2015-11-17T16:38:37Z

I visited here from Stackoverflow.com. Your R package is pretty nice and it seems we have the same issue from Google now on. I would like to discuss any ideas for solving it and share any things that is helpful.

jkeirstead · 2015-11-17T21:08:34Z

Thanks for raising this issue and posting the fix. I'm inclined to wait to see if Google fixes this since that's what the error message suggests.

bastienboutonnet · 2015-11-17T21:12:13Z

Makes sense. If it take too long and want the fixes implemented I'll be happy to help.

rogiersbart · 2015-11-18T11:04:35Z

Ok guys, great! Just noticed the bug looking at an empty citation history plot on my personal blog. Let's hope they fix this soon!

LechMadeyski · 2015-11-19T21:15:36Z

"Fixed the issue by having cookies when it requests URLs." see http://stackoverflow.com/questions/33741372/google-server-gives-a-server-error-with-the-first-request-in-private-browsing-mo

jefferis · 2015-11-19T21:24:43Z

That makes sense. I think httr GET looks after the cookie state.

Sent from my iPhone

On 19 Nov 2015, at 21:15, Lech Madeyski notifications@github.com wrote:

"Fixed the issue by having cookies when it requests URLs." see http://stackoverflow.com/questions/33741372/google-server-gives-a-server-error-with-the-first-request-in-private-browsing-mo

—
Reply to this email directly or view it on GitHub.

jkeirstead · 2015-11-20T09:41:54Z

Thanks @LechMadeyski. That does indeed seem to be the problem; will try to get a fix out shortly.

jkeirstead · 2015-11-21T17:12:00Z

This has now been fixed and the latest version is available on dev; a CRAN release should be out very soon.

For those who are curious, the problem was that cookies have to be accepted in order to access the content. The package now performs a one-off check for a dummy URL and then maintains a persistent Curl handle for future queries.

guillaumelobet · 2017-05-11T09:30:16Z

It appears the issue is back, or at least for me. I try to compile data from several colleagues (so multiple get_profile() queries) and I got randomly stuck with the Error in tables[[1]] : subscript out of bounds error...

Any ideas how to fix this, or any workaround?

pzhaonet · 2021-04-26T01:19:12Z

I also have the same issue. Does anyone know how to fix it?

get_profile(id = "TErVoUAAAAJ")

bastienboutonnet changed the title ~~subscript out of bounds~~ bug: no content is retrieved, potential error at the readHTMLTable stage. Nov 17, 2015

jkeirstead added the bug label Nov 20, 2015

jkeirstead changed the title ~~bug: no content is retrieved, potential error at the readHTMLTable stage.~~ No content is retrieved, potential error at the readHTMLTable stage. Nov 20, 2015

jkeirstead closed this as completed Nov 21, 2015

bastienboutonnet mentioned this issue Nov 22, 2015

problem with XML obtention bastienboutonnet/scholarToPersoWebsite#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No content is retrieved, potential error at the readHTMLTable stage. #23

No content is retrieved, potential error at the readHTMLTable stage. #23

bastienboutonnet commented Nov 13, 2015

bastienboutonnet commented Nov 17, 2015

jefferis commented Nov 17, 2015

bastienboutonnet commented Nov 17, 2015

Kyeongan commented Nov 17, 2015

jkeirstead commented Nov 17, 2015

bastienboutonnet commented Nov 17, 2015

rogiersbart commented Nov 18, 2015

LechMadeyski commented Nov 19, 2015

jefferis commented Nov 19, 2015

jkeirstead commented Nov 20, 2015

jkeirstead commented Nov 21, 2015

guillaumelobet commented May 11, 2017

pzhaonet commented Apr 26, 2021

No content is retrieved, potential error at the readHTMLTable stage. #23

No content is retrieved, potential error at the readHTMLTable stage. #23

Comments

bastienboutonnet commented Nov 13, 2015

bastienboutonnet commented Nov 17, 2015

jefferis commented Nov 17, 2015

bastienboutonnet commented Nov 17, 2015

Kyeongan commented Nov 17, 2015

jkeirstead commented Nov 17, 2015

bastienboutonnet commented Nov 17, 2015

rogiersbart commented Nov 18, 2015

LechMadeyski commented Nov 19, 2015

jefferis commented Nov 19, 2015

jkeirstead commented Nov 20, 2015

jkeirstead commented Nov 21, 2015

guillaumelobet commented May 11, 2017

pzhaonet commented Apr 26, 2021