Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No content is retrieved, potential error at the readHTMLTable stage. #23

Closed
bastienboutonnet opened this issue Nov 13, 2015 · 13 comments
Closed
Labels

Comments

@bastienboutonnet
Copy link

As of today (3AM CET, as the earliest measured occurrence):

  • get_profile() returns the an error with table:
    Error in tables[[1]] : subscript out of bounds
  • get_citation_history() returns an empty df
    [1] year cites
    <0 rows> (or 0-length row.names)

I suspect something has changed in the google API?

Just tried figuring out (but I'm not that skilled) however a pull of the XML content like so using RCurl

getURL('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')

returns a bunch of source code but with an error message at the end:

We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>"`

I'll let you decide if this is worth closing this case. I imagine it is, but there may be more to it that someone more expert can check out.

@bastienboutonnet bastienboutonnet changed the title subscript out of bounds bug: no content is retrieved, potential error at the readHTMLTable stage. Nov 17, 2015
@bastienboutonnet
Copy link
Author

Looks like the problem is still there. I'm not sure if I can be more helpful but it may be a good idea to look into this. I've been able to reproduce the problem using a different scholar profile, and on a different computer.

Not sure where exactly the problem resides but it would seem that when trying to read the table from the page using readHTMLTable(url) no content is retrieved:

> t=readHTMLTable("http://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ")
> t
named list()

@jefferis
Copy link
Collaborator

I can confirm that I see this. See also http://stackoverflow.com/questions/33741372/google-server-gives-a-server-error-with-the-first-request-in-private-browsing-mo

That SO post notes that repeating the request bypasses the issue. Doing:

library(httr)
res=GET('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')
content(res)
res2=GET('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')
content(res2)

worked for me the second time, while getURL never worked.

@bastienboutonnet
Copy link
Author

I can confirm that this works indeed.

if readHTMLTable() is ran on the content of res2 (readHTMLTable(content(res2))) then we obtain the tables needed for the rest of the functions to work.

What does @jkeirstead recommend in terms of mending this issue? Should the functions be written so that a test for content retrieval is performed and if this fails a pull of the content using the method outlined above (twice) is performed and the rest of the function runs on the content of the object?

Not sure how long this issue with google will remain, seems to have been quite some days already. But seems like a fix, that if judged needed I'd be happy to help fixing :)

@Kyeongan
Copy link

I visited here from Stackoverflow.com. Your R package is pretty nice and it seems we have the same issue from Google now on. I would like to discuss any ideas for solving it and share any things that is helpful.

@jkeirstead
Copy link
Owner

Thanks for raising this issue and posting the fix. I'm inclined to wait to see if Google fixes this since that's what the error message suggests.

@bastienboutonnet
Copy link
Author

Makes sense. If it take too long and want the fixes implemented I'll be happy to help.

@rogiersbart
Copy link

Ok guys, great! Just noticed the bug looking at an empty citation history plot on my personal blog. Let's hope they fix this soon!

@LechMadeyski
Copy link

@jefferis
Copy link
Collaborator

That makes sense. I think httr GET looks after the cookie state.

Sent from my iPhone

On 19 Nov 2015, at 21:15, Lech Madeyski notifications@github.com wrote:

"Fixed the issue by having cookies when it requests URLs." see http://stackoverflow.com/questions/33741372/google-server-gives-a-server-error-with-the-first-request-in-private-browsing-mo


Reply to this email directly or view it on GitHub.

@jkeirstead
Copy link
Owner

Thanks @LechMadeyski. That does indeed seem to be the problem; will try to get a fix out shortly.

@jkeirstead jkeirstead added the bug label Nov 20, 2015
@jkeirstead jkeirstead changed the title bug: no content is retrieved, potential error at the readHTMLTable stage. No content is retrieved, potential error at the readHTMLTable stage. Nov 20, 2015
@jkeirstead
Copy link
Owner

This has now been fixed and the latest version is available on dev; a CRAN release should be out very soon.

For those who are curious, the problem was that cookies have to be accepted in order to access the content. The package now performs a one-off check for a dummy URL and then maintains a persistent Curl handle for future queries.

@guillaumelobet
Copy link

It appears the issue is back, or at least for me. I try to compile data from several colleagues (so multiple get_profile() queries) and I got randomly stuck with the Error in tables[[1]] : subscript out of bounds error...

Any ideas how to fix this, or any workaround?

@pzhaonet
Copy link

I also have the same issue. Does anyone know how to fix it?

get_profile(id = "TErVoUAAAAJ")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants