Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many redirects error when adding/updating doi.org url #452

Closed
ghost opened this issue May 3, 2020 · 8 comments
Closed

Too many redirects error when adding/updating doi.org url #452

ghost opened this issue May 3, 2020 · 8 comments

Comments

@ghost
Copy link

ghost commented May 3, 2020

Adding scientific papers to database using doi.org url often ends in "too many redirects" error. This isn't really buku's fault (I think), but still some inconvenience.

Example url: https://doi.org/10.1080/00140139.2013.790485

Log:

~ $ buku -z -u 3765
[DEBUG] buku v4.3
[DEBUG] Python v3.8.2
[DEBUG] netloc: doi.org
[DEBUG] Starting new HTTPS connection (1): doi.org:443
[DEBUG] https://doi.org:443 "GET /10.1080/00140139.2013.790485 HTTP/1.1" 302 211
[DEBUG] Incremented Retry for (url='https://doi.org/10.1080/00140139.2013.790485'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting https://doi.org/10.1080/00140139.2013.790485 -> http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
[DEBUG] Starting new HTTP connection (1): www.tandfonline.com:80
[DEBUG] http://www.tandfonline.com:80 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 301 0
[DEBUG] Incremented Retry for (url='http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
[DEBUG] Starting new HTTPS connection (1): www.tandfonline.com:443
[DEBUG] https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 302 None
[DEBUG] Incremented Retry for (url='https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1
[DEBUG] https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485?cookieSet=1 HTTP/1.1" 302 None
[ERROR] network_handler(): HTTPSConnectionPool(host='www.tandfonline.com', port=443): Max retries exceeded with url: https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1 (Caused by ResponseError('too many redirects'))
[ERROR] Index 3765: No title

[DEBUG] Thread 140005405382400: processed 1
[DEBUG] 1 threads completed

I used -u, because I've already added this article to database, but the same happens with -a option.

@jarun
Copy link
Owner

jarun commented May 3, 2020

Works fine for me.

/GitHub/nnn$ buku -a doi.org  
486. Digital Object Identifier System
   > doi.org

~/GitHub/nnn$ buku -u 486      
Title: [Digital Object Identifier System]
Index 486: updated

486. Digital Object Identifier System
   > doi.org

~/GitHub/nnn$

The following works too: buku -a 'https://www.doi.org/'

Maybe a temporary problem.

@jarun jarun closed this as completed May 3, 2020
@ghost
Copy link
Author

ghost commented May 3, 2020

No, not just https://doi.org, but full url that points to specific paper like this one: https://doi.org/10.1080/00140139.2013.790485

Here is more correct example:

~ $ buku -z -a 'https://doi.org/10.1080/00140139.2013.790485'
[DEBUG] buku v4.3
[DEBUG] Python v3.8.2
[DEBUG] netloc: doi.org
[DEBUG] Starting new HTTPS connection (1): doi.org:443
[DEBUG] https://doi.org:443 "GET /10.1080/00140139.2013.790485 HTTP/1.1" 302 211
[DEBUG] Incremented Retry for (url='https://doi.org/10.1080/00140139.2013.790485'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting https://doi.org/10.1080/00140139.2013.790485 -> http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
[DEBUG] Starting new HTTP connection (1): www.tandfonline.com:80
[DEBUG] http://www.tandfonline.com:80 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 301 0
[DEBUG] Incremented Retry for (url='http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
[DEBUG] Starting new HTTPS connection (1): www.tandfonline.com:443
[DEBUG] https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 302 None
[DEBUG] Incremented Retry for (url='https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
[INFO] Redirecting https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1
[DEBUG] https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485?cookieSet=1 HTTP/1.1" 302 None
[ERROR] network_handler(): HTTPSConnectionPool(host='www.tandfonline.com', port=443): Max retries exceeded with url: https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1 (Caused by ResponseError('too many redirects'))
[DEBUG] Title: [None]
3766. Untitled
   > https://doi.org/10.1080/00140139.2013.790485

@jarun
Copy link
Owner

jarun commented May 3, 2020

Sorry, I don't think I can do much here. it's a server-side redirect. Probably check with the domain owner why this is happening.

@ghost
Copy link
Author

ghost commented May 3, 2020

Well, my browser (Firefox) redirects me correctly. curl also can do that, so I don't think this is server issue. Looks like redirects limit is set to 3 and if this limit is reached buku raises an error.

Here is what curl -vsL https://doi.org/10.1080/00140139.2013.790485 > /dev/null says: curl.log

@jarun
Copy link
Owner

jarun commented May 3, 2020

The relevant function is get_PoolManager:

buku/buku

Line 3534 in 7722294

def get_PoolManager():

The documentation is here: https://urllib3.readthedocs.io/en/latest/reference/

I tried both the retries and redirect options but to no effect. See if you can figure it out.

@rachmadaniHaryono and @zmwangx any ideas?

@jarun
Copy link
Owner

jarun commented May 3, 2020

Similar issue #445. Opening both.

@jarun jarun reopened this May 3, 2020
@ghost
Copy link
Author

ghost commented May 3, 2020

I can make it work with urllib3 in the way below. I found this workaround here: urllib3/urllib3#1555 (comment)

I don't understand is why buku fails after third redirect. Urllib3 default is 10.

import logging
import urllib3
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1

logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
logger = logging.getLogger("requests.packages.urllib3")
logger.setLevel(logging.DEBUG)
logger.propagate = True

manager = urllib3.PoolManager()

method = 'GET'
url = 'https://doi.org/10.1080/00140139.2013.790485'
retries = urllib3.util.Retry(redirect=10)

response = manager.request(method, url, retries=retries)

print('status:', response.status)

Result:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): doi.org:443
send: b'GET /10.1080/00140139.2013.790485 HTTP/1.1\r\nHost: doi.org\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 302 \r\n'
header: Date: Sun, 03 May 2020 14:28:17 GMT
header: Content-Type: text/html;charset=utf-8
header: Content-Length: 211
header: Connection: keep-alive
header: Set-Cookie: __cfduid=d6a870b2f44b644706821401247851a6b1588516097; expires=Tue, 02-Jun-20 14:28:17 GMT; path=/; domain=.doi.org; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept
header: Location: http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
header: Expires: Sun, 03 May 2020 14:49:41 GMT
header: CF-Cache-Status: DYNAMIC
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
header: Server: cloudflare
header: CF-RAY: 58daaae91f7040ce-HAM
header: cf-request-id: 027c8925b3000040ce0a8a2200000001
DEBUG:urllib3.connectionpool:https://doi.org:443 "GET /10.1080/00140139.2013.790485 HTTP/1.1" 302 211
DEBUG:urllib3.util.retry:Incremented Retry for (url='https://doi.org/10.1080/00140139.2013.790485'): Retry(total=9, connect=None, read=None, redirect=9, status=None)
INFO:urllib3.poolmanager:Redirecting https://doi.org/10.1080/00140139.2013.790485 -> http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): www.tandfonline.com:80
send: b'GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1\r\nHost: www.tandfonline.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Date: Sun, 03 May 2020 14:28:17 GMT
header: Content-Length: 0
header: Connection: keep-alive
header: Set-Cookie: __cfduid=d704fc1829d6d78ad92f19bd3dc3c34961588516097; expires=Tue, 02-Jun-20 14:28:17 GMT; path=/; domain=.tandfonline.com; HttpOnly; SameSite=Lax
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=0; includeSubDomains
header: X-Frame-Options: SAMEORIGIN
header: Cache-Control: no-cache
header: Pragma: no-cache
header: Location: https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
header: CF-Cache-Status: DYNAMIC
header: Server: cloudflare
header: CF-RAY: 58daaae9fe05c2c2-FRA
header: cf-request-id: 027c8926370000c2c20fbbc200000001
DEBUG:urllib3.connectionpool:http://www.tandfonline.com:80 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 301 0
DEBUG:urllib3.util.retry:Incremented Retry for (url='http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=8, connect=None, read=None, redirect=8, status=None)
INFO:urllib3.poolmanager:Redirecting http://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.tandfonline.com:443
send: b'GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1\r\nHost: www.tandfonline.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Sun, 03 May 2020 14:28:18 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Set-Cookie: __cfduid=dc0820c61b8897776e8f52874db31acd21588516097; expires=Tue, 02-Jun-20 14:28:17 GMT; path=/; domain=.tandfonline.com; HttpOnly; SameSite=Lax
header: Cache-Control: private
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=2592000
header: X-Frame-Options: SAMEORIGIN
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1
header: Set-Cookie: I2KBRCK=1; domain=.tandfonline.com; path=/; secure; expires=Mon, 03-May-2021 14:28:18 GMT
header: CF-Cache-Status: DYNAMIC
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Server: cloudflare
header: CF-RAY: 58daaaec4ddcc2b3-FRA
header: cf-request-id: 027c8927a90000c2b33ca9a200000001
DEBUG:urllib3.connectionpool:https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485 HTTP/1.1" 302 None
DEBUG:urllib3.util.retry:Incremented Retry for (url='https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485'): Retry(total=7, connect=None, read=None, redirect=7, status=None)
INFO:urllib3.poolmanager:Redirecting https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485 -> https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1
send: b'GET /doi/abs/10.1080/00140139.2013.790485?cookieSet=1 HTTP/1.1\r\nHost: www.tandfonline.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Sun, 03 May 2020 14:28:18 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Set-Cookie: __cfduid=d75e314fc6d2e811e69b37c4ec3326c311588516098; expires=Tue, 02-Jun-20 14:28:18 GMT; path=/; domain=.tandfonline.com; HttpOnly; SameSite=Lax
header: Cache-Control: private
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=2592000
header: X-Frame-Options: SAMEORIGIN
header: Location: https://www.tandfonline.com/action/cookieAbsent
header: CF-Cache-Status: DYNAMIC
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Server: cloudflare
header: CF-RAY: 58daaaf0ab4bc2b3-FRA
header: cf-request-id: 027c892a6d0000c2b33cae9200000001
DEBUG:urllib3.connectionpool:https://www.tandfonline.com:443 "GET /doi/abs/10.1080/00140139.2013.790485?cookieSet=1 HTTP/1.1" 302 None
DEBUG:urllib3.util.retry:Incremented Retry for (url='https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1'): Retry(total=6, connect=None, read=None, redirect=6, status=None)
INFO:urllib3.poolmanager:Redirecting https://www.tandfonline.com/doi/abs/10.1080/00140139.2013.790485?cookieSet=1 -> https://www.tandfonline.com/action/cookieAbsent
send: b'GET /action/cookieAbsent HTTP/1.1\r\nHost: www.tandfonline.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sun, 03 May 2020 14:28:19 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Set-Cookie: __cfduid=d75e314fc6d2e811e69b37c4ec3326c311588516098; expires=Tue, 02-Jun-20 14:28:18 GMT; path=/; domain=.tandfonline.com; HttpOnly; SameSite=Lax
header: X-XSS-Protection: 1; mode=block
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=2592000
header: X-Frame-Options: SAMEORIGIN
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 6eec6aa33dcc244dc9e5fb9a6fd33c95
header: Set-Cookie: JSESSIONID=aaaqWPWUcZ4Xzexwd0xhx; path=/; secure; HttpOnly
header: CF-Cache-Status: DYNAMIC
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Server: cloudflare
header: CF-RAY: 58daaaf21faac2b3-FRA
header: cf-request-id: 027c892b4b0000c2b33cb09200000001
DEBUG:urllib3.connectionpool:https://www.tandfonline.com:443 "GET /action/cookieAbsent HTTP/1.1" 200 None
status: 200

jarun added a commit that referenced this issue May 3, 2020
@jarun
Copy link
Owner

jarun commented May 3, 2020

Thank you so much!

Fixed at commit d0771a5.

@jarun jarun closed this as completed May 3, 2020
@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant