Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in Cloudflare hCaptcha loop. #31

Closed
GermanEngineering opened this issue Dec 14, 2020 · 15 comments
Closed

Stuck in Cloudflare hCaptcha loop. #31

GermanEngineering opened this issue Dec 14, 2020 · 15 comments

Comments

@GermanEngineering
Copy link

GermanEngineering commented Dec 14, 2020

Hello and first of all thank you very much for your work!

It looks, like this is exactly the code that I was looking for, but unfortunately I'm not able to get it running because I get stuck in an endless Cloudflare hCaptcha loop on https://www.blinkist.com/en/nc/login when I'm trying to execute it the first time.
The "One more step - Please complete the security check to access - I am human" appears before entering the login information and no matter how often I solve it, I always end up at the next Captcha (tried it for at least 9 times in a row).

My system:

  • Win 10
  • Chrome 87
  • Python 3.8
  • Venv with all requirements.txt modules installed.

I've already tried:

  • Running it on another Win 10 Laptop --> same problem
  • Different commands: python blinkistscraper email password / python main.py email password
  • Downloaded and specified ChromeDriver 87.0.4280.88 as argument
  • Downloaded Chrome 88 Beta and used ChromeDriver 88.0.4324.27
  • pip install --upgrade for all outdated modules
  • Different locations via VPN (Germany, Portugal and US)
  • Different Networks (DSL and Hotspot from Mobile Phone)
  • Ubuntu VM --> also getting stuck with the same problem

Unfortunately I don't have any other ideas at the moment and feel pretty lost/stupid.
Did you encounter this problem before and have an idea how to solve it?
Or are there some logfiles or something I can collect that might help in this case?

Thank you very much in advance!
Peter

@bckncook
Copy link

Same issue here. Looking forward to solution. Thank you!!!

@GermanEngineering
Copy link
Author

Hello again,

I tested two more things:

  1. Tried to use cookies from chrome
  • logged in to blinkist in chrome
  • added chrome_options.add_argument("user-data-dir=C:\Users\Win10x64\AppData\Local\Google\Chrome\User Data\") argument to chomedriver to use the settings from chrome in chromedriver
  • executed get_login_cookies() to get cookies.pkl
  • started initial code with login cookies
  • gui mode is running into Captcha loop again
  • headless mode is running into timeout
  • [1608123763.485][INFO]: Waiting for pending navigations...
    [1608123763.486][INFO]: Done waiting for pending navigations. Status: ok
    [1608123763.493][INFO]: Waiting for pending navigations...
    [1608123763.494][INFO]: Done waiting for pending navigations. Status: ok
    [1608123763.494][INFO]: [6319a21f140a99f67240dc6507ddab98] RESPONSE FindElement ERROR no such element: Unable to locate element: {"method":"class name","selector":"main-banner-headline-v2"}
    (Session info: headless chrome=87.0.4280.88)
    [1608123764.001][INFO]: [6319a21f140a99f67240dc6507ddab98] COMMAND FindElement {
    "sessionId": "6319a21f140a99f67240dc6507ddab98",
    "using": "class name",
    "value": "main-banner-headline-v2"
    }
  1. Tried selenium with Firefox
  • with driver = selenium.webdriver.Firefox()
  • --> also running into the same Captcha loop

Unfortunately nothing was successful, but maybe it helps to narrow down the root cause of the problem.
Thank you very much, again!
Peter

@leoncvlt
Copy link
Owner

It seems like Blinkist / Cloudflare moved from Goggle's captchas (which worked fine) to HCaptcha which causes this issue. From GermanEngineering's tests it seems like more of an issue of Cloudflare detecting the Chromedriver since even with legit cookies this persists. Will need to look into it - any help welcome!

@GermanEngineering
Copy link
Author

I found a solution that at least allows me to login and download the text.
It doesn't seem to work in headless mode though.
And with the --audio option im running into the json.decoder.JSONDecodeError Exception.
I don't think that this is related to the change I made, but on the other hand I don't know if/how it was working before.

I tried to do a pull request, but I'm not really familiar with the GitHub process, so please excuse me if this is not the correct way to propose a change.
In the end it was just adding:
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
to the Chrome options in the scraper.py

Hope this helps.

@wywywywy
Copy link

That's weird. I tried all these options and it still won't let me through the hcaptcha.

    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
    chrome_options.add_experimental_option('useAutomationExtension', False)

@wywywywy
Copy link

It'd be much better to convert this from Selenium to Puppeteer.

I just tried Puppeteer and that works well, especially with the Stealth plugin.

@rocketinventor
Copy link
Contributor

rocketinventor commented Dec 21, 2020 via email

@mikaelaatan
Copy link

mikaelaatan commented Dec 23, 2020

Hello, I'm not familiar with how Github works, but I'll just share what worked for me. I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") from GermanEngineering's suggestion.

At first it worked, but for the next sessions, it started going back to the captcha again. The workaround is after logging in, and when it goes to the cloudfare site, redirect the browser back to Blinkist.com homepage. This is when the log says, "waiting for user to solve recaptcha and login. After that, the scraper will proceed as expected.

@flowni
Copy link

flowni commented Dec 27, 2020

Hello, I encounter the same problem as you guys, getting stuck in the infinity captcha-loop...

I think we definitely have to add this line chrome_options.add_argument("--disable-blink-features=AutomationControlled"). I also added headers and a user-data-dir to always use the same profile everytime but that's not enough as the loop still appears, as already mentioned.

As a first quick fix, it worked for me to change from seleniumwire webdriver to the "normal" selenium webdriver. Doing this you can at least scrape the texts but to get the audio files you need to have access to the request tab, so audio scraping won't work any longer with this.
Does someone have an idea why the website could know it's a bot with seleniumwire webdriver with the exact same settings of the selenium webdriver?

Edit: I think the problem has something to do with the certificate as selenium-wire issues its own certificate (selenium-wire manual). I already added the Selenium Wire CA to Chrome's Authorities section, but the problem remains.

@rocketinventor

This comment has been minimized.

@usb4
Copy link

usb4 commented Dec 29, 2020

I also run into the hCaptcha loop but can get around it with the following arguments:

    # prevent Cloudflare from detecting ChromeDriver as bot
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")

Occasionally, without these arguments, I find that my first scrape attempt in 12+ hours usually avoids triggering Captcha.

However, audio scraping still doesn't work.

[13:09:39] WARNING Could not find audio url in request, aborting audio scrape...
[13:09:39] ERROR Error processing audio url, aborting audio scrape...

@leoncvlt
Copy link
Owner

In my tests, I had to override the user agent as well on top of implementing @usb4's flags. Although it still asked for the captcha when making a request for the blink's audio files.

Reading around, I found this discussion - https://stackoverflow.com/questions/32795460/loading-json-object-in-python-using-urllib-request-and-json-modules - and magically, yes, using urllib.request instead of requests doesn't seem to trigger the captcha. I tried implementing the other approach they suggested, where you connect to the IP address instead of the host, but was getting some SSL problems.

I pushed my changes in f4cab05, tested (albeit only on the free daily book) and seems to work fine on my end.

@rocketinventor
Copy link
Contributor

rocketinventor commented Dec 31, 2020 via email

@GermanEngineering
Copy link
Author

Thank you very much leoncvlt!

@leoncvlt
Copy link
Owner

leoncvlt commented Jan 1, 2021

Leonardo, which user agent did you use with requests? The default one is a scraper user-agent. That could be why 'urllib.request' "magically" works. In my tests (Windows 10), it was enough to switch from 'seleniumwire.webdriver' to 'selenium.webdriver' (Flowni's "quick fix") and maybe also add in the "--disable-blink-features=AutomationControlled" argument (as per Peter's comment). However, it doesn't seem like any of the other arguments/lines, user-agents, data-dirs, etc, are needed at all. Perhaps those arguments could even prevent selenium-wire from accessing the audio URL's/requests properly. As far as the audio goes, it looks like there is a hard-coded URL now that points to the chapter audio... If so, it might be possible to completely ditch the chrome/selenium web-driver (except maybe to get the cookies). That should really get its own issue / pull-request, so I won't discuss the details much here.

In my case, the user agent was needed to access the actual library / books pages, not specifically for the audio files.

I'm using selenium wire to capture the original audio files request and re-use the cookies / auth information to request the rest of the audio blinks - if anyone can come up with an alternative way of accomplishing this, we could scrap the selenium wire requirements 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants