-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudflare blocking scraping #46
Comments
I think this issue may be the same as the "Captha taking longer than expected." Take a look to see if your problem is the same and if so, hopefully one of the solutions posted there will work for you. |
Yes, cloudflare definitely detects an unusual activity and you land in an endless cycle of captchas. No matter what I'm trying, it don't let me through. |
I am experiencing the same thing as klochden. I have tried to adjust, even disable UBlock with no success. If there is any way I can assist with debugging or testing, let me know. |
I get the chrome app popup and complete the captcha, but keep failing. I have even tried to disable the ublock in various ways. I have also signed into Blinkist on regular chrome, which I can, but the script still fails. Happy to test any specifics. Thank you! |
FYI: uBlock can be disabled using the I also got the cloudflare captcha loop. This seems to be new. |
Hey, thank you very much! Will try it out tomorrow since today is very late now. But to me, the audio files are the most important target, so I hope you can figure out how to get the script working.
I will let you know tomorrow!
Thanks again!
Regards
FYI: uBlock can be disabled using the --no-ublock switch.
I also got the cloudflare captcha loop. This seems to be new.
Currently this workaround seems to be working for me:
In scraper.py change from seleniumwire import webdriver to from selenium import webdriver
This fixes the cloudflare issue, but this will not allow you to download the audio files, as that part requires seleniumwire,
everything else should work, though.
Let me know if this allows you to login.
Will look into a fully functioning fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub , or unsubscribe .
|
Yes it worked now Ind I was able to log in without an issue. Also the SSL Certificate was active, what is an important thing for Cloudflare I think! But got only a JSON Text file, no audio. Hopefully someone can recover the main functions. To download everything complete. |
By the way, with the chrome addon "Audio Downloader Prime" I could manually download the audio files without an issue. Maybe there is a possibility to implement an automated solution? |
I've looked into this issue a little bit... The project is using an old version of For example:
Also, we might be able to remove the I will try to look into those two things. |
Manually using the privacy-pass extension makes scraping audio, e.g., of the daily book, possible again because you get 30 passes when solving 1 captcha. |
I am able to add privacy-pass to my regular chrome and add the 30 passes. When I run the scraper it does not appear in the dev-tools instance and I am still being asked to deal with the captchas that is still circular. How do we add the privacy-pass into the dev-tools instance. Thanks! |
I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. |
Thanks for the feedback. I poked around, however I have no idea how to add
privacy-pass in the chrome instance or increase the time. I am not really a
developer, more a hack. I know my limits. All good. I hope that leoncvlt is
able to fix it soon.
…On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider ***@***.***> wrote:
I did not automate this process but increased the time allowed for solving
the captcha and then manually installed privacy-pass in the chrome instance
opened when running the scraper. For now, this needs to be done every time
the scraper is run.
But this can definitely be automatized similar to the ublock extension.
Maybe @leoncvlt <https://github.com/leoncvlt> or someone else has some
time to automate this process.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ>
.
|
Same issue. Resolving captcha brings another captcha and so on, so can't get this script to work |
@vongyver you can change scraper.py (currently line 180) : WebDriverWait(driver, 60) --> WebDriverWait(driver, 360) that will change it from 60 seconds to 360 seconds. but I did the privacy-pass method from @jonaschn and it does not work for me. changed to selenium and it works minus not being able to download the audio, which is a huge bummer. Hopefully, this gets fixed soon.
|
No luck on that change. I didn't think it was the 60 sec limit, I am
resolving two sets of images within 20 seconds.
I get the classic bicycle or boat and after completing it, the session
flips to the blinkist login screen and then back to the "not a robot"
checkbox and image sets again, repeatedly. It does not look like it's
trying to enter the passed credentials.
Just to clarify, I have confirmed that I do have the latest version, having
cloned fresh a couple of times.
Just a note, I had a password with an "&" in it and had trouble passing
that to the blinkistscraper, so I changed the password.
I had no luck with adding privacy-pass either.
Thanks for the recommendation. Happy to test what's offered.
…On Mon, Jul 12, 2021 at 8:04 PM hxh103 ***@***.***> wrote:
@vongyver <https://github.com/vongyver> you can change scraper.py
(currently line 180) : WebDriverWait(driver, 60) --> WebDriverWait(driver,
360) that will change it from 60 seconds to 360 seconds.
but I did the privacy-pass method from @jonaschn
<https://github.com/jonaschn> and it does not work for me.
changed to selenium and it works minus not being able to download the
audio, which is a huge bummer. Hopefully, this gets fixed soon.
Thanks for the feedback. I poked around, however I have no idea how to add
privacy-pass in the chrome instance or increase the time. I am not really a
developer, more a hack. I know my limits. All good. I hope that leoncvlt is
able to fix it soon.
… <#m_1893796245102071081_>
On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider *@*.***> wrote: I did
not automate this process but increased the time allowed for solving the
captcha and then manually installed privacy-pass in the chrome instance
opened when running the scraper. For now, this needs to be done every time
the scraper is run. But this can definitely be automatized similar to the
ublock extension. Maybe @leoncvlt <https://github.com/leoncvlt>
https://github.com/leoncvlt or someone else has some time to automate
this process. — You are receiving this because you were mentioned. Reply to
this email directly, view it on GitHub <#46 (comment)
<#46 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTLSQXC4NSSSJDP27XWMTTXONLDANCNFSM42ARMJMQ>
.
|
It was more to give you enough time to manually add privacy-pass extension to that instance of chrome before the timeout that was first suggested. Anyways, that did not work for me and I assume it would not for you either.
|
So I found a solution that worked for me. it requires a bit of manual work but it downloads audio now. at least for me, the problem seems to be in the user-agent and the version of selenium-wire identified by @rocketinventor So this worked for me:
So if it's only changing user-agent, this should easily be implemented in the chrome options. Or can save annoyance of manual extension installation by adding this extension like how it is implemented with u-block. the scraping script so far works in the updated packages but I haven't done any extensive testing. I didn't need privacy-pass extension, but if above doesn't work for you then you try to manually install to check. |
Your method is working perfectly for me as well! Thank you for keeping this work |
Due to this error happening to me all the time #58, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution.
This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha.
I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome. |
hxh103, thanks for the recommendations, glad to see it's working for you.
Not working for me, tried both and switching agents about 6 times
with reloads.
I'm still getting the hCaptcha cycle. I expect my issue may be a little
different. I am not sure what Cloudflare is using for browser
fingerprinting, but I may be blocking that too.
FYI
…On Mon, Jul 19, 2021 at 3:23 PM hxh103 ***@***.***> wrote:
Due to this error happening to me all the time #58
<#58>, I got annoyed
with having to reinstall user-agent every morning. I implemented 2 options
to change user-agent. Unfortunately, both require manual clicking, but less
work than the above solution.
1. *change user-agent at start*: Add the following line in scraper.py
(I added in line 88).
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A")
This will change it to a Safari user-agent. If this user-agent gets
flagged by Cloudflare, then just change it to another user-agent. I think
anything other than a chrome user-agent should work. This option always
required me to do the captcha at least once so a little bit annoying. I
tried option 2 below to see if I could get around solving captcha.
1. *load extension at start*: download the user-agent extension as a
.crx file (google if if you don't know how) and place it into the bin
folder (like ublock); so for me, I have it in
bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you
point it correctly in the code below. Then add the below line in scraper.py
(I added it after line 88 as I left the first option in)
chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx"))
I did not have to solve the captcha with this route, but I did have to
click on the extension to change the user-agent and then reload the page. I
don't know how to set user-agent from this extension automatically, but
maybe this would save from clicking. If someone knows how to do this or has
a better solution that doesn't require any manual clicking or captcha, that
would be awesome.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTLSXI5LEES3YPW6HHHP3TYSJUVANCNFSM42ARMJMQ>
.
|
Working like charm in my case, thanks hx103. Sometimes I still got network blocks or errors while backup all the files, but that's already too good already! |
Have you tried to download with a different network? Cause you already created a similar environment like me and hxh103. The only problem left is your network firewall and so on. |
Update. I was able to get it working by disabling my pihole dns for a
minute during login. I am using pi-hole for DNS to block tracking, ads and
malware. I expect there is something in one of the lists that is causing an
issue. I will see if I can find it and if so pass it along.
I also discovered, for me at least, that I could not have a "&" in my
password, as the script was not handling that properly, even with quotes.
Great to see this scraping again. Thank you hxh103!!
…On Tue, Jul 20, 2021 at 3:53 AM kotobuki09 ***@***.***> wrote:
hxh103, thanks for the recommendations, glad to see it's working for you.
Not working for me, tried both and switching agents about 6 times with
reloads. I'm still getting the hCaptcha cycle. I expect my issue may be a
little different. I am not sure what Cloudflare is using for browser
fingerprinting, but I may be blocking that too. FYI
… <#m_6585423364355193943_>
On Mon, Jul 19, 2021 at 3:23 PM hxh103 *@*.***> wrote: Due to this error
happening to me all the time #58
<#58> <#58
<#58>>, I got annoyed
with having to reinstall user-agent every morning. I implemented 2 options
to change user-agent. Unfortunately, both require manual clicking, but less
work than the above solution. 1. *change user-agent at start*: Add the
following line in scraper.py (I added in line 88).
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac
OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3
Safari/7046A194A") This will change it to a Safari user-agent. If this
user-agent gets flagged by Cloudflare, then just change it to another
user-agent. I think anything other than a chrome user-agent should work.
This option always required me to do the captcha at least once so a little
bit annoying. I tried option 2 below to see if I could get around solving
captcha. 1. *load extension at start*: download the user-agent extension
as a .crx file (google if if you don't know how) and place it into the bin
folder (like ublock); so for me, I have it in
bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you
point it correctly in the code below. Then add the below line in scraper.py
(I added it after line 88 as I left the first option in)
chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent",
"User-Agent-1.1.0.crx")) I did not have to solve the captcha with this
route, but I did have to click on the extension to change the user-agent
and then reload the page. I don't know how to set user-agent from this
extension automatically, but maybe this would save from clicking. If
someone knows how to do this or has a better solution that doesn't require
any manual clicking or captcha, that would be awesome. — You are receiving
this because you were mentioned. Reply to this email directly, view it on
GitHub <#46 (comment)
<#46 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AASTLSXI5LEES3YPW6HHHP3TYSJUVANCNFSM42ARMJMQ
.
Have you tried to download with a different network? Cause you already
created a similar environment like me and hxh103. The only problem left is
your network firewall and so on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTLSX4MU5S46AERPM6C73TYVBQ5ANCNFSM42ARMJMQ>
.
|
None of the above options worked for me! I keep getting thrown to captcha page. Didn't even login once. (I tried the extension as well as changing user agent at the load) Thank you for the amazing tool. |
If anyone is still stuck with this, use undetected-chromedriver. Replace your driver with this and fix few errors of unwanted options and voila it works! 😊 |
Did you need to change scraper.py to import this or anything? Are you
using the original scraper.py?
Thanks Ravi.
…On Thu, Jul 22, 2021 at 10:09 PM Ravi Mandliya ***@***.***> wrote:
If anyone is still stuck with this, use undetected-chromedriver
<https://github.com/ultrafunkamsterdam/undetected-chromedriver>. Replace
your driver with this and fix few errors of unwanted options and voila it
works! 😊
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTLSU5ZNWWRSRHKUV6SLTTZDTNZANCNFSM42ARMJMQ>
.
|
Yes, imported in |
Can anoyone confirm undetected-chromedriver does indeed fix the issue? If so, might be time for a PR 😄 |
I have tried to use the undetected-chromedriver but I can't fix this error message. Can somebody help me? python3 blinkistscraper ********@***m ******** --language de --audio --concat-audio --keep-noncat
[14:22:24] INFO Starting scrape run...
[14:22:25] INFO Initialising chromedriver at /home/user/.local/lib/python3.8/site-packages/chromedriver_autoinstaller/97/chromedriver...
[14:22:26] ERROR Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
(Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)
Traceback (most recent call last):
File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 412, in <module>
main()
File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 319, in main
driver = scraper.initialize_driver(
File "/home/user/Downloads/blinkist-scraper/blinkistscraper/scraper.py", line 102, in initialize_driver
driver = uc.Chrome(version_main=97,
File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 302, in __init__
super(Chrome, self).__init__(
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
RemoteWebDriver.__init__(
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in __init__
self.start_session(capabilities, browser_profile)
File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 582, in start_session
super(Chrome, self).start_session(capabilities, browser_profile)
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 359, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
self.error_handler.check_response(response)
File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
(Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)
[14:22:26] CRITICAL Uncaught Exception. Exiting... |
Yes I can confirm it. Just like @mandliya mentioned, the undetected-chromedriver fixes the infinite captcha-loop from cloudflare. |
I will check you fix now out and thank you for the mentioning of the email ^^ Can you also edit it out? |
I got it to work but i can't scrape any audio i get this error message
|
@orenaksakal are you able to download audio with this fix? |
You mean after you saved the cookies am I right? |
Hi @orenaksakal
How do you revert back to selenium? Could you please explain in a little more detail what you do after login or what you need to change in the After passing the captcha with undetected-chromedriver I try to run the program again with the default driver, but the new window opening goes back to the captcha loop. I also tried modifying |
Also have the issue where it just says |
Hello, captcha page stuck. I was wondering if you got it solved, then maybe I can use yours |
Hello, I have just started using this library and all seems to be correctly set up. I ran python blinkistscraper email password with my credentials and Cloudflare unfortunately detects (I assume) an automated activity and blocks me from navigating to Blinkist.com on the browser instance that got opened by the script.
Any ideas?
The text was updated successfully, but these errors were encountered: