Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can WebsiteAgent support anti DDoS website #2658

Open
shawn8888 opened this issue Dec 22, 2019 · 13 comments
Open

Can WebsiteAgent support anti DDoS website #2658

shawn8888 opened this issue Dec 22, 2019 · 13 comments

Comments

@shawn8888
Copy link

WebsiteAgent has trouble to access websites with DDoS protection by Cloudflare

Example website (adult): https://www.inthecrack.com/

Command curl gives me "Checking your browser" message with 503 error.
curl -Lv https://www.inthecrack.com
HTTP/1.1 503 Service Temporarily Unavailable

The error log of a dry run is also attached.
itc.error.log

Any browsers like firefox, IE or chrome can open the website with no problem but have to wait for a few seconds to get pass the “Checking your browser” page.

So, is there a way for websiteagent to act like a browser?

@dsander
Copy link
Collaborator

dsander commented Dec 22, 2019

A temporary solution is to get the cloudflare cookie from the browser and put it into the WebsiteAgent configuration similar to this:

....
    headers:
    {
      "Cookie": "cookiedata"
    }
....

Others had success using PhantomJSCloud

@shawn8888
Copy link
Author

shawn8888 commented Dec 23, 2019

@ dsander Thanks a lot for your help!

I am very new to huginn and web programming, but I am willing to learn! I registered on PhantomJSCloud and still trying to figure out how to use it.

The headers/cookie option seems interesting, but I find very little help in the WebsiteAgent documentation. I found http://huginnio.herokuapp.com/scenarios/19 made by you as an example. In order to make it work, do I just replace {% credential huginn_domain %} with my huggin url, and user/pass? It is not working and I am trying to make it to work.

@shawn8888
Copy link
Author

shawn8888 commented Dec 23, 2019

PhantomJSCloud doesn't work. Here is what I did:

First I input the API key into Huginn:
P 1

Then I created a Phantom Js Cloud Agent, and setup it like this:

{
  "mode": "clean",
  "api_key": "{% credential phantomjs_cloud %}",
  "url": "https://www.inthecrack.com",
  "render_type": "html",
  "output_as_json_radio": "false",
  "output_as_json": "false",
  "ignore_images_radio": "false",
  "ignore_images": "false",
  "user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36",
  "wait_interval": "10000"
}

After a dry run, I got an URL like this:

P 2

I copied the URL and pasted into a browser. I got the "checking your browser..." message and after a few seconds, I got "404 Not Found".

P 3

Is there something wrong about my settings or PhantomJSCloud doesn't support this DDoS protection at all?

@shawn8888
Copy link
Author

shawn8888 commented Dec 24, 2019

I installed "phantomjs-2.1.1-windows". Now this is interesting. I created a simple test.js file

var page = require('webpage').create();
page.open('http://www.osbot.org', function() {
setTimeout(function() {
page.render('test.png');
phantom.exit();
}, 10000);
});

It works and the output test.png shows the screenshot of the webpage! I don't even need to change the default user_agent. However, if I change the timeout from 10 seconds to 1 second, it shows the "Checking you browser" message. It makes me to believe the browser just have to wait for 5 seconds to get pass the DDoS protection page and to be redirected to the real page.

In Hubinn's Phantom Js Cloud Agent, it does give me the option to set the timeout, but no matter what I change it to, it always gives me the 404 error.

So is it an Agent bug or PhantomJSCloud's bug?

@dsander
Copy link
Collaborator

dsander commented Dec 27, 2019

I played with the PhantomJSCloudAgent a little bit and it seems this is a bug in the Agent.

It gives me this url:

https://phantomjscloud.com/api/browser/v2/myAPIKEY/?request=%7B%22url%22%3A%22http%3A%2F%2Fwww.osbot.org%2F%22%2C%22renderType%22%3A%22html%22%2C%22requestSettings%22%3A%7B%22userAgent%22%3A%22Huginn%20-%20https%3A%2F%2Fgithub.com%2Fhuginn%2Fhuginn%22%2C%22wait_interval%22%3A%2210000%22%7D%7D

The wait_interval parameter should be waitInterval. This works for me:

https://phantomjscloud.com/api/browser/v2/myAPIKEY/?request=%7B%22url%22%3A%22http%3A%2F%2Fwww.osbot.org%2F%22%2C%22renderType%22%3A%22html%22%2C%22requestSettings%22%3A%7B%22userAgent%22%3A%22Huginn%20-%20https%3A%2F%2Fgithub.com%2Fhuginn%2Fhuginn%22%2C%22waitInterval%22%3A%226000%22%7D%7D

@shawn8888
Copy link
Author

The wait_interval parameter should be waitInterval. This works for me:

Thank you for the troubleshooting! The 404 error is gone now. More questions:

  1. Can you fix this bug so the Phantom Js Cloud Agent can populate the URL correctly?

PhantomJS is discontinued in 2018. I don't know how long PhantomJSCloud will be alive. However, PhantomJS 2.1.1 can be downloaded and installed locally on any linux or Windows box. So,

  1. Can Huginn's agent connect to a local PhantomJS instance instead of PhantomJSCloud?

@dsander
Copy link
Collaborator

dsander commented Dec 28, 2019

Can you fix this bug so the Phantom Js Cloud Agent can populate the URL correctly?

Yeah we can do that.

PhantomJS is discontinued in 2018. I don't know how long PhantomJSCloud will be alive.

From their docs it kind of reads like that they already moved away from PhantomJS, but they could also be maintaining a fork of that.

Can Huginn's agent connect to a local PhantomJS instance instead of PhantomJSCloud?

You can if you expose it via a HTTP interface so that Huginn can call it, there is also fulldom-server .

@shawn8888
Copy link
Author

shawn8888 commented Dec 29, 2019

fulldom-server runs PhantomJS and it seems also discontinued.
I got this error when trying to install it on my Ubuntu 16 VM:
fulldom.error.txt
So I gave up running it in Ubuntu.

Then I tried to run fulldom in a docker and I found this:
https://hub.docker.com/r/b1nitp7iw/fulldom

After the container is running on port 3600, I tried to build the URL that Fulldom container can process as per the wiki here:
https://github.com/huginn/huginn/wiki/Browser-Emulation-using-fulldom-server
The URL is this:
http://mydocker:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=body
Luckily, I got the "checking your browser..." message! But then I got this:

T9

2nd try: Running PhantomJS in a docker and expose the port.
I download wernight/phantomjs from here:
https://hub.docker.com/r/wernight/phantomjs/
I ran the command as in the overview, and it starts PhantomJS as 'Remote WebDriver mode':
docker run -d -p 8910:8910 wernight/phantomjs phantomjs --webdriver=8910
Then I got this when I visit the pot 8910:
T8

So it looks like phantomjs is working on port 8910. Since I am very new to Phantomjs and Fulldom, could you please shed some light on how to make use of them in Huggin?

@dsander
Copy link
Collaborator

dsander commented Dec 29, 2019

I got this error when trying to install it on my Ubuntu 16 VM:
fulldom.error.txt

I have no problems installing it in a ubuntu 18.04 docker image, your error shows a permission error which is very odd when running the command as root.

So it looks like phantomjs is working on port 8910. Since I am very new to Phantomjs and Fulldom, could you please shed some light on how to make use of them in Huggin?

fulldom-server works similarly to PhantomJSCloud, you send it a HTTP request like it is documented on the website, it waits until the given selector is found and then returns the HTML.

The docker image you used isn't running fulldom-server, is is "just" phantomjs which needs to be called using the webdriver protocol https://w3c.github.io/webdriver/. See the docker hub documentation https://hub.docker.com/r/wernight/phantomjs/

There seem to be two fulldom docker images, but I have never used them:

https://hub.docker.com/r/b1nitp7iw/fulldom

https://hub.docker.com/r/creadom/fulldom

@shawn8888
Copy link
Author

The docker image you used isn't running fulldom-server

About Fulldom docker:
I did try the fulldom docker image, the b1nitp7iw one, but I got a "Cannot POST /" error. Please check my previous post. Any idea why?

is is "just" phantomjs which needs to be called using the webdriver protocol https://w3c.github.io/webdriver/. See the docker hub documentation https://hub.docker.com/r/wernight/phantomjs/

About phantomjs docker:
Thanks for the useful link of the webdriver docs! I did check the hub documentation, but it only shows the usage examples of Java and Phthon. How do I use it in Huginn?

@dsander
Copy link
Collaborator

dsander commented Dec 29, 2019

Sorry I missed that part, this seems to work for me:

docker run --rm -p 3600:3600 b1nitp7iw/fulldom
curl "http://localhost:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=%23jsddm"

Thanks for the useful link of the webdriver docs! I did check the hub documentation, but it only shows the usage examples of Java and Phthon. How do I use it in Huginn?

I don't think you can at the moment. It would require an Agent that can use the protocol.

@shawn8888
Copy link
Author

Looks like fulldom is my only hope.
The docker container is working, Because if I send google URL it works except no pictures, no sure why.
http://mydocker:3600/https%3A%2F%2Fwww.google.com%2F?selector=body

tt1

If I put this URL, (selector=body)
http://mydocker:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=body
It gives me the "checking your browser..." message! But then I got "cannot post" error.
T9

If I put the URL like yours (selector=%23jsddm)
http://mydocker:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=%23jsddm
I do NOT get the "checking your browser..." message, and I got this error:
tt2
The full error log is here:

tt.error.txt

@dsander
Copy link
Collaborator

dsander commented Dec 30, 2019

If I put this URL, (selector=body)
http://mydocker:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=body
It gives me the "checking your browser..." message!

That makes sense, fulldom does not wait a fixed amout of time but instead looks for the specified selector. Since body is present in almost all HTML pages it returns the cloudfront page.

If I put the URL like yours (selector=%23jsddm)
http://mydocker:3600/https%3A%2F%2Fwww.osbot.org%2F?selector=%23jsddm
I do NOT get the "checking your browser..." message, and I got this error:

You probably can filter out the error Huginn a liquid regex_replace filter, or open an issue on the fulldom repository.

I think your best bet it still PhantomJSCloud after manually fixing the URL our Agent generates and putting that into a WebsiteAgent.

@Unending Unending mentioned this issue Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants