Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

https urls are actually fetch using http #6

Closed
guillaumepitel opened this issue Sep 27, 2017 · 3 comments
Closed

https urls are actually fetch using http #6

guillaumepitel opened this issue Sep 27, 2017 · 3 comments

Comments

@guillaumepitel
Copy link
Contributor

I know this sound ridiculous, but it seems to be the case. Most site are actually correctly crawled despite this problem because most sites have an HTTP version for each HTTPS url

But I stumbled on the problem on a site which does not have this behaviour : all http://xxx/yyy are redirected to https://xxx/yyy (an example problematic site : www.kernix.com )

Since Bubing thinks it is actually fetching https://xxx/yyy, only the redirect is stored, and nothing more.

The behaviour is visible in the logs when enabling DEBUG log for org.apache.http :

2017-09-27 19:33:24,180 18536 DEBUG [FetchingThread-0] i.u.d.l.b.f.FetchingThread - Next URL: https://www.exensa.com/robots.txt
2017-09-27 19:33:24,217 18573 DEBUG [FetchingThread-0] o.a.h.c.p.RequestAddCookies - CookieSpec selected: compatibility
2017-09-27 19:33:24,275 18631 DEBUG [FetchingThread-0] o.a.h.c.p.RequestAuthCache - Auth cache not set in the context
2017-09-27 19:33:24,315 18671 DEBUG [FetchingThread-0] i.u.d.l.b.f.FetchingThread$BasicHttpClientConnectionManagerWithAlternateDNS - Get connection for route {}->http://37.59.88.172:80
2017-09-27 19:33:24,379 18735 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Opening connection {}->http://37.59.88.172:80
2017-09-27 19:33:24,381 18737 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultHttpClientConnectionOperator - Connecting to /37.59.88.172:80
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultHttpClientConnectionOperator - Connection established 172.23.0.89:44024<->37.59.88.172:80
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultManagedHttpClientConnection - http-outgoing-0: set socket timeout to 60000
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Executing request GET /robots.txt HTTP/1.1
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Target auth state: UNCHALLENGED
2017-09-27 19:33:24,388 18744 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Proxy auth state: UNCHALLENGED

The problem is that, when the Host address is already known (because it's internally cached in Bubin), the HttpHost sent to the HttpClient is created without the port or the scheme.

What must be done is changing a few lines in FetchData :

                                final URI uri = httpGet.getURI();
                                final String scheme = uri.getScheme();
                                final int port = uri.getPort() == -1 ? (scheme.equals("https") ? 443 : 80) : uri.getPort();
                                final HttpHost httpHost = visitState != null ?
                                        new HttpHost(InetAddress.getByAddress(visitState.workbenchEntry.ipAddress).getHostAddress(), port, scheme) :
                                        new HttpHost(uri.getHost(), port, scheme);
@mapio mapio closed this as completed Sep 28, 2017
@mapio
Copy link
Member

mapio commented Sep 28, 2017

Can you provide a pull request for this?

@mapio mapio reopened this Sep 28, 2017
@guillaumepitel
Copy link
Contributor Author

Sure, I just need some time, I'm pretty busy for the next few days.

@mapio
Copy link
Member

mapio commented Sep 28, 2017

Of course! Thanks

@mapio mapio closed this as completed in a6e3c76 Sep 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants