You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know this sound ridiculous, but it seems to be the case. Most site are actually correctly crawled despite this problem because most sites have an HTTP version for each HTTPS url
But I stumbled on the problem on a site which does not have this behaviour : all http://xxx/yyy are redirected to https://xxx/yyy (an example problematic site : www.kernix.com )
Since Bubing thinks it is actually fetching https://xxx/yyy, only the redirect is stored, and nothing more.
The behaviour is visible in the logs when enabling DEBUG log for org.apache.http :
The problem is that, when the Host address is already known (because it's internally cached in Bubin), the HttpHost sent to the HttpClient is created without the port or the scheme.
What must be done is changing a few lines in FetchData :
I know this sound ridiculous, but it seems to be the case. Most site are actually correctly crawled despite this problem because most sites have an HTTP version for each HTTPS url
But I stumbled on the problem on a site which does not have this behaviour : all http://xxx/yyy are redirected to https://xxx/yyy (an example problematic site : www.kernix.com )
Since Bubing thinks it is actually fetching https://xxx/yyy, only the redirect is stored, and nothing more.
The behaviour is visible in the logs when enabling DEBUG log for org.apache.http :
The problem is that, when the Host address is already known (because it's internally cached in Bubin), the HttpHost sent to the HttpClient is created without the port or the scheme.
What must be done is changing a few lines in FetchData :
The text was updated successfully, but these errors were encountered: