-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs-scraper not working when url has a port #103
Comments
Based on the algolia issue discussion I don't think the issues is because of using localhost, instead it looks like the issue is caused any time there is a port provided. So in the example here, That said, I haven't found in the code yet what would be causing the issue. The urls are using netloc from urlparse which does include the port. On first look I don't see anywhere the port is removed or modified. Maybe it has something to do with the regular expressions being used on the urls? |
Hey @sanders41 |
I have not looked through the algolia code yet. What lead me to the port idea was:
and
Then from that where I started looking around here was
I tried looking at references to these two places since they seem to be the main two places getting the base url, but so far haven't found anywhere that removes the port information or changes it. There are a few different places in the UrlsParser class that have regular expressions, but nothing jumped out at me as looking at the port in any of them. |
@bidoubiwa I have been looking into this some more and I still can't find anywhere that removes or changes the port. I also can't find anywhere that would see localhost specifically as an issue. I currently have another theory based on some behavior I have seen while testing, but I'm not very familiar with the doc scraper so I'm wondering if you can tell my if the following makes sense as a possibility. If we take the MeiliSearch docs as an example, they are built in VuePress so they are rendered as static pages on the live site, but viewing them in development they are rendered with JavaScript. It looks like the doc scraper is supposed to check if a site is rendered with JavaScript or not and use selenium if it is. In testing when I print out the HTML that the scraper gets it does not include the JS rendered elements and when I check js_render it always seems to be False. So I'm wondering if the issue could be the check if Selenium should be used is not working like it should and therefore Selenium is not being used on dynamically generated sites? |
I looked through algolia's scraper and found some additional configuration settings I was able to use to get the scraper to work when the url includes a port and/or when the page is rendered with JS so I added these settings to the README file. To scrape the MeiliSearch docs on local host run with {
"index_uid": "docs",
"sitemap_urls": [],
"start_urls": ["http://localhost:8080"],
"js_render": true,
"js_wait": 1,
"allowed_domains": ["localhost"],
"selectors": {
"lvl0": {
"selector": ".sidebar-heading.open",
"global": true,
"default_value": "Documentation"
},
"lvl1": ".theme-default-content h1",
"lvl2": ".theme-default-content h2",
"lvl3": ".theme-default-content h3",
"lvl4": ".theme-default-content h4",
"lvl5": ".theme-default-content h5",
"text": ".theme-default-content p, .theme-default-content li"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
}
One thing I still have a questions about is a lot of the pages say 0 records. If I point the scraper at the live site's docs removing the |
As for now, docs-scraper does not work on docs-scraper. A workaround would be to use a tool like ngrok.
The bug should be fixed.
When trying to scrap with localhost you will receive the following result:
Problem existed in docsearch-scraper as well: algolia/docsearch-scraper#461 (comment)
The text was updated successfully, but these errors were encountered: