Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs-scraper not working when url has a port #103

Closed
bidoubiwa opened this issue Mar 1, 2021 · 5 comments · Fixed by #112
Closed

Docs-scraper not working when url has a port #103

bidoubiwa opened this issue Mar 1, 2021 · 5 comments · Fixed by #112
Labels
bug Something isn't working

Comments

@bidoubiwa
Copy link
Contributor

bidoubiwa commented Mar 1, 2021

As for now, docs-scraper does not work on docs-scraper. A workaround would be to use a tool like ngrok.

Ngrok exposes local servers behind NATs and firewalls to the public internet over secure tunnels

The bug should be fixed.

When trying to scrap with localhost you will receive the following result:

Docs-Scraper: http://localhost:8080 0 records)

Problem existed in docsearch-scraper as well: algolia/docsearch-scraper#461 (comment)

@sanders41
Copy link
Collaborator

Based on the algolia issue discussion I don't think the issues is because of using localhost, instead it looks like the issue is caused any time there is a port provided. So in the example here, http://localhost:8080, I think the 8080 is the problem and not the localhost. I think this also explains why ngrok can be used for a workaround as it gives you a url to use on the default port 80.

That said, I haven't found in the code yet what would be causing the issue. The urls are using netloc from urlparse which does include the port. On first look I don't see anywhere the port is removed or modified. Maybe it has something to do with the regular expressions being used on the urls?

@bidoubiwa
Copy link
Contributor Author

bidoubiwa commented Mar 29, 2021

Hey @sanders41
Thanks for the investigation. Could you link the related code in algolia? I would love to see it

@bidoubiwa bidoubiwa changed the title Docs-scraper not working on localhost Docs-scraper not working when url has a port Mar 29, 2021
@sanders41
Copy link
Collaborator

I have not looked through the algolia code yet. What lead me to the port idea was:

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

and

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

Then from that where I started looking around here was

return urlparse(url).netloc
and
url_without_params = o.scheme + "://" + o.netloc + o.path

I tried looking at references to these two places since they seem to be the main two places getting the base url, but so far haven't found anywhere that removes the port information or changes it. There are a few different places in the UrlsParser class that have regular expressions, but nothing jumped out at me as looking at the port in any of them.

@sanders41
Copy link
Collaborator

@bidoubiwa I have been looking into this some more and I still can't find anywhere that removes or changes the port. I also can't find anywhere that would see localhost specifically as an issue. I currently have another theory based on some behavior I have seen while testing, but I'm not very familiar with the doc scraper so I'm wondering if you can tell my if the following makes sense as a possibility.

If we take the MeiliSearch docs as an example, they are built in VuePress so they are rendered as static pages on the live site, but viewing them in development they are rendered with JavaScript. It looks like the doc scraper is supposed to check if a site is rendered with JavaScript or not and use selenium if it is. In testing when I print out the HTML that the scraper gets it does not include the JS rendered elements and when I check js_render it always seems to be False.

So I'm wondering if the issue could be the check if Selenium should be used is not working like it should and therefore Selenium is not being used on dynamically generated sites?

@sanders41
Copy link
Collaborator

sanders41 commented Mar 31, 2021

I looked through algolia's scraper and found some additional configuration settings I was able to use to get the scraper to work when the url includes a port and/or when the page is rendered with JS so I added these settings to the README file. To scrape the MeiliSearch docs on local host run with yarn dev I used the following settings:

{
  "index_uid": "docs",
  "sitemap_urls": [],
  "start_urls": ["http://localhost:8080"],
  "js_render": true,
  "js_wait": 1,
  "allowed_domains": ["localhost"],
  "selectors": {
    "lvl0": {
      "selector": ".sidebar-heading.open",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".theme-default-content h1",
    "lvl2": ".theme-default-content h2",
    "lvl3": ".theme-default-content h3",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "text": ".theme-default-content p, .theme-default-content li"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

js_render and js_wait were used because the the pages are generated with JS when using yarn dev so the pages needed ChromeDriver in order to be rendered. "allowed_domains": ["localhost"] is the part that got the scraper to work when a port is included.

One thing I still have a questions about is a lot of the pages say 0 records. If I point the scraper at the live site's docs removing the js_render, js_wait, and allowed_domains, keeping all other settings the same, I get the exact same results so I think it is correct?

@bors bors bot closed this as completed in db012ed Mar 31, 2021
@bors bors bot closed this as completed in #112 Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants