Docs-scraper not working when url has a port #103

bidoubiwa · 2021-03-01T16:07:32Z

As for now, docs-scraper does not work on docs-scraper. A workaround would be to use a tool like ngrok.

Ngrok exposes local servers behind NATs and firewalls to the public internet over secure tunnels

The bug should be fixed.

When trying to scrap with localhost you will receive the following result:

Docs-Scraper: http://localhost:8080 0 records)

Problem existed in docsearch-scraper as well: algolia/docsearch-scraper#461 (comment)

sanders41 · 2021-03-29T01:01:49Z

Based on the algolia issue discussion I don't think the issues is because of using localhost, instead it looks like the issue is caused any time there is a port provided. So in the example here, http://localhost:8080, I think the 8080 is the problem and not the localhost. I think this also explains why ngrok can be used for a workaround as it gives you a url to use on the default port 80.

That said, I haven't found in the code yet what would be causing the issue. The urls are using netloc from urlparse which does include the port. On first look I don't see anywhere the port is removed or modified. Maybe it has something to do with the regular expressions being used on the urls?

bidoubiwa · 2021-03-29T02:08:34Z

Hey @sanders41
Thanks for the investigation. Could you link the related code in algolia? I would love to see it

sanders41 · 2021-03-29T02:33:00Z

I have not looked through the algolia code yet. What lead me to the port idea was:

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

and

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

Then from that where I started looking around here was

docs-scraper/scraper/src/config/urls_parser.py

Line 122 in de09dcf

return urlparse(url).netloc

and

docs-scraper/scraper/src/custom_downloader_middleware.py

Line 24 in de09dcf

url_without_params = o.scheme + "://" + o.netloc + o.path

I tried looking at references to these two places since they seem to be the main two places getting the base url, but so far haven't found anywhere that removes the port information or changes it. There are a few different places in the UrlsParser class that have regular expressions, but nothing jumped out at me as looking at the port in any of them.

sanders41 · 2021-03-29T17:10:51Z

@bidoubiwa I have been looking into this some more and I still can't find anywhere that removes or changes the port. I also can't find anywhere that would see localhost specifically as an issue. I currently have another theory based on some behavior I have seen while testing, but I'm not very familiar with the doc scraper so I'm wondering if you can tell my if the following makes sense as a possibility.

If we take the MeiliSearch docs as an example, they are built in VuePress so they are rendered as static pages on the live site, but viewing them in development they are rendered with JavaScript. It looks like the doc scraper is supposed to check if a site is rendered with JavaScript or not and use selenium if it is. In testing when I print out the HTML that the scraper gets it does not include the JS rendered elements and when I check js_render it always seems to be False.

So I'm wondering if the issue could be the check if Selenium should be used is not working like it should and therefore Selenium is not being used on dynamically generated sites?

sanders41 · 2021-03-31T03:45:44Z

I looked through algolia's scraper and found some additional configuration settings I was able to use to get the scraper to work when the url includes a port and/or when the page is rendered with JS so I added these settings to the README file. To scrape the MeiliSearch docs on local host run with yarn dev I used the following settings:

{
  "index_uid": "docs",
  "sitemap_urls": [],
  "start_urls": ["http://localhost:8080"],
  "js_render": true,
  "js_wait": 1,
  "allowed_domains": ["localhost"],
  "selectors": {
    "lvl0": {
      "selector": ".sidebar-heading.open",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".theme-default-content h1",
    "lvl2": ".theme-default-content h2",
    "lvl3": ".theme-default-content h3",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "text": ".theme-default-content p, .theme-default-content li"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

js_render and js_wait were used because the the pages are generated with JS when using yarn dev so the pages needed ChromeDriver in order to be rendered. "allowed_domains": ["localhost"] is the part that got the scraper to work when a port is included.

One thing I still have a questions about is a lot of the pages say 0 records. If I point the scraper at the live site's docs removing the js_render, js_wait, and allowed_domains, keeping all other settings the same, I get the exact same results so I think it is correct?

bidoubiwa added bug Something isn't working docs-scraper labels Mar 1, 2021

bidoubiwa mentioned this issue Mar 1, 2021

Having issues indexing my jekyll website from local container #101

Closed

bidoubiwa changed the title ~~Docs-scraper not working on localhost~~ Docs-scraper not working when url has a port Mar 29, 2021

sanders41 mentioned this issue Mar 31, 2021

Adding config info to allow urls with ports #112

Merged

bors bot closed this as completed in db012ed Mar 31, 2021

bors bot closed this as completed in #112 Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs-scraper not working when url has a port #103

Docs-scraper not working when url has a port #103

bidoubiwa commented Mar 1, 2021 •

edited

Loading

sanders41 commented Mar 29, 2021

bidoubiwa commented Mar 29, 2021 •

edited

Loading

sanders41 commented Mar 29, 2021

sanders41 commented Mar 29, 2021

sanders41 commented Mar 31, 2021 •

edited

Loading

Docs-scraper not working when url has a port #103

Docs-scraper not working when url has a port #103

Comments

bidoubiwa commented Mar 1, 2021 • edited Loading

sanders41 commented Mar 29, 2021

bidoubiwa commented Mar 29, 2021 • edited Loading

sanders41 commented Mar 29, 2021

sanders41 commented Mar 29, 2021

sanders41 commented Mar 31, 2021 • edited Loading

bidoubiwa commented Mar 1, 2021 •

edited

Loading

bidoubiwa commented Mar 29, 2021 •

edited

Loading

sanders41 commented Mar 31, 2021 •

edited

Loading