Add `webdriver-manager` for managing chrome drivers #161

kinshukdua · 2021-10-06T09:32:22Z

Fixes #139
The library webdriver-manager is used to automatically download the correct version of ChromeDriver in case no path is passed in the environment variable.
The library will also automatically upgrade the drivers in case Chrome is updated. To maintain backwards compatibility or to use a specific version of the drivers, the user can always override it by setting the environment variable.

bidoubiwa

I'm wondering if we should force the download of the ChromeDriver to the user. @sanders41 do you have any opinion on the subject?

Pipfile

sanders41 · 2021-10-06T14:14:03Z

I'm wondering if we should force the download of the ChromeDriver to the user. @sanders41 do you have any opinion on the subject?

There are couple concerns I would have. What happens if a user already has chromedriver installed, but doesn't pass a path? In that case CHROMEDRIVER_PATH is going to be None, so webdriver-manager is going to install chromedriver. What happens in this case? Does this cause a conflict with the two version? I suspect Linux package managers, Homebrew, Chocolatey on Windows, etc aren't going to like this.

The second concern is at many companies, software like chromedriver has to be approved and installed from an approved list. In those cases the docs-scraper would be blocked from installing chromedriver and therefore fail making it unusable. In my dealings with this situation it would be easier to get chromedriver added to an approved list separately. Getting it approved because a Python package wants to install a binary could be seen as a security risk and be more likely to get denied.

bidoubiwa · 2021-10-06T14:45:23Z

Thanks a lot @sanders41. I 100% agree.

So what I suggest is:
If we cannot find the chromedriver of the user

We output to the user that we could not find their chromedriver. That it may be because the path is incorrect or because they do not have chromedriver
We open a prompt asking them if they want to download the chromedriver Y/n
In case of yes, we download it. It can always be blocked by the approved list. In which case we simply throw an error stating we could not download the driver
In case of no, we can shut down the script after telling him to re-run it with the correct path or remove the render_js option

What do you think?

kinshukdua · 2021-10-06T14:47:03Z

@sanders41 If a user already has chromedriver it doesn't cause a conflict since chromedriver is distributed as a portable binary so it doesn't need to be installed. You just have to download the required binary from the source, save it and pass the path to the python program. This is what the library automates. I'm not sure if a python package downloading a binary would be blocked or not in production environments but installation shouldn't be an issue.

sanders41 · 2021-10-06T14:54:08Z

@bidoubiwa your plan sounds good to me. I think by not forcing the download it prevents the issues I can think of, but would still make it an option if the user wants it giving an easier path.

bidoubiwa · 2021-10-06T14:55:12Z

@kinshukdua since we answered at the same moment, i'm not sure you saw my comment just above yours!

kinshukdua · 2021-10-06T14:57:59Z

@bidoubiwa I did, I agree with your suggestion, I'll add a new commit for the same. It just has to be a CLI prompt right?

bidoubiwa · 2021-10-06T15:15:49Z

Yes!

kinshukdua · 2021-10-07T07:45:30Z

@bidoubiwa I've added the requested changes, please review and let me know if anything needs to be changed.

bidoubiwa · 2021-10-07T08:34:05Z

Any idea why the tests are failing?

kinshukdua · 2021-10-07T12:20:25Z

@bidoubiwa They were failing because during testing, the BrowserHandler is called without CHROMEDRIVER_PATH being set so it starts the interactive prompt which pytest cannot provide an input to.
I've fixed this by automatically adding input using monkeypatch.setattr.
One test was failing because of linting issue which was fixed replacing exit with sys.exit

bidoubiwa · 2021-10-11T10:14:50Z

I will be reviewing this PR this week :)

kinshukdua · 2021-10-11T17:17:49Z

@bidoubiwa Thanks!

bidoubiwa

Okey so I found a few problems!

When running the script two times, both time it suggest to download the chromedriver.
In both cases, the scraping fails:

> Docs-Scraper: https://docs.meilisearch.com/learn/what_is_meilisearch/sdks.html 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/installation.html 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/core_concepts/ 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/quick_start.html 0 records)

I tried on the main branch, it still works with js_render and it renders correctly

> Docs-Scraper: https://docs.meilisearch.com/create/how_to/gcp.html 74 records)
> Docs-Scraper: https://docs.meilisearch.com/create/how_to/aws.html 83 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/what_is_meilisearch/sdks.html 43 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/installation.html 37 records)

The problem is because you created you PR from your main branch you will have a hard time going back to this main branch.
I would suggest you reopen a PR with your PR on a branch by following these steps. If you prefer not, I suggest you re-clone this project in another directory.

Here is a test config file to see if your PR works:

{
  "index_uid": "docs",
  "sitemap_urls": ["https://docs.meilisearch.com/sitemap.xml"],
  "start_urls": ["https://docs.meilisearch.com"],
  "js_render": true,
  "selectors": {
    "lvl0": {
      "selector": ".sidebar-heading.open",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".theme-default-content h1",
    "lvl2": ".theme-default-content h2",
    "lvl3": ".theme-default-content h3",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "text": ".theme-default-content p, .theme-default-content li"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

README.md

scraper/src/config/browser_handler.py

kinshukdua · 2021-10-15T13:23:17Z

Hi, @bidoubiwa I've opened a new PR #165 on a new branch.

bidoubiwa · 2021-10-18T11:04:32Z

closed in favor of #165

kinshukdua added 2 commits October 6, 2021 14:53

Add webdriver_manager for chrome

ebdaab2

Fixed typo in README

26ea287

curquiza requested a review from bidoubiwa October 6, 2021 10:05

bidoubiwa suggested changes Oct 6, 2021

View reviewed changes

Pipfile Outdated Show resolved Hide resolved

Add specific version to webdriver-manager in Pipfile

e5519bf

Add chromdriver download prompt

3da4909

Fix Pytest and Linting errors

4b23d8b

bidoubiwa suggested changes Oct 14, 2021

View reviewed changes

README.md Show resolved Hide resolved

scraper/src/config/browser_handler.py Show resolved Hide resolved

bidoubiwa closed this Oct 18, 2021

Add webdriver-manager for managing chrome drivers #161

Add webdriver-manager for managing chrome drivers #161

Uh oh!

Conversation

kinshukdua commented Oct 6, 2021

Uh oh!

bidoubiwa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanders41 commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bidoubiwa commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kinshukdua commented Oct 6, 2021

Uh oh!

sanders41 commented Oct 6, 2021

Uh oh!

bidoubiwa commented Oct 6, 2021

Uh oh!

kinshukdua commented Oct 6, 2021

Uh oh!

bidoubiwa commented Oct 6, 2021

Uh oh!

kinshukdua commented Oct 7, 2021

Uh oh!

bidoubiwa commented Oct 7, 2021

Uh oh!

kinshukdua commented Oct 7, 2021

Uh oh!

bidoubiwa commented Oct 11, 2021

Uh oh!

kinshukdua commented Oct 11, 2021

Uh oh!

bidoubiwa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kinshukdua commented Oct 15, 2021

Uh oh!

bidoubiwa commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `webdriver-manager` for managing chrome drivers #161

Add `webdriver-manager` for managing chrome drivers #161

sanders41 commented Oct 6, 2021 •

edited

Loading

bidoubiwa commented Oct 6, 2021 •

edited

Loading