Skip to content

Conversation

@kinshukdua
Copy link

Fixes #139
The library webdriver-manager is used to automatically download the correct version of ChromeDriver in case no path is passed in the environment variable.
The library will also automatically upgrade the drivers in case Chrome is updated. To maintain backwards compatibility or to use a specific version of the drivers, the user can always override it by setting the environment variable.

@curquiza curquiza requested a review from bidoubiwa October 6, 2021 10:05
Copy link
Contributor

@bidoubiwa bidoubiwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should force the download of the ChromeDriver to the user. @sanders41 do you have any opinion on the subject?

@sanders41
Copy link
Collaborator

sanders41 commented Oct 6, 2021

I'm wondering if we should force the download of the ChromeDriver to the user. @sanders41 do you have any opinion on the subject?

There are couple concerns I would have. What happens if a user already has chromedriver installed, but doesn't pass a path? In that case CHROMEDRIVER_PATH is going to be None, so webdriver-manager is going to install chromedriver. What happens in this case? Does this cause a conflict with the two version? I suspect Linux package managers, Homebrew, Chocolatey on Windows, etc aren't going to like this.

The second concern is at many companies, software like chromedriver has to be approved and installed from an approved list. In those cases the docs-scraper would be blocked from installing chromedriver and therefore fail making it unusable. In my dealings with this situation it would be easier to get chromedriver added to an approved list separately. Getting it approved because a Python package wants to install a binary could be seen as a security risk and be more likely to get denied.

@bidoubiwa
Copy link
Contributor

bidoubiwa commented Oct 6, 2021

Thanks a lot @sanders41. I 100% agree.

So what I suggest is:
If we cannot find the chromedriver of the user

  • We output to the user that we could not find their chromedriver. That it may be because the path is incorrect or because they do not have chromedriver
  • We open a prompt asking them if they want to download the chromedriver Y/n
  • In case of yes, we download it. It can always be blocked by the approved list. In which case we simply throw an error stating we could not download the driver
  • In case of no, we can shut down the script after telling him to re-run it with the correct path or remove the render_js option

What do you think?

@kinshukdua
Copy link
Author

@sanders41 If a user already has chromedriver it doesn't cause a conflict since chromedriver is distributed as a portable binary so it doesn't need to be installed. You just have to download the required binary from the source, save it and pass the path to the python program. This is what the library automates. I'm not sure if a python package downloading a binary would be blocked or not in production environments but installation shouldn't be an issue.

@sanders41
Copy link
Collaborator

@bidoubiwa your plan sounds good to me. I think by not forcing the download it prevents the issues I can think of, but would still make it an option if the user wants it giving an easier path.

@bidoubiwa
Copy link
Contributor

@kinshukdua since we answered at the same moment, i'm not sure you saw my comment just above yours!

@kinshukdua
Copy link
Author

@bidoubiwa I did, I agree with your suggestion, I'll add a new commit for the same. It just has to be a CLI prompt right?

@bidoubiwa
Copy link
Contributor

Yes!

@kinshukdua
Copy link
Author

@bidoubiwa I've added the requested changes, please review and let me know if anything needs to be changed.

@bidoubiwa
Copy link
Contributor

Any idea why the tests are failing?

@kinshukdua
Copy link
Author

@bidoubiwa They were failing because during testing, the BrowserHandler is called without CHROMEDRIVER_PATH being set so it starts the interactive prompt which pytest cannot provide an input to.
I've fixed this by automatically adding input using monkeypatch.setattr.
One test was failing because of linting issue which was fixed replacing exit with sys.exit

@bidoubiwa
Copy link
Contributor

I will be reviewing this PR this week :)

@kinshukdua
Copy link
Author

@bidoubiwa Thanks!

Copy link
Contributor

@bidoubiwa bidoubiwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey so I found a few problems!

  1. When running the script two times, both time it suggest to download the chromedriver.

  2. In both cases, the scraping fails:

> Docs-Scraper: https://docs.meilisearch.com/learn/what_is_meilisearch/sdks.html 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/installation.html 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/core_concepts/ 0 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/quick_start.html 0 records)
  • I tried on the main branch, it still works with js_render and it renders correctly
> Docs-Scraper: https://docs.meilisearch.com/create/how_to/gcp.html 74 records)
> Docs-Scraper: https://docs.meilisearch.com/create/how_to/aws.html 83 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/what_is_meilisearch/sdks.html 43 records)
> Docs-Scraper: https://docs.meilisearch.com/learn/getting_started/installation.html 37 records)

The problem is because you created you PR from your main branch you will have a hard time going back to this main branch.
I would suggest you reopen a PR with your PR on a branch by following these steps. If you prefer not, I suggest you re-clone this project in another directory.

Here is a test config file to see if your PR works:

{
  "index_uid": "docs",
  "sitemap_urls": ["https://docs.meilisearch.com/sitemap.xml"],
  "start_urls": ["https://docs.meilisearch.com"],
  "js_render": true,
  "selectors": {
    "lvl0": {
      "selector": ".sidebar-heading.open",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".theme-default-content h1",
    "lvl2": ".theme-default-content h2",
    "lvl3": ".theme-default-content h3",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "text": ".theme-default-content p, .theme-default-content li"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

@kinshukdua
Copy link
Author

Hi, @bidoubiwa I've opened a new PR #165 on a new branch.

@bidoubiwa
Copy link
Contributor

closed in favor of #165

@bidoubiwa bidoubiwa closed this Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exception: Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file

3 participants