Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ pytest = "==6.2.3"
meilisearch = "==0.16.1"
requests-iap = "==0.2.0"
python-keycloak-client = "==0.2.3"
webdriver-manager = "==3.4.2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could update to webdriver-manager = "==3.5.2" 🎉


[dev-packages]
pylint = "==2.8.2"
Expand Down
274 changes: 174 additions & 100 deletions Pipfile.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ If used, `min_indexed_level` is ignored.

When `js_render` is set to `true`, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: `autoreload` `watch`.

After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH` (default value is `/usr/bin/chromedriver`).
After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH`. If the variable is not set, the scraper will automatically download and use a compatible version of ChromeDriver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH`. If the variable is not set, the scraper will automatically download and use a compatible version of ChromeDriver.
After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH`. If the variable is not set, the scraper will ask for permission to download and use a compatible version of ChromeDriver.


The default value of `js_render` is `false`.

Expand Down
45 changes: 39 additions & 6 deletions scraper/src/config/browser_handler.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
import re
import os
import sys
from distutils.util import strtobool
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from ..custom_downloader_middleware import CustomDownloaderMiddleware
from ..js_executor import JsExecutor

Expand All @@ -26,12 +29,42 @@ def init(config_original_content, js_render, user_agent):
chrome_options.add_argument('--headless')
chrome_options.add_argument('user-agent={0}'.format(user_agent))

CHROMEDRIVER_PATH = os.environ.get('CHROMEDRIVER_PATH',
"/usr/bin/chromedriver")
if not os.path.isfile(CHROMEDRIVER_PATH):
raise Exception(
"Env CHROMEDRIVER_PATH='{}' is not a path to a file".format(
CHROMEDRIVER_PATH))
CHROMEDRIVER_PATH = os.environ.get('CHROMEDRIVER_PATH', '')
if not CHROMEDRIVER_PATH or not os.path.isfile(CHROMEDRIVER_PATH):
print("Could not find ChromeDriver.")
print("Either the Env CHROMEDRIVER_PATH='{}' path is incorrect or "
"ChromeDriver is not installed.".format(CHROMEDRIVER_PATH))
print("Do you want to automatically download ChromeDriver?")
while(True):
user_input = input("[Y/n]: ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use only "[y/n]: " because the uppercased Y means for Linux users: press enter and the default choice will be "Yes".

Like this:
image

try:
yes = strtobool(user_input)
break
except ValueError:
print("Please enter a valid input.")
continue
if yes:
try:
CHROMEDRIVER_PATH = ChromeDriverManager().install()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.chmod(CHROMEDRIVER_PATH, 0o777)

I tested locally using docker, and I wasn't able to use the driver because the downloaded driver wasn't executable. I'm not sure if this is the best way to handle this, but it worked here in the docker env.

except Exception as e:
print("Could not download ChromeDriver. "
"Please install ChromeDriver manually.")
print(e)
if sys.platform == "linux" or sys.platform == "darwin":
os.system('read -s -n 1 -p "Press any key to continue..."')
if sys.platform == "win32":
os.system('pause')
Comment on lines +54 to +57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could remove these lines :)

sys.exit(1)
else:
print("Please install ChromeDriver and set the CHROMEDRIVER_PATH "
"environment variable or remove the render_js option.")
if sys.platform == "linux" or sys.platform == "darwin":
os.system('read -s -n 1 -p "Press any key to continue..."')
if sys.platform == "win32":
os.system('pause')
Comment on lines +62 to +65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And these also.

After using the option "n" when I was asked, I received the message and a shell error. Analyzing a little bit the lines, they don't add much value to your implementation, since the idea is to finish the execution flow.

Please install ChromeDriver and set the CHROMEDRIVER_PATH environment variable or remove the render_js option.
/bin/sh: 1: read: Illegal option -s

sys.exit(1)

driver = webdriver.Chrome(
CHROMEDRIVER_PATH,
options=chrome_options)
Expand Down
3 changes: 3 additions & 0 deletions tests/config_loader/get_extra_facets_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def test_extra_facets_should_be_set_from_start_urls_variables_browser(self,
monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating an option "auto-yes" like apt-get install -y? Because this will be needed when we will run in CI or automated environments.


c = config({
"start_urls": [
Expand All @@ -43,6 +44,7 @@ def test_extra_facets_should_be_set_from_start_urls_variables_with_two_start_url
self, monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")

c = config({
"js-render": True,
Expand Down Expand Up @@ -74,6 +76,7 @@ def test_extra_facets_should_be_set_from_start_urls_variables_with_multiple_tags
self, monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")

c = config({
"start_urls": [
Expand Down
2 changes: 2 additions & 0 deletions tests/config_loader/open_selenium_browser_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ def test_browser_not_needed_by_default(self):
def test_browser_needed_when_js_render_true(self, monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")
# When
c = config({
"js_render": True
Expand All @@ -37,6 +38,7 @@ def test_browser_needed_when_config_contains_automatic_tag(self,
monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")

# When
c = config({
Expand Down
1 change: 1 addition & 0 deletions tests/config_loader/start_urls_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ def test_start_urls_should_be_generated_when_there_is_automatic_tagging_browser(
self, monkeypatch):
monkeypatch.setattr("selenium.webdriver.chrome",
lambda x: MockedInit())
monkeypatch.setattr('builtins.input', lambda _: "y")

# When
c = config({
Expand Down