Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file #139

Closed
linlai163 opened this issue Jul 1, 2021 · 11 comments

Comments

@linlai163
Copy link

Hey guys.
I have trouble in using the docker scraper, I try it on mac and ubuntu, every time I run the docker image, it tells me that the file named chromedriver is not a file.
But I could run it manually.
Can you help me resolve it ? Is the chromedriver in the docker image or locallly?

@bidoubiwa
Copy link
Contributor

Hey @ssaylo, thanks for your feedback :) could you show me your docs-scraper configuration file and your docker command?

@linlai163
Copy link
Author

@bidoubiwa
Hey, this is my config.json file, it uses react so this need the js_render and js_wait

{
  "index_uid": "trantor_docs",
  "start_urls": [
    {
      "url": "https://trantor-docs-dev.app.terminus.io/"
    }
  ],
  "selectors": {
    "lvl0": ".ant-page-header-heading-title",
    "lvl1": ".ant-card-body h1",
    "lvl2": ".ant-card-body h2",
    "lvl3": ".ant-card-body h3",
    "lvl4": ".ant-card-body h4",
    "lvl5": ".ant-card-body h5",
    "lvl6": ".ant-card-body h6",
    "text": ".ant-card-body p, ant-card-body a, .ant-card-body li, .ant-card-body td, .ant-card-body code span, .antd-card-body code, .antd-card-body pre, .antd-card-body strong, .antd-card-body a, .antd-card-body"
  },
  "js_render": true,
  "js_wait": 1
}

and this is my docker command

docker run -t --rm --network=host \
    -e MEILISEARCH_HOST_URL=127.0.0.1:80 \
    -e MEILISEARCH_API_KEY=myMasterKey \
    -e CHROMEDRIVER_PATH=/usr/local/bin/chromedriver \
    -v /Users/ssaylo/Company/docs-scraper/config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:v0.9.5 pipenv run ./docs_scraper config.json

docker run -t --rm \
    -e MEILISEARCH_HOST_URL=127.0.0.1:80 \
    -e MEILISEARCH_API_KEY=myMasterKey \
    -v /Users/ssaylo/Company/docs-scraper/config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

This problem has bothered me for a whole day and now I don't konw how to resolve it, thank you for your help.

@bidoubiwa
Copy link
Contributor

In your config file you are using the following options:

  "js_render": true,
  "js_wait": 1

Which requires a downloaded Chrome binary. You probably have it already on your computer at the default path provided in docs-scraper. This means you should install chrome driver in your docker instance.

See here

When js_render is set to true, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: autoreload watch.

After installing ChromeDriver, provide the path to the bin using the following environment variable CHROMEDRIVER_PATH (default value is /usr/bin/chromedriver).

@linlai163
Copy link
Author

The problem is that I‘ve been installed chrome and chrome driver,and I also set the docker environment -e CHROMEDRIVER_PATH=/usr/local/bin/chromedriver \, but it doesn't working. So sad.
image

@sanders41
Copy link
Collaborator

Docker won't be able to use the chrome driver from you local environment. You could install it into the container by updating the apt-get section in the Dockerfile

RUN apt-get update -Y \
  && apt-get install -y python3-pip \
  && apt-get install -y chromium

Then build the container docker build -t doc-scraper-chrome . and run with your custom container. I think this should work, but haven't had a chance to test it. @bidoubiwa please correct me if I'm wrong about this.

@linlai163
Copy link
Author

@sanders41 appreciate

@linlai163
Copy link
Author

@sanders41 The solution that you suggested me didn't working because it need chromedriver...
@bidoubiwa Finally, I found a solution.The problem is the docker image which named python:3.8.4-buster doesn't has chrome and chrome driver, so if you use the js_render, you must install chrome and chrome driver manually. For convenience, I use the algolia's base image and finally solved it.
Thanks again.

FROM algolia/docsearch-scraper-base

WORKDIR /docs-scraper

COPY . .

ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8

RUN apt-get update -y \
    && apt-get install -y python3-pip
RUN pip3 install pipenv
RUN pipenv --python 3.6 install

or you could add these to your dockerfile:

# Install selenium
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN true

RUN useradd -d /home/seleuser -m seleuser
RUN chown -R seleuser /home/seleuser
RUN chgrp -R seleuser /home/seleuser

RUN apt-get update -y && apt-get install -yq \
    software-properties-common\
    python3.7
RUN add-apt-repository -y ppa:openjdk-r/ppa
RUN apt-get update -y && apt-get install -yq \
    curl \
    wget \
    sudo \
    gnupg \
    && curl -sL https://deb.nodesource.com/setup_8.x | sudo bash -
RUN apt-get update -y && apt-get install -yq \
    nodejs -yq
RUN apt-get update -y && apt-get install -yq \
  unzip \
  xvfb \
  libxi6 \
  libgconf-2-4 \
  default-jdk

RUN curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
RUN echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
RUN apt-get update -y && apt-get install -yq \
  google-chrome-stable=85.0.4183.102-1 \
  unzip
RUN wget -q https://chromedriver.storage.googleapis.com/85.0.4183.83/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip

RUN mv chromedriver /usr/bin/chromedriver
RUN chown root:root /usr/bin/chromedriver
RUN chmod +x /usr/bin/chromedriver

RUN wget -q https://selenium-release.storage.googleapis.com/3.13/selenium-server-standalone-3.13.0.jar
RUN wget -q http://www.java2s.com/Code/JarDownload/testng/testng-6.8.7.jar.zip
RUN unzip testng-6.8.7.jar.zip

# Install DocSearch dependencies
COPY Pipfile .
COPY Pipfile.lock .

ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8
ENV PIPENV_HIDE_EMOJIS 1
RUN apt-get update -y && apt-get install -yq \
    python3-pip
RUN pip3 install pipenv
RUN pipenv install --python 3.6

@sanders41
Copy link
Collaborator

I'm surprised chromium didn't work, I've use it in place of chrome for chrome driver before without issue. Maybe it was setting it as an executable that made a difference. Either way glad you got it working.

@bidoubiwa would it be worth adding chromedriver to the container by default? Seems to be more and more common to have JS rendered pages, but maybe not so much for docs specifically. I guess it's really an ease of use vs container size question.

@bidoubiwa
Copy link
Contributor

bidoubiwa commented Oct 4, 2021

Sorry for the late answer, it flew under the radar 🙈

Chromedriver is 16Mb, it also needs to be updated to the chrome version of the user. So if my chrome is at 9.X and my chromedriver at 9.Y it throws an error.

Alternatively we can:

Use the algolia base image in our dockerfile since this is after all based on their repo. @curquiza what do you think?
FROM python:3.8.4-buster becomes: FROM algolia/docsearch-scraper-base in the DockerFile

Or pin this issue for future users.

Additionally we should add some documentation in the ##js-wait part of the documentation

@mdraevich
Copy link
Contributor

Hello,
If I'm not mistaking the problem is solved orally however no accepted pull request to solve it completely.
Any updates?

@brunoocasali
Copy link
Member

Hey people, I will look into it during this week :D

bors bot added a commit that referenced this issue Feb 10, 2022
184: Add libnss3 package to Dockerfile r=brunoocasali a=brunoocasali

Following the discussions about this issue #139 and after running this #165 locally, I had some trouble using `chrome_webdriver` because my `Dockerfile` didn't have this package.

This is not a fix for the mentioned issue, is just a small part of it!

Co-authored-by: Bruno Casali <brunoocasali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants