Skip to content
This repository has been archived by the owner on Jan 22, 2020. It is now read-only.

Python Documentation #42

Open
tegansnyder opened this issue Dec 31, 2015 · 0 comments
Open

Python Documentation #42

tegansnyder opened this issue Dec 31, 2015 · 0 comments

Comments

@tegansnyder
Copy link

It might be worth noting that you need a few things on your system to get this working for the Python example.

Python Modules:

You will receive this error if you try and run without installing a few modules.

  File "crawl_executor.py", line 25, in <module>
    from bs4 import BeautifulSoup
ImportError: No module named bs4
Install the following:
sudo pip install wget
sudo pip install beautifulsoup4
sudo pip install html5lib
sudo yum install -y libxml2-devel
sudo yum install -y libxslt-devel
sudo yum install -y python-devel
sudo pip install lxml
PhantomJS:

You will get errors about PhantomJs like the following:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "render_executor.py", line 62, in run_task
    if call(["phantomjs", "render.js", url, destination]) != 0:
  File "/usr/lib64/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

To resolve that you need to build PhantomJs from source. If you can find a binary for your Linux distro then go with that. I used a binary I found for Centos 7 here. Note there are some issues bundling binaries for PhantomJs see thead here. If you must build from source follow the steps below it can take an hour or so.

# needed to phantomjs build from source
sudo yum -y install gcc gcc-c++ make flex bison gperf ruby \
  openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel \
  libpng-devel libjpeg-devel

git clone --recurse-submodules https://github.com/ariya/phantomjs.git
cd phantomjs
./build.py
Parser Warning on BS4:

Also the Executer throws a nice warning about not explicitly specifying the parser for BS4 that appears to halt the script.

Executor registered on slave 586d51bc-408a-4191-bce7-8527a6c0f2f4-S0
/usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this (See PR #41):

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant