Permalink
404 lines (284 sloc) 14.4 KB

Installation

Notes:

  • Hyphe is built to run on a limited list of GNU/Linux distributions. Docker can be used to install Hyphe locally on other systems including MacOs, see the Docker doc. Making it work on Windows might be feasible but is not supported.
  • MongoDB is limited to 2Go databases on 32bit systems, so we recommand to always install Hyphe on a 64bit machine.
  • Do not add sudo to any of the following example commands. Every line of shell written here should be ran from Hyphe's root directory and sudo should only be used when explicitly listed.

The easiest way to install Hyphe is by uncompressing the gzipped release. It has been successfully tested on a variety of blank distributions of Ubuntu, Debian and CentOS. Please let us know if you get it working on other versions!

Distribution Version precision OK ?
Ubuntu 12.04.5 LTS server
Ubuntu 12.04.5 LTS desktop
Ubuntu 14.04.1 LTS server
Ubuntu 14.04.1 LTS desktop
Ubuntu 14.10 desktop
Ubuntu 15.04 desktop — (ScrapyD + Upstart issue with Ubuntu 15 so far)
CentOS 5.7 server — (issues due to missing upstart & python2.4)
CentOS 6.4 Final server
Debian 6.0.10 squeeze server
Debian 7.5 wheezy server
Debian 7.8 wheezy livecd gnome
Debian 8.0 jessie livecd gnome — (MongoDB not supporting Debian 8 yet)
RedHat 7.3 Maipo server ✓ (Be careful to use step by step advanced installation
Warning: ScrapyD won't be installed as service and will have to be ran manually)

Just uncompress the release archive, go into the directory and run the installation script.

Do not use sudo: the script will do so on its own and ask for your password only once. This works so in order to install all missing dependencies at once, including mainly Java (OpenJDK-6-JRE), Python (python-dev, pip, virtualEnv, virtualEnvWrapper...), Apache2, MongoDB & ScrapyD.

# WARNING: DO NOT prefix any of these commands with `sudo`!
tar xzvf hyphe-release-*.tar.gz
cd hyphe
./bin/install.sh

If you are not comfortable with this or if you prefer to install from git sources, please follow the steps below.

1) Clone the source code

git clone https://github.com/medialab/hyphe hyphe
cd hyphe

From here on, you can also run bin/install.sh to go faster as with the release, or follow the next steps.

2) Get requirements and dependencies

MongoDB (a NoSQL database server), ScrapyD (a crawler framework server), Python 2.6/2.7, JAVA 6+ (with Maven 2+ and Thrift for contributors/developers) are required for the backend to work.

2.1) Prerequisites:

Install possible missing required basics:

For Debian/Ubuntu:

sudo apt-get update
sudo apt-get install curl wget python-dev python-pip apache2 libapache2-mod-proxy-html libxml2-dev libxslt1-dev build-essential libffi-dev libssl-dev libstdc++6-dev

Or for CentOS/RedHat:

sudo yum check-update
sudo yum install curl wget python-devel python-setuptools python-pip httpd libxml2-devel libxslt-devel gcc libffi-devel openssl-devel libstdc++.so.6

# Fix possibly misnamed pip
pip > /dev/null || alias pip="python-pip"

# Activate Apache's autorestart on reboot
sudo chkconfig --levels 235 httpd on
sudo service httpd restart

2.2) Install MongoDB

As they are usually very old, we recommand not to use the MongoDB packages shipped within distributions official repositories. Below are basic examples to manually install MongoDB (3.0) on Debian/Ubuntu/CentOS, although it does not seem to be supported on all distributions yet, so please read official documentation for more details. If you'd rather install an older version 2.x, you can follow the dedicated isntructions in the bin/install.sh script to see examples.

On Debian/Ubuntu:

# Install the GPG key for the package repository
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

# Add the repository to apt's sources list
sudo apt-get install lsb-release
distrib=$(cat /etc/issue | sed -r 's/^(\S+) .*/\L\1/')
listrepo="main"
if [ "$distrib" = "ubuntu" ]; then listrepo="multiverse"; fi
echo "deb http://repo.mongodb.org/apt/$distrib "$(lsb_release -sc)"/mongodb-org/3.0 $listrepo" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list

# Update apt's sources list & install
sudo apt-get update
sudo apt-get install mongodb-org

On CentOS/RedHat, this is slightly more complex:

# Test whether SELinux runs
# If it says enabled, you will have to do a few more steps after the installation, see here: http://docs.mongodb.org/manual/tutorial/install-mongodb-on-red-hat/#run-mongodb
sestatus

# Add the repository to yum's sources list
echo "[mongodb-org-3.0]
name=MongoDB Repository
baseurl=http://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.0/x86_64/
gpgcheck=0
enabled=1" | sudo tee /etc/yum.repos.d/mongodb-org-3.0.list

# Update yum's sources list & install
sudo yum check-update
sudo yum install mongodb-org

# Let MongoDB autostart on reboot
sudo chkconfig mongod on
sudo service mongod restart

For development and administrative use, you can also optionally install one of the following projects to easily access MongoDB's databases:

2.3) Install ScrapyD

On all distribs, start by installing globally the python dependencies required by Hyphe's Scrapy spider so that ScrapyD can use them (versions are fixed to avoid breakage: pymongo3 currently breaks txmongo):

sudo pip install pymongo==2.7
sudo pip install txmongo==0.6
sudo pip install selenium==2.42.1

Then easily install ScrapyD on Ubuntu:

# Install the GPG key for the package repository
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

# Add the repository to apt's sources list
echo "deb http://archive.scrapy.org/ubuntu scrapy main" | sudo tee /etc/apt/sources.list.d/scrapy.list

# Update apt's sources list & install
sudo apt-get update
sudo apt-get install scrapy-0.24
sudo apt-get install scrapyd

For other distributions, first install python scrapy globally via pip:

sudo pip install Scrapy==0.18

Then follow the next steps on CentOS & Debian: ScrapingHub unfortunately only provides ScrapyD packages for Ubuntu, so we had to build our own:

For Debian:

# Download our homemade package...
wget --no-check-certificate "https://github.com/medialab/scrapyd/raw/medialab-debian/debs/scrapyd_1.0~r0_all.deb"

sudo dpkg -i scrapyd_1.0~r0_all.deb
rm -rf scrapyd_1.0~r0_all.deb

# You can later remove the homemade package by running:
# sudo dpkg -r scrapyd` to remove

Or for CentOS:

# Download our homemade package...
wget --no-check-certificate "https://github.com/medialab/scrapyd/raw/medialab-centos/rpms/scrapyd_1.0.1-3.el6.x86_64.rpm"

sudo rpm -i scrapyd_1.0.1-3.el6.x86_64.rpm
rm -rf scrapyd_1.0.1-3.el6.x86_64.rpm

# You can later remove the homemade package by running:
# sudo rpm -e scrapyd

Alternatively, for RedHat > v6, or for others if none of the above work, you can install ScrapyD as python package and run it manually instead of as a service:

sudo pip install scrapyd==1.0.1

# Create environnement
sudo mkdir -p /etc/scrapyd/conf.d /var/lib/scrapyd /var/log/scrapyd
# Change <user> with your user
sudo chown -R <user>:<user> /var/lib/scrapyd /var/log/scrapyd

Finally, on all distributions, add Hyphe's specific config for ScrapyD:

sudo ln -s `pwd`/config/scrapyd.config /etc/scrapyd/conf.d/100-hyphe

Then restart the service on Debian/Ubuntu/CentOS:

sudo /etc/init.d/scrapyd restart

Or run it manually for RedHat > v6:

nohup scrapyd &

You can test whether ScrapyD is properly installed and running by querying http://localhost:6800/listprojects.json. If everything is normal, you should see something like this:

{"status": "ok", "projects": []}

2.4) Install Java & Thrift

2.4.1) Java

Hyphe requires at least the Java JRE 6 installed. You can test it by running java -version and in case it is missing run:

# Debian/Ubuntu:
sudo apt-get install openjdk-6-jre
# CentOS/RedHat:
sudo yum install java-1.6.0-openjdk

2.4.2) Thrift

Important: If you're running these installation steps from a downloaded zipped archive of a release, you can skip this section, it only applies if you chose to install from the git sources.

Hyphe uses Thrift version 0.8 to ensure the communication between the python Twisted core and the Java Lucene MemoryStructure.

As for the global installation process, a script allows you to run this part of the installation in just one line.

# WARNING: DO NOT prefix this commands with sudo, the script will ask for your sudo rights on its own
./bin/install_thrift.sh

If you are not comfortable with this running on its own, follow the next steps.

Thrift requires a few dependencies including JAVA's JDK, as well as Ant & Maven2+. On Ubuntu/Debian:

sudo apt-get install build-essential openjdk-6-jdk ant
sudo apt-get install maven || sudo apt-get install maven2

On CentOS/RedHat:

sudo yum install java-1.6.0-openjdk-devel ant

# Download and extract Maven binaries
wget http://www.eu.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
tar xvf apache-maven-3.1.1-bin.tar.gz

# Install Maven in system
sudo cp -r apache-maven-3.1.1 /usr/local/maven
echo "export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}" | sudo tee /etc/profile.d/maven.sh
source /etc/profile.d/maven.sh
rm -rf apache-maven-3.1.1*

You can then download and install Thrift:

# Download and extract Thrift sources
wget http://archive.apache.org/dist/thrift/0.8.0/thrift-0.8.0.tar.gz
tar xvf thrift-0.8.0.tar.gz

# Configure and compile Thrift with java & python
cd thrift-0.8.0
./configure --with-java --without-erlang --without-php
make
sudo make install
cd ..

Then finally build Hyphe's MemoryStructure jar using Thrift:

./bin/build_thrift.sh

2.5) Setup a Python virtual environment with Hyphe's dependencies

We recommend using virtualenv with virtualenvwrapper:

# Install VirtualEnv & Wrapper
sudo pip install virtualenv
sudo pip install virtualenvwrapper
source $(which virtualenvwrapper.sh)

# Create Hyphe's VirtualEnv & install dependencies
mkvirtualenv --no-site-packages hyphe
workon hyphe
add2virtualenv $(pwd)
pip install -r requirements.txt
deactivate

2.6) [Unnecessary for now] Install PhantomJS

Important: Crawling with PhantomJS is currently only possible as an advanced option in Hyphe. Do not bother with this section except for advanced use or development.

Hyphe ships with a compiled binary of PhantomJS-2.0 for Ubuntu, unfortunately it is not cross-compatible with other distributions: so when on CentOS or Debian, you should compile your own from sources.

./bin/install_phantom.sh

Note that PhantomJS 1.9.7 is easily downloadable as binary, altough it uses a very outdated version of WebKit and PhantomJS 2+ is required to handle modern websites such as Facebook.

3) Prepare and configure

3.1) Setup the backend

  • Copy and adapt the sample config.json.example to config.json in the config directory:
sed "s|##HYPHEPATH##|"`pwd`"|" config/config.json.example > config/config.json

3.2) Set the frontend

Copy and adapt the sample conf_default.json to conf.json in the hyphe_frontend/app/conf directory:

sed "s|##WEBPATH##|hyphe|" hyphe_frontend/app/conf/conf_default.js > hyphe_frontend/app/conf/conf.js

3.3) Serve everything with Apache

The backend core API relies on a Twited web server serving on a dedicated port (defined as twisted.port in config.json just before). For external access, proxy redirection is handled by Apache.

  • Copy and adapt the sample apache2_example.conf from the config directory:
twport=$(grep '"twisted.port"' config/config.json | sed 's/[^0-9]//g')
sed "s|##HYPHEPATH##|"`pwd`"|" config/apache2_example.conf |
sed "s|##TWISTEDPORT##|$twport|" |
sed "s|##WEBPATH##|hyphe|" > config/apache2.conf
  • Install it as an Apache's site:

On Debian/Ubuntu:

# Enable use of mod_proxy & mod_proxy_http
sudo a2enmod proxy
sudo a2enmod proxy_http

# Install & enable site
sudo ln -s `pwd`/config/apache2.conf /etc/apache2/sites-available/hyphe
sudo a2ensite hyphe

# Reload Apache
sudo service apache2 reload

On CentOS/RedHat:

# Apache's mod_proxy & mod_proxy_http usually ship with Httpd on CentOS machines but it might be missing.
# Ensure it is indeed present running the following command and google how to install it otherwise
grep -r "^\s*LoadModule.*mod_proxy_http" /etc/httpd/

# Install site
sudo ln -s `pwd`/config/apache2.conf /etc/httpd/conf.d/hyphe.conf

# Reload Apache
sudo service httpd reload

This will install Hyphe locally only first: http://localhost/hyphe. The page should be accessible even though the website should not work yet since we have not started the server, see next section.

If you encounter issues here or would like to serve Hyphe on the web, please see the related documentation.

3.4) Run Hyphe!

To start, stop or restart the server's daemon, run (with the proper rights, so no sudo if you installed as your user!):

bin/hyphe <start|restart|stop> [--nologs]

You should now be able to enjoy Hyphe at http://localhost/hyphe!