Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This prototype is a focus web crawler that allows the visiting pre-defined websites, extract their contents and save them into an elasticsearch datastore. The blog crawler is based on scrapy, a python framework to facilitate the development of webscraping applications. ##Requirements

  • Linux or Mac OS X
  • Python 2.7 or newer
  • Scrapy 0.22 or newer
  • lxml
  • pyOpenSSL
  • Elasticsearch 1.2.1 or newer
  • Java Runtime Environment 7 or newer

##Installation The easiest way to install the required software is to use the packet manager of the OS. The commands below are tested with Ubuntu 14.04, but they should work on all the distributions that use the apt-get packet manager. For yum-based systems like Fedora and SuSE some modifications might be required.

The following commands will make sure that all requirements are met:

sudo apt-get install -y build-essential git python-pip python python-dev libxml2-dev libxslt-dev lib32z1-dev openjdk-7-jdk
sudo pip install pyopenssl lxml scrapy elasticsearch dateutils Twisted service_identity

Elasticsearch can not be installed via apt-ge. Hence, this has to be done manually. First we download the latest Elasticsearch version:


where x.y.z. is the current version number and thus needs to be replaced. Then the installation process can be trigged via:

sudo dpkg -i elasticsearch-x.y.z.deb

Again, x.y.z refers to the latest's version number and has to be replaced. The installation is complete and elasticsearch can be started using the following command:

sudo service elasticsearch start

It is to note, that the service will not start automatically when the computer boots up. If this is required, the following command has to be used:

sudo update-rc.d elasticsearch defaults 95 10

The blog crawler can be cloned from Github:

git clone}}

The crawler can be invoked by changing into the just cloned repository-directory and starting the script That the process can take several hours. To interrupt the process <ctrl+z> must be pressed.


No description, website, or topics provided.







No releases published


No packages published