Permalink
Fetching contributors…
Cannot retrieve contributors at this time
111 lines (73 sloc) 6.01 KB

Developers documentation

Architecture

Hyphe relies on the following main components/directories:

  • memory_structure: a JAVA Lucene database designed for WebEntities & WebEntityLinks handling, served as an API using Thrift
  • hyphe_backend: Python 2.6/2.7 controllers for the crawling and backend API, with MongoDB buffer database to store crawled data
    • core.tac: a Twisted based JSON-RPC API controller
    • crawler: a Scrapy spider project to build and deploy on ScrapyD
    • lib: shared libraries between the two
    • memory_structure: Thrift-generated classes for easy dialogue with the Lucene MemoryStructure from Python
  • hyphe_frontend: a JavaScript web application powered with Angular.js to constitute and explore web corpora through the backend API

Other useful directories are:

  • bin for the executable scripts
  • config where all useful configuration files are
  • doc with this documentation among a few others

Note: hyphe_www_backend is the source code of an older implementation of the Javascript web frontend, meant to work with Hyphe MonoCorpus (see setting MULTICORPUS), not maintained anymore. _deprecated gathers old pieces of code or documentation from the past.

Build & run the Java MemoryStructure

The MemoryStructure relies on a specific Lucene database made accessible thanks to Apache Thrift which allows to call the MemoryStructure's Java API from the Python core. This results in building both a compiled jar and a set of Python classes. The Python core starts one instance of the MemoryStructure jar for each corpus (and automatically shuts it down when inactive), see the dedicated code in hyphe_backend/lib/corpus.py.

All of this means that whenever the code in the memory_structure directory is modified, the jar and python classes running the memory structure need to be rebuilt, so a dedicated script does this:

bin/build_thrift.sh

The Lucene data model is defined in src/main/java/fr/sciencespo/medialab/hci/memorystructure/index/IndexConfiguration.java. The Thrift API and its list of routines is defined in src/main/java/memorystructure.thrift and src/main/java/fr/sciencespo/medialab/hci/memorystructure/thrift/MemoryStructureImpl.java. All other files in src/main/java/fr/sciencespo/medialab/hci/memorystructure/thrift are autogenerated and shouldn't be modified except ThriftServer which configures the API. Most of the algorithms logic rely in memory_structure/src/main/java/fr/sciencespo/medialab/hci/memorystructure/index/ and memory_structure/src/main/java/fr/sciencespo/medialab/hci/memorystructure/cache/.

To run a single memory structure for tryouts without starting a Hyphe corpus, you can use the following command with example arguments:

java -server -Xms256m -Xmx1024m -jar hyphe_backend/memorystructure/MemoryStructureExecutable.jar log.level=DEBUG thrift.port=13500 corpus=TEST

Build & deploy a Scrapy crawler

Hyphe's crawler is implemented as a Scrapy spider which needs to be deployed for each corpus on the ScrapyD server (the core API automatically takes care of it whenever a corpus is created) (more information here).

For debug purposes, it can be deployed as follow for a specific corpus:

bin/deploy_scrapy_spider.sh <corpus_name>

Whenever config.json or the code in hyphe_backend/crawler and hyphe_backend/lib/urllru.py is modified, the spider needs to be redeployed on the ScrapyD server to be taken into account. You can either do this by hand running the previous command, or by calling the Core API's method crawl.deploy_crawler (see API documentation).

Use the API from command-line

The entire frontend relies on calls to the core API which can also very well be scripted or reimplemented. This is especially useful when wanting to exploit some of Hyphe's functionalities which are not available from the web interface yet (for instance, tag all webentities from a list of urls with tag CSV, crawl all IN webentities, etc.).

All of the API's fonctions are catalogued and described in the API documentation.

A simple python script hyphe_backend/test_client.py which could certainly be greatly improved provides a way to call the API from the command-line by stacking the arguments after the name of the called function, using keyword array before any rich argument such as an array or an object. For instance:

source $(which virtualenvwrapper.sh)
workon hyphe
./hyphe_backend/test_client.py get_status
./hyphe_backend/test_client.py create_corpus test
./hyphe_backend/test_client.py declare_page http://medialab.sciences-po.fr test
./hyphe_backend/test_client.py declare_pages array '["http://medialab.sciences-po.fr", "http://www.sciences-po.fr"]' test
WEID=$(./hyphe_backend/test_client.py store.get_webentity_for_url http://medialab.sciences-po.fr test |
         grep "u'id':" |
         sed -r "s/^.*: u'(.*)',/\1/")
./hyphe_backend/test_client.py store.add_webentity_tag_value $WEID USER MyTags GreatValue test
./hyphe_backend/test_client.py crawl_webentity $WEID 1 False IN prefixes array '{}' test

In bin/samples/ can be found multiple examples of advanced routines ran direcly via the shell using the command-line client, although these are presently deprecated as they were working with the old MONOCORPUS version of Hyphe and still need to be updated.

Contribute to the frontend

The Javascript dependencies are currently shipped with git sources until proper grunt/gulp/cat is setup.

In the mean time, to update dependencies, you can run the following after having installed Node.js:

sudo npm install -g bower
cd hyphe_frontend
bower install

Build the API's documentation

bin/build_apidoc.sh

Update the list of TLDs used by the frontend from Mozilla's list

bin/update_tlds_list.sh

Build a release

bin/build_release.sh <optional version_id>