Permalink
8adc918 Sep 11, 2015
@boogheta @Yomguithereal
164 lines (77 sloc) 8.27 KB

Configuration

Backend core API

The default configuration should work by default for a local install (i.e. running on http://localhost/hyphe), but you may want to provide a few finer settings. You can configure Hyphe's options by editing config/config.json.

Default options should fit for most cases. Typical important options to set depending on your situation are highlighted as bold:

  • mongo-scrapy [object]: backend config for the database and the scrapy crawler

    • host [str]:

      usually "localhost", but possibly another domain name or IP (no http://) of a machine on which both MongoDB and ScrapyD are installed and accept external access

    • mongo_port [int]:

      usually 27017, the port on which MongoDB is served

    • proxy_host [str] & proxy_port [int]:

      in case you want the crawler to query the web through a http proxy (host should be a domain name, without http://)

    • db_name [str]:

      usually "hyphe", the name of the MongoDB database that will host Hyphe's data. Typically useful when wanting to deploy multiple Hyphe instances on the same server

    • scrapy_port [int]:

      usually 6800, the port on which ScrapyD is served

    • maxdepth [int]:

      usually 3, the maximum depth allowed to the users for each individual crawl (meaning the number of clicks to be followed within a crawled WebEntity). Note that crawls with a depth of 3 and more can easily take hours depending on the crawled website

    • download_delay [int]:

      usually 1, the pause time (in seconds) taken by the crawler between two queries of the same crawl

    • max_simul_requests [int]:

      usually 12, the maximum number of concurrent queries performed by the crawler

    • max_simul_requests_per_host [int]:

      usually 1, the maximum number of concurrent queries performed by the crawler on a same hostname

  • memoryStructure [object]: config for the Java Lucene part of Hyphe, defining the limits of possibly simultaneously running corpora, by default to 10

    • keepalive [int]:

      usually 1800, the time (in seconds) after which a corpus which has not been used will automatically stop and free a slot for other corpora

    • thrift.portrange [2-ints array]:

      usually [13500, 13509], an array of two ports values defining a minimum and a maximum values between which all possible ports can be used by each corpus' MemoryStructure to communicate via Thrift with the core API. Hyphe won't accept more simultaneously running corpus than the number of available ports

    • thrift.max_ram [int]:

      usually 2560, the maximum ram possibly allocated to the MemoryStructure of all simultaneously running corpora. By default a corpus will start with 256Mo, and, possibly restyart with 256 more whenever the corpus grows too big and runs out of memory

    • lucene.rootpath [str]:

      usually the lucene-data directory within Hyphe's code, the absolute path to the directory in which the MemoryStructure data for each corpus will be stored (can get as high as a few gigaoctets per corpus)

    • log.level [str]:

      usually "INFO", possibly "WARN", "DEBUG" or "TRACE" to get more log within each Lucene MemoryStructure's log files (such as log/hyphe-memory-structure-<corpus>.log)

    • max_simul_pages_indexing [int]:

      usually 100, advanced setting for internal performance adjustment, do not modify unless you know what you're doing

    • max_simul_links_indexing [int]:

      usually 10000, advanced setting for internal performance adjustment, do not modify unless you know what you're doing

  • twisted.port [int]:

    usually 6978, the port through which the server and the web interface will communicate. Typically useful when wanting to deploy multiple Hyphe instances on the same server

  • precisionLimit [int]:

    usually 2, the maximum precision to keep on links between crawled webpages, the value being the number of slashes after the root prefix of a WebEntity (read the wiki for more info). Do not modify unless you know what you're doing

  • defaultStartpagesMode [str | str array]:

    usually ["prefixes", "pages-5"], possibly one or many of "startpages", "prefixes", "pages-<N>". Sets the default behavior when crawling discovered WebEntities with no startpage manually set. When using only "startpages", crawl will fail on WebEntities with no humanly set startpage. With other options, Hyphe will try respectively the "N" most linked pages known of the WeEntity ("pages-<N>") or all of its prefixes ("prefixes"), then add them automatically to the WebEntity's startpages on success during crawl.

  • defaultCreationRule [str]:

    usually "domain", possibly one of "subdomain", "subdomain-<N>", "domain", "path-<N>", "page", "prefix+<N>". Sets the default behavior when discovering new web pages, meaning the creation of a new WebEntity for each different discovered "domain", "subdomain", etc. <N> being an integer. Read more about creation rules in the wiki and the dedicated code

  • creationRules [object]:

    see default values for example, an object defined with domain names as keys and creationrules as values (read defaultCreationRule above for explanations on creationrules)

  • discoverPrefixes [str array]:

    see default values for example, a list of domain names for which the crawler will automatically try to resolve redirections in order to avoid having links shorteners in the middle of the graph of links

  • phantom [object]: settings for crawl jobs using PhantomJS to simulate a human browsing the webpages, scrolling and clicking on any possible interactive part (still experimental, do not modify unless you know what you're doing)

    • autoretry [bool]:

      false for now, set to true to enable auto retry of crawl jobs having apparently failed (depth > 0 & pages found < 3)

    • timeout [int]:

      usually 600, the maximum time in seconds PhantomJS is allowed to spend on one single page (10 minutes are required for instance to load all hidden content on big Facebook group pages for instance)

    • idle_timeout [int]:

      usually 20, the maximum time in seconds after which PhantomJS will consider the page properly crawled if nothing happened within during that time

    • ajax_timeout [int]:

      usually 15, the maximum time in seconds allowed to any Ajax query performed within a crawled page

    • whitelist_domains [str array]:

      empty for now, a list of domain names for which the crawler will automatically use PhantomJS (meant for instance in the long term for Facebook, Twitter or Google)

  • MULTICORPUS [bool]:

    normally true, mainly for retrocompatibility, but can be set to false to allow only one corpus (called --hyphe--)

  • ADMIN_PASSWORD [str]:

    usually unset, but can be defined to a string that will be accepted as the password by all existing corpora to let admins access them for administration use

  • OPEN_CORS_API [bool]:

    usually set to false, enable only when you want to allow another frontend web instance to query the core API from another web domain

  • DEBUG [int]:

    a value from 0 to 2 indicating the level of verbosity desired from the API core in log/hyphe-core.log

Note: Many of these settings are configurable per corpus individually. Although the webapp interface does not allow to set them yet, they can be adjusted via the command line client to the API using the set_corpus_options method. See the list of settable corpus options here.

Frontend webapp

A few adjustments can be set to the frontend by editing the file hyphe_frontend/app/conf/conf.js:

  • serverURL:

    The path to Hyphe's core API. Default is /hyphe-api which corresponds to the url to which the API is proxied within config/apache2.conf. Useful to plug and develop a frontend onto an Hyphe instance without having it locally installed.

  • googleAnalyticsId:

    A Google Analytics ID to track use of the tool's web interface. Default is ''.