Skip to content

Core_API

Benjamin Ooghe-Tabanou edited this page Feb 28, 2013 · 6 revisions

JSON-RPC API

Results can take two forms :

  • success : {code: 'success', result: json_object}
  • error : {code: 'fail', message: error_string}

Main Core functions

  • ping : answers pong when core API is alive

  • get_status : returns statistics and information on the core's general status and loops

  • reinitialize : reinitializes the whole databases, crawl jobs and memory structure

  • listjobs : returns the list of crawling jobs past, running and pending

  • refreshjobs : updates and returns the list of crawling jobs past, running and pending

  • lookup_httpstatus : checks a webpage's existence and returns its http code status

    • url : string url of the looked up webpage
    • timeout : integer number of seconds to allow for lookup (default : 2)
  • lookup : checks a webpage's existence and returns a boolean True when the webpage exists or is a redirection (httpstatus = 200 or (> 300 and < 400), False otherwise

    • url : string url of the looked up webpage
    • timeout : integer number of seconds to allow for lookup (default : 2)
  • declare_pages : add pages in the memory structure and create webentities based on the default creation rule from these pages if necessary. Returns the webentities

    • list_urls : array of strings of the webpages urls
  • declare_page : add page in the memory structure and create webentity based on the default creation rule from this pages if necessary. Returns the webentity

    • url : string url of the declared webpage
  • crawl_webentity : programs the future crawl of a webentity

    • webentity_id : string id of the webentity to crawl from Memory Structure
    • maxdepth : integer maximum depth crawling value (default : None, will apply main config's mongo-scrapy default maxdepth value)
    • all_pages_as_startpoints : boolean, True to use all existing pages from the webentity in the memory structure as crawl's starting points instead of the webentity's startpages (default : False)

Crawl functions

  • crawl.reinitialize : cleans the list of jobs, empties the crawled results in the mongo database and cancels all pending crawls

  • crawl.start : programs a crawl for a webentity

    • webentity_id : string id of the webentity to crawl from memory structure
    • starts : array of strings of the crawl's starting points urls
    • follow_prefixes : array of strings of LRU prefixes to follow within the crawled links until maxdepth is reached (usually set to the webentity's LRU prefixes)
    • nofollow_prefixes : array of strings of LRU prefixes to not follow within the crawled links (usually set to the list of LRU prefixes of a webentity's subwebentities)
    • discover_prefixes : array of strings of LRU prefixes to follow for redirections (default : config['discoverPrefixes'])
    • maxdepth : maximum depth of links to follow from the starting points (default : config['mongo-scrapy']['maxdepth'])
    • download_delay : integer number of seconds to wait between two consecutive requests on the same domain name (default : config['mongo-scrapy']['download_delay'])
  • crawl.cancel : cancel a running or pending crawl job

    • job_id : string id of the job given by listjobs
  • crawl.cancel_all : cancels all running or pending crawl jobs

  • crawl.list : returns the list of past, running and pending crawls

  • crawl.get_job_logs : returns a time ordered list of logs relative to a specific crawl job

    • job_id : string id of the job given by listjobs
  • crawl.get_webentity_logs : returns a time ordered list of logs relative to all crawls relative to a specific webentity

    • webentity_id : string id of the webentity to crawl from memory structure

# Memory Structure functions

  • store.reinitialize : empties the memory structure and redefines the default webentity creation rule

  • store.declare_webentity_by_lru : creates if necessary a webentity for a specific LRUprefix and returns it

    • lru_prefix : string lru_prefix
  • store.get_webentity_by_url : tries to find the webentity corresponding to a url and returns it if it exists or returns None

    • url : string url to look for
  • store.get_webentities : returns the list of all webentities in the memory structure or of those whose IDs are given as input

    • list_ids : optionnal list of string ids of the webentities looked for (default : None)
  • store.get_webentity_pages : returns the list of all webpages stored in the memory structure corresponding to a specific webentity

    • webentity_id : string id of the webentity
  • store.get_webentity_subwebentities : returns a list of webentities having LRU prefixes starting with one of the webentity's prefixes

    • webentity_id : string id of the webentity
  • store.get_webentity_parentwebentities : returns a list of webentities having LRU prefixes starting like one of the webentity's prefixes but shorter

    • webentity_id : string id of the webentity
  • store.get_precision_exceptions : returns the list of string LRU prefixes defined as precision exceptions

  • store.remove_precision_exceptions : removes a list of string LRU prefixes from precision exceptions if existing

    • list_exceptions : array of string lru prefixes
  • store.merge_webentity_into_another : Merges a webentity into another by adding all of its lru prefixes to the other one before removing it

    • old_webentity_id : string id of the webentity to merge into the other
    • good_webentity_id : string id of the webentity to host the merged one
    • include_tags : boolean True to add all tags from old_webentity to other one (default : False)
    • include_home_and_startpages_as_startpages : boolean True to add all startpages and the homepage from old_webentity as startpages of the other one (default : False)
  • store.delete_webentity : Removes a webentity from the memory structure (all its webpages will be associated with the default OUTSIDE WEB webentity for LRU prefix "s:http" or "s:https")

    • webentity_id : string id of the webentity

Fine webentity fields accessors

  • store.rename_webentity : Defines a webentity's name field

    • webentity_id : string id of the webentity
    • new_name : string name
  • store.set_webentity_status : Defines a webentity's status

    • webentity_id : string id of the webentity
    • status : string status (UNDECIDED, IN, OUT or DISCOVERED)
  • store.set_webentity_homepage : Defines the homepage to display for a specific webentity

    • webentity_id : string id of the webentity
    • homepage : string url of the webentity's homepage
  • store.add_webentity_lruprefix : Adds a LRU prefix to a specific webentity. Eventually removes it from another webentity if it was already defined

    • webentity_id : string id of the webentity
    • lru_prefix : string lru prefix to add to the webentity
  • store.rm_webentity_lruprefix : Removes a LRU prefix to a specific webentity. Eventually removes the webentity if it has no LRU prefix left

    • webentity_id : string id of the webentity
    • lru_prefix : string lru prefix to remove to the webentity
  • store.add_webentity_startpage : Adds a starting point URL to a specific webentity

    • webentity_id : string id of the webentity
    • startpage_url : string url to add to the webentity's startpoints
  • store.rm_webentity_startpage : Removes a starting point URL to a specific webentity if existing

    • webentity_id : string id of the webentity
    • startpage_url : string url to remove to the webentity's startpoints
  • store.add_webentity_tag_value : Adds a namespace:key=value tag to a specific webentity

    • webentity_id : string id of the webentity
    • tag_namespace : string namespace (should not contain any ":")
    • tag_key : string key (should not contain any "=")
    • tag_value : string value
  • store.rm_webentity_tag_key : Removes all namespace:key tag values to a specific webentity if existing.

    • webentity_id : string id of the webentity
    • tag_namespace : string namespace (should not contain any ":")
    • tag_key : string key (should not contain any "=")
  • store.rm_webentity_tag_value : Removes a namespace:key=value tag to a specific webentity if existing.

    • webentity_id : string id of the webentity
    • tag_namespace : string namespace (should not contain any ":")
    • tag_key : string key (should not contain any "=")
    • tag_value : string value
  • store.set_webentity_tag_values : Removes all values for a specific namespace:key and replace them with a specific list of tag_values.

    • webentity_id : string id of the webentity
    • tag_namespace : string namespace (should not contain any ":")
    • tag_key : string key (should not contain any "=")
    • tag_values : list of string values

Graph networks generation and retrieval

  • store.get_webentities_network_json : Returns a json representation of the whole network between linked webentities

  • store.generate_webentities_network_gexf : Generates a GEXF local file representing the whole network between linked webentities

  • store.get_webentity_nodelinks_network_json : Returns a json representation of the network between linked nodes for a specific webentityif set of for the whole memory structure otherwise

    • webentity_id : string id of the webentity whose nodes to represent (default : None)
    • include_frontier : boolean True to include foreign links to nodes from other webentities or False to get only links within the webentity (default : False)