Memory Structure Interface

To interface the core coded in Python and Lucene memory structure coded in Java, we will use THRIFT http://thrift.apache.org/. Closely related to : Core specification

We will write the interface specification using Thrift syntax.

Table of Contents inserting pages create_pages_cache index_pages_from_cache get_precision_exceptions_from_cache get_web_entities_flags_from_cache delete_page_cache implementation in thrift Page_links Web_entity_links when to trigger the generation of web entity links ? how to trigger the generation of web entity links ? specifications as a Thrift file

inserting pages

The insertion of new pages in the memory structure from the CORE needs an exchange of information between CORE and Memory structure. We want to minimize the exchanges for optimization purposes.

The pages will be inserted in batch. The core will first send the pages information at once to the memory structure. And then ask the memory structure to apply methods on this cache.

create_pages_cache

parameters : List of Page, Page beeing LRU + is_node boolean flag
returns : the cache id

Stores the pages inside a cache. Returns a cache id which is the unique reference for the list of pages in the memory structure.

index_pages_from_cache

parameters : cache_id
returns : acknowledgment

insert the pages from a cache list in the index

get_precision_exceptions_from_cache

parameters : cache_id
information returned : list of lru_prefixes

extract the page (and not the nodes) from the cache. For each page, the memory structure will look for precision exceptions in the LRU_item index :

precision_limit_lru_prefixes=Array()
for lru of included_pages which are not node (is_node==false):
- look for precision_limit exception on the LRU branch
- if found one :
  - precision_limit_lru_prefixes.append( lru_prefix_of_precision_limit_node)
return precision_limit_lru_prefixes

get_web_entities_flags_from_cache

parameters : cache_id
information returned : list of lists (flag_type,lru_prefixe,regexp)
- flag_type : web_entity_creation_rule OR no_web_entity
- lru_prefixe: the lru prefixe of the flag or the lru of the page for the no_web_entity flag (see below)
- special_parameter : the regexp of the web_entity_creation_rule

This function will be used by the core to make sure the correct web entity exists to hold this page regarding the web entities creation rules

The flags are :

web entity creation rule : this flag indicates that a web entity creation rule is to be tested when inserting new LRU sharing the LRU_prefix of the flag.
no web entity found : this flag has to be created by the memory strucutre when there were no Web entity neither web entity creation rules on a LRU to be inserted.

To retrieve web entities flags the memory structure should behave as described below :

for page in included_pages
- retrieve the first flag on the page.lru branch which is web_entity or web_entity_creation_rule (i.e. the longuest (in term of number of stem) lru_prefixe)
- if flag == web_entity :
  - don't do nothing
- else if flag == web_entity_creation_rule :
  - add_flag(flag="web_entity_creation_rule",lru_prefixe_of_flag,web_entity_creation_rule_flag)
- else if no flag found :
  - add_flag(flag="no_web_entity",page.lru,Null)

delete_page_cache

parameter : cache_id
returns : status

dump a page cache (or perhaps save it on disk in debug mode ?)

implementation in thrift

The above interface is implemented in a Thrift file, reproduced below. I have made 1 change which is to introduce class WebEntityInfo for the return objects of getWebEntitiesFromCache, which is better OO practice than to returning lists of string-triples.

The Thrift file is:

HCI memory structure - core interface

1. Objects as structures

namespace java fr.sciencespo.medialab.hci.memorystructure.thrift

struct MetadataItem {

  1: string id,
  2: string name,
  3: bool multiple,
  4: string dataType,
  5: string defaultValue

}

struct LRUItem {

  1: string id,
  2: string url,
  3: string lru,
  4: string crawlerTimestamp,
  5: i32 httpStatusCode,
  6: i32 depth,
  7: string errorCode,
  8: bool isFullPrecision = false,
  9: bool isNode,
  10: bool isPage,
  11: bool isWebEntity,
  12: list<MetadataItem> metadataItems

}

struct NodeLink {

  1: string id,
  2: string sourceLRU,
  3: string targetLRU,
  4: i32 weight=1

}

struct WebEntityInfo {

  1: string id,
  2: string flagType,
  3: string lruPrefix,
  4: string regExp

}

Services

service MemoryStructure {

// heikki: implementation of the new interface described on http://jiminy.medialab.sciences-po.fr/hci/index.php/Memory_structure_interface

// create_pages_cache /**

 * @param 1 lruItems : list of LRUItem objects
 * @return id of the created cache
 */

string createPagesCache(1:list lruItems),

// index_pages_from_cache /**

 * @param 1 cacheId : id of the cache
 * @return acknowledgement
 */

string indexPagesFromCache(1:string cacheId),

 //get_precision_exceptions_from_cache
 /**
  * @param 1 cacheId : id of the cache
  * @return list of lru prefixes
  */
 list&lt;string&gt; getPrecisionExceptionsFromCache(1:string cacheId),
 
 // get_web_entities_flags_from_cache
 /**
  * @param 1 cacheId : id of the cache
  * @return list of WebEntityInfo
  */  
 list&lt;WebEntityInfo&gt; getWebEntitiesFromCache(1:string cacheId),
 
 // delete_page_cache
 /**
  * @param 1 cacheId : id of the cache
  * @return status
  */  
 i32 deleteCache(1:string cacheId),
 
 
 // heikki: does it mean the rest of the earlier interface below is no longer necessary ?

// LRUItems /**

 *
 * @param 1 lruItems : list of LRUItem objects
 * @return true if success, false else

- /

bool storeLRUItems(1:list<LRUItem> lruItems),

// NodeLinks /**

 *
 * @param 1 nodeLinks : list of NodeLink objects
 * @return true if success, false else

- /

bool storeNodeLinks(1:list<NodeLink> nodeLinks),

// WebEntity /**

 *
 * @param 1 lruItem : the lruItem to be marked as WebEntity
 * @return true if success, false else

- /

bool storeWebEntity(1:LRUItem lruItem),

} </syntaxhighlight></string>

Page_links

Page Links are only kept if they target a node page as described in the Precision limit page. Therefore the memory structure set of page links is only a partial set of data. The complete one will only be avialable in the raw data level memory.

The system will nervertheless offer the possibility to recompute the page links information in case of changing one of the two Precision limit settings which are :

global Precision_limit setting
local FULL_PRECISION setting

In case of changing one of those setting after harvesting, we would haev to trigger the recomputation of Pages links information.

I propose we delegate this to the raw data level which could use the memory structure API described here to update information accordingly.

Web_entity_links

The Web Entity Links are the agregation of page links depdning on teh web entity declaration. Thus Web enitty links are just a meta information accessible but not editable.

The question to be discussed in this paragraph is how and when to trigger the generation of web entity links ?

when to trigger the generation of web entity links ?

The web entity link information should be reloaded after any of those events :

harvesting i.e. new Page_Link insertion/modification

   there is no need to reload web entity links at each Page_link insertion.
   We could imagine to reload link information only when enough Page_links have been inserted or modified.
   If we add a timestamp to Page_link object and store the last agregation time, we can calculate the number of Page_link modification and trigger the agregation only above a threshold of modification
   To be discussed !

web entity modification

   each web entity modification should change the web entity links accordingly. We could then go for a on-demand agregation trigger which launch the work only when the functions related to web_entity_links are called.

We don't put here the event "change in Precision_Limit settings (global and local)" because this should actually trigger Page_link reloading from raw data level' which falls back to event 1'

how to trigger the generation of web entity links ?

on demand by the core.

A method compute_webentitylinks() should be proposed by the memory structure.

This method will first insert a new reference in the agregation task index. This reference will be identified by a insertion_timestamp and contain a status ("done" or "ongoing")

Then the Memory structure will recompute only the node_links which has been inserted after the last "done" agregation task, but before the insertion_timestamp of the current task.

It should also recompute links attached to a node contained in a webentity which has changed (or was created) in the same time frame.

specifications as a Thrift file

HCI memory structure - core interface
1. Objects as structures

struct Page {

  1: string lru,
  2: bool full_precision = False,
  3: bool is_node,

}

struct Web_entity {

  1: i32 id,
  2: optional string name,
  3: set&lt;string&gt;&lt;/string&gt; lru_prefixes,
  4: map&lt;string,string&gt; metadata

}

struct Web_entity_rule {

  1: string lru_prefix,
  2: string regex,
  3: optional string comments

}

struct Page_link {

  1: string lru_source,
  2: string lru_target,
  3: optional i32 weight=1

}

struct Web_entity_link {

  1: i32 web_entity_source,
  2: i32 web_entity_target,
  3: i32 weight=1

}

Services

service memory_structure {

 //        PAGES

/**

 * This methods create a list of pages in batch based on a list of LRUs. Before creating the Page Object the Memory Structure has to check if a web entity has to be created
 * @param 1 pages : list of Page Objects
 * @return True if success, False else

- /

see more info about this function algorithm above

bool add_pages(1:list <Page> pages),

/**

 * return a list of pages contained in a web entity. if no webentityid, list all pages.
 * @param webentityId : a webentity identification 
 * @return list of pages

- /

/**

return the number of pages.
@return number of pages indexed
- /

i32 count_pages(),

 //         WEB ENTITY

/**

add a new web entity into the index if does not already exists
@param web_entity_to_add : the web entity to add
@param name : a user-readable name for the web entity
@return true if the WebEntity has been inserted, false if it was already present.
- /

 bool add_web_entity(1: Web_entity web_entity_to_add),

/**

 * Try to update a WebEntity already present in the database.
 * Use this method to update an already present WebEntity urls aliases.
 * @param web_entity the web_entity to update.
 * @return true if the web_entity exists in the database, false else.

- /

/**

Remove a WebEntity from the database adn propagate deletion to WebEntityLinks(?)
@param webEntity the WebEntity to remove.
@return true if the WebEntity has been removed from the database
@throws IOException
/

bool remove_web_entity(1: Web_entity web_entity_to_remove),

/**

Find the deepest WebEntity that matches the specified Url.
@param url the target url.
@return the deepest WebEntity that matches this url, or the Universal Web Entity if no match were found. (to be discussed)
@throws IOException
/

Web_entity find_web_entity(1: Page page),

/**

list all the web entities of the corpus
@return a list of web entities
- /

list<Web_entity> list_web_entities(),

/**

 * Get the number of WebEntities inserted in this database.
 * @return the number of WebEntities.
 */

i32 count_web_entities(),

// WEB ENTITY RULES

/**

add a Web_entity_rule object
@param lru_prefix : the lru_prefix used to test the lru to be compared to the regexp
@param regex : the regular expression that returns as group 1 the new lru_prefix to be used as a web_entity
@return true if added, false else
/

bool add_web_entity_rule(1: string lru_prefix, 2: string regex),

/**

remove a Web_entity_rule object
@param lru_prefix_to_remove : the lru_prefix to be removed
@return True if removes, false else
/

bool remove_web_entity_rule(1: string lru_prefix_to_remove),

// well actually this method find_web_entity_rule could remain private in the memory structure ?

/**

Find the deepest web_entity_rule that matches the specified page.
for all the lru_prefix that match the begining of pagge.lru
match with the regex
return the longuest matching result
@param page to be tested.
@return the deepest lru_prefix (regex match result) that matches this page, or ??? exception when not found
@throws NOT FOUND ? ???
/

string find_web_entity_rule(1: Page page),

// PAGE LINKS

keep in mind that only links toward node pages are kept
Thus the page links object reflects only part of the real information.
the complete set of link is only available through the raw data level
see specific paragraph upper

/**

Try to add Links between pages.
if a link is already present, add the weight to the existing link
@param Page_link the link
@return true if the 2 Urls exist in the database, false else.
/

bool add_page_Links(1: Page_link link),

/**

Find the outgoing links of the specified Page.
@param source the source Page.
@return the outgoing links.
@throws IOException
/

list<Page_link> find_out_links(1: Page source),

/**

Find the incoming links of the specified Page.
@param target the target Page.
@return the incoming links.
@throws IOException
/

list<Page_link> find_in_page_links(1: Page target),

// WEB ENTITY LINK

Keep in mind that those web entity links are a agregation of pages_link
Therefore there is no method to add such links.
we could add a compute_web_entity function here but it should be perhaps cleverer to delegate this to the internal memory structure (see specific paragraph below)

/**

Find the outgoing links of the specified Web Entity.
@param source the source Web_entity.
@return the outgoing links.
/

list<Web_entity_link> find_out_web_entity_links(1: Web_entity source),

/**

Find the incoming links of the specified Web entity.
@param target the target Web entity.
@return the incoming links.
/

list<Web_entity_link> find_in_web_entity_links(1: Web_entity target),

}

</syntaxhighlight></string,string>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly