Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

System Collections #711

Open
kaplun opened this Issue · 7 comments

6 participants

@kaplun
Collaborator

Originally on 2011-06-27

For general repository health monitoring purposes or also to re-factor certain computational intensive algorithm that are spread-around the Invenio codebase by pre-computing special collections it would be great to enhance WebColl and in general the WebSearch module to support a new type of collection called System Collections. These collections would be as normal collections in everything but their definition which can't be expressed by a normal query and must be therefore directly be specified in the code base. These System Collections are:
Empty Records::
containing all the records that have an ID but nothing else (i.e. no XM)
Deleted Records::
containing all the records that have DELETED in 980__% (which is the convention in Invenio to marc a record as deleted.
Restricted Records::
containing all the records that belong to at least one restricted collection (this would greatly speed up the runtime computation for checking authorizations)
Classified Records::
containing all the records that belong to at least one real collection (if a record does not belong to such collection will surely not be available to anyone but its owner or superadmin)
Unclassified Records::
this is the counterpart of Classified Records, and will contain all the records that do not belong to at least a collection and are therefore accessible only to their owners or to superadmin
Existing Records::
this is the union of the Classified and Unclassified Records collections
Public Records::
this will be sort of an alias to the Home collection as it will contain all the records that are searchable from the home and are a priory discoverable by a crawler.
Note that a new record will initially not belong to any of the above collections (as webcoll will still need to be run). Subsequently, after webcoll will have classified it, it will either belong to the Classified Records collection or to the Unclassified Records collection


In order to make this collection safe, they will be actually called with an umprobable name such as "System Collection -- Emtpy Records" and be treated in special ways both by WebColl and by the WebSearch Admin Interface (e.g. it should not be possible to delete such a collection and if an admin attach these collections as real child of real collections, webcoll must ignore them in the computation of the real collection.)


tabcreate.sql can come with a default configuration where there is a non attached System Collection with all of the above collections attached as virtual collections.

@kaplun kaplun self-assigned this
@tiborsimko
Owner

Originally on 2011-06-27

One more special collection would be Merged records that will list all the records that used to be independent but that cataloguers merged as dupes in BibMerge, via 970__d. This is a special category of deleted records that may be useful to single out. See also #514.

Speaking of terminology, "classified records" ("unclassified records") may create an unwanted link to BibClassify, so perhaps "alive records" ("zombie records") or "attributed records" ("unattributed records") would be better, as we mused about originally.

@jrbl
Collaborator

Originally on 2011-06-27

I think this ticket is a really good idea.

Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.

@jeromecaffaro
Collaborator

Originally on 2011-06-28

I can also imagine (if possible and useful) that the following "system" collections would be good candidates:

Authority Records::
containing all the authority record (dunno the criteria yet)
Bibliographic Records::
containing all the bibliographic records (dunno the criteria yet)

(Just throwing it there, following some IRL musings with wiki:Team/ChristopherDickinson on authority records, though the authority collections might be handled in a slightly more flexible way, and might be out of the scope of this ticket)

@kaplun
Collaborator

Originally on 2011-06-28

Hi Joe,

Replying to [comment:2 jblayloc]:

I think this ticket is a really good idea.

Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.

Well this collections will be defined from the beginning in Invenio, so they will have each a well defined name, which doesn't need to change all the time. In principle to call the System Collection -- FOO Records should be enough geeky to avoid any conflict with admins, but yes, a CFG_WEBSEARCH_SYSTEM_COLLECTIONS variable (statically stored in search_engine_config.py will make it easy to write checks in the Admin interface to avoid admins to uses these special names (which is definitively unlikely :-) ).

@kaplun
Collaborator

Originally on 2014-01-22

The current implementation suffer from performance issues (due to actually assigning recids one by one to system collections rather than using set theory and intbitset).

I have rebased the current implementation against latest master and will work towards make it production ready. (See: sam/711-system-collections)

@kaplun kaplun modified the milestone: v1.x, v1.2.0
@jirikuncar
Owner

This could be done via special indexes on calculated record fields. See implementation of collection indexes in #2587 related to following RFC #2638.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.