Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP


BibUpload: optional use of bibxxx tables #671

tiborsimko opened this Issue · 7 comments

3 participants


Originally on 2011-06-14

During bibupload, the incoming record is broken according to MARC tags
into many bibxxx tables (bib10x, bib11x, etc) which results in
many SQL queries being done by bibupload. Advantage is doing so is
that the end users can then simply search in any MARC
tag. Disadvantage in doing so is that the uploading step takes time,
and that we are preparing indexes that may perhaps not even be used by
the end users at all. (Since they typically search in logical field
indexes, say firstauthor:ellis, not in physical MARC tags, say

In certain situations, it would be better not to create these indexes
during upload time, but to defer handing them for the indexing time.
(Especially when using external indexer such as Solr for the record
the metadata.)

For this, it would be good to introduce a new configuration option
called say CFG_BIBUPLOAD_USE_BIBXXX that would be True by default
but that could optionally be set to False on a per-site basis. When
set to False, the stage 4 of bibupload (=filling of bibxxx tables)
would not be executed.

This would result in bibupload speed-ups that can be illustrated by
the following example taken from INSPIRE-sized database (1M of

  • example record CERN-TH-6002-91 from INSPIRE TEST (record ID 315385)

  • timings to replace it, stage 4 enabled:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006    4.112    4.112
      256    0.003    0.000    4.095    0.016
        1    0.001    0.001    2.632    2.632
      109    0.001    0.000    2.605    0.024
        1    0.000    0.000    1.255    1.255
  • timings to replace it, stage 4 disabled:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.020    0.020
       37    0.001    0.000    0.017    0.000
        1    0.000    0.000    0.006    0.006

As can be seen, the upload time is faster by several orders of
magnitude, since we are not pre-creating those huge and possibly
non-useful bibxxx indexes.

-Important note:* while it is simple to introduce such a
CFG_BIBUPLOAD_USE_BIBXXX variable for record uploading processes,
this variable should be propagated to other Invenio modules such as
searcher/indexer that should read record metadata from pre-stored
MARCXML formats (see table bibfmt) rather than from bibxxx tables.
When bibxxx tables are not in use, other Invenio modules are not
free to rely on the existence of bibxxx tables anymore. So this
task is really bigger than it may seem. The settings of
CFG_BIBUPLOAD_USE_BIBXXX should therefore be progressively
propagated to all the Invenio modules that assume the existence of
bibxxx for granted, starting with the the most important modules
(indexer, searcher, editor, check for deleted records, etc).


Originally on 2011-06-14

I forgot to add a possibly obvious thing that in order to propagate the elimination of bibxxx tables to other Invenio modules faster, we can keep some of the most important ones (bib03x, bib97x, bib98x) so that the filtering of incoming records and handling of deleted records and collections and whatnot would not necessitate any codebase change and could be therefore kept as it is now. (So we would keep "small and useful" bibxxx tables, while we would eliminate only "big and not-so-useful" bibxxx tables, so to speak.)


Originally on 2011-06-21

As discussed in the videoconf we had today, I implemented a light bibupload. I made the implementation more configurable than what is described here as we might still need to populate some tables for bibupload to run correctly. For example it might be a good idea to keep populating the tables that contain CFG_BIBUPLOAD_EXTERNAL_SYSNO_TAG and CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG so that bibupload can decide if a record is going to be overwritten.

The implementation relies on a new configuration variable which I called CFG_BIBUPLOAD_BIBXXX_TAGS and that accepts a comma-separated list of MARC tags. If left empty (default) then all tags will be stored and Invenio will run as normal.

Commit on Github


Originally on 2011-07-12

First version of this is available on my Github:

The configuration variable CFG_BIBUPLOAD_BIBXXX_TAGS is a comma-separated list of tags which are handled at upload time. It is recommended to keep storing 035, 037, 970 and 980 to allow bibupload and webcoll to run correctly. Depending on the collections' dbqueries, other tags might be necessary. If this variable is left empty, then Invenio will run normally, i.e. store all tags. This should remain the default behavior.

At index time, bibindex first populates the bibxxx tables and then continues with its regular business. I've added my code to bibindex.bibindex_bibxxx_manager and only a call to this in bibindex.bibindex_engine.

I tested the regular Invenio and the fast upload and the results are consistent:

  • regular upload took 16 minutes.
  • fast upload took 4 minutes and populating the bibxxx tables took 12 minutes. Same total time but the shorter upload time allows us to move quicker with the initial upload while indexing our metadata in Solr.

Comments are welcome.


Originally on 2011-07-13

Hi Benoit,

Replying to [comment:3 bthiell]:

It is recommended to keep storing 035, 037, 970 and 980 to allow bibupload and webcoll to run correctly.

I think you should also add the OAI-related fields (those depends on invenio.conf). By default these are: 909COo, 909COp and I would like to propose (in a future branch to come) 909COq.



Originally on 2011-07-13

For now the default is to store everything at upload time : see

And by the way, I've limited the tag description to the first 3 digits (i.e. 909 and not 909COo) as in my opinion it doesn't really make sense to populate the bibxxx tables both at upload and index time.



Originally on 2013-02-04

I think it would be cool to have this one in 1.2


This will have to be rebased and re-checked...

@kaplun kaplun modified the milestone: v1.x, v1.2.0
@kaplun kaplun self-assigned this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.