Skip to content

Per project management and workflow

Vladimir Kotal edited this page Apr 1, 2020 · 32 revisions

Motivation

OpenGrok can be run with or without projects. A project is simply a directory directly underneath the OpenGrok source root directory. A project can have zero or more Source Code Management repositories underneath. In a setup without projects, all of the data has to be indexed at once. With projects however, each project has its own index so it is possible to index projects in parallel, thus speeding the overall process.

When working with project data, there are 2 types of processing that can take a long time:

  • synchronization: updating project data so that it matches its origin
    • usually involves running commands like git pull in all the repositories for given project.
  • indexing: updating the index so that it matches the project data

For some projects either or both steps can take a long time. Say you have a repository that has its origin residing on a NFS share across the Atlantic so it has high latency plus it uses legacy VCS that operates not on changesets but on individual files and therefore the repository takes a long time (say tens of minutes if not hours) to synchronize. Or, there is a repository that has a large number of files so the initial phase of indexing always takes a long time (due to scanning the whole project directory tree for changed files) even though the incremental changes are small.

Or maybe there is lots of lots of projects that exhibit some of these characteristics.

Previously, it was necessary to index all of source root in order to discover new projects and put them to configuration. Starting with OpenGrok 1.1, it is possible to manage and index projects separately.

As a result, the indexing of complete source root is only necessary when upgrading across OpenGrok version with incompatible Lucene indexes.

Combine these procedures with the parallel processing tools (see repository synchronization) and you have per-project management with parallel processing.

The following examples assume that OpenGrok install base is under the /opengrok directory.

Workflow

It is possible to start from scratch or use OpenGrok instance that already indexes all projects in one go and convert it to index projects separately and in parallel.

There are some design choices that need to be dealt with:

  • The indexer either has to discover projects and their repositories during the indexing preparation or it has to know them in advance.
  • The configuration file has to be written once a project was added or modified or indexed for the first time.

Thus, when indexing newly added project, it is necessary to add it to the configuration first, then index it and lastly make the new configuration persistent.

This page lists all the pieces and how to operate them.

Building blocks

The following is assuming that the commands opengrok-projadm, opengrok-groups and opengrok-config-merge tools are in PATH. You can install these from the opengrok-tools python package available in the release tarball.

Using the opengrok-projadm tool (that utilizes the opengrok-config-merge tool and RESTful API) it is possible to manage the projects.

Configuration backup

The next sections start by suggesting to backup current configuration. This could be done by e.g. copying the configuration.xml (that is written by the indexer when using the -W indexer option) file aside, taking file-system snapshot of the directory the configuration is stored in etc.

This is necessary as a prevention if something goes wrong.

Adding a project

  • backup current config
  • add the project data to a directory under the source root directory
    • this usually involves running VCS command such as git clone, extracting source code from an archive, etc.
  • perform any necessary authorization adjustments
  • add the project to configuration (also refreshes the configuration on disk):
   opengrok-projadm -b /opengrok -a PROJECT

Indexing a project

The indexing part of the wiki explains how to run the indexer in general.

Running the indexer for single project has several constraints:

  • scanning for repositories/projects is not wanted - no -P or -S options
    • however the indexer has to know the project/repository information so it needs to be either retrieved from the web application or use the persistent configuration on disk
  • it is undesirable to write the configuration that is created during the indexer run to disk - no -W option

Thus, running the indexer for single project may look like this:

curl -s -X GET http://localhost:8080/source/api/v1/configuration -o fresh_config.xml
opengrok-indexer -a /opengrok/dist/lib/opengrok.jar -- \
    -c /usr/local/bin/ctags -U 'http://localhost:8080/source' -o /opengrok/etc/ctags.config
    -H PROJECT_NAME -R fresh_config.xml PROJECT_NAME

This does not deal with logging to a separate log file.

Now, there is the opengrok-reindex-project script which is recommended to use, it downloads fresh configuration from the webapp so that the indexer has the knowledge about indexed project and its repositories. It can also generate logging configuration on the fly.

Once the project reindex is done, save the configuration (this is necessary so that the indexed flag of the project is persistent. If not made consistent and the web app restarts the project will not be accessible in the web app).

   opengrok-projadm -b /opengrok -r

The -R indexer option can be used for opengrok-projadm to supply path to read-only configuration so that it is merged with current configuration.

Deleting a project

  • backup current config
  • delete the project from configuration (deletes project's index data and refreshes on disk configuration). The -R indexer option can be used to supply path to read-only configuration so that it is merged with current configuration.
   opengrok-projadm -b /opengrok -d PROJECT

opengrok-sync

provides a way how to run a sequence of commands for a set of projects in parallel.

The script accepts the configuration either in JSON or YAML.

Use e.g. like this:

  $ opengrok-sync -c /scripts/sync.conf -d /ws-local/

where the sync.conf file contents might look like this:

commands:
- command:
  - http://localhost:8080/source/api/v1/messages
  - POST
  - cssClass: info
    duration: PT1H
    tags: ['%PROJECT%']
    text: resync + reindex in progress
- command: [sudo, -u, wsmirror, /opengrok/dist/bin/opengrok-mirror, -c, /opengrok/etc/mirror-config.yml, -U, 'http://localhost:8080/source']
- command: [sudo, -u, webservd, /opengrok/dist/bin/opengrok-reindex-project, -J=-d64,
    '-J=-XX:-UseGCOverheadLimit', -J=-Xmx16g, -J=-server, --jar, /opengrok/dist/lib/opengrok.jar,
    -t, /opengrok/etc/logging.properties.template, -p, '%PROJ%', -d, /opengrok/log/%PROJECT%,
    -P, '%PROJECT%', -U, 'http://localhost:8080/source', --, --renamedHistory, 'on', -r, dirbased, -G, -m, '256', -c,
    /usr/local/bin/ctags, -U, 'http://localhost:8080/source', -o, /opengrok/etc/ctags.config,
    -H, '%PROJECT%']
  env: {LC_ALL: en_US.UTF-8}
  limits: {RLIMIT_NOFILE: 1024}
- command: ['http://localhost:8080/source/api/v1/messages?tag=%PROJECT%', DELETE,
    '']
- command: [/scripts/check-indexer-logs.ksh]
cleanup:
  - command: ['http://localhost:8080/source/api/v1/messages?tag=%PROJECT%', DELETE, '']

Note: the above -U 'http://localhost:8080/source' twice in reindex-project is not a typo. It must be specified twice - for the python and for the indexer.

The above opengrok-sync command will basically take all directories under /ws-local and for each it will run the sequence of commands specified in the sync.conf file. This will be done in parallel - on project level. The level of parallelism can be specified using the the --workers option (by default it will use as many workers as there are CPUs in the system).

Another variant of how to specify the list of projects to be synchronized is to use the --indexed option of opengrok-sync that will query the webapp configuration for list of indexed projects and will use that list. Otherwise, the --projects option can be specified to process just specified projects.

The commands above will basically:

  • mark the project with alert (to let the users know it is being synchronized/indexed) using the RESTful API call (the %PROJECT% string is replaced with current project name)
  • pull the changes from all the upstream repositories that belong to the project using the opengrok-mirror command
  • reindex the project using opengrok-reindex-project
  • clear the alert using the second RESTful API call
  • execute the /scripts/check-indexer-logs.ksh script to perform some pattern matching in the indexer logs to see if there were any serious failures there. The script can look e.g. like this:
#!/usr/bin/ksh
#
# Check OpenGrok indexer logs in the last 24 hours for any signs of serious
# trouble.
#

if (( $# != 1 )); then
        print -u2 "usage: $0 <project_name>"
        exit 1
fi

project_name=$1

typeset -r log_dir="/opengrok/log/$project_name/"
if [[ ! -d $log_dir ]]; then
        print -u2 "cannot open log directory $log_dir"
        exit 1
fi

# Check the last log file.
if grep SEVERE "$log_dir/opengrok0.0.log"; then
        exit 1
fi

The opengrok-sync script will print any errors to the console and uses file level locking to provide exclusivity of run so it is handy to run from crontab periodically.

Each "command" can be either normal command execution (supplying the list of program arguments) or RESTful API call (supplying the HTTP verb and optional payload).

Note that the cleanup is a set of commands. If any of them fails (i.e. returns non zero value), the process is not interrupted, unlike the main command sequence.

URI specification

Note that if the web application is listening on non-standard host or port (localhost and 8080 is the default), the URI has to be used everywhere where it matters. Given that opengrok-sync performs RESTful API queries itself, one has to specify the location using the -U option of this script and then again it is necessary to specify it in the configuration file - for any RESTful API calls or for opengrok-indexer command (which also uses the -U option).

Cleanup

If any of the commands in "commands" fail, the "cleanup" command will be executed. This is handy in this case since the first RESTful API call will mark the project with alert in the WEB UI so if any of the commands that follow fails, the cleanup call will be made to clear the alert.

Normal command execution can be also performed in the cleanup section.

Ignoring errors

Some project can be notorious for producing spurious errors so their errors are ignored via the "ignore_errors" section.

Run

In the above example it is assumed that opengrok-sync is run as root and synchronization and reindexing are done under different users. This is done so that the web application cannot tamper with source code even if compromised.

Pattern replacement and logging

The commands got appended project name unless one of their arguments contains %PROJECT%, in which case it is substituted with project name and no append is done.

For per-project reindexing to work properly, opengrok-reindex-project uses the logging.properties.template to make sure each project has its own log directory. The file can look e.g. like this:

handlers= java.util.logging.FileHandler

.level= FINE

java.util.logging.FileHandler.pattern = /opengrok/log/%PROJ%/opengrok%g.%u.log
# Create one file per indexer run. This makes indexer log easy to check.
java.util.logging.FileHandler.limit = 0
java.util.logging.FileHandler.append = false
java.util.logging.FileHandler.count = 30
java.util.logging.FileHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter

java.util.logging.ConsoleHandler.level = WARNING
java.util.logging.ConsoleHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter

The %PROJ% template is passed to the script for substitution in the logging template. This pattern must differ from the %PROJECT% pattern, otherwise the sync.py script would substitute it in the command arguments and the substitution in the template file would not happen.

You can find a logging.properties.template file in the final release tarball, under doc directory.