Skip to content

Commit

Permalink
document incremental training call
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Nov 24, 2022
1 parent 2dce9b9 commit b1e4fd1
Showing 1 changed file with 44 additions and 12 deletions.
56 changes: 44 additions & 12 deletions doc/Training-the-models-of-Grobid.md
Expand Up @@ -28,7 +28,7 @@ Grobid uses different sequence labelling models depending on the labeling task t

* table

The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data.
The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data as frozen set (e.g. holdout set), automatic random split or apply 10-fold cross-evaluation.

## Train and evaluate

Expand All @@ -43,56 +43,88 @@ When generating a new model, a segmentation of data can be done (e.g. 80%-20%) b

There are different ways to generate the new model and run the evaluation, whether running the training and the evaluation of the new model separately or not, and whether to split automatically the training data or not. For any methods, the newly generated models are saved directly under grobid-home/models and replace the previous one. A rollback can be made by replacing the newly generated model by the backup record (`<model name>.wapiti.old`).

### Train and evaluation in one command
### Train and evaluation in one command (simple mode)

For simple training without particular parameters, a single command can be used as follow. All the available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus` will be used for trainng and all available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation/` will be used for evaluation.

Under the main project directory `grobid/`, run the following command to execute both training and evaluation:

```bash
> ./gradlew <training goal. I.E: train_name-header>
> ./gradlew train_<name_of_model>
```
Example of goal names: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...

The files used for the training are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`, and the evaluation files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`.
Example: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...

Examples for training the header model:

```bash
> ./gradlew train_header
```

Examples for training the model for names in header:

```bash
> ./gradlew train_name_header
```

### Train and evaluation separately
Under the main project directory `grobid/`, execute the following command (be sure to have built the project as indicated in [Install GROBID](Install-Grobid.md)):
### Train and evaluation separately and using more parameters (full mode)

To have more flexibility and options for training and evaluating the models, use the following commands.

First be sure to have the full project libraries locally built (see [Install GROBID](Install-Grobid.md) for nore details):

```bash
> ./gradlew clean install
```

Under the main project directory `grobid/`:

**Train** (generate a new model):

Train (generate a new model):
```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home
```

The training files considered are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`

The training of the models can be controlled using different parameters. The `nbThreads` parameter in the configuration file `grobid-home/config/grobid.yaml` can be increased to speed up the training. Similarly, modifying the stopping criteria can help speed up the training. Please refer [this comment](https://github.com/kermitt2/grobid/issues/336#issuecomment-412516422) to know more.

Evaluate:
**Evaluate**:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 1 <name of the model> -gH grobid-home
```

The considered evaluation files are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`

Automatically split data, train and evaluate:
**Automatically split data, train and evaluate**:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 <name of the model> -gH grobid-home -s <segmentation ratio as a number between 0 and 1, e.g. 0.8 for 80%>
```

For instance, training the date model with a ratio of 75% for training and 25% for evaluation:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 date -gH grobid-home -s 0.75
```

A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty. *Automatic split data, train and evaluate* is for the moment only available for the following models: header, citation, date, name-citation, name-header and affiliation-address.
A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty.

**Incremental training**:

The previous commands were starting a training from scratch, using all available training data in one training task.
Incremental training will start from an existing already train model and apply a further training task using the available training data under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`.

Launching an incremental training is similar as the previous commands, but adding the parameter `-i`. An existing model under `grobid/grobid-home/models/*MODEL*` must be available. For example:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home -i
```

Several runs with different files to evaluate can be made to have a more reliable evaluation (e.g. 10 fold cross-validation). For the time being, such segmentation and iterative evaluation is not yet implemented.
Note that a full training from scratch with all training data should normally provide better accuracy for a model than several iterative training with a partition of the training data. Using incremental training makes sense for exemple when the model has been trained with a lot of data during days/weeks, and an update is required, or for the development of training data when the update of a model must be quick to generate new trainng data.

In incremental training phases, the training parameters might require some update to stop the training earlier than in normal full training.

### N-folds cross-evaluation

Expand Down

0 comments on commit b1e4fd1

Please sign in to comment.