Generally speaking, Model = Data + Network + Training
.
AdaSeq currently supports file-based mode, which means you need to clarify these arguments in a configuration file before starting training. And the configuration file is all you need to train a model.
This part of tutorial will introduce how to train a model with a custom dataset step by step in the sections to come.
Let's take resume.yaml as an example.
For detailed descriptions of the configuration arguments, please refer to this tutorial Learning about Configs.
First, tell AdaSeq where to save all outputs during training.
experiment:
exp_dir: experiments/
exp_name: resume
seed: 42
So the training logs, model checkpoints, prediction results and a copy of the configuration files will all be saved to ./experiments/resume/${datetime}/
The dataset
argument determines what the dataset look like (data format) and where to fetch the dataset (data source).
dataset:
data_file:
train: 'https://www.modelscope.cn/api/v1/datasets/damo/resume_ner/repo/files?Revision=master&FilePath=train.txt'
valid: 'https://www.modelscope.cn/api/v1/datasets/damo/resume_ner/repo/files?Revision=master&FilePath=dev.txt'
test: 'https://www.modelscope.cn/api/v1/datasets/damo/resume_ner/repo/files?Revision=master&FilePath=test.txt'
data_type: conll
As the example snippet shows, AdaSeq will fetch the training, validation and testing dataset files from remote urls. The data is specified as conll
format so that a corresponding script will be used to parse the data.
For more dataset loading approaches and supported data formats, please refer to this tutorial Customizing Dataset.
This part specifies the task
preprocessor
data_collator
model
in the training.
The basic data flow is:
dataset -> preprocessor -> data_collator -> model
preprocessor
tells how AdaSeq processes a data sample. It needs a model_dir
indicating the tokenizer to use and is used to turn a sentence into ids and masks.
data_collator
tells how to collate data samples into batches.
model
tells how a model is assembled where type
indicates the basic architecture. A model usually consists of several replaceable components such as embeder
, encoder
, etc.
task: named-entity-recognition
preprocessor:
type: sequence-labeling-preprocessor
model_dir: sijunhe/nezha-cn-base
max_length: 150
data_collator: SequenceLabelingDataCollatorWithPadding
model:
type: sequence-labeling-model
embedder:
model_name_or_path: sijunhe/nezha-cn-base
dropout: 0.1
use_crf: true
Last but not least, set some training and evaluation arguments in train
and evaluation
. Model performances can vary widely under different training settings.
This part is relatively easy to understand (I think). You can copy one and adjust the values to whatever you want.
train:
max_epochs: 20
dataloader:
batch_size_per_gpu: 16
optimizer:
type: AdamW
lr: 5.0e-5
param_groups:
- regex: crf
lr: 5.0e-1
lr_scheduler:
type: LinearLR
start_factor: 1.0
end_factor: 0.0
total_iters: 20
evaluation:
dataloader:
batch_size_per_gpu: 128
metrics:
- type: ner-metric
- type: ner-dumper
model_type: sequence_labeling
dump_format: conll
Once you have a configuration file, it is easy to training a model. You can also use any of the configuration files in the examples. Just try it!
python scripts/train.py -c examples/bert_crf/configs/resume.yaml
We also provide advanced tutorials if you want to improve your training.
During training, the process bar and the evaluation results will be logged to both the terminal and to a log file.
As we mentioned in 1.1 Meta Settings, all outputs will be saved to ./experiments/resume/${datetime}/
. After training, there will be 5 files in the folder.
./experiments/resume/${datetime}/
├── best_model.pth
├── config.yaml
├── metrics.json
├── out.log
└── pred.txt
You can either collect the evaluation results in metrics.json
or review all training logs in out.log
. pred.txt
will give predictions on the test dataset. You can analyze it to improve your model or submit it to some competition. best_model.pth
can be used for further tuning or deployment.