paraformer

模型介绍

Paraformer是达摩院语音团队提出的一种高效的非自回归端到端语音识别框架。本项目为Paraformer中文通用语音识别模型，采用工业级数万小时的标注音频进行模型训练，保证了模型的通用识别效果。模型可以被应用于语音输入法、语音导航、智能会议纪要等场景。 Paraformer模型结构如上图所示，由 Encoder、Predictor、Sampler、Decoder 与 Loss function 五部分组成。Encoder可以采用不同的网络结构，例如self-attention，conformer，SAN-M等。Predictor 为两层FFN，预测目标文字个数以及抽取目标文字对应的声学向量。Sampler 为无可学习参数模块，依据输入的声学向量和目标向量，生产含有语义的特征向量。Decoder 结构与自回归模型类似，为双向建模（自回归为单向建模）。Loss function 部分，除了交叉熵（CE）与 MWER 区分性优化目标，还包括了 Predictor 优化目标 MAE。其核心点主要有：

Predictor 模块：基于 Continuous integrate-and-fire (CIF) 的预测器 (Predictor) 来抽取目标文字对应的声学特征向量，可以更加准确的预测语音中目标文字个数。
Sampler：通过采样，将声学特征向量与目标文字向量变换成含有语义信息的特征向量，配合双向的 Decoder 来增强模型对于上下文的建模能力。
基于负样本采样的 MWER 训练准则。

更详细的细节见：

模型推理

以16k paraformer-large中文模型为例

代码(infer.py)

若输入格式为文件wav.scp(注：文件名需要以.scp结尾)，可添加 output_dir 参数将识别结果写入文件中，api调用方式可参考如下范例:

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    output_dir='./decode_dir'
)

rec_result = inference_pipeline(audio_in='wav.scp')
print(rec_result)

模型训练以16k paraformer-large中文模型为例

代码(finetune.py)：

import os
import json
import shutil

from modelscope.pipelines import pipeline
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from modelscope.utils.constant import Tasks

from funasr.datasets.ms_dataset import MsDataset
from funasr.utils.compute_wer import compute_wer


def modelscope_finetune(params):
    if not os.path.exists(params["model_dir"]):
        os.makedirs(params["model_dir"], exist_ok=True)
    # dataset split ["train", "validation"]
    ds_dict = MsDataset.load(params["dataset_path"])
    kwargs = dict(
        model=params["modelscope_model_name"],
        data_dir=ds_dict,
        dataset_type=params["dataset_type"],
        work_dir=params["model_dir"],
        batch_bins=params["batch_bins"],
        max_epoch=params["max_epoch"],
        lr=params["lr"])
    trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
    trainer.train()
    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
    required_files = ["am.mvn", "decoding.yaml", "configuration.json"]
    for file_name in required_files:
        shutil.copy(os.path.join(pretrained_model_path, file_name),
                    os.path.join(params["model_dir"], file_name))
    

def modelscope_infer(params):
    # prepare for decoding
    with open(os.path.join(params["model_dir"], "configuration.json")) as f:
        config_dict = json.load(f)
        config_dict["model"]["am_model_name"] = params["decoding_model_name"]
    with open(os.path.join(params["model_dir"], "configuration.json"), "w") as f:
        json.dump(config_dict, f, indent=4, separators=(',', ': '))
    decoding_path = os.path.join(params["model_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
    os.mkdir(decoding_path)

    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
        model=params["model_dir"],
        output_dir=decoding_path,
        batch_size=64
    )
    audio_in = os.path.join(params["test_data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)

    # computer CER if GT text is set
    text_in = os.path.join(params["test_data_dir"], "text")
    if os.path.exists(text_in):
        text_proc_file = os.path.join(decoding_path, "1best_recog/token")
        compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
        os.system("tail -n 3 {}".format(os.path.join(decoding_path, "text.cer")))

if __name__ == '__main__':
    finetune_params = {}
    finetune_params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
    finetune_params["dataset_path"] = "./example_data/"
    finetune_params["model_dir"] = "./checkpoint"
    finetune_params["dataset_type"] = "small"
    finetune_params["batch_bins"] = 2000
    finetune_params["max_epoch"] = 20
    finetune_params["lr"] = 0.00005

    modelscope_finetune(finetune_params)

    infer_params = {}
    infer_params["model_dir"] = "./checkpoint"
    infer_params["decoding_model_name"] = "20epoch.pb"
    infer_params["test_data_dir"] = "./example_data/test/"
    modelscope_infer(infer_params)

流程：

modelscope模型资源下载->本地数据加载->模型训练->模型测试并计算CER

运行命令

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 finetune.py > log.txt 2>&1

如果想在一台机器上跑多个多卡任务，请加上master_port参数，防止DDP通信端口被占用。具体执行命令为:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 --master_port 47770 finetune.py > log.txt 2>&1

训练代码主要参数说明

modelscope_model_name：需要finetune的modelscope模型名字
dataset_path：训练数据目录、需自行提供，目录格式如下：

tree ./example_data/
./example_data/
├── validation
│   ├── text
│   └── wav.scp
├── test
│   ├── text
│   └── wav.scp
└── train
    ├── text
    └── wav.scp

3 directories, 6 files

text文件中存放音频标注，wav.scp文件中存放wav音频绝对路径，样例如下：

cat text
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
IC0004W0044 放 一 首 歌 the sound of silence

cat wav.scp
BAC009S0002W0122 /mnt/data/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /mnt/data/wav/train/S0002/BAC009S0002W0123.wav
IC0004W0044  /mnt/data/wav/train/S0002/IC0004W0044.wav

model_dir：训练模型、配置文件、解码结果保存目录
dataset_type：训练dataloader类型。训练数据小于1000小时推荐small，大于1000小时推荐large。small、large dataloader主要的区别在于训练读取数据是否做全局shuffle
batch_bins：训练的batch_size值，如果dataset_type="small"，batch_bins表示为fbank特征帧数，如果dataset_type="large"，batch_bins表示为样本的总时长(单位ms)。
max_epoch：训练的epoch数
lr：训练学习率

Provide feedback

Saved searches

Use saved searches to filter your results more quickly