# PTMS1-BERT

# BERT(Bidirectional Encoder Representations  from Transformers)

In [2]:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1810.04805', width=1200, height=550)

## Q&A
### 为什么说是Bidirectional
https://blog.csdn.net/laobai1015/article/details/87937528

BERT：全称是Bidirectional Encoder Representation from Transformers，即双向Transformer的Encoder。其中“双向”表示模型在处理某一个词时，它能同时利用前面的词和后面的词两部分信息，这种“双向”的来源在于BERT与传统语言模型不同，其中 BERT 和 ELMo 都使用双向信息，OpenAI GPT 使用单向信息


# Train
https://zhuanlan.zhihu.com/p/74090249

bert的train包含两个过程。
Pre-training & fine-tuning

![](img/bert01.png)

## BERT的输入
reference: microstrong

bert的输入包含3个部分: 
* Token Embedding, 
* Segment Embedding, 标记token是属于句子A还是句子B
* Positon Embedding。

最后把这三个Embedding的对应位置加起来，作为BERT最后的输入Embedding
### 特殊字符介绍
CLS bert中编码 101

SEP bert中编码 102

UNK bert中编码 100

PAD bert中编码 0

## Pre-training: How to train a new language model from scratch using Transformers and Tokenizers

https://huggingface.co/blog/how-to-train?nsukey=MwfrrZHYtrS9g%2F2Y3hxHCiUr6QiHNgZ9Nb%2BhPS2oFosDP0vUdsyh8Nrs%2F7sc7%2FPEN3yYaxo%2BNJQtoe%2BR1hZc%2BNf4hnknnCpCDzioGByvE5F6Zen4MoyyFWGNioRFeUDCpqDzr8DEbQL0bI1%2B4QGie1nCT2PeplBJKRi9IAd8DSfx64yFkZlstBx%2FAcFNr6ky8j3RbAKXkzaCulH5I3TWiA%3D%3D

https://github.com/huggingface/blog/tree/master/notebooks


In [None]:
cmd ="""
python run_language_modeling.py
    --train_data_file ./oscar.eo.txt
    --output_dir ./EsperBERTo-small-v1
    --model_type roberta
    --mlm
    --config_name ./EsperBERTo
    --tokenizer_name ./EsperBERTo
    --do_train
    --line_by_line
    --learning_rate 1e-4
    --num_train_epochs 1
    --save_total_limit 2
    --save_steps 2000
    --per_gpu_train_batch_size 16
    --seed 42
""".replace("\n", " ")

### run_language_modeling.py
https://github.com/huggingface/transformers/tree/master/examples/language-modeling

In [None]:
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, CTRL, BERT, RoBERTa, XLNet).


GPT, GPT-2 and CTRL are fine-tuned using a causal language modeling (CLM) loss. 

BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss. 

XLNet is fine-tuned using a permutation language modeling (PLM) loss.
"""


import logging
import math
import os
from dataclasses import dataclass, field
from typing import Optional

from transformers import (
    CONFIG_MAPPING,
    MODEL_WITH_LM_HEAD_MAPPING,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
#     DataCollatorForPermutationLanguageModeling,
    HfArgumentParser,
    LineByLineTextDataset,
    PreTrainedTokenizer,
    TextDataset,
    Trainer,
    TrainingArguments,
    set_seed,
)


logger = logging.getLogger(__name__)


MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """

    model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": "The model checkpoint for weights initialization. Leave None if you want to train a model from scratch."
        },
    )
    model_type: Optional[str] = field(
        default=None,
        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    train_data_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a text file)."}
    )
    eval_data_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    line_by_line: bool = field(
        default=False,
        metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
    )

    mlm: bool = field(
        default=False, metadata={"help": "Train with masked-language modeling loss instead of language modeling."}
    )
    mlm_probability: float = field(
        default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
    )
    plm_probability: float = field(
        default=1 / 6,
        metadata={
            "help": "Ratio of length of a span of masked tokens to surrounding context length for permutation language modeling."
        },
    )
    max_span_length: int = field(
        default=5, metadata={"help": "Maximum length of a span of masked tokens for permutation language modeling."}
    )

    block_size: int = field(
        default=-1,
        metadata={
            "help": "Optional input sequence length after tokenization."
            "The training dataset will be truncated in block of this size for training."
            "Default to the model max input length for single sentence inputs (take into account special tokens)."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )


def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )


def main():
    # See all possible arguments in src/transformers/training_args.py
    # or by passing the --help flag to this script.
    # We now keep distinct sets of args, for a cleaner separation of concerns.

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    if data_args.eval_data_file is None and training_args.do_eval:
        raise ValueError(
            "Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
            "or remove the --do_eval argument."
        )

    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
        )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        training_args.local_rank,
        training_args.device,
        training_args.n_gpu,
        bool(training_args.local_rank != -1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
    set_seed(training_args.seed)

    # Load pretrained model and tokenizer
    #
    # Distributed training:
    # The .from_pretrained methods guarantee that only one local process can concurrently
    # download model & vocab.

    if model_args.config_name:
        config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
    elif model_args.model_name_or_path:
        config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
    else:
        config = CONFIG_MAPPING[model_args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")

    if model_args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
    elif model_args.model_name_or_path:
        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
    else:
        raise ValueError(
            "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
            "and load it from here, using --tokenizer_name"
        )

    if model_args.model_name_or_path:
        model = AutoModelWithLMHead.from_pretrained(
            model_args.model_name_or_path,
            from_tf=bool(".ckpt" in model_args.model_name_or_path),
            config=config,
            cache_dir=model_args.cache_dir,
        )
    else:
        logger.info("Training new model from scratch")
        model = AutoModelWithLMHead.from_config(config)

    model.resize_token_embeddings(len(tokenizer))

    if config.model_type in ["bert", "roberta", "distilbert", "camembert"] and not data_args.mlm:
        raise ValueError(
            "BERT and RoBERTa-like models do not have LM heads but masked LM heads. They must be run using the"
            "--mlm flag (masked language modeling)."
        )

    if data_args.block_size <= 0:
        data_args.block_size = tokenizer.max_len
        # Our input block size will be the max possible for the model
    else:
        data_args.block_size = min(data_args.block_size, tokenizer.max_len)

    # Get datasets

    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
    eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
    if config.model_type == "xlnet":
#         data_collator = DataCollatorForPermutationLanguageModeling(
#             tokenizer=tokenizer, plm_probability=data_args.plm_probability, max_span_length=data_args.max_span_length,
#         )
        pass
    else:
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
        )

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        prediction_loss_only=True,
    )

    # Training
    if training_args.do_train:
        model_path = (
            model_args.model_name_or_path
            if model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path)
            else None
        )
        trainer.train(model_path=model_path)
        trainer.save_model()
        # For convenience, we also re-save the tokenizer to the same directory,
        # so that you can share your model easily on huggingface.co/models =)
        if trainer.is_world_master():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
    results = {}
    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()

        perplexity = math.exp(eval_output["eval_loss"])
        result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
        if trainer.is_world_master():
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key in sorted(result.keys()):
                    logger.info("  %s = %s", key, str(result[key]))
                    writer.write("%s = %s\n" % (key, str(result[key])))

        results.update(result)

    return results


def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()


if __name__ == "__main__":
    main()

# bert的变种
https://huggingface.co/transformers/model_summary.html

## 改进思路
https://medium.com/analytics-vidhya/what-happens-after-bert-summarize-those-ideas-behind-ee02f1eae5d9


### Increase coverage to improve MaskedLM
Masking on whole word —wwm
Masking on Phrase level — ERNIE
Scaling to a certain length — Ngram Masking / Span Masking

Phrase level needs to provide a corresponding phrase list. Providing such artificially added messages may disturb the model, give it a bias. It seems that mask on longer length should be a better solution, so T5 try on different lengths to reach this conclusion:

It can be seen that increasing the length is effective, but it does not mean that longer is better. SpanBert has a better solution, to reduce the chance of Mask overly long text through probability sampling.

### Change the proportion of Masked
Google’s T5 tries different masked ratios to explore what the best parameter settings are.Surprisingly, bert original setting is the best :


### NextSentencePrediction 👎?
NSP learns sentence-level information by predicting whether two sentences are contextual. From the experimental result, it didn’t give much improvement, and even drops on some tasks.

## ALBERT
Same as BERT but with a few tweaks:

        Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token) whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it’s more logical to have H >> E. Als, the embedding matrix is large since it’s V x E (V being the vocab size). If E < H, it has less parameters.

        Layers are split in groups that share parameters (to save memory).

        Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A et B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.



In [2]:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1909.11942', width=1200, height=550)

## ABSTRACT
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point fur-ther model increases become harder due to GPU/TPU memory limitations and longer  training  times.   

To  address  these  problems,  we  present  two  parameter-reduction  techniques  to  lower  memory  consumption  and  increase  the  training speed of BERT (Devlin et al., 2019).  Comprehensive empirical evidence shows that  our  proposed  methods  lead  to  models  that  scale  much  better  compared  to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. 

# RoBERTa: A Robustly Optimized BERT Pretraining Approach

In [4]:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1907.11692', width=1200, height=550)

## Abstract
Language model pretraining has led to sig-nificant performance gains but careful com-parison between different approaches is chal-lenging. Training is computationally expen-sive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final re-sults. We present a replication study of BERT pretraining (Devlin et al.,2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.  Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.


Our modifications are simple, they include: 
* (1)training the model longer, with bigger batches,over more data; 
* (2) removing the next sentenceprediction objective; 
* (3) training on longer se-quences; and 
* (4) dynamically changing the mask-ing pattern applied to the training data.

## 与BERT的比较

Same as BERT with better pretraining tricks:

        dynamic masking: tokens are masked differently at each epoch whereas BERT does it once and for all

        no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of contiguous texts together to reach 512 tokens (so sentences in in an order than may span other several documents)

        train with larger batches

        use BPE with bytes as a subunit and not characters (because of unicode characters)



# DistilBERT
Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it’s been trained to predict the same probabilities as the larger model. The actual objective is a combination of:

        finding the same probabilities as the teacher model

        predicting the masked tokens correctly (but no next-sentence objective)

        a cosine similarity between the hidden states of the student and the teacher model



## 模型蒸馏Distillation
https://zhuanlan.zhihu.com/p/71986772

Hinton在NIPS2014[1]提出了知识蒸馏（Knowledge Distillation）的概念，旨在把一个大模型或者多个模型ensemble学到的知识迁移到另一个轻量级单模型上，方便部署。简单的说就是用新的小模型去学习大模型的预测结果，改变一下目标函数。听起来是不难，但在实践中小模型真的能拟合那么好吗？所以还是要多看看别人家的实验，掌握一些trick。
### 名词解释

    teacher - 原始模型或模型ensemble
    student - 新模型
    transfer set - 用来迁移teacher知识、训练student的数据集合
    soft target - teacher输出的预测结果（一般是softmax之后的概率）
    hard target - 样本原本的标签
    temperature - 蒸馏目标函数中的超参数
    born-again network - 蒸馏的一种，指student和teacher的结构和尺寸完全一样
    teacher annealing - 防止student的表现被teacher限制，在蒸馏时逐渐减少soft targets的权重

### 基本思想
1.1 为什么蒸馏可以work

好模型的目标不是拟合训练数据，而是学习如何泛化到新的数据。所以蒸馏的目标是让student学习到teacher的泛化能力，理论上得到的结果会比单纯拟合训练数据的student要好。另外，对于分类任务，如果soft targets的熵比hard targets高，那显然student会学习到更多的信息。

# domain-adaptive  pretraining
https://mp.weixin.qq.com/s/qULq9ye_Pg56pEQIdvr8tQ

https://zhuanlan.zhihu.com/p/149210123

ACL2020 Best Paper有一篇论文提名奖，《Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks》。这篇论文做了很多语言模型预训练的实验，系统的分析了语言模型预训练对子任务的效果提升情况。有几个主要结论：

在目标领域的数据集上继续预训练（DAPT）可以提升效果；目标领域的语料与RoBERTa的原始预训练语料越不相关，DAPT效果则提升更明显。

在具体任务的数据集上继续预训练（TAPT）可以十分“廉价”地提升效果。

结合二者（先进行DAPT，再进行TAPT）可以进一步提升效果。

如果能获取更多的、任务相关的无标注数据继续预训练（Curated-TAPT），效果则最佳。

如果无法获取更多的、任务相关的无标注数据，采取一种十分轻量化的简单数据选择策略，效果也会提升。

为了更好地理解这篇paper，我们需要牢记两个重要的专有名词：

* DAPT：领域自适应预训练(Domain-Adaptive Pretraining)，就是在所属的领域（如医疗）数据上继续预训练～
* TAPT：任务自适应预训练(Task-Adaptive Pretraining)，就是在具体任务数据上继续预训练～

## Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

In [5]:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/2004.10964', width=1200, height=550)

### Abstract
Language  models  pretrained  on  text  from  a wide  variety  of  sources  form  the  foundation of  today’s  NLP.  In  light  of  the  success  of these  broad-coverage  models,  we  investigate whether it is still helpful to tail or a pretrained model  to  the  domain  of  a  target  task.    

We present a study across four domains (
* biomedical and 
* computer science publications,  
* news,and  
* reviews)  and  

eight  classification  tasks,showing that a second phase of pretraining in-domain  (domain-adaptive  pretraining)  leads to  performance  gains,  under  both  high- and low-resource  settings.

Moreover,   adapting to  the  task’s  unlabeled  data  (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. 

Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.  Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

## pytorch transformers 预训练模型
https://mp.weixin.qq.com/s/qULq9ye_Pg56pEQIdvr8tQ

虽然在bert上语言模型预训练在算法比赛中已经是一个稳定的上分操作。但是上面这篇文章难能可贵的是对这个操作进行了系统分析。大部分中文语言模型都是在tensorflow上训练的，一个常见例子是中文roberta项目。可以参考

https://github.com/brightmart/roberta_zh



使用pytorch进行中文bert语言模型预训练的例子比较少。在huggingface的Transformers中，有一部分代码支持语言模型预训练(不是很丰富，很多功能都不支持比如wwm)。为了用最少的代码成本完成bert语言模型预训练，本文借鉴了里面的一些现成代码。也尝试分享一下使用pytorch进行语言模型预训练的一些经验。主要有三个常见的中文bert语言模型

bert-base-chinese

roberta-wwm-ext

ernie

### bert-base-chinese



(https://huggingface.co/bert-base-chinese)

这是最常见的中文bert语言模型，基于中文维基百科相关语料进行预训练。把它作为baseline，在领域内无监督数据进行语言模型预训练很简单。只需要使用官方给的例子就好。

https://github.com/huggingface/transformers/tree/master/examples/language-modeling

(本文使用的transformers更新到3.0.2)

方法就是

In [None]:
python run_language_modeling.py \
    --output_dir=output \
    --model_type=bert \
    --model_name_or_path=bert-base-chinese \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

其中$TRAIN_FILE 代表领域相关中文语料地址。



### roberta-wwm-ext

(https://github.com/ymcui/Chinese-BERT-wwm)

哈工大讯飞联合实验室发布的预训练语言模型。预训练的方式是采用roberta类似的方法，比如动态mask，更多的训练数据等等。在很多任务中，该模型效果要优于bert-base-chinese。

对于中文roberta类的pytorch模型，使用方法如下

In [None]:
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext")
roberta = BertModel.from_pretrained("hfl/chinese-roberta-wwm-ext")

切记不可使用官方推荐的

In [None]:
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext")
model = AutoModel.from_pretrained("hfl/chinese-roberta-wwm-ext")

In [None]:
dir_ = 'chinese-roberta-wwm-ext'  # 预训练模型保存目录 
model.save_pretrained(dir_)  # 会生成 模型bin文件和config.json
tokenizer.save_pretrained(dir_)

因为中文roberta类的配置文件比如vocab.txt，都是采用bert的方法设计的。英文roberta模型读取配置文件的格式默认是vocab.json。对于一些英文roberta模型，倒是可以通过AutoModel自动读取。这就解释了huggingface的模型库的中文roberta示例代码为什么跑不通。https://huggingface.co/models?

如果要基于上面的代码run_language_modeling.py继续预训练roberta。还需要做两个改动。

下载roberta-wwm-ext到本地目录hflroberta，在config.json中修改“model_type”:"roberta"为"model_type":"bert"。

对上面的run_language_modeling.py中的AutoModel和AutoTokenizer都进行替换为BertModel和BertTokenizer。

再运行命令

In [None]:
python run_language_modeling_roberta.py \
    --output_dir=output \
    --model_type=bert \
    --model_name_or_path=hflroberta \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

### ernie

（https://github.com/nghuyong/ERNIE-Pytorch ）

ernie是百度发布的基于百度知道贴吧等中文语料结合实体预测等任务生成的预训练模型。这个模型的准确率在某些任务上要优于bert-base-chinese和roberta。如果基于ernie1.0模型做领域数据预训练的话只需要一步修改。

下载ernie1.0到本地目录ernie，在config.json中增加字段"model_type":"bert"。

In [None]:
python run_language_modeling.py \
    --output_dir=output \
    --model_type=bert \
    --model_name_or_path=ernie \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

最后，huggingface项目中语言模型预训练用mask方式如下。仍是按照15%的数据随机mask然后预测自身。如果要做一些高级操作比如whole word masking或者实体预测，可以自行修改transformers.DataCollatorForLanguageModeling。

本文实验代码库。拿来即用！

https://github.com/zhusleep/pytorch_chinese_lm_pretrain