### © Copyright 2020 [George Mihaila](https://github.com/gmihaila).

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Info

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14KCDms4YLrE7Ekxl9VtrdT229UTDyim3#offline=true&sandboxMode=true)
[![Generic badge](https://img.shields.io/badge/GitHub-Source-greensvg)](https://github.com/gmihaila/machine_learning_things/blob/master/tutorial_notebooks/pretrain_transformer.ipynb)



This notebook is used to pretrain transformers models using [Huggingface](https://huggingface.co/transformers/). This notebooks is part of my trusty notebooks for Machine Learning. Check out more similar content on my website [gmihaila.github.io/useful/useful/](https://gmihaila.github.io/useful/useful/) where I post useful notebooks like this one.

This notebook is **heavily inspired** from the Huggingface script used for training language models: [transformers/tree/master/examples/language-modeling](https://github.com/huggingface/transformers/tree/master/examples/language-modeling).

'Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, CTRL, BERT, RoBERTa, XLNet).
GPT, GPT-2 and CTRL are fine-tuned using a causal language modeling (CLM) loss. BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss. XLNet is fine-tuned using a permutation language modeling (PLM) loss.'

<br>

## How to use this notebook? 

This notebooks is a code adaptation of the [run_language_modeling.py](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py). 

**Models that are guarantee to work:** [GPT](https://huggingface.co/transformers/model_summary.html#original-gpt), [GPT-2](https://huggingface.co/transformers/model_summary.html#gpt-2), [BERT](https://huggingface.co/transformers/model_summary.html#bert), [DistilBERT](https://huggingface.co/transformers/model_summary.html#distilbert), [RoBERTa](https://huggingface.co/transformers/model_summary.html#roberta) and [XLNet](https://huggingface.co/transformers/model_summary.html#xlnet). 

Parse the arguments needed that are split in TrainingArguments, ModelArguments and DataTrainingArguments. The only variables that need configuration depending on your needs are `model_args`, `data_args` and `training_args` in **Parameters**:

* `model_args` of type **ModelArguments**: These are the arguments for the model that you want to use such as the model_name_or_path, tokenizer_name etc. You'll need these to load the model and tokenizer.

  Minimum setup:

  ```python
  model_args = ModelArguments(model_name_or_path, 
                            model_type,
                            tokenizer_name,
                            )
  ```

  * `model_name_or_path` path to existing transformers model or name of transformer model to be used: *bert-base-cased*, *roberta-base*, *gpt2* etc. More details [here](https://huggingface.co/transformers/pretrained_models.html).

  * `model_type` type of model used: *bert*, *roberta*, *gpt2*. More details [here](https://huggingface.co/transformers/pretrained_models.html).

  * `tokenizer_name` [tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html#tokenizer) used to process data for training the model. It usually has same name as `model_name_or_path`: *bert-base-cased*, *roberta-base*, *gpt2* etc.


* `data_args` of type **DataTrainingArguments**: These are as the name suggests arguments needed for the dataset. Such as the directory name where your files are stored etc. You'll need these to load/process the dataset.

  Minimum setup:

  ```python
  data_args = DataArgs(train_data_file,
                     eval_data_file,
                     mlm,
                     )
  ```
  
  * `train_data_file` path to your dataset. This is a plain file that contains all your text data to train a model. Use each line to separate examples: i.e. if you have a dataset composed of multiple  text documents, create a single file with each line in the file associated to a text document.

  * `eval_data_file` same story as `train_data_file`. This file is used to evaluate the model performance

  * `mlm` is a flag that changes loss function depending on model architecture. This variable needs to be set to **True** when working with masked language models like *bert* or *roberta*.



* `training_args` of type **TrainingArguments**: These are the training hyper-parameters such as learning rate, batch size, weight decay, gradient accumulation steps etc. See all possible arguments [here](https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py). These are used by the Trainer.

  Minimum setup:

* `model_args`
  ```python
  training_args = TrainingArguments(output_dir, 
                                  do_train, 
                                  do_eval,
                                  )
  ```

  * `output_dir` path where to save the pre-trained model.
  * `do_train` variable to signal if you're using train data or not. Set it to **True** if you mentioned `train_data_file`.
  * `do_eval` variable to signal if you're using evaluate data or not. Set it to **True** if you mentioned `eval_data_file`.

<br>

## Example:

### Pre-train Bert

In the **Parameters** section use arguments:

```python
# process model arguments. Check Info - Notes for more details
model_args = ModelArguments(model_name_or_path='bert-base-cased', 
                            model_type='bert',
                            tokenizer_name='bert-base-cased',
                            )

# process data arguments. Check Info - Notes for more details
data_args = DataArgs(train_data_file='/content/your_train_data',
                     eval_data_file='/content/your_test_data,
                     mlm=True,
                     )

# process training arguments. Check Info - Notes for more details
training_args = TrainingArguments(output_dir='/content/pretrained_bert', 
                                  do_train=True, 
                                  do_eval=False)
```


<br>

## Notes:
* Parameters details got from [here](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb).

* **Models that are guarantee to work:** [GPT](https://huggingface.co/transformers/model_summary.html#original-gpt), [GPT-2](https://huggingface.co/transformers/model_summary.html#gpt-2), [BERT](https://huggingface.co/transformers/model_summary.html#bert), [DistilBERT](https://huggingface.co/transformers/model_summary.html#distilbert), [RoBERTa](https://huggingface.co/transformers/model_summary.html#roberta) and [XLNet](https://huggingface.co/transformers/model_summary.html#xlnet). I plan on testing more models in the future.
* I used the [The WikiText Long Term Dependency Language Modeling Dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) as an example. **To reduce training time I used the evaluate split as training and test split as evaluation!**.


In [2]:
# check GPU alocation
!nvidia-smi

Sun Aug  9 14:42:51 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Download

In [3]:
# download any dataset or manualy upload it. I will use the wikitext raw:
!wget -q -nc https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip -q -n /content/wikitext-2-raw-v1.zip

# Installs

In [4]:
# install latest version of Tranformers from GitHub
!pip install -q git+https://github.com/huggingface/transformers

[K     |████████████████████████████████| 3.0MB 15.8MB/s 
[K     |████████████████████████████████| 1.1MB 40.8MB/s 
[K     |████████████████████████████████| 890kB 44.9MB/s 
[?25h  Building wheel for transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


# Imports

In [5]:
import logging
import math
import os
from dataclasses import dataclass, field
from typing import Optional
from transformers import (
                          CONFIG_MAPPING,
                          MODEL_WITH_LM_HEAD_MAPPING,
                          AutoConfig,
                          AutoModelWithLMHead,
                          AutoTokenizer,
                          DataCollatorForLanguageModeling,
                          DataCollatorForPermutationLanguageModeling,
                          HfArgumentParser,
                          LineByLineTextDataset,
                          PreTrainedTokenizer,
                          TextDataset,
                          Trainer,
                          TrainingArguments,
                          set_seed,
                          )

# setup logger
logger = logging.getLogger(__name__)
# get names of models with language model heads
MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
# get all model types
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

# Helper Functions

In [6]:
class ModelArguments:
  """Class to define model arguments.
  These are the arguments for the model that you want to use such as the 
  model_name_or_path, tokenizer_name etc. 
  You'll need these to load the model and tokenizer.

  Arguments:
  
    model_name_or_path: path to existing transformers model or name of 
      transformer model to be used: bert-base-cased, roberta-base, gpt2 etc. 
      More details: https://huggingface.co/transformers/pretrained_models.html

    model_type: type of model used: bert, roberta, gpt2. 
      More details: https://huggingface.co/transformers/pretrained_models.html

    tokenizer_name: tokenizer used to process data for training the model. 
      It usually has same name as model_name_or_path: bert-base-cased, 
      roberta-base, gpt2 etc.
    
    cache_dir: path to cache files to save time when re-running code.
  """
  def __init__(self, model_name_or_path=None, model_type=None, config_name=None, 
               tokenizer_name=None, cache_dir=None):
    self.model_name_or_path = model_name_or_path
    self.model_type = model_type
    self.config_name = config_name
    self.tokenizer_name = tokenizer_name
    self.cache_dir = cache_dir

    return


class DataArguments:
  """Class to define data arguments.
  Arguments needed for the dataset such as the directory name where your 
  files are stored etc. You'll need these to load/process the dataset.

  Arguments:
  
    train_data_file: path to your dataset. This is a plain file that 
      contains all your text data to train a model. Use each line to separate 
      examples: i.e. if you have a dataset composed of multiple text documents, 
      create a single file with each line in the file associated 
      to each text document.

    eval_data_file: same story as train_data_file. This file is used to evaluate 
      the model performance

    line_by_line: if each line is associated to a specific example in 
      your dataset.

    mlm: is a flag that changes loss function depending on model architecture. 
      This variable needs to be set to True when working with masked language 
      models like bert or roberta.

    mlm_probability: used when training masked language models. 
      Needs to have mlm set to True. 
      It represents the probability of masking tokens when training model.

    plm_probability: flag to define the ratio of length of a span of masked 
      tokens to surrounding context length for permutation language modeling.
      Used for XLNet.

    max_span_length: flag may also be used to limit the length of a span of 
      masked tokens used for permutation language modeling.
      Used for XLNet.

    block_size: it refers to block size of data. Set to -1 to use all data.

    overwrite_cache: if there are any cached files, overwrite them.
  """

  def __init__(self, train_data_file=None, eval_data_file=None, line_by_line=False, 
               mlm=False, mlm_probability=0.15, plm_probability=float(1/6), max_span_length=5,
               block_size=-1, overwrite_cache=False):
    self.train_data_file = train_data_file
    self.eval_data_file = eval_data_file
    self.line_by_line = line_by_line
    self.mlm = mlm
    self.mlm_probability = mlm_probability
    self.plm_probability = plm_probability
    self.max_span_length = max_span_length
    self.block_size = block_size
    self.overwrite_cache = overwrite_cache
    return


def get_dataset(args: DataArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
  """Process dataset file
  """
  file_path = args.eval_data_file if evaluate else args.train_data_file
  if args.line_by_line:
    return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
  else:
    return TextDataset(
  tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
    )


# Parameters

In [7]:
# process model arguments. Check Info - Notes for more details
model_args = ModelArguments(model_name_or_path='bert-base-cased', 
                            model_type='bert',
                            tokenizer_name='bert-base-cased',
                            )

# process data arguments. Check Info - Notes for more details
data_args = DataArguments(train_data_file='/content/wikitext-2-raw/wiki.valid.raw',
                          eval_data_file='/content/wikitext-2-raw/wiki.test.raw',
                          line_by_line=False,
                          mlm=True,
                          )

# process training arguments. Check Info - Notes for more details
training_args = TrainingArguments(output_dir='pretrain_bert', 
                                  do_train=True, 
                                  do_eval=True,
                                  overwrite_output_dir=True)

# check arguments
if data_args.eval_data_file is None and training_args.do_eval:
  # make sure do_eval is set to False if no evaluate data
  raise ValueError("Cannot do evaluation without an evaluation data file. \
  Either supply a file to --eval_data_file \
  or remove the --do_eval argument."
  )

if (
    os.path.exists(training_args.output_dir)
    and os.listdir(training_args.output_dir)
    and training_args.do_train
    and not training_args.overwrite_output_dir):
    # make sure we set overwrite correct if fileas already exist in path
    raise ValueError(f"Output directory ({training_args.output_dir}) already \
    exists and is not empty. Use --overwrite_output_dir to overcome."
    )

# setup logger
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

# setup seed
set_seed(training_args.seed)

08/09/2020 14:43:26 - INFO - transformers.training_args -   PyTorch: setting up devices
08/09/2020 14:43:26 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='pretrain_bert', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Aug09_14-43-26_431ba4944b5f', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1, run_name=None)


# Load Model and Tokenizer

In [8]:
# Load pretrained model and tokenizer
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.

# check model configuration
if model_args.config_name:
  # use configure name if defined
  config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
  # use model name or path if defined
  config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
else:
  # use config mapping
  config = CONFIG_MAPPING[model_args.model_type]()
  logger.warning("You are instantiating a new config instance from scratch.")

# check tokenizer configuraiton
if model_args.tokenizer_name:
  # use tokenizer name if define
  tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
  # use tokenizer name of path if defined
  tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
else:
  # tokenizer configuration incorrect
  raise ValueError("You are instantiating a new tokenizer from scratch. \
      This is not supported, but you can do it from another script, save it, \
      and load it from here, using --tokenizer_name"
      )

# check if using pre-trained model or train from scratch
if model_args.model_name_or_path:
  # use pre-trained model
  model = AutoModelWithLMHead.from_pretrained(
      model_args.model_name_or_path,
      from_tf=bool(".ckpt" in model_args.model_name_or_path),
      config=config,
      cache_dir=model_args.cache_dir,
      )
else:
  # use model from configuration - train from scratch
    logger.info("Training new model from scratch")
    model = AutoModelWithLMHead.from_config(config)

# resize model to fit all tokens in tokenizer
model.resize_token_embeddings(len(tokenizer))

# make sure `--mlm` flag si set for masked language models
if config.model_type in ["bert", "roberta", "distilbert", "camembert"] and not data_args.mlm:
  raise ValueError("BERT and RoBERTa-like models do not have LM heads but \
  masked LM heads. They must be run using the --mlm flag \
  (masked language modeling)."
  )

# setp data block size
if data_args.block_size <= 0:
  # set block size to maximum length of tokenizer
  # input block size will be the max possible for the model
  data_args.block_size = tokenizer.max_len
else:
  # never go beyond tokenzier maximum length
  data_args.block_size = min(data_args.block_size, tokenizer.max_len)

08/09/2020 14:43:27 - INFO - filelock -   Lock 139780907161584 acquired on /root/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391.lock
08/09/2020 14:43:27 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp_g5qdb2_


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

08/09/2020 14:43:28 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json in cache at /root/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391
08/09/2020 14:43:28 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391
08/09/2020 14:43:28 - INFO - filelock -   Lock 139780907161584 released on /root/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391.lock
08/09/2020 14:43:28 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /root/




08/09/2020 14:43:29 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /root/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391
08/09/2020 14:43:29 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

08/09/2020 14:43:30 - INFO - filelock -   Lock 139780907289400 acquired on /root/.cache/torc

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…

08/09/2020 14:43:31 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt in cache at /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
08/09/2020 14:43:31 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
08/09/2020 14:43:31 - INFO - filelock -   Lock 139780907289400 released on /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock
08/09/2020 14:43:31 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /root/.cache/torch/t




08/09/2020 14:43:32 - INFO - filelock -   Lock 139780907161136 acquired on /root/.cache/torch/transformers/d8f11f061e407be64c4d5d7867ee61d1465263e24085cfa26abf183fdc830569.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2.lock
08/09/2020 14:43:32 - INFO - transformers.file_utils -   https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpfxci_az3


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…

08/09/2020 14:43:38 - INFO - transformers.file_utils -   storing https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin in cache at /root/.cache/torch/transformers/d8f11f061e407be64c4d5d7867ee61d1465263e24085cfa26abf183fdc830569.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
08/09/2020 14:43:38 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/d8f11f061e407be64c4d5d7867ee61d1465263e24085cfa26abf183fdc830569.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
08/09/2020 14:43:38 - INFO - filelock -   Lock 139780907161136 released on /root/.cache/torch/transformers/d8f11f061e407be64c4d5d7867ee61d1465263e24085cfa26abf183fdc830569.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2.lock
08/09/2020 14:43:38 - INFO - transformers.modeling_utils -   loading weights file https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin from cache at /root/.cache/torch/transformers/d8f11f061e407be64c4d5d7




- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Dataset

In [9]:
# setup train dataset if `do_train` is set
train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None

# setup evaluation dataset if `do_eval` is set
eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None

# special dataset handle depending on model type
if config.model_type == "xlnet":
  # configure data for XLNET
  data_collator = DataCollatorForPermutationLanguageModeling(
      tokenizer=tokenizer, 
      plm_probability=data_args.plm_probability, 
      max_span_length=data_args.max_span_length,
      )
else:
  # configure data for rest of model types
  data_collator = DataCollatorForLanguageModeling(
      tokenizer=tokenizer, 
      mlm=data_args.mlm, 
      mlm_probability=data_args.mlm_probability
      )

08/09/2020 14:43:43 - INFO - filelock -   Lock 139780907175328 acquired on /content/wikitext-2-raw/cached_lm_BertTokenizer_510_wiki.valid.raw.lock
08/09/2020 14:43:43 - INFO - transformers.data.datasets.language_modeling -   Creating features from dataset file at /content/wikitext-2-raw
08/09/2020 14:43:45 - INFO - transformers.data.datasets.language_modeling -   Saving features into cached file /content/wikitext-2-raw/cached_lm_BertTokenizer_510_wiki.valid.raw [took 0.008 s]
08/09/2020 14:43:45 - INFO - filelock -   Lock 139780907175328 released on /content/wikitext-2-raw/cached_lm_BertTokenizer_510_wiki.valid.raw.lock
08/09/2020 14:43:45 - INFO - filelock -   Lock 139783131337168 acquired on /content/wikitext-2-raw/cached_lm_BertTokenizer_510_wiki.test.raw.lock
08/09/2020 14:43:45 - INFO - transformers.data.datasets.language_modeling -   Creating features from dataset file at /content/wikitext-2-raw
08/09/2020 14:43:47 - INFO - transformers.data.datasets.language_modeling -   Saving 

# Trainer

In [10]:
# initialize Trainer
trainer = Trainer(model=model,
                  args=training_args,
                  data_collator=data_collator,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  prediction_loss_only=True,
                  )

# check model path to save
if training_args.do_train:
  # setup model path is already defined
  model_path = (model_args.model_name_or_path 
                if model_args.model_name_or_path is not None and 
                os.path.isdir(model_args.model_name_or_path) 
                else None
                )
  trainer.train(model_path=model_path)
  trainer.save_model()
  # For convenience, we also re-save the tokenizer to the same directory,
  # so that you can share your model easily on huggingface.co/models =)
  if trainer.is_world_master():
    tokenizer.save_pretrained(training_args.output_dir)

08/09/2020 14:44:01 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
08/09/2020 14:44:01 - INFO - transformers.trainer -   To use comet_ml logging, run `pip/conda install comet_ml` see https://www.comet.ml/docs/python-sdk/huggingface/
08/09/2020 14:44:01 - INFO - transformers.trainer -   ***** Running training *****
08/09/2020 14:44:01 - INFO - transformers.trainer -     Num examples = 481
08/09/2020 14:44:01 - INFO - transformers.trainer -     Num Epochs = 3
08/09/2020 14:44:01 - INFO - transformers.trainer -     Instantaneous batch size per device = 8
08/09/2020 14:44:01 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 8
08/09/2020 14:44:01 - INFO - transformers.trainer -     Gradient Accumulation steps = 1
08/09/2020 14:44:01 - INFO - transformers.trainer -     Total optimization steps = 18

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=61.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=61.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=61.0, style=ProgressStyle(description_wid…

08/09/2020 14:47:17 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


08/09/2020 14:47:17 - INFO - transformers.trainer -   Saving model checkpoint to pretrain_bert
08/09/2020 14:47:17 - INFO - transformers.configuration_utils -   Configuration saved in pretrain_bert/config.json






08/09/2020 14:47:19 - INFO - transformers.modeling_utils -   Model weights saved in pretrain_bert/pytorch_model.bin


# Evaluate

In [11]:
# save results
results = {}

# check if `do_eval` flag is set
if training_args.do_eval:
  # evaluate mode on evaluate data
  logger.info("*** Evaluate ***")
  # capture output if trainer evaluate
  eval_output = trainer.evaluate()
  # compute perplexity from model loss
  perplexity = math.exp(eval_output["eval_loss"])
  # save perplexity of results
  result = {"perplexity": perplexity}
  # set path for output evaluation file
  output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
  # dump evaluaiton results to file
  if trainer.is_world_master():
    with open(output_eval_file, "w") as writer:
      logger.info("***** Eval results *****")
      for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))
  # update results
  results.update(result)

print("results: ",results)

08/09/2020 14:47:19 - INFO - __main__ -   *** Evaluate ***
08/09/2020 14:47:19 - INFO - transformers.trainer -   ***** Running Evaluation *****
08/09/2020 14:47:19 - INFO - transformers.trainer -     Num examples = 552
08/09/2020 14:47:19 - INFO - transformers.trainer -     Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=69.0, style=ProgressStyle(description_wi…

08/09/2020 14:47:45 - INFO - __main__ -   ***** Eval results *****
08/09/2020 14:47:45 - INFO - __main__ -     perplexity = 15.721432898648294



{'eval_loss': 2.7550249341605366, 'epoch': 3.0, 'step': 183}
results:  {'perplexity': 15.721432898648294}
