## Further Pre-training

Besides the training data of a target task, we can further pre-train a transformer on the data from the same domain.

![image.png](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_16/MediaObjects/489562_1_En_16_Fig1_HTML.png)

The Transformer models are pre-trained on the general domain corpus. For a text classification task / regression task in a specific domain, such as Readability Assesment, its data
distribution may be different from a transformer trained on a different corpus e.g. RoBERTa trained on BookCorpus, Wiki, CC-News, OpenWebText, Stories. Therefore the idea is, we can further pre-train the transformer with masked language model and next sentence prediction tasks on the domain-specific data. Three further pretraining approaches are performed:

1) `Within-task pre-training (ITPT)`, in which transformer is further pre-trained on the training data of a target task. `This Kernel.`

2) `In-domain pre-training (IDPT)`, in which the pretraining data is obtained from the same domain of a target task. For example, there are several different sentiment classification tasks, which have a similar data distribution. We can further pre-train the transformer on the combined training data from these tasks.

3) `Cross-domain pre-training (CDPT)`, in which the pretraining data is obtained from both the same and other different domains to a target task.

#### Reference: [How to finetune BERT for Text Classification ?](https://arxiv.org/pdf/1905.05583.pdf)

> Note: This Kernel implements ITPT i.e. Within-Task Pretraining. First we will pretrain a RoBERTa model and then utilize the same for further finetuing tasks using different strategies.

#### Code Reference: 
`Transformer Examples` - https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm_no_trainer.py

> 90-95% of the code is from this great `run_mlm_no_trainer.py script` from `HuggingFace Examples Repository`. I have merely `changed few lines to adjust the code according to my task`. 
    
    P.S. Make sure to understand everything instead of blindly copying the code.

### Install Dependencies

In [1]:
!nvidia-smi

Fri Apr 29 14:19:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    51W / 350W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install datasets accelerate 



In [3]:
import pandas as pd
import numpy as np

### Import Dependencies

In [4]:
import logging
import math
import os

import datasets
from datasets import load_dataset
from accelerate import Accelerator

from torch.optim import AdamW
import torch
from torch.utils.data import DataLoader

import transformers
from transformers import (
    CONFIG_MAPPING, 
    MODEL_MAPPING, 
    AutoConfig, 
    AutoModelForMaskedLM, 
    AutoTokenizer, 
    DataCollatorForLanguageModeling, 
    get_scheduler, 
    set_seed
)

logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

### Config

In [5]:
class TrainConfig:
    train_file = "mlm_data.csv"
    validation_file = "mlm_data.csv"
    pad_to_max_length = True
    fold = 0
    model_name_or_path = "SpanBERT/spanbert-large-cased"
    config_name = "SpanBERT/spanbert-large-cased"
    tokenizer_name = "SpanBERT/spanbert-large-cased"
    use_slow_tokenizer = False
    per_device_train_batch_size = 8
    per_device_eval_batch_size = 2
    learning_rate = 5e-5
    weight_decay = 0.0
    num_train_epochs = 100  # change to 5
    max_seq_length = 512
    max_train_steps = None
    gradient_accumulation_steps = 2
    lr_scheduler_type = "constant_with_warmup"
    num_warmup_steps = 0
    output_dir = "../output/spanbert"
    seed = 2021
    model_type = "SpanBERT/spanbert-large-cased"
    mlm_column = "pn_history"
    line_by_line = False
    path_original_dataset = "../input/corpus.csv"
    preprocessing_num_workers = 4
    overwrite_cache = True
    mlm_probability = 0.15
    additional_tokens = []


config = TrainConfig()

if config.train_file is not None:
    extension = config.train_file.split(".")[-1]
    assert extension in [
        "csv",
        "json",
        "txt",
    ], "`train_file` should be a csv, json or txt file."
if config.validation_file is not None:
    extension = config.validation_file.split(".")[-1]
    assert extension in [
        "csv",
        "json",
        "txt",
    ], "`validation_file` should be a csv, json or txt file."
if config.output_dir is not None:
    os.makedirs(config.output_dir, exist_ok=True)

### Run

In [6]:
def main():
    args = TrainConfig()
    accelerator = Accelerator()
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        filename="../output/transformer-ssl-spanbert-large.log",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )
    logger.info(accelerator.state)
    logger.setLevel(logging.INFO if accelerator.is_local_main_process else logging.ERROR)

    if accelerator.is_local_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()
    if args.seed is not None:
        set_seed(args.seed)
    
    df = pd.read_csv(args.path_original_dataset)
    
    mlm_data = df.loc[df['fold']!=args.fold, [args.mlm_column]]
    mlm_data = mlm_data.rename(columns={'excerpt':'text'})
    mlm_data.to_csv('mlm_data.csv', index=False)

    mlm_data_val = df.loc[df['fold']==args.fold, [args.mlm_column]]
    mlm_data_val = mlm_data_val.rename(columns={'excerpt':'text'})
    mlm_data_val.to_csv('mlm_data_val.csv', index=False)

    data_files = {}
    if args.train_file is not None:
        data_files["train"] = args.train_file
    if args.validation_file is not None:
        data_files["validation"] = args.validation_file
    extension = args.train_file.split(".")[-1]
    if extension == "txt":
        extension = "text"
    raw_datasets = load_dataset(extension, data_files=data_files)
    
    if args.config_name:
        config = AutoConfig.from_pretrained(args.config_name)
    elif config.model_name_or_path:
        config = AutoConfig.from_pretrained(args.model_name_or_path)
    else:
        config = CONFIG_MAPPING[args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")

    print("========= load tokenizer ============")
    if args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer)
    elif args.model_name_or_path:
        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer)
    else:
        raise ValueError(
            "You are instantiating a new tokenizer from scratch. This is not supported by this script."
            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
        )
    print("========= loaded tokenizer ============")
    if len(args.additional_tokens):
        tokenizer.add_tokens(args.additional_tokens)
    
    if args.model_name_or_path:
        model = AutoModelForMaskedLM.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
        )
    else:
        logger.info("Training new model from scratch")
        model = AutoModelForMaskedLM.from_config(config)
    model.resize_token_embeddings(len(tokenizer))

    column_names = raw_datasets["train"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]
    
    print(f"tokenizer model_max_length: {tokenizer.model_max_length}")
    print("========= loaded tokenizer ============")
    if args.max_seq_length is None:
        max_seq_length = tokenizer.model_max_length
        if max_seq_length > 512:
            logger.warning(
                f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
                "Picking 1024 instead. You can change that default value by passing --max_seq_length xxx."
            )
            max_seq_length = 512
    else:
        if args.max_seq_length > tokenizer.model_max_length:
            logger.warning(
                f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the"
                f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}."
            )
        max_seq_length = min(args.max_seq_length, tokenizer.model_max_length)

    def tokenize_function(examples):
        return tokenizer(examples[text_column_name], return_special_tokens_mask=True)

    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        batched=True,
        num_proc=args.preprocessing_num_workers,
        remove_columns=column_names,
        load_from_cache_file=not args.overwrite_cache,
    )

    def group_texts(examples):
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        total_length = (total_length // max_seq_length) * max_seq_length
        result = {
            k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
            for k, t in concatenated_examples.items()
        }
        return result

    tokenized_datasets = tokenized_datasets.map(
        group_texts,
        batched=True,
        num_proc=args.preprocessing_num_workers,
        load_from_cache_file=not args.overwrite_cache,
    )
    train_dataset = tokenized_datasets["train"]
    eval_dataset = tokenized_datasets["validation"]

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=args.mlm_probability)
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=data_collator, batch_size=args.per_device_train_batch_size
    )
    eval_dataloader = DataLoader(eval_dataset, collate_fn=data_collator, batch_size=args.per_device_eval_batch_size)

    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate)

    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader
    )

    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    if args.max_train_steps is None:
        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
    else:
        args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

    lr_scheduler = get_scheduler(
        name=args.lr_scheduler_type,
        optimizer=optimizer,
        num_warmup_steps=args.num_warmup_steps,
        num_training_steps=args.max_train_steps,
    )

    total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps

    logger.info("***** Running training *****")
    logger.info(f"  Num examples = {len(train_dataset)}")
    logger.info(f"  Num Epochs = {args.num_train_epochs}")
    logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
    logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
    logger.info(f"  Total optimization steps = {args.max_train_steps}")
    completed_steps = 0

    best_perplexity = np.inf
    n_update_best_perplexity = 0
    for epoch in range(args.num_train_epochs):
        
        model.train()
        for step, batch in enumerate(train_dataloader):
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / args.gradient_accumulation_steps
            accelerator.backward(loss)
            if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                completed_steps += 1

            if completed_steps >= args.max_train_steps:
                break

        model.eval()
        losses = []
        with torch.inference_mode():
            for batch in eval_dataloader:
                with torch.no_grad():
                    outputs = model(**batch)

                loss = outputs.loss
                losses.append(accelerator.gather(loss.repeat(args.per_device_eval_batch_size)))

        losses = torch.cat(losses)
        losses = losses[: len(eval_dataset)]
        perplexity = math.exp(torch.mean(losses))

        logger.info(f"epoch {epoch}: perplexity: {perplexity}")

        if perplexity < best_perplexity:
            n_update_best_perplexity += 1
            if n_update_best_perplexity % 5 == 0 or epoch==args.num_train_epochs-1:
                best_perplexity = perplexity
                accelerator.wait_for_everyone()
                unwrapped_model = accelerator.unwrap_model(model)
                unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)

In [None]:
if __name__ == "__main__":
    main()

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-8d6c8657429f80d9/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-8d6c8657429f80d9/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

loading configuration file https://huggingface.co/SpanBERT/spanbert-large-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1a1dfe6956710e7344f6fc7595b16b878615c5f6f2b91e9699f6c8787af0d2fb.6b8ba0ec4a9062565fc1a89d50d0feecbd2295e94a5dd52cde59d9109618a95a
Model config BertConfig {
  "_name_or_path": "SpanBERT/spanbert-large-cased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}





Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/SpanBERT/spanbert-large-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1a1dfe6956710e7344f6fc7595b16b878615c5f6f2b91e9699f6c8787af0d2fb.6b8ba0ec4a9062565fc1a89d50d0feecbd2295e94a5dd52cde59d9109618a95a
Model config BertConfig {
  "_name_or_path": "SpanBERT/spanbert-large-cased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

l



loading weights file https://huggingface.co/SpanBERT/spanbert-large-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/47c606d9dcccc6b316f18beea908fad9eccfec6213d9a8ea31e366fc4233934d.8466fabf7a827d20467e8c2781c1fff0ba40669185f5e0eb34035b8019a36d4f
All model checkpoint weights were used when initializing BertForMaskedLM.

Some weights of BertForMaskedLM were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer model_max_length: 1000000000000000019884624838656
        

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

#3:   0%|          | 0/9 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#3:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#3:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

#3:   0%|          | 0/9 [00:00<?, ?ba/s]

Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/spanbert/pytorch_model.bin
Configuration saved in ../output/spanbert/config.json
Model weights saved in ../output/s

In [None]:
%env TOKENIZERS_PARALLELISM=true
tokenizer = AutoTokenizer.from_pretrained("SpanBERT/spanbert-large-cased", trim_offsets=False)

In [None]:
tokenizer.save_pretrained('./tokenizer/')