<a href="https://colab.research.google.com/github/neqkir/working-with-tranformers/blob/main/training_bert_huggingface_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pre-training BERT huggingface transformers in python**

https://www.thepythoncode.com/article/pretraining-bert-huggingface-transformers-in-python

A pre-trained model is a model that was previously trained on a large dataset and saved for direct use or fine-tuning. In this tutorial, you will learn how you can train BERT (or any other transformer model) from scratch on your custom raw text dataset with the help of the Huggingface transformers library in Python.

Pre-training on transformers can be done with self-supervised tasks, below are some of the popular tasks done on BERT:

Masked Language Modeling (MLM): This task consists of masking a certain percentage of the tokens in the sentence, and the model is trained to predict those masked words. We'll be using this one in this tutorial.
Next Sentence Prediction (NSP): The model receives pairs of sentences as input and learns to predict whether the second sentence in the pair is the subsequent sentence in the original document.

In [None]:
pip install datasets transformers==4.11.2 sentencepiece

In [5]:
from datasets import *
from transformers import *
from tokenizers import *
import os
import json

CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.
It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English language subset of the CC-News dataset.

In [6]:
# download and prepare cc_news dataset
dataset = load_dataset("cc_news", split="train")

Downloading:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/932 [00:00<?, ?B/s]

Downloading and preparing dataset cc_news/plain_text (download: 805.98 MiB, generated: 1.88 GiB, post-processed: Unknown size, total: 2.67 GiB) to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6...


Downloading:   0%|          | 0.00/845M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset cc_news downloaded and prepared to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6. Subsequent calls will reuse this data.


In [7]:
# split the dataset into training (90%) and testing (10%)
d = dataset.train_test_split(test_size=0.1)
d["train"], d["test"]

for t in d["train"]["text"][:3]:
  print(t)
  print("="*50)

By Richard Ndoma, Calabar Worried by the scarcity of kerosene, major stakeholders in oil and gas industry have advocated for the involvement of micro finance...
Staunchly conservative U.S. lawmaker Scalise among wounded in shooting
WASHINGTON, June 14 Steve Scalise, the Republican leader wounded in a gunman's attack on Wednesday on people practicing for a charity baseball game, is a staunch conservative and key figure in trying to push legislation through the U.S. House of Representatives.
Missouri State Highway Patrol trooper shoots robbery suspect
ST. JOSEPH, Mo. (AP) — Authorities say a Missouri State Highway Patrol trooper chasing two robbery suspects on foot returned fire and shot one of them.
The Missouri Highway Patrol said in a news release Monday that it was alerted at 10:46 a.m. Monday of a theft in progress at a home in Ridgeway, Missouri. The Harrison County Sheriff's Department provided a vehicle description and advised the suspect vehicle was southbound on Interstate 35.


Next, we need to train our tokenizer. To do that, we need to write our dataset into text files, as that's what the tokenizers library require the input to be:

In [8]:
# if you want to train the tokenizer from scratch (especially if you have custom
# dataset loaded as datasets object), then run this cell to save it as files
# but if you already have your custom data as text files, there is no point using this
def dataset_to_text(dataset, output_filename="data.txt"):
  """Utility function to save dataset text to disk,
  useful for using the texts to train the tokenizer 
  (as the tokenizer accepts files)"""
  with open(output_filename, "w") as f:
    for t in dataset["text"]:
      print(t, file=f)

# save the training set to train.txt
dataset_to_text(d["train"], "train.txt")
# save the testing set to test.txt
dataset_to_text(d["test"], "test.txt")

In [9]:
special_tokens = [
  "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]
# if you want to train the tokenizer on both sets
# files = ["train.txt", "test.txt"]
# training the tokenizer on the training set
files = ["train.txt"]
# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522
# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = True

Since this is BERT, the default tokenizer is WordPiece. We initialize the BertWordPieceTokenizer() tokenizer class from the tokenizers library and use the train() method to train it.  



In [10]:
# initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()
# train the tokenizer
tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)
# enable truncation up to the maximum 512 tokens
tokenizer.enable_truncation(max_length=max_length)

We save the tokenizer.

The `tokenizer.save_model()` method saves the vocabulary file into that path, we also manually save some tokenizer configurations, such as special tokens:

`unk_token`: A special token that represents an out-of-vocabulary token, even though the tokenizer is a WordPiece tokenizer, the unk tokens are not impossible, but rare.

`sep_token`: A special token that separates two different sentences in the same input.

`pad_token`: A special token that is used to fill sentences that do not reach the maximum sequence length (since the arrays of tokens must be the same size).

`cls_token`: A special token representing the class of the input.
mask_token: This is the mask token we use for the Masked Language Modeling (MLM) pretraining task.

In [11]:
model_path = "pretrained-bert"
# make the directory if not already there
if not os.path.isdir(model_path):
  os.mkdir(model_path)
# save the tokenizer  
tokenizer.save_model(model_path)
# dumping some of the tokenizer config to config file, 
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_path, "config.json"), "w") as f:
  tokenizer_cfg = {
      "do_lower_case": True,
      "unk_token": "[UNK]",
      "sep_token": "[SEP]",
      "pad_token": "[PAD]",
      "cls_token": "[CLS]",
      "mask_token": "[MASK]",
      "model_max_length": max_length,
      "max_len": max_length,
  }
  json.dump(tokenizer_cfg, f)

In [12]:
# when the tokenizer is trained and configured, load it  
tokenizer = BertTokenizerFast.from_pretrained(model_path)

Now that we have the tokenizer ready, we tokenize the dataset:



In [13]:
def encode_with_truncation(examples):
  """Mapping function to tokenize the sentences passed with truncation"""
  return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
  """Mapping function to tokenize the sentences passed without truncation"""
  return tokenizer(examples["text"], return_special_tokens_mask=True)

# the encode function will depend on the truncate_longer_samples variable
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

# tokenizing the train dataset
train_dataset = d["train"].map(encode, batched=True)
# tokenizing the testing dataset
test_dataset = d["test"].map(encode, batched=True)
if truncate_longer_samples:
  # remove other columns and set input_ids and attention_mask as 
  train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
  test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
  test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
  train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
train_dataset, test_dataset

  0%|          | 0/638 [00:00<?, ?ba/s]

  0%|          | 0/71 [00:00<?, ?ba/s]

(Dataset({
     features: ['attention_mask', 'date', 'description', 'domain', 'image_url', 'input_ids', 'special_tokens_mask', 'text', 'title', 'token_type_ids', 'url'],
     num_rows: 637416
 }), Dataset({
     features: ['attention_mask', 'date', 'description', 'domain', 'image_url', 'input_ids', 'special_tokens_mask', 'text', 'title', 'token_type_ids', 'url'],
     num_rows: 70825
 }))

In [14]:
# Truncating all sentences so that all samples have the same length
truncate_longer_samples=True

# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result
# Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
# remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
# might be slower to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
if not truncate_longer_samples:
  train_dataset = train_dataset.map(group_texts, batched=True, batch_size=2_000,
                                    desc=f"Grouping texts in chunks of {max_length}")
  test_dataset = test_dataset.map(group_texts, batched=True, batch_size=2_000,
                                  num_proc=4, desc=f"Grouping texts in chunks of {max_length}")

In [15]:
# initialize the model with the config
model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)

In [16]:
# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)

In [17]:
# initialize training arguments

training_args = TrainingArguments(
    output_dir=model_path,          # output directory to where save model checkpoint
    evaluation_strategy="steps",    # evaluate each `logging_steps` steps
    overwrite_output_dir=True,      
    num_train_epochs=10,            # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=500,             # evaluate, log and save model checkpoints every 1000 step
    save_steps=500,
    # load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    # save_total_limit=3,           # whether you don't have much space so you let only 3 model weights saved in the disk
)

And the actual training.

In [None]:
# initialize the trainer and pass everything to it
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# train the model
trainer.train()

Once the model is trained, we can use it. We reload it first as well as the tokenizer.

In [None]:
# load the model checkpoint
model = BertForMaskedLM.from_pretrained(os.path.join(model_path, "checkpoint-10000"))
# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_path)

If you're on Google Colab, then you have to save your checkpoints in Google Drive for later use, you can do that by setting model_path to a drive path instead of a local path like we did here, just make sure you have enough space there.

Alternatively, you can push your model and tokenizer into the huggingface hub https://huggingface.co/docs/transformers/model_sharing. 

In [None]:
# Using our model
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Perform predictions
examples = [
  "Today's most trending hashtags on [MASK] is Donald Trump",
  "The [MASK] was cloudy yesterday, but today it's rainy.",
]
for example in examples:
  for prediction in fill_mask(example):
    print(f"{prediction['sequence']}, confidence: {prediction['score']}")
  print("="*50)