# Pre-training SmallBERTa - A tiny model to train on a tiny dataset
(Using HuggingFace Transformers)<br>
Admittedly, while language modeling is associated with terabytes of data, not all of use have either the processing power nor the resources to train huge models on such huge amounts of data.
In this example, we are going to train a relatively small neural net on a small dataset (which still happens to have over 2M rows).
<br>

The ***main purpose*** of this blog is not to achieve state-of-the-art performance on LM tasks but to show a simple idea of how the recent language_modeling.py script can be used to train a Transformer model from scratch.

This very notebook can be extended to various esoteric use cases where general purpose pre-trained models fail to perform well. Examples include medical dataset, scientific literature, legal documentation, etc.

Input:
  1. To the Tokenizer:<br>
      LM data in a directory containing all samples in separate *.txt files.
  
  2. To the Model:<br>
      LM data split into:<br>
        1. train.txt <br>
        2. eval.txt 
        
Output:<br>
  Trained Model weights(that can be used elsewhere) and Tensorboard logs

## Install Dependencies

In [None]:
#tokenizer working version --- 0.5.0
#transformer working version --- 2.5.0
!pip install transformers
!pip install tokenizers
!pip install tensorboard==2.1.0

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/04/58/3d789b98923da6485f376be1e04d59ad7003a63bdb2b04b5eea7e02857e5/transformers-2.5.0-py3-none-any.whl (481kB)
[K     |▊                               | 10kB 21.9MB/s eta 0:00:01[K     |█▍                              | 20kB 29.1MB/s eta 0:00:01[K     |██                              | 30kB 24.6MB/s eta 0:00:01[K     |██▊                             | 40kB 19.1MB/s eta 0:00:01[K     |███▍                            | 51kB 15.6MB/s eta 0:00:01[K     |████                            | 61kB 15.4MB/s eta 0:00:01[K     |████▊                           | 71kB 13.4MB/s eta 0:00:01[K     |█████▍                          | 81kB 12.9MB/s eta 0:00:01[K     |██████▏                         | 92kB 12.7MB/s eta 0:00:01[K     |██████▉                         | 102kB 12.9MB/s eta 0:00:01[K     |███████▌                        | 112kB 12.9MB/s eta 0:00:01[K     |████████▏                       | 

## Fetch Data
We will be using a tiny dataset(The Examiner - SpamClickBait News) of around 3M rows from kaggle to train our model. The dataset also contains output labels which will be dropped and only the text shall be used. For convenience we are using the Kaggle API to direcltly download the data from Kaggle to save our time and efforts. 

In [None]:
import os
import getpass

#For a kaggle username & key, just go to your kaggle account and generate key
#The JSON file so downloaded contains both of them
if("examine-the-examiner.zip" not in os.listdir()):
  print("Copy these two values from the JSON file so generated")
  os.environ['KAGGLE_USERNAME'] = getpass.getpass(prompt='Kaggle username: ') 
  os.environ['KAGGLE_KEY'] =  getpass.getpass(prompt='Kaggle key: ')
  !kaggle datasets download -d therohk/examine-the-examiner
  !unzip /content/examine-the-examiner.zip

Copy these two values from the JSON file so generated
Kaggle username: ··········
Kaggle key: ··········
Downloading examine-the-examiner.zip to /content
 86% 123M/142M [00:00<00:00, 132MB/s]
100% 142M/142M [00:00<00:00, 163MB/s]
Archive:  /content/examine-the-examiner.zip
  inflating: examiner-date-text.csv  
  inflating: examiner-date-tokens.csv  


## Load and Preprocess data

In [None]:
import regex as re
def basicPreprocess(text):
  try:
    processed_text = text.lower()
    processed_text = re.sub(r'\W +', ' ', processed_text)
  except Exception as e:
    print("Exception:",e,",on text:", text)
    return None
  return processed_text

In [None]:
import pandas as pd
from tqdm import tqdm

## Read and Prune the data
For our purpose we are going to read a subset (~200,000 samples) to train, just to see results quickly. Feel free to increase (or remove) this limitation.  

In [None]:
data = pd.read_csv("/content/examiner-date-text.csv")
print(data)

         publish_date                                      headline_text
0            20100101       100 Most Anticipated books releasing in 2010
1            20100101       10 best films of 2009 - What's on your list?
2            20100101  10 days of free admission at Lan Su Chinese Ga...
3            20100101      10 PlayStation games to watch out for in 2010
4            20100101  10 resolutions for a Happy New Year for you an...
...               ...                                                ...
3089776      20151231  Which is better investment, Lego bricks or gol...
3089777      20151231  Wild score three unanswered goals to defeat th...
3089778      20151231  With NASA and Russia on the sidelines, Europe ...
3089779      20151231  Wolf Pack battling opponents, officials on the...
3089780      20151231          Writespace hosts all genre open mic night

[3089781 rows x 2 columns]


In [None]:
data = data.sample(frac=1).sample(frac=1)
data = data[:200000]

### Before Preprocessing 

In [None]:
print(data)

         publish_date                                      headline_text
618246       20100816  Triangle UFO low and silent over rural Deansbo...
1794117      20120420  Kevin Hart and 'Think Like a Man' co-stars lea...
3053438      20150920  Uma Thurman custody battle finally settled wit...
180273       20100313           Legislator confident of Health Care bill
938083       20101228         McDonald's ad in Spanish, provoking sparks
...               ...                                                ...
1737672      20120319  Washington Post: Obama has been lying to Ameri...
1780904      20120413  California retiree collects $227k Mega Million...
1614310      20120105    This Weekend at Miami Science Museum Laser Show
1565925      20111205           December 12th is National Poinsettia Day
1358212      20110731  Spartans' Cousins gives stirring, thought-prov...

[200000 rows x 2 columns]


In [None]:
data["headline_text"] = data["headline_text"].apply(basicPreprocess).dropna() #ignore exception if for empty/nan values

### After Preprocessing

In [None]:
print(data)

         publish_date                                      headline_text
618246       20100816  triangle ufo low and silent over rural deansbo...
1794117      20120420  kevin hart and 'think like a man co-stars lear...
3053438      20150920  uma thurman custody battle finally settled wit...
180273       20100313           legislator confident of health care bill
938083       20101228          mcdonald's ad in spanish provoking sparks
...               ...                                                ...
1737672      20120319  washington post obama has been lying to americ...
1780904      20120413  california retiree collects $227k mega million...
1614310      20120105    this weekend at miami science museum laser show
1565925      20111205           december 12th is national poinsettia day
1358212      20110731  spartans cousins gives stirring thought-provok...

[200000 rows x 2 columns]


Removing newline characters just in case the input text has them. This is because the LineByLine class that we are going to use later assumes that samples are separated by newline

In [None]:
data = data["headline_text"]
data = data.replace("\n"," ")

## Train a custom tokenizer
I have used a ByteLevelBPETokenizer just to prevent \<unk> tokens entirely.
Furthermore, the function used to train the tokenizer assumes that each sample is stored in a different text file.

In [None]:
txt_files_dir = "/tmp/text_split"
!mkdir {txt_files_dir}

Split LM data into individual files. These files are stored in /tmp/text_split and are used to train the tokenizer **only**.

In [None]:
i=0
for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

100%|██████████| 200000/200000 [00:09<00:00, 20693.63it/s]


In [None]:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


paths = [str(x) for x in Path(txt_files_dir).glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

vocab_size=5000
# Customize training
tokenizer.train(files=paths, vocab_size=vocab_size, min_frequency=5, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [None]:
lm_data_dir = "/tmp/lm_data"
!mkdir {lm_data_dir}

## Split into Valdation and Train set
We split the train data into validation and train. These two files are used to train and evaluate our model

In [None]:
train_split = 0.9
train_data_size = int(len(data)*train_split)

with open(os.path.join(lm_data_dir,'train.txt') , 'w') as f:
    for item in data[:train_data_size].tolist():
        f.write("%s\n" % item)

with open(os.path.join(lm_data_dir,'eval.txt') , 'w') as f:
    for item in data[train_data_size:].tolist():
        f.write("%s\n" % item)

In [None]:
!mkdir /content/models
!mkdir /content/models/smallBERTa

In [None]:
tokenizer.save("/content/models/smallBERTa", "smallBERTa")

['/content/models/smallBERTa/smallBERTa-vocab.json',
 '/content/models/smallBERTa/smallBERTa-merges.txt']

In [None]:
!mv /content/models/smallBERTa/smallBERTa-vocab.json /content/models/smallBERTa/vocab.json
!mv /content/models/smallBERTa/smallBERTa-merges.txt /content/models/smallBERTa/merges.txt

In [None]:
train_path = os.path.join(lm_data_dir,"train.txt")
eval_path = os.path.join(lm_data_dir,"eval.txt")

## Set Model Configuration
For our purpose, we are training a very small model for demo purposes

In [None]:
import json
config = {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "num_attention_heads": 1,
  "num_hidden_layers": 1,
  "vocab_size": vocab_size,
  "intermediate_size": 256,
  "max_position_embeddings": 256
}
with open("/content/models/smallBERTa/config.json", 'w') as fp:
    json.dump(config, fp)

In [None]:
#%cd /content
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 19858 (delta 5), reused 6 (delta 0), pack-reused 19834[K
Receiving objects: 100% (19858/19858), 11.95 MiB | 4.05 MiB/s, done.
Resolving deltas: 100% (14423/14423), done.


## Run training using the run_language_modeling.py examples script

In [None]:
!nvidia-smi #just to confirm that you are on a GPU, if not go to Runtime->Change Runtime

Fri Feb 21 12:17:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [None]:
#Setting environment variables
os.environ["train_path"] = train_path
os.environ["eval_path"] = eval_path
os.environ["CUDA_LAUNCH_BLOCKING"]='1'  #Makes for easier debugging (just in case)
weights_dir = "/content/models/smallBERTa/weights"
!mkdir {weights_dir}

In [None]:
cmd = '''python /content/transformers/examples/run_language_modeling.py --output_dir {0}  \
    --model_type roberta \
    --mlm \
    --train_data_file {1} \
    --eval_data_file {2} \
    --config_name /content/models/smallBERTa \
    --tokenizer_name /content/models/smallBERTa \
    --do_train \
    --line_by_line \
    --overwrite_output_dir \
    --do_eval \
    --block_size 256 \
    --learning_rate 1e-4 \
    --num_train_epochs 5 \
    --save_total_limit 2 \
    --save_steps 2000 \
    --logging_steps 500 \
    --per_gpu_eval_batch_size 32 \
    --per_gpu_train_batch_size 32 \
    --evaluate_during_training \
    --seed 42 \
    '''.format(weights_dir, train_path, eval_path)

In [None]:
!{cmd}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Evaluating:  96% 598/625 [00:04<00:00, 124.17it/s][A[A

Evaluating:  98% 611/625 [00:04<00:00, 124.94it/s][A[A

Evaluating: 100% 625/625 [00:05<00:00, 126.93it/s][A[A

[A[A02/21/2020 12:30:10 - INFO - __main__ -   ***** Eval results  *****
02/21/2020 12:30:10 - INFO - __main__ -     perplexity = tensor(873.4072)

Iteration:  11% 628/5625 [00:31<44:27,  1.87it/s][A
Iteration:  11% 632/5625 [00:31<31:46,  2.62it/s][A
Iteration:  11% 636/5625 [00:31<22:55,  3.63it/s][A
Iteration:  11% 640/5625 [00:31<16:43,  4.97it/s][A
Iteration:  11% 644/5625 [00:31<12:21,  6.72it/s][A
Iteration:  12% 648/5625 [00:31<09:18,  8.91it/s][A
Iteration:  12% 652/5625 [00:31<07:10, 11.54it/s][A
Iteration:  12% 656/5625 [00:31<05:41, 14.56it/s][A
Iteration:  12% 660/5625 [00:31<04:41, 17.63it/s][A
Iteration:  12% 664/5625 [00:32<03:56, 20.97it/s][A
Iteration:  12% 668/5625 [00:32<03:24, 24.25it/s][A
Iteration:  12% 672/5625 [00:

## View Results on Tensorboard

In [None]:
!tensorboard dev upload --logdir /content/runs


***** TensorBoard Uploader *****

This will upload your TensorBoard logs to https://tensorboard.dev/ from
the following directory:

/content/runs

This TensorBoard will be visible to everyone. Do not upload sensitive
data.

Your use of this service is subject to Google's Terms of Service
<https://policies.google.com/terms> and Privacy Policy
<https://policies.google.com/privacy>, and TensorBoard.dev's Terms of Service
<https://tensorboard.dev/policy/terms/>.

This notice will not be shown again while you are logged into the uploader.
To log out, run `tensorboard dev auth revoke`.

Continue? (yes/NO) yes

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=373649185512-8v619h5kft38l4456nm2dj4ubeqsrvh6.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&state=kgAdxJj3xxL6gDgTUoUWbPVrkXeIzl&prompt=consent&access_type=offline