In [1]:
from huggingface_hub import notebook_login
from datasets import load_dataset, ClassLabel
from IPython.display import display, HTML
import random
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Main Language Modeling:

- Causal Language Modeling : Model predicts the next token in the given sentence. So the targets are **inputs shifted to right.**

- Masked Language Modeling: Model has to predict some masked tokens in the given sentence, and has access to entire sentence

In [4]:
# One training step will be to change this dataset with another
# wikiset = load_dataset('wikitext', 'wikitext-2-raw-v1')
wikiset = load_dataset(path="/home/kamal/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/")
# datasets = load_dataset("text", data_files={"train": path_to_train.txt,"validation": path_to_validation.txt}


Found cached dataset arrow (/home/kamal/.cache/huggingface/datasets/arrow/wikitext-2-raw-v1-3f7e6bb1b926f47c/0.0.0/74f69db2c14c2860059d39860b1f400a03d11bf7fb5a8258ca38c501c878c137)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
wikiset['train'][56]

{'text': ' = = Construction = = \n'}

In [12]:
len(wikiset['train'])

36718

In [6]:
# Writing a function to show random elements

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset)  # dataset must have sufficient data
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)
        
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(wikiset['train'])

Unnamed: 0,text
0,= = = Sculpture = = = \n
1,"Midway through the poem , there is a split between the two actions of the poem : the first attempts to identify with the nightingale and its song , and the second discusses the convergence of the past with the future while experiencing the present . This second theme is reminiscent of Keats 's view of human progression through the Mansion of Many Apartments and how man develops from experiencing and wanting only pleasure to understanding truth as a mixture of both pleasure and pain . The Elysian fields and the nightingale 's song in the first half of the poem represent the pleasurable moments that overwhelm the individual like a drug . However , the experience does not last forever , and the body is left desiring it until the narrator feels helpless without the pleasure . Instead of embracing the coming truth , the narrator clings to poetry to hide from the loss of pleasure . Poetry does not bring about the pleasure that the narrator original asks for , but it does liberate him from his desire for only pleasure . \n"
2,"Most restorations of ceratopsians show them with erect hindlimbs but semi @-@ sprawling forelimbs , which suggest that they were not fast movers . But Paul and Christiansen ( 2000 ) argued that at least the later ceratopsians had upright forelimbs and the larger species may have been as fast as rhinos , which can run at up to 56 km or 35 miles per hour . \n"
3,
4,"Many of the objects around her are also closely detailed , in particular the wooden floor and nails , the folds of the Magdalene 's dress , the costume of the figures in the exterior and the beads of Joseph 's rosary . The effect of falling light is closely studied ; Joseph 's crystal rosary beads have bright highlights , while subtle delineations of light and shade can be seen in the sideboard 's tracery and in the clasps of her book . Mary is absorbed in her reading and seemingly unaware of her surroundings . Van der Weyden has given her a quiet dignity although he is generally seen as the more emotional of the master Netherlandish painters of the era , in particular when contrasted with Jan van Eyck . \n"


we are going to take **all the texts** in our dataset and **concatenate them** after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text.

The labels will be the same as the inputs, shifted to the left.

In [8]:
model_cp = "gpt2"
tokenizer_cp = "sgugger/gpt2-like-tokenizer"

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from transformers import Trainer, TrainingArguments

tokeniser = AutoTokenizer.from_pretrained(tokenizer_cp)

Downloading tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/396k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/678k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [9]:
def tokenise_text(examples):
    return tokeniser(examples['text'])

In [10]:
tokenised_wiki = wikiset.map(tokenise_text,
                             batched=True,
                             num_proc=4,
                             remove_columns=['text'])  # will return just ids / masks

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

In [11]:
tokenised_wiki

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3760
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4358
    })
})

In [12]:
show_random_elements(tokenised_wiki['train'])

Unnamed: 0,input_ids,attention_mask
0,"[4257, 2744, 258, 455, 7035, 307, 198, 4015, 218, 2078, 2098, 311, 24958, 47, 310, 263, 1282, 245, 201, 7353, 201, 225, 258, 496, 218, 951, 9516, 2143, 281, 540, 218, 196, 3115, 459, 4050, 1811, 209, 261, 2405, 2929, 5288, 8860, 16677, 296, 19000, 288, 11498, 209, 795, 3161, 258, 4489, 14881, 307, 198, 726, 218, 6785, 209, 595, 1089, 920, 201, 6785, 201, 198, 24958, 47, 3827, 198, 2405, 1657, 227, 4257, 1750, 319, 4887, 234, 363, 9899, 209, 252]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1,"[5237, 8499, 1499, 1706, 198, 6685, 196, 2923, 218, 1112, 573, 218, 1973, 263, 342, 4384, 661, 342, 1389, 1633, 3010, 209, 261, 24374, 223, 3135, 218, 342, 5688, 258, 196, 14488, 218, 4940, 1468, 201, 331, 1899, 201, 2267, 296, 196, 1304, 211, 218, 23439, 221, 230, 306, 914, 8005, 225, 20728, 6172, 18534, 296, 196, 1994, 8005, 225, 947, 7178, 370, 23651, 24756, 298, 201, 445, 331, 7936, 1927, 295, 198, 2970, 258, 239, 547, 277, 2587, 239, 225, 198, 2362, 239, 21918, 264, 8139, 424, 239, 209, 252]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
2,"[492, 11514, 2073, 1179, 311, 1617, 221, 8785, 296, 15672, 7997, 310, 440, 11954, 201, 5804, 252]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
3,"[238, 238, 4040, 3184, 238, 238, 252]","[1, 1, 1, 1, 1, 1, 1]"
4,"[238, 238, 12407, 238, 238, 252]","[1, 1, 1, 1, 1, 1]"


In [36]:
data_test = [[1, 2, 3,], [2, 5, 7], [5, 6, 8]]
# b = 0
# print(data_test[b] + data_test[b+1])

# x = [sum(data_test[b], []) for b in range(3)]  # [1, 2, 5]
sum(data_test[0], [])  # why this is not working??

TypeError: can only concatenate list (not "int") to list

In [48]:
test_data = tokenised_wiki['train'][:3]
test_data

{'input_ids': [[], [238, 8576, 9441, 2987, 238, 252], []],
 'attention_mask': [[], [1, 1, 1, 1, 1, 1], []]}

In [49]:
# test_data.keys() # dict_keys(['input_ids', 'attention_mask'])

concat_examples = {b: sum(test_data[b], []) for b in test_data.keys()}
concat_examples

{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [50]:
concat_examples[list(test_data.keys())[0]]  # [238, 8576, 9441, 2987, 252]


[238, 8576, 9441, 2987, 238, 252]

In [51]:
test_data.items()

dict_items([('input_ids', [[], [238, 8576, 9441, 2987, 238, 252], []]), ('attention_mask', [[], [1, 1, 1, 1, 1, 1], []])])

In [13]:
print(tokeniser.model_max_length)  # 1024 is length the model is trained on
# block_size = tokenizer.model_max_length
block_size = 128

1024


In [15]:
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # above will make sub_list inside input_ids and attention_masks to be extended
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # provide the total_length of input_ids
    # we drop the remainder of the sentence, like truncating
    tot_length = (total_length // block_size) * block_size  
    # // will porvide only the whole number after dividing
    # (3268 // 128) * 128 
    # split by chunks of max_len
    result = {
        k: [t[i : i + block_size] for i in range(0, tot_length, block_size)]
            for k, t in concatenated_examples.items()
    }
    # k will be keys while t will be values
    result["labels"] = result["input_ids"].copy() # inputs are duplicated as labels
    # transformers library will apply the "shifting to right"
    return result

In [16]:
lm_datasets = tokenised_wiki.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=6
)

Map (num_proc=6):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/3760 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/4358 [00:00<?, ? examples/s]

In [17]:
tokeniser.decode(lm_datasets['train'][1]['input_ids'])

' the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Ozawa. A large'

In [18]:
# bring in the config and the model


config = AutoConfig.from_pretrained(model_cp)
model = AutoModelForCausalLM.from_config(config) # this will pull the config only

In [19]:
model.get_memory_footprint()

510342192

In [20]:
model_test = AutoModelForCausalLM.from_pretrained(model_cp) # this will load the model
del model_test

In [11]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
# training args

t_args = TrainingArguments(
    f"/home/kamal/training_files/{model_cp}-wiki",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
    hub_model_id=f"kamaljp/{model_cp}-wiki",  # this repo must be available
    report_to="none"
)

In [22]:
# initiate trainer

trainer = Trainer(
    model=model,
    args=t_args,
    train_dataset=lm_datasets['train'],
    eval_dataset=lm_datasets['validation']
) 

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,6.5547,6.475204
2,6.1568,6.2017
3,6.0132,6.114817


TrainOutput(global_step=6750, training_loss=6.392513888888889, metrics={'train_runtime': 708.6425, 'train_samples_per_second': 76.172, 'train_steps_per_second': 9.525, 'total_flos': 3526070648832000.0, 'train_loss': 6.392513888888889, 'epoch': 3.0})

In [24]:
import math
eval_results = trainer.evaluate()

In [25]:
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 452.51


In [21]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Kamaljp/gpt2-wiki/commit/991f8b5abc3f5bca0970a6d7f349804b7fcf9fcc', commit_message='End of training', commit_description='', oid='991f8b5abc3f5bca0970a6d7f349804b7fcf9fcc', pr_url=None, pr_revision=None, pr_num=None)

### Masked Language Modelling

Masked Language Modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by [MASK]) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

**The random masking is done with DataCollator at the end.**

In [27]:
mask_model_cp = "bert-base-cased"
tokeniser_bert_cp = "sgugger/bert-like-tokenizer"

tokeniser_bert = AutoTokenizer.from_pretrained(tokeniser_bert_cp)

def tokenise_text(examples):
    return tokeniser_bert(examples['text'])

In [28]:
tokenise_text(wikiset['train'][1])

{'input_ids': [2, 33, 9773, 10627, 4171, 33, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [29]:
bert_tokenised_ds = wikiset.map(tokenise_text,  # this has to be a function
                               batched=True,
                               num_proc=4,
                               remove_columns=["text"])

       

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (571 > 512). Running this sequence through the model will result in indexing errors


       

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (657 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors


        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [30]:
lm_bert_ds = bert_tokenised_ds.map(group_texts,
                                   batched=True,
                                   batch_size=1000,
                                   num_proc=4)

        

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [32]:
from transformers import AutoModelForMaskedLM

bert_config = AutoConfig.from_pretrained(mask_model_cp)
model = AutoModelForMaskedLM.from_config(bert_config)

In [39]:
# training args
tbert_args = TrainingArguments(
    "test-clm",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
    hub_model_id=f"Kamaljp/{mask_model_cp}-wiki",
    report_to="none",
    num_train_epochs=1)

The data_collator is a function that is responsible of taking the samples and batching them in tensors. 

Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. 

By doing this step inside the data_collator, we ensure this **random masking is done** in a new way each time we go over the data.

In [40]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokeniser_bert,
                                               mlm_probability=0.15)

In [41]:
# trainer takes the additional DataCollator

trainer_bert = Trainer(
    model=model,
    args=tbert_args,
    train_dataset=lm_bert_ds['train'],
    eval_dataset=lm_bert_ds['validation'],
    data_collator=data_collator
)

In [42]:
trainer_bert.train()

Epoch,Training Loss,Validation Loss
1,7.2135,7.200663


TrainOutput(global_step=1173, training_loss=7.4592584868526215, metrics={'train_runtime': 421.4989, 'train_samples_per_second': 44.51, 'train_steps_per_second': 2.783, 'total_flos': 1234474385943552.0, 'train_loss': 7.4592584868526215, 'epoch': 1.0})

In [43]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 549.38


In [44]:
trainer_bert.push_to_hub()

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Kamaljp/bert-base-cased-wiki/commit/3087e13379cccef60e6ed884aee222a258c0e4c4', commit_message='End of training', commit_description='', oid='3087e13379cccef60e6ed884aee222a258c0e4c4', pr_url=None, pr_revision=None, pr_num=None)