In [1]:
!pip install transformers[torch]
!pip install datasets
!pip install SentencePiece
!pip install accelerate -U

Collecting transformers[torch]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m109.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m

In [2]:
 # Import necessary libraries
from transformers import pipeline, TextDataset, GPT2Tokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling, GPT2LMHeadModel

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [30]:
dl_data = TextDataset(
    tokenizer = tokenizer,
    file_path='/content/Deep Learning.txt', #Deep Learning by Ian Goodfellow
    block_size = 64 #Length of each chunck of text to use as a datapoint
)



In [31]:
dl_data[0], dl_data[0].shape #Inspecting the first point

(tensor([29744, 18252,   198, 37776,  4599,    69,  5037,   198,    56,  3768,
          6413, 14964,   952,   198, 34451,  2734,  4244,   628,   200, 15842,
           198, 33420,   198,   198,    85,  4178,   198,   198, 39482, 11726,
           198,   198,    85, 15479,   198,   198,  3673,   341,   198,   198,
         29992,   198,   198,    16,   198,   198, 21906,   198,    16,    13,
            16,  5338, 10358,  4149,   770,  4897,    30,   764,   764,   764,
           764,   764,   764,   764]),
 torch.Size([64]))

In [32]:
print(tokenizer.decode(dl_data[0]))

Deep Learning
Ian Goodfellow
Yoshua Bengio
Aaron Courville

Contents
Website

vii

Acknowledgments

viii

Notation

xi

1

Introduction
1.1 Who Should Read This Book?.......


In [33]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False, #MLM is masked Language Modelling
)


In [34]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

0

In [35]:
collator_example = data_collator([tokenizer("I am an input"), tokenizer('so am I')])
collator_example

{'input_ids': tensor([[   40,   716,   281,  5128],
        [  568,   716,   314, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]]), 'labels': tensor([[  40,  716,  281, 5128],
        [ 568,  716,  314, -100]])}

In [36]:
collator_example.input_ids # 50257 is our pad token id

tensor([[   40,   716,   281,  5128],
        [  568,   716,   314, 50257]])

In [37]:
tokenizer.pad_token_id

50257

In [38]:
collator_example.labels #note the -100 to ignore lpss calculation for the padded token
# Remainder that the labels are shifted inside the GPT model so we don't need to worry about that

tensor([[  40,  716,  281, 5128],
        [ 568,  716,  314, -100]])

In [39]:
model = GPT2LMHeadModel.from_pretrained('gpt2') #Loading up the GPT2 model

pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length':200, 'do_simple': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

In [41]:
# Before finetuning the model
print('------------------')
for generated_sequence in pretrained_generator('If a bias is for an output unit, then it is often beneﬁcial', num_return_sequences=5):
  print(generated_sequence['generated_text'])
  print('------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


------------------
If a bias is for an output unit, then it is often beneﬁcial to specify bias on it. See, here, for the rule that: The output must be of such a shape as that of the bias. If the bias
------------------
If a bias is for an output unit, then it is often beneﬁcial to be sure, because it can affect all the outputs, which can be quite a bit (we will use the most basic of the many terms in the following
------------------
If a bias is for an output unit, then it is often beneﬁcial to consider a higher sample size to be more robust in its quality estimates.

Using data from a small number of people, we can calculate the bias from
------------------
If a bias is for an output unit, then it is often beneﬁcial to an input and an output unit, and hence the same bias can be added to the pair which should work, and so the same bias can be added to
------------------
If a bias is for an output unit, then it is often beneﬁcial for a bias measurement. However, when that bias is comput

In [44]:
# let's train the model

training_args = TrainingArguments(
    output_dir = "/content/drive/MyDrive/GPT_for_style_completion", #The output directory
    overwrite_output_dir = True, #overwrite the content of the output directory
    num_train_epochs = 3, #number of tarining epochs
    per_device_train_batch_size=16, #Batch size for training
    per_device_eval_batch_size=16, #Batch size for evaluation
    warmup_steps=len(dl_data.examples) // 5, # Numbers of warmup steps for learning rate schedulers,
    logging_steps=50,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'

)

trainer = Trainer(
    model = model,
    args=training_args,
    data_collator = data_collator,
    train_dataset = dl_data.examples[:int(len(dl_data.examples)*.8)],
    eval_dataset = dl_data.examples[int(len(dl_data.examples)*.8):],
)

trainer.evaluate()

{'eval_loss': 4.620068550109863,
 'eval_runtime': 10.2742,
 'eval_samples_per_second': 144.634,
 'eval_steps_per_second': 9.052}

In [45]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,3.6066,3.679982
2,3.2024,3.543131
3,3.0213,3.527218


TrainOutput(global_step=1116, training_loss=3.469020570050858, metrics={'train_runtime': 424.1041, 'train_samples_per_second': 42.018, 'train_steps_per_second': 2.631, 'total_flos': 582028001280000.0, 'train_loss': 3.469020570050858, 'epoch': 3.0})

In [46]:
trainer.evaluate()

{'eval_loss': 3.5272176265716553,
 'eval_runtime': 8.1561,
 'eval_samples_per_second': 182.195,
 'eval_steps_per_second': 11.403,
 'epoch': 3.0}

In [47]:
trainer.save_model()

In [48]:
loaded_model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/GPT_for_style_completion')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length':200, 'do_sample':True, 'top_p': 0.9, 'temperature':0.7, 'top_k': 10}
)

In [49]:
# After finetuning the model
print('------------------')
for generated_sequence in finetuned_generator('If a bias is for an output unit, then it is often beneﬁcial', num_return_sequences=5):
  print(generated_sequence['generated_text'])
  print('------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


------------------




If a bias is for an output unit, then it is often beneﬁcial for sampling to
frequently get a copy of the output, and to perform a sample-based
sample analysis. Some tools that can perform regularized sampling
------------------
If a bias is for an output unit, then it is often beneﬁcial to apply an undirected cost function in a low cost diﬀerent way. In fact, the unparameterized gradient
function can be
------------------
If a bias is for an output unit, then it is often beneﬁcial to
avoid such biases to the right of the input. Instead, most machine learning training examples are
represented by the nearest neighbor of an boasts. If there
------------------
If a bias is for an output unit, then it is often beneﬁcial to
put some parameters into the output unit that do not depend on each other. In one
case, a feature may depend on only 1 feature − eats
------------------
If a bias is for an output unit, then it is often beneﬁcial to use
a bias for other outputs rather than the input. For 