Source codes for Python Machine Learning By Example 4th Edition (Packt Publishing)

Chapter 13 Advancing language understanding and Generation with the Transformer models

Author: Yuxi (Hayden) Liu (yuxi.liu.ece@gmail.com)

# Generating text using GPT 

## Writing your own War and Peace with GPT

In [1]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(0)
generator("I love machine learning",
          max_length=20,
          num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I love machine learning, so you should use machine learning as your tool for data production.\n\n'},
 {'generated_text': 'I love machine learning. I love learning and I love algorithms. I love learning to control systems.'},
 {'generated_text': 'I love machine learning, but it would be pretty difficult for it to keep up with the demands and'}]

In [2]:
from transformers import TextDataset, GPT2Tokenizer

# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', local_files_only=True)


In [3]:
text_dataset = TextDataset(tokenizer=tokenizer, file_path='warpeace_input.txt', block_size=128)



In [4]:
len(text_dataset)

6176

In [5]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


In [6]:
import torch
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [7]:
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./gpt_results', 
    num_train_epochs=20,     
    per_device_train_batch_size=16, 
    logging_dir='./gpt_logs',
    save_total_limit=1,
    logging_steps=500,
)


In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=text_dataset,
    optimizers=(optim, None)
)


In [10]:
trainer.train()

Step,Training Loss
500,3.4141
1000,3.1495
1500,3.0075
2000,2.8826
2500,2.7792
3000,2.6992
3500,2.6217
4000,2.5488
4500,2.4954
5000,2.4476


TrainOutput(global_step=7720, training_loss=2.6408239769812076, metrics={'train_runtime': 1411.9948, 'train_samples_per_second': 87.479, 'train_steps_per_second': 5.467, 'total_flos': 8068697948160000.0, 'train_loss': 2.6408239769812076, 'epoch': 20.0})

In [11]:
def generate_text(prompt_text, model, tokenizer, max_length):
    input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to(device)
    
    # Generate response
    output_sequences = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        top_p=0.9,
    )

    # Decode the generated responses
    responses = []
    for response_id in output_sequences:
        response = tokenizer.decode(response_id, skip_special_tokens=True)
        responses.append(response)

    return responses

In [12]:
prompt_text = "the emperor"
responses = generate_text(prompt_text, model, tokenizer, 100)

for response in responses:
    print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


the emperor's, and the Emperor Francis, who was in attendance on him,
was present.

The Emperor was present because he had received the news that the French
troops were advancing on Moscow, that Kutuzov had been wounded, the
Emperor's wife had died, a letter from Prince Andrew had come from
Prince Vasili, Prince Bolkonski had seen at the palace, news of the death of
the Emperor, but the most important news was that


---

Readers may ignore the next cell.

In [13]:
!jupyter nbconvert --to python ch13_part2.ipynb --TemplateExporter.exclude_input_prompt=True

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[NbConvertApp] Converting notebook ch13_part2.ipynb to python
[NbConvertApp] Writing 2580 bytes to ch13_part2.py
