# GPT2 Language Model Fine-tuning with Texts from Shakespeare
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fcakyon/gpt2-shakespeare/blob/main/gpt2-shakespeare.ipynb)

## 0. Install requirements

In [None]:
!pip install -U transformers datasets torch sentencepiece pyyaml

Collecting transformers
  Downloading transformers-4.11.0-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.4 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 32.3 MB/s 
Collecting torch
  Downloading torch-1.9.1-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 7.0 kB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 44.5 MB/s 
Collecting pyyaml
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 47.2 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers

## 1. Initialize Model and Tokenizer

- Import required modules:

In [None]:
#create a directory in google drive

import os.path

from google.colab import drive
PROJECT_NAME = 'gpt_2_model'
drive.mount('/content/gdrive')
ROOT_DIR = "/content/gdrive/My Drive/" + PROJECT_NAME
import os
os.makedirs(ROOT_DIR, exist_ok=True)
!ls -la "{ROOT_DIR}"
!head "{ROOT_DIR}/corpus_gpt2_final.txt"

Mounted at /content/gdrive
total 207
-rw------- 1 root root 105721 Sep 28 14:52 corpus_gpt2_final.txt
-rw------- 1 root root  86618 Sep 28 15:51 gpt2.txt
-rw------- 1 root root  18158 Sep 27 12:53 qa.txt
we decided to do the exhibition at our place at the Riad, opening will be on the 6th of may, so we have our conference at the 7th at ours as well. Everybody is happy with that and starts working. Amine told us yesterday to send a list of things we need: 1 screen earphones 3 projector speakers 5 power mulitplugs 2 Mediaplayer bluetac table.
I need a possibility to print ( Amine offered the possibility to print at his office) junior needs to print as well but just 4A format.
It is great to meet you and talk about all the Projects. Would it be possible to meet the Monday After?
But on Monday I have to be in Duesseldorf, because i am shortlisted for the art prize of the city. The presentation is on Tuesday, Monday is installing.
i bring my laptop at the moment there is not that much to see

In [None]:
#inport the necessary libraries

import torch
import math
from transformers import GPT2Tokenizer, GPT2LMHeadModel, HfArgumentParser, TrainingArguments, Trainer, default_data_collator
from datasets import load_dataset, Dataset, DatasetDict

- Initialize a GPT2 model with a language modelling head:

In [None]:
 model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

- Initialize GPT2 tokenizer:

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


In [None]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

## 2. Initialize Dataset

In [None]:
data_file=ROOT_DIR + "/corpus_gpt2_final.txt"
print("Loading from file:", data_file)
with open(data_file) as fin:
  total_lines = [line.rstrip('\n') for line in fin]
print('Loaded all files, num total lines:', len(total_lines))


dataset = Dataset.from_dict({'text': total_lines})
print(dataset)
print(dataset[0])
test_and_train_dsd = dataset.train_test_split(train_size=0.8)
train_ds = test_and_train_dsd['train']
test_and_validation_dsd = test_and_train_dsd['test'].train_test_split(train_size=0.5)
test_ds = test_and_validation_dsd['test']
validation_ds = test_and_validation_dsd['train']

def join_dataset(ds):
  return Dataset.from_dict({'text': ['\n'.join(ds['text'])]})

train_ds = join_dataset(train_ds)
test_ds = join_dataset(test_ds)
validation_ds = join_dataset(validation_ds)

datasets = DatasetDict({'test': test_ds, 'train': train_ds, 'validation': validation_ds})
print(datasets)

Loading from file: /content/gdrive/My Drive/gpt_2_model/corpus_gpt2_final.txt
Loaded all files, num total lines: 364
Dataset({
    features: ['text'],
    num_rows: 364
})
{'text': 'we decided to do the exhibition at our place at the Riad, opening will be on the 6th of may, so we have our conference at the 7th at ours as well. Everybody is happy with that and starts working. Amine told us yesterday to send a list of things we need: 1 screen earphones 3 projector speakers 5 power mulitplugs 2 Mediaplayer bluetac table.'}
DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
})


- Tokenize all the texts:

In [None]:
column_names = datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    # truncate dataset with max accepted size of the model
    output = tokenizer(examples[text_column_name])
    return output

# tokenize dataset
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset",
)


Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
print(column_names)
print(tokenized_datasets)

['text']
DatasetDict({
    test: Dataset({
        features: ['attention_mask', 'input_ids'],
        num_rows: 1
    })
    train: Dataset({
        features: ['attention_mask', 'input_ids'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids'],
        num_rows: 1
    })
})


- Split whole dataset into smaller sets of blocks:

In [None]:
# get block size (max input length of the model)
block_size = tokenizer.model_max_length
print(tokenizer.model_max_length)
if block_size > 1024:
    block_size = 1024
    
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# split total dataset into smaller sets of length block_size
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}",
)
print(lm_datasets)

1024


Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 1
    })
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 17
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 2
    })
})


In [None]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

## 3. Initialize Trainer

In [None]:
training_args = TrainingArguments(output_dir = "output/", per_device_train_batch_size=1, num_train_epochs=30, save_total_limit=1)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    # Data collator will default to DataCollatorWithPadding, so we change it.
    data_collator=default_data_collator,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


# 4. Perform Training

In [None]:
# perform training
train_result = trainer.train()

# saves the tokenizer
trainer.save_model()

# save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

# save training state
trainer.save_state()

***** Running training *****
  Num examples = 17
  Num Epochs = 30
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 510
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Step,Training Loss
500,1.5207


Saving model checkpoint to output/checkpoint-500
Configuration saved in output/checkpoint-500/config.json
Model weights saved in output/checkpoint-500/pytorch_model.bin
tokenizer config file saved in output/checkpoint-500/tokenizer_config.json
Special tokens file saved in output/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to output/
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Configuration saved in output/config.json
Model weights saved in output/pytorch_model.bin
tokenizer config file saved in output/tokenizer_config.json
Special tokens file saved in output/special_tokens_map.json


***** train metrics *****
  epoch                    =       30.0
  total_flos               =   248214GF
  train_loss               =     1.5054
  train_runtime            = 0:06:21.64
  train_samples_per_second =      1.336
  train_steps_per_second   =      1.336


# 5. Evaluate Model

In [None]:
# perform evaluation over validation data
metrics = trainer.evaluate()

# calculate perplexity
try:
    perplexity = math.exp(metrics["eval_loss"])
except OverflowError:
    perplexity = float("inf")
    
# save perplexity
metrics["perplexity"] = perplexity
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** Running Evaluation *****
  Num examples = 2
  Batch size = 8


***** eval metrics *****
  epoch                   =       30.0
  eval_loss               =     4.4783
  eval_runtime            = 0:00:00.46
  eval_samples_per_second =      4.288
  eval_steps_per_second   =      2.144
  perplexity              =    88.0843


# 6. Generate Samples

In [None]:
# fix seed
import torch
torch.manual_seed(2)

# generate a text given prompt
def generate_text(input_text):
  # tokenize start of a sentence
  ids = tokenizer.encode(input_text,
                         return_tensors='pt').cuda()

  # generate samples by top-p sampling
  sample_output = model.generate(
      ids,
      do_sample=True,
      max_length=200,
      top_p=0.92,
      top_k=0,
      temperature=0.2
  )
  return tokenizer.decode(sample_output[0], skip_special_tokens=True)

input = 'What inspires your art? A lot, experiences, texts, songs, films, conversations, observations, the daily life as news.'
output = generate_text(input)
# print generated texts
print("Output:\n" + 100 * '-')
print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
What inspires your art? A lot, experiences, texts, songs, films, conversations, observations, the daily life as news.
Great, that would be really cool.
The Avatar series is based on my latest work: On the one hand, on the VR work of Hans Bellmer and Bruce Nauman, two self-portrait series that use the iPhone to imagerically gather information about human interaction in the real world. On the other hand, the dystopian genre-western genre-western, where the iPhone is connected to the Internet, is routinely used to gather information on the human body and its interaction.
I would love to stay in contact. Attached you can find my portfolio.
The series heads got 55 images per panel, each of them dealing with a different aspect of the human body. From there, the series heads can be downloaded for individual presentation or as part of a larger exhibition.
I would love to invite you to m

In [None]:
# fix seed
import torch
torch.manual_seed(2)

#load the dataset with question-short answer pairs, generate extensions for them and save it in a txt document
with open(ROOT_DIR + "/qa.txt", "r") as fin, open(ROOT_DIR + "/gpt2.txt", "w") as gpt2_output:
  for qa in fin:
    qa=qa.strip()
    # add dot at the end of the short answer
    if not qa.endswith("."):
      qa= qa + "."
    output=generate_text(qa)
    #output only the generated text, not includeng the question and short answer
    output=output[len(qa)+1:].strip()
    
    #postprocess
    #replace new line with white space, so that there are no new lines within the generated output
    output=output.replace("\n", " ")
    #to cut the generated output after the last dot to keep it complete
    output=output[:output.rfind(".")+1]
    #write the generated text to the file
    print(output, file=gpt2_output)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene