# Fine-tune GPT-2 on Shakespeare
In this notebook, you will fine-tune GPT-2 to generate text and that sounds like Shakespeare. You will use the Hugging Face framework.

References:
* https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface
* https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb#scrollTo=m9lHS0mIMak4
* https://huggingface.co/docs/transformers/en/model_doc/gpt2

## Import required packages
And check if GPU is available

In [None]:
import torch

print(torch.__version__)
print(torch.__path__)
print(torch.cuda.is_available()) # Check for GPU availability if installed

# Load Shakespeare data for training
We'll put the data in a Hugging Face `TextDataset` object.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [None]:
from transformers import TextDataset, DataCollatorForLanguageModeling

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='data/shakespeare_train.txt',
    block_size=64)

test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='data/shakespeare_test.txt',
    block_size=64)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

# Initialize training settings

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")


training_args = TrainingArguments(
    output_dir="./gpt2-shakespeare", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=1, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 300, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Finetune (train) the model
And save it out.

In [None]:
trainer.train()

In [None]:
trainer.save_model() # this will save to the directory specified in the TrainingArguments object

# Generate text from the finetuned model

In [None]:
from transformers import pipeline

shakespeare_gpt2 = pipeline('text-generation', model='./gpt2-shakespeare', tokenizer='gpt2')

In [None]:
prefix = '' # FILL IN a word or two. This acts as the context (prompt) for generation.

print(shakespeare_gpt2(prefix)[0]['generated_text'])