# Fine-tuning GPT-2 on Shakespeare Text

This tutorial will guide you through fine-tuning a GPT-2 Huggingface model on a Shakespeare text dataset. We'll:

1. Load and preprocess the text data from a `.txt` file.
2. Tokenize the data.
3. Set up the Trainer.
4. Fine-tune GPT-2.
5. Save the fine-tuned model.
6. Load the fine-tuned model. 
7. Generate text from the fine-tuned model ala Shakespearean Text

Run each cell in order.

## 1. Install Dependencies

Ensure you have the necessary libraries installed.

In [None]:
!pip install transformers datasets scikit-learn torch accelerate

## 2. Load and Preprocess Shakespeare Text

### ✅ Goal

Split the text like this:

```
Citizen:
This, here before you.
CORIOLANUS:
Thank you, sir: farewell.
O world, thy slippery turns! ...
```

Into structured chunks like:

- `Citizen: This, here before you.`
- `CORIOLANUS: Thank you, sir: farewell. O world, thy slippery turns! ...`

Each of these becomes one training sample.

---

### 🔍 Step-by-Step Breakdown

#### 1. **Speaker Format Detection**

We detect lines that look like:

```
CHARACTER_NAME:
```

These always start in uppercase and end with a colon.

---

#### 2. **Regex Pattern Used**

```python
pattern = r'(?=^[A-Z][A-Za-z\' ]+-?:\n)'
```

This matches:
- Names with spaces (`First Servingman:`)
- Apostrophes (`O'Conner:`)
- Optional dashes (`Stage-Direction:`)
- One or more capital letters followed by a colon and newline

The `(?=...)` syntax is a **lookahead**, meaning it **splits the text just before** the speaker line, without removing it.

---

#### 3. **Splitting the Text**

We use:

```python
re.split(pattern, text, flags=re.MULTILINE)
```

This breaks the raw text into chunks, one per speaker.

---

#### 4. **Cleaning and Reattaching**

The split removes the speaker line, so we use `re.findall(...)` to capture the speaker names and reattach them.

Each block is also flattened (newlines → spaces) using:

```python
part.replace('\n', ' ').strip()
```

---

#### ✅ Result

Each item in the final list looks like:

```python
"CORIOLANUS: Thank you, sir: farewell. O world, thy slippery turns! ..."
```

This structure is ideal for training GPT models on realistic, stylized dialogue data.


In [None]:
import re

def split_by_speakers(text):
    # Corrected character class: dash is moved to the end so it's not interpreted as a range
    pattern = r'(?=^[A-Z][A-Za-z\' ]+-?:\n)'  # Match lines like "First Servingman:\n"
    
    # Split on speaker lines
    parts = re.split(pattern, text, flags=re.MULTILINE)

    # Find the matching speaker headers to reattach
    speaker_lines = re.findall(pattern, text, flags=re.MULTILINE)

    # Remove any empty or header-only sections
    parts = [p for p in parts if p.strip()]
    full_blocks = [f"{speaker.strip()} {part.replace('\\n', ' ').strip()}"
                   for speaker, part in zip(speaker_lines, parts)]

    return full_blocks

In [None]:
import os
from datasets import Dataset

def load_shakespeare_blocks(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        raw_text = f.read()
    cleaned_blocks = split_by_speakers(raw_text)
    return cleaned_blocks

# Load data
file_path = "./data/input.txt"  # Make sure the file is in the working directory
text_blocks = load_shakespeare_blocks(file_path)

# Create a Dataset
dataset = Dataset.from_dict({"text": text_blocks})
dataset = dataset.train_test_split(test_size=0.2, seed=42) # define 20% test dataset

dataset 


In [None]:
print(dataset['train'][2])

## 3. Load GPT-2 Model and Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 pad token-> for padding in the tokenizer if the text is less than the model input length.

model = AutoModelForCausalLM.from_pretrained(model_id)
model

## 4. Tokenize the Dataset

In [None]:
def tokenize(text):
    tokens = tokenizer(
        text["text"],
        truncation=True,
        max_length=64,
        padding="max_length",
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_train = dataset["train"].map(tokenize, remove_columns=["text"])
tokenized_test = dataset["test"].map(tokenize, remove_columns=["text"])

tokenized_train

## 5. Set Up Data Collator and Training Arguments

In [None]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

training_args = TrainingArguments(
    output_dir="gpt2-shakespeare-finetuned",
    num_train_epochs=2,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=5e-5,
)

## 6. Train the Model

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
)

trainer.train()

## 7. Save the Fine-Tuned Model

In [None]:
model.save_pretrained("gpt2-shakespeare-finetuned")
tokenizer.save_pretrained("gpt2-shakespeare-finetuned")
print("Model saved!")

## 8. Load the Fine-Tuned Model for Text Generation

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

# Load fine-tuned model and tokenizer
model_path = './gpt2-shakespeare-finetuned' # Wind the folder where the fine-tuned shakespearean model is located
tokenizer = AutoTokenizer.from_pretrained(model_path)#  We can use AutoTokenizer to detect model tokenizer automatically, i.e., add model_path in its parameter.
model =  AutoModelForCausalLM.from_pretrained(model_path) # AutoModelFor..., add model_path is its parameter

# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)

## 9. Generate Shakespearean Text

In [None]:
# Provide a prompt to the model
prompt = "To be, or not to be"
results = generator(prompt, max_length=100, num_return_sequences=1)

print("Generated Text:\n")
print(results[0]["generated_text"])