# JPT-2 (Jocker Pre-tarined and Tuned-Transformer)

<img src="https://i.postimg.cc/GmnXj8Ld/Gemini-Generated-Image-3xdk643xdk643xdk.png" width="300" height="300" alt="JPT-2 Logo"><img/>

# Overview
- This project purpose is fine-tunning GPT-2 into JPT-2(Jocker Pre-tarined and Tuned-Transformer)
- *Why so serious..?*


In [113]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="jocker_lines.jsonl")

In [114]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 470
    })
})

In [115]:
dataset = dataset['train'].train_test_split(test_size=0.1, shuffle=True, seed=42)

In [116]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 423
    })
    test: Dataset({
        features: ['text'],
        num_rows: 47
    })
})

In [117]:
print(dataset['train'][0])
print(dataset['test'][0])

{'text': 'I love the smell of fear in the morning. It smells like... victory.'}
{'text': "I'm not clumsy. I'm just a random act of violence waiting to happen."}


In [118]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # Make new token for padding

In [119]:
tokenizer.pad_token

'[PAD]'

In [120]:
def tokenizer_function(examples):
  return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=64)

tokenized_datasets = dataset.map(
    tokenizer_function,
    batched=True,
    remove_columns=["text"]
)

In [121]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 423
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 47
    })
})

In [122]:
from transformers import GPT2LMHeadModel

base_model = GPT2LMHeadModel.from_pretrained("gpt2")
base_model.resize_token_embeddings(len(tokenizer))

Embedding(50258, 768)

In [123]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=4,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

lora_model = get_peft_model(base_model, lora_config)



In [124]:
lora_model.print_trainable_parameters()

trainable params: 147,456 || all params: 124,588,032 || trainable%: 0.1184


In [125]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="jpt2-lora",
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="epoch",
    save_strategy="no",
    report_to="none",
    weight_decay=0.01
)

In [126]:
from transformers import DataCollatorForLanguageModeling

# data_collator make lable colunm from input_ids
# It is play a same role with `result["labels"] = result["input_ids"].copy()`
# And dobule check tokenized whether it is
# `mlm=False` means that collator doens't maksing any tokens.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [127]:
from transformers import Trainer

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


In [128]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50257}.


Epoch,Training Loss,Validation Loss
1,No log,3.553508
2,No log,3.534223


TrainOutput(global_step=212, training_loss=3.5659856256449, metrics={'train_runtime': 738.503, 'train_samples_per_second': 1.146, 'train_steps_per_second': 0.287, 'total_flos': 27679535529984.0, 'train_loss': 3.5659856256449, 'epoch': 2.0})

In [129]:
merged_model = lora_model.merge_and_unload()

merged_model.save_pretrained("./jpt2")

In [130]:
from transformers import AutoModelForCausalLM, GPT2Tokenizer, pipeline

jpt2 = AutoModelForCausalLM.from_pretrained("./jpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

generator = pipeline("text-generation", model=jpt2, tokenizer=tokenizer)

Device set to use cpu


In [135]:
prompt = "Why so serious about this city?"

outputs = generator(prompt,
                    max_new_tokens=64,
                    num_return_sequences=3,
                    do_sample=True,
                    top_k=50)

print(f"--- Prompt: {prompt} ---")
for i, output in enumerate(outputs):
    print(f"Result {i+1}: {output['generated_text']}")

--- Prompt: Why so serious about this city? ---
Result 1: Why so serious about this city? I would love to be able to walk my dog, but it just doesn't seem to be very productive. I know that I'm not in a position to buy a house, but I feel that this city needs to move away from the current housing model that is the key to revitalizing the area. I want to
Result 2: Why so serious about this city? Why do we have these people who are taking things into their own hands?"

"I don't think so," said the young woman, who appeared to be more frightened than she was.

"The only things that can really save us," said the woman, "are the people who are fighting back.
Result 3: Why so serious about this city? Well, that depends on how you look at it.

The city's population is about one million, and it's already well below the national average.

That means, of course, that it's not really the biggest city in the country, but it's still one of the most diverse.




# Result
- As you saw above, I failed to mimic joker character's parlance.
- To improve output, I even applied priming method in prompt like "Why so serious about this city?".
- But through this change, I could see many what problems happen when model collapse.
  1. **Not working as intended**: Most general problem. Naturally, the learning fails and we cannot expect the desired (joker-speak) outcome, and is almost identical to the output of the pre-fine-tuning model
  2. **Model Collapes**: The model is trapped in a specific pattern to reduce the loss.

## Why this project failed?
Here is some expected problems about fail.
1. **Lack of data**: To process project, I had to gain jocker-speech dataset. But it was difficult, So I made dataset with using 'Gemini-2.5-pro'. But data was still lack and low-quality. So, model couldn't understand jocker-speeech distinction in learning.
2. **Model Size**: 1.5B parameters is very bigger than GPT-1's 117M. But, it is not sufficient to learn jocker-speech distinction. Jocker speech distinction is very ambiguous and sophisticated. So, if model parameter size was big, we may have had better results.
3. **Prompt**: In the first test, I input "City" in prompt. But, "City" is very general and predictable word. So, outputs were very general and predictable. But after apply priming, output more imporvement.

# Final Toughts
This project was a practical lesson in the challenges of fine-tuning a persona. Our attempt was ultimately constrained by a combination of three factors: a dataset too small to convey the character's complexity, a model (GPT-2) whose scale was insufficient to override its powerful pre-training, and the clear necessity of specific, guiding prompts to activate the new style. This demonstrates that creating a convincing persona is a careful balance between the quality of data, the capacity of the model, and the art of the prompt.