In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

# Go through data

In [3]:
df = pd.read_csv("/kaggle/input/filtered-ielts-writing-dataset/dataset_for_generating_evaluation.csv")
# Get essay with band greater than 5.5
df = df[df["computed_band"] >= 6.0][["prompt", "essay"]]

In [4]:
df

Unnamed: 0,prompt,essay
0,Interviews form the basic criteria for most la...,It is believed by some experts that the tradit...
2,Interview form the basic selection criteria fo...,The interview section is the most vital part o...
11,Interviews form the basic selection criteria f...,It is undeniable that most companies rely on i...
12,Interviews form the basic selecting criteria f...,"Nowadays, most companies employ workers after ..."
14,Interviews form the basic selection criteria f...,Interviews are commonly used as a way to scree...
...,...,...
10269,"As well as making money , businesses also have...",Businesses have sets of principles. Earning pr...
10270,"As well as making money, businesses also have ...",It is true that businesses need to make a prof...
10271,"As well as making money, businesses also have ...",The role of companies is to produce all the go...
10272,"As well as making money, businesses also have ...",Although earning money is one of the most impo...


In [5]:
df = df.reset_index()

In [6]:
df

Unnamed: 0,index,prompt,essay
0,0,Interviews form the basic criteria for most la...,It is believed by some experts that the tradit...
1,2,Interview form the basic selection criteria fo...,The interview section is the most vital part o...
2,11,Interviews form the basic selection criteria f...,It is undeniable that most companies rely on i...
3,12,Interviews form the basic selecting criteria f...,"Nowadays, most companies employ workers after ..."
4,14,Interviews form the basic selection criteria f...,Interviews are commonly used as a way to scree...
...,...,...,...
6383,10269,"As well as making money , businesses also have...",Businesses have sets of principles. Earning pr...
6384,10270,"As well as making money, businesses also have ...",It is true that businesses need to make a prof...
6385,10271,"As well as making money, businesses also have ...",The role of companies is to produce all the go...
6386,10272,"As well as making money, businesses also have ...",Although earning money is one of the most impo...


In [7]:
step = 5000
for i in range(500, len(df), step):
    print("Prompt:", df["prompt"][i])
    print("Essay:", df["essay"][i])


Prompt: Some people believe that the government should take care of old people and provide financial support after they retire. Others say individuals should save during their working years to fund their own retirement. What is your opinion? Give reasons for your answer and include examples from your own experience.
Essay: People have different views on whether a government should support the elderly and retired people financially or not. I believe that it is mostly an individual’s duty to save funds for their retirement, but I totally disagree that elderly people shouldn't receive any support from the state. The combination of personal support and the government’s assistance could be the best possible solution for the retired elderly people.

I think the regime should support the elderly people financially as this is the part of a social democracy which states the equality of opportunities and distribution of resources fairly. For example, many developed countries like Germany, the Un

In [8]:
n_test = 40

# test set: last 40 rows
test_df = df.iloc[-n_test:].reset_index(drop=True)

# train set: all the other rows
train_df = df.iloc[:-n_test].reset_index(drop=True)

In [9]:
test_df

Unnamed: 0,index,prompt,essay
0,10208,Some people think the best way to solve enviro...,It is commonly believed by many that inflating...
1,10209,Some people think that one of the best ways to...,The earth's average surface temperatures are i...
2,10210,Some people believe that the government should...,Some individuals suppose that the government s...
3,10211,Some people believe that the government should...,"Over the last 100 years, population has grown ..."
4,10212,Some people believe that it is the government’...,Some people believe that the government should...
5,10214,Some people think that the main purpose of sch...,Some people are of the opinion that the primar...
6,10215,Some people think the main purpose of school i...,The young generation of each country is its ma...
7,10216,Some people think the main purpose of school i...,there are several arguments about the institut...
8,10223,Some people believe that teenager should be re...,There are a lot of young people who go to do u...
9,10224,Some people think that all teenagers should be...,"Many youngsters work on a volunteer basis, and..."


In [10]:
text = "Prompt: " + test_df["prompt"][2] + "\nEssay: " + test_df["essay"][2]

In [11]:
print(text)

Prompt: Some people believe that the government should not spend money on international aid when they have their own disadvantaged people like homeless and unemployed. To what extent do you agree or disagree?
Essay: Some individuals suppose that the government should allocate financial resources to other impoverished countries while others believe that it is of paramount importance to focus on solving their domestic issues. From my point of view, I strongly agree with the proposal of determining national priority on internal problems.

On the one hand, it is undeniable that cross-border support is a symbol of humanity. Geographically, people are divided into different ethnicities but by nature, it will be the core of connecting features among individuals. Thus, aiding the poor stems from the consciousness and heart of each resident and is not solely contingent on monetary contributions. Not to mention that helping other nations to encounter  financial crises would eliminate the chanc

## Custom Dataset

In [12]:
import pandas as pd
import torch
from torch.utils.data import Dataset
from transformers import GPT2TokenizerFast, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling, GPT2Config

2025-05-23 08:33:07.944938: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747989188.149077      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747989188.204571      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [13]:
class PromptEssayDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.examples = []
        for _, row in dataframe.iterrows():
            prompt = row['prompt'].strip()
            essay = (row['essay'] + " <end_of_essay>").strip()
            # Combine prompt and essay with EOS separators
            text = prompt + tokenizer.eos_token + essay
            # Tokenize + pad + truncate in one go (fast tokenizer)
            enc = tokenizer(
                text,
                truncation=True,
                max_length=self.max_length,
                padding='max_length',
                return_tensors='pt'
            )
            self.examples.append({
                'input_ids': enc['input_ids'].squeeze(),
                'attention_mask': enc['attention_mask'].squeeze(),
                'labels': enc['input_ids'].squeeze().clone()
            })

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

## Get tokenizer of GPT2

In [14]:

# Import tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2-medium")
special_tokens_dict = {
    'eos_token': '',
    'pad_token': '',
    'additional_special_tokens': ['<end_of_essay>']
}
tokenizer.add_special_tokens(special_tokens_dict)

dataset = PromptEssayDataset(df, tokenizer, max_length=512)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

# Model

In [15]:
from transformers import GPT2Config

# Load cogfig of gpt2
config = GPT2Config.from_pretrained("gpt2-medium", loss_type="causal_lm")

""" Load model with config """
model = GPT2LMHeadModel.from_pretrained("gpt2-medium", config=config)

model.resize_token_embeddings(len(tokenizer))

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 1024)

## Trainer api

In [16]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-essay-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=100,
    logging_dir="./logs",
    logging_steps=50,
    save_steps=250,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    report_to="none",
)

# 4. Data collator (just handles LM labels)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# 5. Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)


## Train model

In [17]:
print("Start Training")
trainer.train()
print("Train OK!")

Start Training


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,3.4062
100,3.012
150,2.8691
200,2.7874
250,2.7499
300,2.7214
350,2.6857
400,2.6869
450,2.6601
500,2.6753


Train OK!


## Save model

In [18]:
# Save
trainer.save_model("./gpt2-essay-finetuned")

## Call model if have

In [19]:
device = 'cuda' # 'cpu'

In [20]:
# # This model i finetune like above in anorther version
# from transformers import AutoTokenizer, AutoModelForCausalLM
# tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/fine-tunedgpt2/gpt2-essay-finetuned")
# model = AutoModelForCausalLM.from_pretrained("/kaggle/input/fine-tunedgpt2/gpt2-essay-finetuned")

In [21]:
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50258, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50258, bias=False)
)

# Generation Function

In [22]:
def generate_essay(prompt: str, device='cuda', max_length: int = 512):
    # Encode the input prompt with the EOS token
    input_ids = tokenizer.encode(prompt + tokenizer.eos_token, return_tensors="pt").to(device)

    # Generate output
    output = model.generate(
        input_ids,
        max_length=input_ids.shape[-1] + max_length,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids("<end_of_essay>"),
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.8,
        num_return_sequences=1
    )

    # Decode and post-process output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Optional: Cut at the special end token if still included in output
    end_token = "<end_of_essay>"
    if end_token in generated_text:
        generated_text = generated_text.split(end_token)[0]

    return generated_text[len(prompt):].strip()


# Result

In [23]:
prompt = "Write an essay about the importance of artificial intelligence in modern society."
continuation = generate_essay(prompt)
print("Prompt:\n", prompt)
print("\nGenerated essay:\n", continuation)


Prompt:
 Write an essay about the importance of artificial intelligence in modern society.

Generated essay:
 Discuss the following topics.There are two main factors that motivate people to invent artificial intelligence. First, it is a matter of increasing the productivity of the workforce. In other words, AI is an artificial intelligence that can easily replace human labour. Secondly, it is a matter of improving the quality of education. In this essay, I will discuss the reasons for both these factors.

Firstly, Artificial Intelligence can be considered as a replacement for human labour. When people work with AI, they are performing repetitive tasks which results in a reduction in the quality of education. For example, when the number of pupils in school increases, so does the number of teachers. Therefore, the quality of education will be reduced.

Secondly, the quality of education will be improved with the advancement of AI. This will help to increase the productivity of the w

In [24]:
prompt = "The best way to solve world’s environmental problem is to increase the cost of fuel for cars and other vehicles. To what extent do you agree or disagree?"
continuation = generate_essay(prompt)
print("Prompt:\n", prompt)
print("\nGenerated essay:\n", continuation)


Prompt:
 The best way to solve world’s environmental problem is to increase the cost of fuel for cars and other vehicles. To what extent do you agree or disagree?

Generated essay:
 The world’s environmental problem is the major concern for the next generations. To address this problem, there are two main solutions that need to be considered. I completely agree that the most suitable way is to raise the cost of fuel for cars and other vehicles.

To begin with, the number of people who use personal vehicles is increasing due to the cost of fuel. In other words, people have to spend more money to travel to work or school. In order to reduce the burden on the government, the ministry should invest in public transport services, such as buses, trains or subways, that provide passengers with more comfortable and convenient transportation. In addition to this, people also need to use public transport to work and school, which can provide them with a good quality of life. Moreover, they can 

In [25]:
prompt = "Some people think that all teenagers should be required to do unpaid work in their free time to help the local community. They believe this would benefit both the individual teenager and society as a whole.\
Do you agree or disagree?"
continuation = generate_essay(prompt, device)
print("Prompt:\n", prompt)
print("\nGenerated essay:\n", continuation)

Prompt:
 Some people think that all teenagers should be required to do unpaid work in their free time to help the local community. They believe this would benefit both the individual teenager and society as a whole.Do you agree or disagree?

Generated essay:
 It is believed by a section of society that all children should participate in unpaid work during their leisure time to help the local community and this may be beneficial to both of them and society. I completely agree with this notion and will explain my opinion in the following paragraphs. 

To begin with, I believe this is a beneficial idea for the youngster as they will have time to develop their skills. In addition, there are many people who need some volunteer work to sustain their current financial status. For instance, a lot of people can not afford to pay for school fees. However, their child will help them by doing some unpaid work to earn money and pay for their schooling. 

Secondly, the community will be benefite