# Task 1: Text Generation using with GPT2:

### Installing Dependencies

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers datasets accelerate

Looking in indexes: https://download.pytorch.org/whl/cu121


### Importing the needed Libraries

In [2]:
import torch #Importing the torch Library
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline
'''
Importing:
GPT2Tokenizer
GPT2LMHeadModel
TextDataset
DataCollatorForLanguageModeling
Trainer
TrainerArguments
Pipeline
'''
import os #Importing the os module

2025-06-30 11:04:31.952182: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-30 11:04:32.046993: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751261672.084656    5828 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751261672.095008    5828 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751261672.175186    5828 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### Checking for GPU Availability

In [3]:
# For checking if GPU is available
if not torch.cuda.is_available(): #If Torch discovers no GPU
    raise RuntimeError("GPU Unavailable, Ensure PyTorch was installed with CUDA support.")
    #Raises a Runtime Error stating that the local device Has no CUDA enabled GPU

print("CUDA Available", torch.cuda.is_available()) #If CUDA is available prints "CUDA Available"
print("Using device:", torch.cuda.get_device_name()) #If CUDA is available prints the CUDA device being used
print("PyTorch version:", torch.__version__) #Prints the version of Torch being used
print("GPU Memory Summary:", torch.cuda.memory_summary()) #Prints the summary of the GPU memory

CUDA Available True
Using device: NVIDIA GeForce RTX 4070 Laptop GPU
PyTorch version: 2.5.1+cu121
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      

###  Loading and Preparing Pretrained GPT-2 Model and Tokenizer

In [6]:
model_name = "gpt2"  # Defining the pretrained model's name
tokenizer = GPT2Tokenizer.from_pretrained(model_name)  # Defining the tokenizer for the GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)  # Loading the pre-trained GPT-2 language model with a language modeling head

# Set pad token (as GPT-2 does not have one by default)
tokenizer.pad_token = tokenizer.eos_token  # Assigns the padding token of the tokenizer to be the EOS (end-of-sequence) token
model.resize_token_embeddings(len(tokenizer))  # Resizes the model's token embedding layer to match the new vocabulary size of the tokenizer

Embedding(50257, 768)

### Preparing the Dataset and Data Collator for Language Model Training

In [7]:
def load_dataset(file_path, tokenizer, block_size=128):
    '''
    Loads a text file into a format suitable for training a language model.

    Args:
    file_path (str): Path to the training text file.
    tokenizer (PreTrainedTokenizer): The tokenizer to use for encoding the text.
    block_size (int): The maximum length of each input block after tokenization.

    Returns:
    TextDataset: A dataset object containing tokenized text in blocks.
'''
    return TextDataset( #Returns TextDataset
        tokenizer=tokenizer, #Tokenizer used to tokenize the input text
        file_path=file_path, #Path to the text file to be loaded
        block_size=block_size, #Maximum sequence length per training example
    )

train_file = 'shake.txt' #Path to the text file used to train the language model

# Load the dataset using the custom function and tokenizer
dataset = load_dataset(train_file, tokenizer) #Defining the dataset and the tokenizer
# Define a data collator that dynamically pads batches and prepares them for language modeling
# 'mlm=False' means this is for causal (auto-regressive) language modeling like GPT-2
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)



### Configuring Training Arguments for GPT-2 Fine-Tuning

In [29]:
training_args = TrainingArguments(
    output_dir="./gpt2-shakespeare-finetuned2",       # Directory to save the model and checkpoints
    overwrite_output_dir=True,                        # Overwrites the output directory if it exists
    num_train_epochs=5,                               # Number of training epochs (passes through the entire dataset)
    per_device_train_batch_size=5,                    # Batch size per GPU/CPU during training
    gradient_accumulation_steps=1,                    # Number of steps to accumulate gradients before updating model weights
    save_steps=500,                                   # Save a checkpoint every 500 steps
    save_total_limit=2,                               # Maximum number of checkpoints to keep (older ones are deleted)
    logging_dir="./logs",                             # Directory to store training logs for TensorBoard or other tools
    logging_steps=100,                                # Log training metrics every 100 steps
    fp16=True,                                        # Use 16-bit (mixed) precision training if supported by the hardware
    report_to="none",                                 # Disable integration with logging/reporting tools like WandB or TensorBoard
    dataloader_pin_memory=True,                       # Improves performance by enabling faster data transfer to GPU
)

### Setting Up Device for Model Training and Inference

In [30]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Selects GPU ('cuda') if available, otherwise defaults to CPU
model.to(device)                                                       # Moves the model to the selected device (GPU or CPU)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Saving the Fine-Tuned GPT-2 Model and Tokenizer

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [32]:
trainer.train()  # This starts the training process

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,3.9648
200,3.6408
300,3.6319
400,3.5979
500,3.5362
600,3.4149
700,3.3541
800,3.3262
900,3.3265
1000,3.3551


TrainOutput(global_step=2640, training_loss=3.261607976393266, metrics={'train_runtime': 263.6034, 'train_samples_per_second': 50.075, 'train_steps_per_second': 10.015, 'total_flos': 862263705600000.0, 'train_loss': 3.261607976393266, 'epoch': 5.0})

In [33]:
trainer.save_model("./gpt2-shakespeare-finetuned2")         # Saves the fine-tuned model to the specified directory
tokenizer.save_pretrained("./gpt2-shakespeare-finetuned2")  # Saves the tokenizer configuration and vocabulary to the same directory

('./gpt2-shakespeare-finetuned2/tokenizer_config.json',
 './gpt2-shakespeare-finetuned2/special_tokens_map.json',
 './gpt2-shakespeare-finetuned2/vocab.json',
 './gpt2-shakespeare-finetuned2/merges.txt',
 './gpt2-shakespeare-finetuned2/added_tokens.json')

### Generating Text Using the Fine-Tuned GPT-2 Model

In [35]:
# Generation
generator = pipeline("text-generation", model="./gpt2-shakespeare-finetuned", tokenizer=tokenizer, device=0)  
# Creates a text generation pipeline using the fine-tuned model and tokenizer
# 'device=0' assigns the pipeline to the first CUDA GPU (if available)

prompt = "Shakespeare Quote"  # Input text prompt to start generation from (empty string means model generates from scratch)

output = generator(prompt, max_length=100, num_return_sequences=1)  
# Generates up to 100 tokens of text based on the prompt
# Returns 1 generated sequence

print(output[0]["generated_text"])  # Prints the generated text from the first sequence in the output

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Shakespeare Quote: Originally Posted by I think the only way I would go about it is to say that I don't really believe in a god. I would like to believe that there is something beyond the divine, but it's really hard to explain why. I would like to believe that there is something beyond the divine, but it's really hard to explain why.

You should have to explain why.

I think there is something beyond the divine but it's really hard to explain why.


I would like to believe that there is something beyond the divine but it's really hard to explain why.I think there is something beyond the divine but it's really hard to explain why.

There is a real possibility that there is. Like the Lord is the Creator and will be forever. There is a real possibility that there is. Like the Lord is the Creator and will be forever.

Sorcery (1/2):

I think there is something beyond the divine but it's really hard to explain why. I would like to believe that there is something beyond the divine but it's 