# Task 1: Text Generation using with GPT2:

### 1. Environment Setup: Installing Dependencies

Before we begin, we need to install the essential Python libraries. This cell handles the installation of:

* **PyTorch (`torch`, `torchvision`, `torchaudio`)**: An open-source machine learning framework that provides the fundamental building blocks for building and training neural networks. We install it with a specific index URL to ensure compatibility with the CUDA version available for GPU acceleration.
* **Hugging Face Libraries**:
    * `transformers`: Provides access to thousands of pre-trained models like GPT-2 and the tools needed to download, configure, and train them.
    * `datasets`: A library for easily accessing and processing large datasets.
    * `accelerate`: A library that simplifies running PyTorch training across different hardware setups (like single GPU, multiple GPUs, or TPUs) with minimal code changes.

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers datasets accelerate

Looking in indexes: https://download.pytorch.org/whl/cu121


### 2. Importing Core Libraries

With the dependencies installed, we now import the specific modules and classes required for our task.

* `torch`: The core PyTorch library.
* From `transformers`:
    * `GPT2Tokenizer`: Responsible for converting raw text into a format (tokens) that the GPT-2 model can understand and vice-versa.
    * `GPT2LMHeadModel`: The GPT-2 model architecture with a language modeling head on top, which is essential for text generation.
    * `TextDataset`: A utility class to handle loading text files for language modeling tasks.
    * `DataCollatorForLanguageModeling`: A helper that takes tokenized samples from our dataset and groups them into batches for the model. It also handles padding.
    * `Trainer` & `TrainingArguments`: High-level classes that manage the entire training and evaluation loop, abstracting away much of the boilerplate code.
    * `pipeline`: A high-level utility for performing inference tasks easily.
* `os`: A standard Python library for interacting with the operating system, which can be useful for managing files and directories.

In [2]:
import torch #Importing the torch Library
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline
'''
Importing:
GPT2Tokenizer
GPT2LMHeadModel
TextDataset
DataCollatorForLanguageModeling
Trainer
TrainerArguments
Pipeline
'''
import os #Importing the os module

2025-06-30 11:04:31.952182: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-30 11:04:32.046993: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751261672.084656    5828 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751261672.095008    5828 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751261672.175186    5828 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### 3. Verifying GPU Availability

Training large language models like GPT-2 is computationally intensive and can be extremely slow on a CPU. A CUDA-enabled GPU can accelerate this process by orders of magnitude. This code block verifies that PyTorch can detect and utilize the available GPU.

* `torch.cuda.is_available()`: Checks if a compatible NVIDIA GPU is found and if the installed PyTorch version has CUDA support.
* If a GPU is not found, a `RuntimeError` is raised to halt execution.
* If successful, it prints the name of the GPU, the current PyTorch version, and a detailed memory summary to confirm the setup.

In [3]:
#For checking if GPU is available
if not torch.cuda.is_available(): #If Torch discovers no GPU
    raise RuntimeError("GPU Unavailable, Ensure PyTorch was installed with CUDA support.")
    #Raises a Runtime Error stating that the local device Has no CUDA enabled GPU

print("CUDA Available", torch.cuda.is_available()) #If CUDA is available prints "CUDA Available"
print("Using device:", torch.cuda.get_device_name()) #If CUDA is available prints the CUDA device being used
print("PyTorch version:", torch.__version__) #Prints the version of Torch being used
print("GPU Memory Summary:", torch.cuda.memory_summary()) #Prints the summary of the GPU memory

CUDA Available True
Using device: NVIDIA GeForce RTX 4070 Laptop GPU
PyTorch version: 2.5.1+cu121
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      

### 4. Loading the Pre-trained GPT-2 Model and Tokenizer

We will use the standard `gpt2` model from the Hugging Face model hub as our starting point. This model has been pre-trained on a massive corpus of general text and already has a strong grasp of the English language.

* `GPT2Tokenizer.from_pretrained(model_name)`: Downloads and loads the tokenizer that was specifically trained with the `gpt2` model.
* `GPT2LMHeadModel.from_pretrained(model_name)`: Downloads and loads the pre-trained weights of the `gpt2` model.
* **Handling the Padding Token**: GPT-2 does not have a default padding token. We set the `pad_token` to be the same as the `eos_token` (end-of-sequence token). This is a common practice to enable batching of sequences with different lengths.
* `model.resize_token_embeddings()`: We resize the model's token embedding layer to match the tokenizer's vocabulary size, ensuring consistency after adding the padding token.

In [6]:
model_name = "gpt2"  #Defining the pretrained model's name
tokenizer = GPT2Tokenizer.from_pretrained(model_name)  #Defining the tokenizer for the GPT-2 model
model = GPT2LMHeadModel.from_pretrained(model_name)  #Loading the pre-trained GPT-2 language model with a language modeling head

#Set pad token (as GPT-2 does not have one by default)
tokenizer.pad_token = tokenizer.eos_token  #Assigns the padding token of the tokenizer to be the EOS (end-of-sequence) token
model.resize_token_embeddings(len(tokenizer))  #Resizes the model's token embedding layer to match the new vocabulary size of the tokenizer

Embedding(50257, 768)

### 5. Preparing the Dataset for Fine-Tuning

Now, we prepare our custom dataset (Shakespeare's text) for the model.

* `load_dataset()`: This function wraps the `TextDataset` class from Hugging Face. It reads a text file (`shake.txt`), tokenizes its content, and splits it into smaller chunks or `block_size`. A `block_size` of 128 means the model will be trained on segments of 128 tokens at a time.
* `DataCollatorForLanguageModeling`: This object is crucial for the training process. It intelligently creates batches of data from our dataset. By setting `mlm=False` (Masked Language Modeling), we specify that we are doing Causal Language Modeling (CLM), which is the standard for auto-regressive models like GPT-2. The collator will handle padding the batches so that all sequences in a batch have the same length.

In [7]:
def load_dataset(file_path, tokenizer, block_size=128):
    '''
    Loads a text file into a format suitable for training a language model.

    Args:
    file_path (str): Path to the training text file.
    tokenizer (PreTrainedTokenizer): The tokenizer to use for encoding the text.
    block_size (int): The maximum length of each input block after tokenization.

    Returns:
    TextDataset: A dataset object containing tokenized text in blocks.
'''
    return TextDataset( #Returns TextDataset
        tokenizer=tokenizer, #Tokenizer used to tokenize the input text
        file_path=file_path, #Path to the text file to be loaded
        block_size=block_size, #Maximum sequence length per training example
    )

train_file = 'shake.txt' #Path to the text file used to train the language model

#Load the dataset using the custom function and tokenizer
dataset = load_dataset(train_file, tokenizer) #Defining the dataset and the tokenizer
#Define a data collator that dynamically pads batches and prepares them for language modeling
#'mlm=False' means this is for causal (auto-regressive) language modeling like GPT-2
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)



### 6. Configuring Training Arguments

The `TrainingArguments` class allows us to define all the hyperparameters and settings for the training process in a single object.

* `output_dir`: The directory where the fine-tuned model checkpoints and final model will be saved.
* `overwrite_output_dir`: If `True`, it will overwrite the content of the output directory.
* `num_train_epochs`: The total number of times the model will iterate over the entire training dataset.
* `per_device_train_batch_size`: The number of training examples to use in a single batch on one GPU.
* `gradient_accumulation_steps`: The number of forward passes to perform before running a backward pass to update the model's weights. This effectively increases the batch size without using more memory.
* `save_steps`: A model checkpoint will be saved every 500 training steps.
* `save_total_limit`: This limits the total number of checkpoints saved. Older ones are deleted to save space.
* `logging_steps`: How often to log training metrics like loss.
* `fp16`: Enables mixed-precision training, which uses both 16-bit and 32-bit floating-point types to speed up training and reduce memory usage on compatible hardware (like modern NVIDIA GPUs).
* `dataloader_pin_memory`: When set to `True`, it can speed up data transfer from the CPU to the GPU.

In [29]:
training_args = TrainingArguments(
    output_dir="./gpt2-shakespeare-finetuned2",       #Directory to save the model and checkpoints
    overwrite_output_dir=True,                        #Overwrites the output directory if it exists
    num_train_epochs=5,                               #Number of training epochs (passes through the entire dataset)
    per_device_train_batch_size=5,                    #Batch size per GPU/CPU during training
    gradient_accumulation_steps=1,                    #Number of steps to accumulate gradients before updating model weights
    save_steps=500,                                   #Save a checkpoint every 500 steps
    save_total_limit=2,                               #Maximum number of checkpoints to keep (older ones are deleted)
    logging_dir="./logs",                             #Directory to store training logs for TensorBoard or other tools
    logging_steps=100,                                #Log training metrics every 100 steps
    fp16=True,                                        #Use 16-bit (mixed) precision training if supported by the hardware
    report_to="none",                                 #Disable integration with logging/reporting tools like WandB or TensorBoard
    dataloader_pin_memory=True,                       #Improves performance by enabling faster data transfer to GPU
)

### 7. Moving the Model to the GPU

Before we can start training, we must ensure that the model is loaded onto the correct computational device.

* `torch.device(...)`: This line creates a `device` object that points to the GPU (`cuda`) if one is available, or falls back to the CPU otherwise.
* `model.to(device)`: This is a crucial step that moves all of the model's parameters and buffers to the selected device (in this case, the GPU), ensuring that all subsequent computations are performed on the accelerated hardware.

In [30]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Selects GPU ('cuda') if available, otherwise defaults to CPU
model.to(device)                                                       # Moves the model to the selected device (GPU or CPU)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### 8. Initializing the Trainer

The `Trainer` class is a powerful utility from the Hugging Face `transformers` library that handles the entire training loop. We initialize it by passing all the components we have prepared so far:

* `model`: The GPT-2 model we loaded and moved to the GPU.
* `args`: The `TrainingArguments` object containing all our training configurations.
* `data_collator`: The data collator to create batches for training.
* `train_dataset`: Our tokenized Shakespeare dataset.

The `Trainer` will now manage everything from batching the data to calculating the loss, performing backpropagation, and updating the model's weights.

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

### 9. Commencing the Fine-Tuning Process

This is the moment where the actual training happens. Calling `trainer.train()` starts the fine-tuning process.

The `Trainer` will now:
1.  Iterate through the dataset for the specified number of epochs.
2.  Feed batches of data to the model.
3.  Calculate the loss (a measure of how far off the model's predictions are from the actual text).
4.  Adjust the model's internal weights to minimize this loss.

The output will show the training progress, including the loss at each logging step. A decreasing loss indicates that the model is successfully learning the patterns in the Shakespearean text.

In [32]:
trainer.train()  #This starts the training process

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,3.9648
200,3.6408
300,3.6319
400,3.5979
500,3.5362
600,3.4149
700,3.3541
800,3.3262
900,3.3265
1000,3.3551


TrainOutput(global_step=2640, training_loss=3.261607976393266, metrics={'train_runtime': 263.6034, 'train_samples_per_second': 50.075, 'train_steps_per_second': 10.015, 'total_flos': 862263705600000.0, 'train_loss': 3.261607976393266, 'epoch': 5.0})

### 10. Saving the Final Model and Tokenizer

After the training is complete, it's essential to save our work. This saves the fine-tuned model weights and the tokenizer's configuration, allowing us to load it later for inference without needing to repeat the training process.

* `trainer.save_model(...)`: This saves the learned weights of the model, along with its configuration file, to the specified directory.
* `tokenizer.save_pretrained(...)`: This saves the tokenizer's vocabulary and configuration files to the same directory. This ensures that the exact same tokenization scheme is used when we later load the model for text generation.

In [33]:
trainer.save_model("./gpt2-shakespeare-finetuned2")         #Saves the fine-tuned model to the specified directory
tokenizer.save_pretrained("./gpt2-shakespeare-finetuned2")  #Saves the tokenizer configuration and vocabulary to the same directory

('./gpt2-shakespeare-finetuned2/tokenizer_config.json',
 './gpt2-shakespeare-finetuned2/special_tokens_map.json',
 './gpt2-shakespeare-finetuned2/vocab.json',
 './gpt2-shakespeare-finetuned2/merges.txt',
 './gpt2-shakespeare-finetuned2/added_tokens.json')

### 11. Generating Text with the Fine-Tuned Model

With our model fine-tuned and saved, we can now use it to generate text. The Hugging Face `pipeline` provides a simple and high-level API for this task.

* `pipeline("text-generation", ...)`: We create a text generation pipeline.
* `model`: We load the model from the directory where we saved our fine-tuned version (`./gpt2-shakespeare-finetuned2`).
* `tokenizer`: We load the corresponding tokenizer.
* `device=0`: We assign the pipeline to run on the first GPU (`cuda:0`) for faster inference.
* `generator()`: We call the pipeline with our desired `prompt`, `max_length` (the total length of the generated text), and `num_return_sequences` (how many different versions to generate).

The output will be a piece of text that starts with our prompt and is completed by the model in a style that it learned from the works of Shakespeare.

In [35]:
#Generation
generator = pipeline("text-generation", model="./gpt2-shakespeare-finetuned", tokenizer=tokenizer, device=0)  
#Creates a text generation pipeline using the fine-tuned model and tokenizer
#'device=0' assigns the pipeline to the first CUDA GPU (if available)

prompt = "Shakespeare Quote"  #Input text prompt to start generation from (empty string means model generates from scratch)

output = generator(prompt, max_length=100, num_return_sequences=1)  
#Generates up to 100 tokens of text based on the prompt
#Returns 1 generated sequence

print(output[0]["generated_text"])  #Prints the generated text from the first sequence in the output

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Shakespeare Quote: Originally Posted by I think the only way I would go about it is to say that I don't really believe in a god. I would like to believe that there is something beyond the divine, but it's really hard to explain why. I would like to believe that there is something beyond the divine, but it's really hard to explain why.

You should have to explain why.

I think there is something beyond the divine but it's really hard to explain why.


I would like to believe that there is something beyond the divine but it's really hard to explain why.I think there is something beyond the divine but it's really hard to explain why.

There is a real possibility that there is. Like the Lord is the Creator and will be forever. There is a real possibility that there is. Like the Lord is the Creator and will be forever.

Sorcery (1/2):

I think there is something beyond the divine but it's really hard to explain why. I would like to believe that there is something beyond the divine but it's 