# **Task_01 : Text Generation with GPT-2**

*Train a model to generate coherent and contextually relevant text based on a given prompt. Starting with GPT-2, a transformer model developed by OpenAI, you will learn how to fine-tune the model on a custom dataset to create text that mimics the style and structure of your training data.*

In [44]:
from google.colab import drive
drive.mount('/content/drive')





Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install transformers datasets evaluate
!pip install torch # If not already installed

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.4
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from to

In [19]:
from datasets import Dataset
import pandas as pd

# Define the path to the data file in Google Drive
drive_path = "/content/drive/MyDrive/data.txt"

# Read the data into a pandas DataFrame
df = pd.read_csv(drive_path, header=None, names=["text"])

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Display the dataset
print(dataset)

with open(drive_path, 'r') as file:
    data = file.read()
    print(data)

Dataset({
    features: ['text'],
    num_rows: 2
})
Once upon a time...
The knight rode through the misty forest...



In [20]:
from transformers import GPT2Tokenizer

In [21]:
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [22]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 768)

In [30]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

In [27]:
def add_labels(examples):
    examples["labels"] = examples["input_ids"]
    return examples

tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)
print(tokenized_dataset)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 2
})


In [29]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [31]:
trainer.train()


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=3, training_loss=4.136036554972331, metrics={'train_runtime': 399.3651, 'train_samples_per_second': 0.015, 'train_steps_per_second': 0.008, 'total_flos': 1567752192000.0, 'train_loss': 4.136036554972331, 'epoch': 3.0})

In [46]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# Load tokenizer and model from local checkpoint
model_path = "/content/gpt2-finetuned/checkpoint-3"  # Ensure this path exists

tokenizer = GPT2Tokenizer.from_pretrained(model_path, local_files_only=True)
model = GPT2LMHeadModel.from_pretrained(model_path, local_files_only=True)

# Create text generation pipeline
text_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
prompt = "Once upon a time"
generated = text_gen(prompt, max_length=100, num_return_sequences=1)

print(generated[0]["generated_text"])


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, it was a place where one could make a living. However, this place was slowly dying.

On the way back, I noticed that the door to the room where the two of you were was still standing had been replaced with a door that was now locked and sealed.

"What's happening?"

"It's been changed so that there's no way to enter, so now it's the same as before."

"What?"

"I heard that the boss's boss is a bit of a bit of a mess, but he's the one who changed things, so I'm worried about his efficiency. That's why I'm going to be taking care of you after all."

"What's the matter?"

"There's a few things that need to be done, but first, I want to see if you can tell me all about it. You sure are an experienced warrior, right?"

"Yes, I know that."

Well, I'm not a warrior, but I'm an adventurer, so I don't need to tell you everything.

The boss's boss, who's a bit old, has already changed a lot of things in the past few days.


