## Story Generation1

## Remove Temporary Directory

In [6]:
!rm -r '/kaggle/working/wandb'

  pid, fd = os.forkpty()


In [7]:
## Extract 50k dataset because of resource

In [8]:
import random

def generate_random_pairs(input_file, output_file, num_pairs=50000):
    # Read the dataset from the input file
    with open(input_file, 'r') as file:
        data = file.read()

    # Split the data into individual pairs (Keywords and Story)
    pairs = data.split('<|startoftext|>')

    # Filter out empty strings or any non-valid pair
    pairs = [pair.strip() for pair in pairs if pair.strip()]

    # Select a random subset of pairs (ensure we don't exceed the available number)
    random_pairs = random.sample(pairs, min(num_pairs, len(pairs)))

    # Rebuild the text in the same format
    output_text = ""
    for pair in random_pairs:
        output_text += f"<|startoftext|>{pair}\n"

    # Write the result to the output file
    with open(output_file, 'w') as file:
        file.write(output_text)

# Example usage:
input_file = '/kaggle/input/keyword-story-dataset/keyword_story_dataset.txt'  # Replace with the path to your input file
output_file = '/kaggle/working/keyword_story_dataset_50k.txt'  # Replace with the path where you want to save the output

generate_random_pairs(input_file, output_file)


## Tokenization

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import torch

# 1. Load and Preprocess Dataset
def load_and_preprocess_data(file_path):
    # Read the dataset file
    with open(file_path, 'r', encoding='utf-8') as f:
        data = f.read().split('<|endoftext|>')  # Split stories
    
    # Prepare dataset format
    examples = [{"text": text.strip() + "<|endoftext|>"} for text in data if text.strip()]
    return Dataset.from_list(examples)

# Tokenizer function: Include 'labels'
def tokenize_function(examples, tokenizer, max_length=512):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=max_length)
    tokenized["labels"] = tokenized["input_ids"].copy()  # Copy input_ids to labels for loss calculation
    return tokenized

# 2. Load GPT-2 Model and Tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Add special tokens
tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
model.resize_token_embeddings(len(tokenizer))

# 3. Load and Tokenize Dataset
dataset_path = "/kaggle/working/keyword_story_dataset_50k.txt"
raw_dataset = load_and_preprocess_data(dataset_path)
# Tokenize the dataset
print("Tokenize")
tokenized_dataset = raw_dataset.map(lambda x: tokenize_function(x, tokenizer), batched=True)

# Save the tokenized dataset to disk
tokenized_dataset.save_to_disk("/kaggle/working/tokenized_dataset_50k")

# Now you can load it later using:
# from datasets import load_from_disk
# tokenized_dataset = load_from_disk("/kaggle/working/tokenized_dataset")


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Tokenize


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
from datasets import load_dataset, Dataset, load_from_disk
# from datasets import load_from_disk
tokenized_dataset_10k = load_from_disk("/kaggle/working/tokenized_dataset_50k")
# Split into training and validation
tokenized_dataset = tokenized_dataset_10k.train_test_split(test_size=0.1)

# 4. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    evaluation_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True
)

# 5. Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)

# 6. Train the Model
trainer.train()

# 7. Save the Model
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

  trainer = Trainer(


Step,Training Loss,Validation Loss
500,1.4045,0.980985
1000,1.006,0.939133


In [50]:

# 8. Generate Stories
def generate_story(keywords, max_length=200):
    prompt = f"Keywords: {keywords}\nStory:"
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)  # Move to model's device
    output = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.9,
        top_p=0.9,
        top_k=50,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


# Example: Generate a story
keywords = "magic, dragon, castle"
print(generate_story(keywords))



Keywords: magic, dragon, castle
Story: Once there was a magic dragon. He was very powerful and he was very brave. He was very big and he had a big castle. The castle was very big and it was very tall. The dragon was very brave. He was very strong and he had a big castle. The dragon was very tall and he had a big castle. The dragon was very brave and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very brave and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very brave and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very strong and he had a big castle. The dragon was very brave and he had a big castle. The


In [51]:
# Example: Generate a story
keywords = "child are playing, there is house in field, airoplane is flying in sky,"
print(generate_story(keywords))

Keywords: child are playing, there is house in field, airoplane is flying in sky,
Story: There is a child are playing in the field. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing with aroids. The child is playing


In [52]:
keywords = "A curious cat found a glowing key in the attic, The robot discovered it had a secret mission hidden in its memory,On a rainy night, a letter arrived with no sender,  The ancient tree whispered secrets to anyone who touched it, A young girl woke up to find her reflection missing from the mirror "
print(generate_story(keywords))

Keywords: A curious cat found a glowing key in the attic, The robot discovered it had a secret mission hidden in its memory,On a rainy night, a letter arrived with no sender,  The ancient tree whispered secrets to anyone who touched it, A young girl woke up to find her reflection missing from the mirror 
Story: A curious cat found a mysterious key in the attic. The key was a secret mission hidden in its memory.   The cat was curious about the secret mission. It discovered that the key had a secret mission hidden in its memory.   The cat was curious about the secret mission. It was curious about the secret mission.   The cat found the key in the attic and was curious about the secret mission.   The cat was curious about the secret mission. It was curious about the secret mission.   The cat was curious about the secret mission. It was curious about the secret mission.   The cat was curious about the secret mission


In [53]:
keywords = "Moonlight Secret Journey Treasure Whisper Forest Shadow Magic Mystery Adventure"
print(generate_story(keywords))

Keywords: Moonlight Secret Journey Treasure Whisper Forest Shadow Magic Mystery Adventure
Story: The Shadow of the Moon is a magical land. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets. It is full of secrets


In [57]:
import os
import zipfile

def create_zip_from_directory(directory_path, output_zip_path):
    with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(directory_path):
            for file in files:
                file_path = os.path.join(root, file)
                arcname = os.path.relpath(file_path, directory_path)  # Preserve directory structure
                zipf.write(file_path, arcname)
    print(f"Zip file created at: {output_zip_path}")

# Usage
directory_to_zip = '/kaggle/working/tokenized_dataset_50k'
output_zip_file = '/kaggle/working/tokenized_dataset_50k.zip'
create_zip_from_directory(directory_to_zip, output_zip_file)


Zip file created at: /kaggle/working/tokenized_dataset_50k.zip


## Story Generation2

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import torch

In [2]:
import re

def preprocess_data(file_path):
    # Read and parse the dataset
    with open(file_path, 'r', encoding='utf-8') as f:
        data = f.read()
    # Extract all keyword-story pairs
    pattern = re.compile(r'<\|keywords\|>(.*?)<\|story\|>(.*?)<\|endoftext\|>', re.DOTALL)
    matches = pattern.findall(data)

    data_pairs = []
    
    i = 0
    for keywords, story in matches:
        # Format the input text with special tokens
        if(i<50000):
            formatted_text = f"<|keywords|> {keywords.strip()} <|story|> {story.strip()} <|endoftext|>"
            data_pairs.append(formatted_text)
            i = i+1
    return data_pairs

In [3]:
from transformers import GPT2Tokenizer
from datasets import Dataset

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Add a padding token (GPT-2 does not have one by default)
tokenizer.pad_token = tokenizer.eos_token # Use EOS token as padding
tokenizer.add_special_tokens({"additional_special_tokens": ["<|keywords|>", "<|story|>", "<|endoftext|>"]})

# Load the GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

# Load and preprocess the dataset from the file
data_pairs = preprocess_data('/kaggle/input/storydataset2/formatted_text.txt')  # Replace with your file path

# Convert it to Hugging Face Dataset format
train_dataset = Dataset.from_dict({'text': data_pairs})

def tokenize_function(examples):
    encoding = tokenizer(
        examples['text'], 
        padding="max_length",  # Ensures uniform length
        truncation=True, 
        max_length=512, 
        return_tensors="pt"
    )
    encoding["labels"] = encoding["input_ids"]  # Use input_ids as labels
    return encoding



train_dataset = train_dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

# Split into training and validation
tokenized_dataset = train_dataset.train_test_split(test_size=0.1)

# 4. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results-model2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    evaluation_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator  # Handles padding during training
)

  trainer = Trainer(


In [6]:
# Fine-tune the model
trainer.train()

# 7. Save the Model
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

In [16]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)

# Ensure model is in evaluation mode
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50259, bias=False)
)

In [17]:
def generate_story(keywords):
    prompt = f"<|keywords|>{keywords}<|story|>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(
        input_ids,
        max_length=300,
        do_sample=True,
        temperature=0.9,
        top_p=0.92,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )
    story = tokenizer.decode(output[0], skip_special_tokens=False)
    story = story.split("<|story|>")[1].replace("<|endoftext|>", "").strip()
    return story

In [18]:
generate_story("day, girl, named, Lily, found, needle, room, knew, play, wanted, share, mom, sew, button, shirt, went")

'One day a little girl named Lily found a needle in her room. She knew it was very special and couldn\'t play with it because she wanted to share it with her mom. Lily wanted the needle to be clean and shiny. So, she went to her mom and said, "I can sew a button on my shirt." Her mom was proud of her and they both went to put the button on the shirt. From that day onward, Lily knew that if she wanted something clean and pretty, she could share it with her mom.'

In [19]:
keywords = "Moonlight Secret Journey Treasure Whisper Forest Shadow Magic Mystery Adventure"
print(generate_story(keywords))

Dungeon: On Behalf of God, there was a mighty guardian. He had the biggest and strongest will in the whole of the forest - and he was very powerful.  The guardian was so mighty that nobody could ever be able to be trusted. So every day, the guardian would go quietly into the deepest, darkest of dark places. But each time he went through this mysterious, frightening secret, nobody was able to be trusted.  Moral ofThe Day™s story is never too safe for you! When you are strong enough?™œIf only you could trust God.â€™


In [20]:
# Example: Generate a story
keywords = "Opnce upon a time, wand, danger dragon , black castle"
print(generate_story(keywords))
print("==================")
# # Example: Generate a story
keywords = "Once upon a time, wand, danger dragon , black castle"
print(generate_story(keywords))

Openceupon was a brave and fierce warrior. He was always ready to fight when it was his turn!  The dragon was very strong and would often go out on the castle. But he was also very brave of himself.  One day, Joe had to face an even more fierce dragon! It was big and strong, like no other dragon in the whole castle.   Joe's bravery against such a fierce dragon made him even stronger than ever before.  Joe is now only one brave warrior but he is still able to do great things with his sword.
Once upon the time there was an evil dragon. The dragon was very dark and scary and could not go anywhere.  He had to go away when it was too dangerous. One day, he was brave enough to go into another castle.  But this time, it was much safer. The dragon was still safe from being in his dark castle.  He was ready for anything!


In [14]:
# Step 1: Improved dataset preprocessing
# def generate_random_pairs(input_file, output_file, num_pairs=50000):
#     with open(input_file, 'r') as file:
#         data = file.read().split('<|endoftext|>')
#     pairs = random.sample[:num_pairs]
#     output_text = ""
#     for pair in pairs:
#         if "Keywords:" in pair and "Story:" in pair:
#             kw, story = pair.split("Story:", 1)
#             kw = kw.replace("Keywords:", "").strip()
#             story = story.strip()
#             output_text += f"<|keywords|>{kw}<|story|>{story}<|endoftext|>\n"
#     with open(output_file, 'w') as file:
#         file.write(output_text)

# Step 2: Training with special tokens
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
# tokenizer.add_special_tokens({"additional_special_tokens": ["<|keywords|>", "<|story|>", "<|endoftext|>"]})
# model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
# model.resize_token_embeddings(len(tokenizer))

# Step 3: Generate with better prompts
# def generate_story(keywords):
#     prompt = f"<|keywords|>{keywords}<|story|>"
#     input_ids = tokenizer.encode(prompt, return_tensors="pt")
#     output = model.generate(
#         input_ids,
#         max_length=300,
#         do_sample=True,
#         temperature=0.9,
#         top_p=0.92,
#         repetition_penalty=1.2,
#         pad_token_id=tokenizer.eos_token_id
#     )
#     story = tokenizer.decode(output[0], skip_special_tokens=False)
#     story = story.split("<|story|>")[1].replace("<|endoftext|>", "").strip()
#     return story

In [15]:
# # Example usage:
# input_file = '/kaggle/input/keyword-story-dataset/keyword_story_dataset.txt'  # Replace with the path to your input file
# output_file = '/kaggle/working/keyword_story_dataset_50k.txt'  # Replace with the path where you want to save the output

# generate_random_pairs(input_file, output_file)

In [None]:
# generate_story("day, girl, named, Lily, found, needle, room, knew, play, wanted, share, mom, sew, button, shirt, went")

In [21]:
keywords = "An astronaut drifts alone in space, staring at the ruins of an ancient civilization on a forgotten planet, A detective dusts off an old book, revealing a hidden map that could expose a powerful secret society, A lone robot wanders through an abandoned city, searching for signs of the last human survivor"
print(generate_story(keywords))

Once there was An astronaut who wasn't afraid to be brave. He's just staring around at his ruins, staring up at the terrible ruins of an ancient civilization.  Suddenly he sees something shiny and brilliant. It's a tiny map! The astronaut's eyes are glimmering, but he's still scared.  The explorer looks carefully behind the map, until he's out of sight. He's only three-years-old - but now he is strong enough to reveal it once more.  The explorer has revealed the secrets from the ancient city, so all the robots and humans must be brave too. The astronaut is glad he's been brave, and no longer can be fearful. He's also glad he's able Toompose himself with the map as he walks away.


In [24]:
keywords = "forensic team, airoplan in sky, child are playing"
print(generate_story(keywords))

Forensic team has a Forensic team. It is very organized. Every child is in the team. The team members are all united in one single operation.  In this operation, they do everything together. They act like doctors, nurses and other special people. They act like the Forensics team and they act like the other children. Everyone is strong and united by their work.  The forensic team have a great time at the same time. They act like the other children and act like the Forensics team and act like the other children. They act like the doctors who act with power. And everyone is always united as the team members act like the Forensics team and act like the other children too!  The Forensic teams act and act for many years. Everywhere they go, everyone is there to act together. They act like the doctors who act with power and act like the Fore Forensic team. And everyone is so proud of them.


In [27]:
!zip -r file_tuned_gpt2.zip '/kaggle/working/fine_tuned_gpt2'

  pid, fd = os.forkpty()


  adding: kaggle/working/fine_tuned_gpt2/ (stored 0%)
  adding: kaggle/working/fine_tuned_gpt2/added_tokens.json (deflated 20%)
  adding: kaggle/working/fine_tuned_gpt2/config.json (deflated 51%)
  adding: kaggle/working/fine_tuned_gpt2/tokenizer_config.json (deflated 71%)
  adding: kaggle/working/fine_tuned_gpt2/model.safetensors (deflated 7%)
  adding: kaggle/working/fine_tuned_gpt2/vocab.json (deflated 68%)
  adding: kaggle/working/fine_tuned_gpt2/generation_config.json (deflated 24%)
  adding: kaggle/working/fine_tuned_gpt2/merges.txt (deflated 53%)
  adding: kaggle/working/fine_tuned_gpt2/special_tokens_map.json (deflated 81%)
