## Story Generation

## Remove Temporary Directory

In [1]:
!rm -r '/kaggle/working'

rm: cannot remove '/kaggle/working': Device or resource busy


## Story Generation

In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import torch

In [4]:
import re

def preprocess_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = f.read()
    # Extract all keyword-story pairs
    pattern = re.compile(r'<\|keywords\|>(.*?)<\|story\|>(.*?)<\|endoftext\|>', re.DOTALL)
    matches = pattern.findall(data)

    data_pairs = []
    
    i = 0
    for keywords, story in matches:
        # Format the input text with special tokens
        if(i<50000):
            formatted_text = f"<|keywords|> {keywords.strip()} <|story|> {story.strip()} <|endoftext|>"
            data_pairs.append(formatted_text)
            i = i+1
    return data_pairs

In [5]:
from transformers import GPT2Tokenizer
from datasets import Dataset

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # Use EOS token as padding
tokenizer.add_special_tokens({"additional_special_tokens": ["<|keywords|>", "<|story|>", "<|endoftext|>"]})

# Load the GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

# Load and preprocess the dataset from the file
data_pairs = preprocess_data('/kaggle/input/storydataset2/formatted_text.txt')

# Convert it to Hugging Face Dataset format
train_dataset = Dataset.from_dict({'text': data_pairs})

def tokenize_function(examples):
    encoding = tokenizer(
        examples['text'], 
        padding="max_length",  # Ensures uniform length
        truncation=True, 
        max_length=512, 
        return_tensors="pt"
    )
    encoding["labels"] = encoding["input_ids"]  # Use input_ids as labels
    return encoding

train_dataset = train_dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

# Split into training and validation
tokenized_dataset = train_dataset.train_test_split(test_size=0.1)

# 4. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./story_generation_model",
    overwrite_output_dir=True,
    num_train_epochs=4,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    evaluation_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator  # Handles padding during training
)

  trainer = Trainer(


In [7]:
# Fine-tune the model
trainer.train()

# 7. Save the Model
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss
500,1.1326,0.954637
1000,0.9733,0.90276
1500,0.9368,0.871572
2000,0.8987,0.851485
2500,0.8824,0.838151
3000,0.8722,0.824667
3500,0.8662,0.817094
4000,0.8602,0.806334
4500,0.8413,0.801199
5000,0.8462,0.79603


('./fine_tuned_gpt2/tokenizer_config.json',
 './fine_tuned_gpt2/special_tokens_map.json',
 './fine_tuned_gpt2/vocab.json',
 './fine_tuned_gpt2/merges.txt',
 './fine_tuned_gpt2/added_tokens.json')

In [8]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)

model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50259, bias=False)
)

In [9]:
def generate_story(keywords):
    prompt = f"<|keywords|>{keywords}<|story|>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(
        input_ids,
        max_length=300,
        do_sample=True,
        temperature=0.9,
        top_p=0.92,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )
    story = tokenizer.decode(output[0], skip_special_tokens=False)
    story = story.split("<|story|>")[1].replace("<|endoftext|>", "").strip()
    return story

In [10]:
generate_story("day, girl, named, Lily, found, needle, room, knew, play, wanted, share, mom, sew, button, shirt, went")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"One day a little girl named Lily found a needle in her room. She was very stubborn and didn't know how to do it herself. So she decided not go near the needle.   But one day, she found something special: a tiny button on her shirt! It was shiny and soft and pretty. She loved the button so much that she would never want anyone else to see it.   And from then on, she always did what her mom said. She just wanted to be brave and share her favorite thing with everyone."

In [13]:
keywords = "Moonlight Secret Journey Treasure Whisper Forest Shadow Magic Mystery Adventure"
print(generate_story(keywords))

- Bob, Lily, brother, sister, daydreaming about, going, safari, excited, see, said, bring, mom, dad, went, arrived, saw, lotion, plants, looked, touched, skin, smiled, felt, happy, helped  , followed, house, got, took, care, cleaned, dried and dressed. Inside the secret cave, there was a lotion. It was very strong and smelled good. All of it was dirty.  The party was over so the kids could go home. They were all excited to visit them because they had never been here before! But when it finally came time for them to go, the mummy and dad were not sure what else to do.  They hurried outside and when they arrived at their safari, everybody was amazed! Everything in sight looked beautiful and warm, just like the lotion that had helped them with everything.  Finally after a while, the friends were able come back inside. They were safe, but they also had some pretty lotion too. It was soft and smelled nice.  And then, the treasure started pouring out of the surprise container into something 

In [14]:
keywords = "Once upon a time, wand, danger dragon , black castle"
print(generate_story(keywords))

Once upon every once in awhile there was an orange dragon. The dragon was very big and it was ready to fight!  One day the dragon was so strong that it could almost fight off all of its own. Everyone around it was amazed by how powerful the purple dragon was.  But then one day it found out that it had been warned about being too brave with its magic. The dragon's strength wasn't enough for it but it still wanted to be safe and close away from all the other dragons.  So the dragon decided to try again. This time it felt more confident than ever before. It didnâ€™t want any more dragons to fight against it anymore.  And just like that, the dragon learned that sometimes you have to be brave and never give up on what you do best.


In [15]:
keywords = "day, girl, named, Lily, found, needle, room, knew, play, wanted, share, mom, sew, button, shirt, went"
generate_story(keywords)

'One day a little girl named Lily found a needle in her room. She knew it would be fun to play with. She wanted to share the needle with her mom.  So, Lily and her mom went to sew a button on her shirt. They were very careful with their sewing. They also had some small button on their shirt that they could wear together.  When they went home from sewing, Lily and her mom put the needle on her shirt. They did a good job sewing the button. And they all looked very cozy inside. From then On, they always made sure not to sew too many buttons because they thought it would be fun for them every day!'

In [16]:
keywords = "An astronaut drifts alone in space, staring at the ruins of an ancient civilization on a forgotten planet, A detective dusts off an old book, revealing a hidden map that could expose a powerful secret society, A lone robot wanders through an abandoned city, searching for signs of the last human survivor"
print(generate_story(keywords))

A brave new alien drifts away into space, staring back at his ruined city. He is always looking to uncover more secrets as he slowly dusts off the book, and the map reveals a large hidden secret underground where the strongest and most powerful humans have been found!


In [17]:
keywords = "forensic team, airoplan in sky, child are playing"
print(generate_story(keywords))

Forensic team has a unique arocutus. Every child have it! The team is very fast and strong! Everyone in the team is so patient and patient with their arocuts every day.  The teams are always together. The children are great at playing together and the arocutuses are special to them too. The team is like a powerful magic team that can do anything they set their mind to.


In [18]:
!zip -r file_tuned_gpt2.zip '/kaggle/working/fine_tuned_gpt2'

  pid, fd = os.forkpty()


  adding: kaggle/working/fine_tuned_gpt2/ (stored 0%)
  adding: kaggle/working/fine_tuned_gpt2/tokenizer_config.json (deflated 71%)
  adding: kaggle/working/fine_tuned_gpt2/merges.txt (deflated 53%)
  adding: kaggle/working/fine_tuned_gpt2/config.json (deflated 51%)
  adding: kaggle/working/fine_tuned_gpt2/added_tokens.json (deflated 20%)
  adding: kaggle/working/fine_tuned_gpt2/vocab.json (deflated 68%)
  adding: kaggle/working/fine_tuned_gpt2/model.safetensors (deflated 7%)
  adding: kaggle/working/fine_tuned_gpt2/special_tokens_map.json (deflated 81%)
  adding: kaggle/working/fine_tuned_gpt2/generation_config.json (deflated 24%)
