<a href="https://colab.research.google.com/github/iam-abbas/Transformer-Horror/blob/main/Story_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Horror Story Generator

##### ***Using Pytorch, Transformers and Open-AI's GPT-2***

*Installing Transformers*

In [1]:
!pip install -q transformers

*importing all the required modules*

In [2]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup
from torch.utils.data import Dataset
from torch.utils.data import Dataset, DataLoader
import os

#### **Choosing a Model**
##### Transformes has 4 models
![Image by Jay Alammar from post The Illustrated GPT-2](https://i.imgur.com/yrIxPVX.png)

**Model Names**:
- `gpt2-small` (124M Model)
- `gpt2-medium` (345M Model)
- `gpt2-large` (774M Model)
- `gpt2-xl` (1558M Model)

*In our case I focused on making a lighter model so I used medium*

In [3]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
    
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=764.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3247202234.0, style=ProgressStyle(descr…




In [4]:
device

'cuda'

`choose_from_top`:
- Function to first select topN tokens from the probability list and then based on the selected N word distribution

`generate_text`:
- At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text.

In [5]:
def choose_from_top(probs, n=10):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob)
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

def generate_text(input_str, text_len = 100):
    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
    model.eval()
    with torch.no_grad():
        for i in range(text_len):
            outputs = model(cur_ids, labels=cur_ids)
            loss, logits = outputs[:2]
            softmax_logits = torch.softmax(logits[0,-1], dim=0) 
            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10)
            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word
        output_list = list(cur_ids.squeeze().to('cpu').numpy())
        output_text = tokenizer.decode(output_list)
        print(output_text)

### Testing the loaded model

In [6]:
generate_text("Once opon a time")

Once opon a time, they would be able to take advantage of the situation. They would then be able to get in front of the enemy team as well.

"If they are able to gain control of these areas, they would be able to make the enemy team lose a lot of time," says Zeng. "They would be able to make the entire game go faster."

But what if the enemy team already has a lead? What happens then? In that case, the team would have to


*That was interesting isn't it?*

## **Fine-tuning GPT-2 on Stories Dataset from Reddit**


In [7]:
class Text_Corpus(Dataset):
    def __init__(self, dataset_path = ''):
        super().__init__()
        corpus_path = os.path.join(dataset_path, 'stories.txt')
        self.token_list = []
        self.end_of_text_token = "<|endoftext|>"

        with open(corpus_path) as f:
            data = f.read()
            self.token_list = data.split("<|endoftext|>")

        for i in range(len(self.token_list)):
          self.token_list[i] = self.end_of_text_token+self.token_list[i]+self.end_of_text_token

    def __len__(self):
        return len(self.token_list)

    def __getitem__(self, item):
        return self.token_list[item]

*Loading the dataset from `Text_Corpus`*

In [8]:
dataset = Text_Corpus()
print("Number of Tokens Found:", len(dataset))
data_loader = DataLoader(dataset, batch_size=1, shuffle=True)

Number of Tokens Found: 713


*Assigning Parameterts (EPOCH, Batch Size, etc)*

In [9]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [10]:
BATCH_SIZE = 1
EPOCHS = 20
LEARNING_RATE = 1e-5
WARMUP_STEPS = 10000
MAX_SEQ_LEN = 550

### *Training the Model for 20 Epochs*

In [11]:
model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)
text_count = 0
sum_loss = 0.0
batch_count = 0

tmp_text_tens = None

for epoch in range(EPOCHS):

    print(f"EPOCH {epoch} started " + '=' * 30)
    for idx,text in enumerate(data_loader):
            
        text_tens = torch.tensor(tokenizer.encode(text[0])).unsqueeze(0).to(device)

        if text_tens.size()[1] > MAX_SEQ_LEN:
            continue
            print("Skipping")
        if not torch.is_tensor(tmp_text_tens):
            tmp_text_tens = text_tens
            print("Skipping")
            continue
        else:
            if tmp_text_tens.size()[1] + text_tens.size()[1] > MAX_SEQ_LEN:
                work_text_tens = tmp_text_tens
                tmp_text_tens = text_tens
            else:
                tmp_text_tens = torch.cat([tmp_text_tens, text_tens[:,1:]], dim=1)
                print("Skipping")               
                continue
                          
        outputs = model(work_text_tens, labels=work_text_tens)
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data                    
        text_count = text_count + 1

        if text_count == BATCH_SIZE:
            text_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()
            
        if batch_count == 1000:
            print(f"sum loss {sum_loss}")
            batch_count = 0
            sum_loss = 0.0

Token indices sequence length is longer than the specified maximum sequence length for this model (4350 > 1024). Running this sequence through the model will result in indexing errors


Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping
Skipping


### **Generating 5 Samples**

In [12]:
model.eval()
with torch.no_grad():
    
    for text_idx in range(5):

        cur_ids = torch.tensor(tokenizer.encode("The night she died she said to him")).unsqueeze(0).to(device)
        
        for i in range(250):
            outputs = model(cur_ids, labels=cur_ids)
            loss, logits = outputs[:2]
            softmax_logits = torch.softmax(logits[0,-1], dim=0)
            if i < 2:
                n = 15
            else:
                n = 3
            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1)
            if next_token_id in tokenizer.encode('<|endoftext|>'):
                break
            
        output_list = list(cur_ids.squeeze().to('cpu').numpy())
        output_text = tokenizer.decode(output_list)
        print(f"SAMPLE {text_idx}: {output_text.capitalize()} \n")

SAMPLE 0: The night she died she said to him, "i love you."

"you're not the first one to die in my bed," she said.

"i'm not the first one to die," he said.

she said, "i know you're not the last one to die."

"i know i'm not the last one to die," he said.

"i know you're not the last one to die," she said.

"i'm going to be with you forever," he said.

"i know you're going to be with me forever," she said.

"i know you're going to be with me forever," he said.

he told her he loved her. she said, "i love you."

"i love you," he said.

he said, "you're my angel," she said.

"i'm going to be with you forever," he said. "you're my angel."

he told her she was beautiful and he loved her. she said, "i love you."

she said, "i love you."

"i love you," he said.

he said, "i know 

SAMPLE 1: The night she died she said to him, 'don't be sad. i'll be happy soon.'

"i don't think she was in the right place at the right time."<|endoftext|> 

SAMPLE 2: The night she died she said to him: "you'r

In [19]:
generate_text("It was the scariest face I have ever seen", 500)

It was the scariest face I have ever seen," he said.

Mr. Giffords was shot in the neck and left for dead by her assailant, Chris Cox, who later killed himself. The gunman also killed his wife.

Mr. Giffords, 52 years old, said he was not sure he wanted to talk about the case at a news conference, but he would. "I have been trying for a long time to tell the truth," he said.

He said he had never been afraid. He was not a political activist, he said, but had always had his eyes open and was trying to do what he thought was right.

"If it's a political statement, it's because I'm concerned about what I believe is going on around me," he said.

Mr. Giffords, who was born in Tucson and raised in Phoenix, said that he had voted for President Bush and was planning to do so again on Tuesday.

"We need to stop the madness, but if the madness has taken place, we need to make sure that we are not going to continue down this path," he said.

Mr. Giffords said he had been in touch with a group of