<a href="https://colab.research.google.com/github/lizarci3/gtp2_film_generation/blob/master/film_script_generation_gtp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

Original article can be found [here](https://towardsdatascience.com/film-script-generation-with-gpt-2-58601b00d371)

Repo [here](https://github.com/cdpierse/script_buddy_v2)

The author used film scripts (~60 MB) of data scraped from the Internet Movie Script Database (IMSDB) in order to fine-tune a GTP2 to write a film script.




The author of this script only had ~1300 scripts to use, however, on averagea screenplay has 30,000 words. So the dataset has close to 40 million sequences of words.

The author wanted the model to be able to generate entire sequences of scripts with mixed scripted elements in each sequence.

This fine-tuning is developed based on hugginface's example on fine-tuning dataset found in run_language_modeling.py

In [1]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp36-none-any.whl size=7413 sha256=60859cab976680383b695afa7644e649c9e617d05a06ee16a6204b13dd1255d3
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.8 GB  |     Proc size: 111.2 MB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total     11441MB


# Preparing the data

The script data is loaded into the model in batches were the data has already been tokenized for GPT-2. In the repo, the ScriptData class splits the entire dataset into tokenized blocks of tensors.

Once the data has been properly prepared, these blocks are loaded in batches into a GPT-2 in a training loop.

In [2]:
!pip install git+https://github.com/huggingface/transformers #just doing a pip install transformers creates some sync problems

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-_io5t9k1
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-_io5t9k1
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 6.9MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 41.2MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K  

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
import numpy as np
import os
import logging 
import pickle
import logging
import random

The gpt2-medium model used in this work has 12 layers ~345 million parameters and took ~6h to train (with 3 epochs).

The first thing that needs to be done is to upload the model and tokenizer from the pre-trained transformers package.

In [4]:
device = 'cpu'

if torch.cuda.is_available():
  device = 'cuda'

device

'cuda'

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

* In order to fine tune this pre-trained model you need to create a training loop where you progressively load a batch of script sequences from the entire dataset.
* Each batch is like a tokenized block of tensors from the data (done in ScriptData).
* An important parameter to consider is the batch size. Large batch sizes can result in running out of GPU memory fast. To start, you can choose a batch of 1 and then test how much you can test it.

* In this work his batch size was 7.

In [6]:
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
)


# Preparing the Data

The following ScriptData class splits the dataset into tokenized blocks of tensors. These blocks will then be loaded in batches into the training loop.

In [7]:
FILE_PATH = "/content/drive/My Drive/WJ/film_text.txt" # ~ 60 MB
logger = logging.getLogger(__name__)

class ScriptData(Dataset):

  def __init__(
      self, #instance of the class ScriptData
      tokenizer: PreTrainedTokenizer,
      file_path: str, 
      block_size = 512, # Fine-tuning item
      overwrite_cache = False
  ):

      assert os.path.isfile(file_path) #assert raises an error if condition False

      block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

      directory, filename = os.path.split(file_path)

      #Create the path/filename for the cached file
      # so that is stored in the same folder and it stores which block size we used
    
      cached_features_file = os.path.join(directory,"gpt2"+"_"+str(block_size)+"_"+filename)

      #if the file already exists and if overwrite_cache is set to False don't overwrite
      if os.path.exists(cached_features_file) and not overwrite_cache:
        logger.info(f"Loading features from cached file {cached_features_file}")

        with open(cached_features_file, 'rb') as cache:
          self.examples = pickle.load(cache)
          logger.debug("Loaded examples from cache")

      else:

        logger.info(f"Creating features from file {filename} at {directory}")

        self.examples = []

        with open(file_path, encoding="utf-8") as f:
          text = f.read()
          logger.debug("Succesfully read text from file")

        #convert_tokens_to_ids = Converts a token string (or a sequence of tokens) in a single integer id (or sequence of ids), using the vocabulary
        tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

        #slice in steps of block_size the text
        #append in examples

        for i in range(0, len(tokenized_text) - block_size + 1, block_size):
          self.examples.append(
              tokenizer.build_inputs_with_special_tokens( #From Bert model: 
                  tokenized_text[i : i + block_size]
              )
          )

        logger.info(f"Saving features into cached file {cached_features_file}")

        # save it
        with open(cached_features_file, "wb") as cache:
          pickle.dump(self.examples, cache, protocol=pickle.HIGHEST_PROTOCOL)

  def __len__(self):
    return len(self.examples)

  def __getitem__(self, item):
    return torch.tensor(self.examples[item], dtype=torch.long)




# Fine-tuning GPT-2: Training

A GPU is necessary when training this model. We are using a dataset of film scripts that is about 60 MB to train. This text has been prepared by scrapping IMSDB (see in the repo the specifics of the scrapping).

We talked about how to fine tune the model (or optimize it on a custom dataset of tokenized text) you need to create a TRAINING LOOP WHERE YOU PROGRESSIVELY LOAD A BATCH OF SCRIPT SEQUENCES FROM THE DATASET.

* Each batch (a batch of tokenized tensor) is run through the language model head as BOTH its intput and target labels.
* From this step we return the loss and logits (i.e., prediction scores) to conduct the backward pass on the gradients.
* Every X number of batches set up an evaluation step to generate a batch of text. This helps us understand how well the model is optimizing and being fine-tuned to the specific text.
* Transformer's generate function provides a number of different decoding methods to get the best results.

In [None]:
output_dir = '/content/drive/My Drive/WJ/'

In [None]:
dataset = ScriptData(tokenizer= tokenizer, file_path= FILE_PATH )
script_loader = DataLoader(dataset,batch_size=4,shuffle=True)



In [None]:
type(script_loader)

torch.utils.data.dataloader.DataLoader

### Parameters

In [None]:
BATCH_SIZE = 1 #starting point, author used in the end a batch_size of 7
EPOCHS = 1 # the author mentions he used in total 3 full epochs lasting ~6h for training
LEARNING_RATE = 0.00002
WARMUP_STEPS = 10000

Start the optimizer, scheduler and set up the loss, batch counts to start at zero

In [None]:
model.train()

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)



In [None]:
device

'cuda'

In [None]:
def calculate_perplexity(encoded_joke, input_stride):
  lls = []


  for i in range(1, encoded_joke.size(1), input_stride):
    begin_loc = max(i + input_stride - encoded_joke.size(1), 0)
    end_loc = i + input_stride

    input_ids = encoded_joke[:,begin_loc:end_loc].to(device)

    target_ids = input_ids.clone()
    target_ids[:,:-input_stride] = -100

    with torch.no_grad():
      ppl_output = model(input_ids, labels=target_ids)
      log_likelihood = ppl_output[0]*input_stride

    lls.append(log_likelihood)

  perplex = torch.exp(torch.stack(lls).sum()/i)

  return(perplex.item())

## Just for testing

In [None]:
script_count = 0
sum_loss = 0.0
batch_count = 0

# input text: You can use input_ids or bos_token_id to start your text generation
# bos_token_id should be 1 positive int (in the setup we have below, it initializes with a random word)
# in model generate choose how you want your text to begin

test_text = "Once upon a time there was a dog"
tokenized_test = torch.tensor(tokenizer.encode(test_text)).unsqueeze(0).to(device)

# or
bos_token_id_gen = random.randint(1,30000) 


# for perplexity
stride = 100

for epoch in range(EPOCHS):
  print(f"EPOCH {epoch} started"+'='*30)

  j = 0
  nsamples = 199

  for idx, script in enumerate(script_loader):

    if j>nsamples:
      break

    else:
      #print(idx, script, script[0].shape) #512 was what we used as block size in ScriptData

      outputs = model(script.to(device), labels=script.to(device))

      loss, logits = outputs[:2] # language modeling loss and prediction scores of the language modeling head 
      loss.backward()

      sum_loss = sum_loss + loss.detach().data
      script_count = script_count + 1
      #print('Sum loss', sum_loss, 'script_count', script_count)
      
      #once we have loaded enough scripts == batch_size

      j = j + 1 

      if script_count == BATCH_SIZE: #7 in original text
        #print('script_count equal to batch_size')
        script_count = 0 # re-start script counter
        batch_count = batch_count + 1 # how many full batches we have
        #print('batch_count', batch_count)
        
        optimizer.step() # perform an optimization step
        scheduler.step()
        optimizer.zero_grad() # clear the gradients from the last step, PyTorch accumulates the gradients so before starting to propagate we need to set them to zero
        model.zero_grad()


        ## As opposed to the jokes generation script, this script generates text
        ## Apparently, every X number of batches to see how it is doing
       
        if batch_count == 200:
            model.eval()  
            print(f"sum loss {sum_loss}")

            ## see https://huggingface.co/blog/how-to-generate
           
            sample_outputs = model.generate( #function added since version 2.4
                                    #bos_token_id=bos_token_id_gen, # Beginning of sentence token if no prompt is provided. The sequence used as a prompt for the generation. If `None` the method initializes  it as an empty `torch.LongTensor` of shape `(1,)`. 
                                    input_ids = tokenized_test, #(optional) tf.Tensor of dtype=tf.int32 of shape (batch_size, sequence_length) The sequence used as a prompt for the generation. If None the method initializes it as an empty tf.Tensor of shape (1,).
                                    do_sample=True, #If set to `False` greedy decoding is used. Otherwise sampling is used. In its basic form, in sampling you randomly choose the next token  
                                    top_k=50, # The number of highest probability vocabulary tokens to keep for top-k-filtering. Must be between 0 and 1. Default to 50.  
                                    max_length = 1000, # The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.
                                    top_p=0.95, #The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.  
                                    num_return_sequences=1 #The number of independently computed returned sequences. If you want to choose between different options set > 1. Default to 1.
                                )

            print("Output:\n" + 100 * '-')
            #print('bos_token_id_gen', bos_token_id_gen)
            for i, sample_output in enumerate(sample_outputs):

                  ppl = calculate_perplexity(sample_outputs, stride)
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
                  print('Perplexity', ppl)
            
            batch_count = 0
            sum_loss = 0.0
            model.train()
        

        






Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


sum loss 442.0477294921875
Output:
----------------------------------------------------------------------------------------------------
0: Once upon a time there was a dog who lived in the house with a little brother. When we began to run away to the house a boy saw a little dog that he knew. He thought it was a black cat and he went away, crying.

A man called me and told me, "I don't know the names of the men who killed the dog. It was in the neighbourhood, so I am very sorry."

A woman said, "There were two brothers living, and we ran after them. They were afraid of me because I am black, so they followed them and killed our little dog."

At first I did not speak to them and they did not stop us. I found out later that they were trying to escape to the south.

On Christmas I went there to have a tea. Then when I came out there there was a lady of the same family, who had a dog, and she found a small dog with it, and she was crying.

On the Christmas eve of Christmas there was anothe

In [None]:
ppl = calculate_perplexity(sample_outputs, 300)
ppl

24.807331085205078

# Breaking down preparing the data

In [None]:
FILE_PATH = "/content/drive/My Drive/WJ/film_text.txt" # ~ 60 MB
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")

model.train()

In [None]:
assert os.path.isfile(FILE_PATH)

In [None]:
tokenizer.max_len, tokenizer.max_len_single_sentence

(1024, 1024)

In [None]:
block_size = 512 # I am assuming this is the size of the block that will be loaded into the training loop

print('tokenizer max len', tokenizer.max_len, 'tokenizer max len single sentence')

block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

directory = "/content/drive/My Drive/WJ/"
filename ='film_text.txt'

block_size, directory, filename

tokenizer max len 1024 tokenizer max len single sentence


(512, '/content/drive/My Drive/WJ/', 'film_text.txt')

In [None]:
cached_features_file = os.path.join(directory, "gpt2" + "_" + str(block_size) + "_" + filename)
cached_features_file

'/content/drive/My Drive/WJ/gpt2_512_film_text.txt'

In [None]:
overwrite_cache = False
logger = logging.getLogger(__name__)


if os.path.exists(cached_features_file) and not overwrite_cache: #if it already exists, don't overwrite if overwite_cache set to False
      print('Loading featues from cached file')
      logger.info(f"Loading features from your cached file {cached_features_file}") # report event

      with open(cached_features_file, "rb") as cache:
                self.examples = pickle.load(cache) #take binary data and deserialize to use 
                logger.debug("Loaded examples from cache")

else:
      logger.info(f"Creating features from file {filename} at {directory}") #report event
      print('Creating features from file')

     

Creating features from file


Let's  breakdown what is happening inside the second else:

In [None]:
self.examples = []

with open(FILE_PATH, encoding="utf-8") as f:
    text = f.read()
    print('read_file')
    logger.debug("Succesfully read text from file")

#tokenize the text
# convert_tokens_to_ids = Converts a token string (or a sequence of tokens) in a single integer id (or sequence of ids), using the vocabulary
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

read_file


In [None]:
len(tokenized_text)

28066436

In [None]:
examples = []

for i in range(0, len(tokenized_text)-block_size +1, block_size):
      examples.append(
      tokenizer.build_inputs_with_special_tokens(
          tokenized_text[i:i+block_size]
      )
  )


In [None]:
len(examples)

54817

In [None]:
examples[9]

In [None]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

device

'cuda'

# Understanding model generate parameters


* temperature:   It allows for more "creativity" the higher the temperature, the crazier the result. This means that the network is allowed to make sub-optimal predictions.

* prefix: How you want your text to begin

* length: Number of tokens to generate (default = 1023, which is also the maximum)

* top_k: Limits the generated guesses to the top k guesses (default 0 will disable the behavior; if the generated output is super crazy you may want to set up top_k=40)
* top_p: Nucleus sampling: limits the generated guesses based on a cumulative probability (gets good results on a dataset with top_p = 0.9)
* truncate: Truncates the input text until it sees a pre-determined sequence (e.g. if truncate=<|endoftext|> the returned text will include everything before the first of those tokens) It may be useful to combine this with a smaller length if the input texts are short.
* include_prefix: If using truncate  and include_prefix=False, the specified prefix won't be included in the returned text.

* num_beams: Sometimes, greedy search can miss high probability words hidden behind a low probability word. If we set up num_beams say to n then at each step the model will keep track of n paths of high probability and will only choose the branch that in the end has the greatest probability even if at the beginning it didn't look like that was going to be the case (it was being greedy)

Beam search is usually not very useful for open-ended generation where you don't have a specific lenght.


* no_repeat_ngram_size : It ensures that there are not too many repetitions in the text. The most common n-gram penalty makes sure that no n-gram appears twice (no_repeat_ngram_size = 2)
* num_return_sequences: If you want to choose the best output for your text you can set up num_return_sequences > 1 and then return which text you like the most (num_return_sequences <= num_beams)



# References

* generate https://huggingface.co/transformers/v2.9.1/_modules/transformers/modeling_tf_utils.html#TFPreTrainedModel.generate

* model generate https://huggingface.co/transformers/v2.9.1/main_classes/model.html

* https://huggingface.co/blog/how-to-generate