# Prompt Generation for ROFT (Real or Fake Text?) -- http://roft.io/

Developed by Liam Dugan and Arun Kirubarajan in Spring 2020 ([Github](https://github.com/kirubarajan/roft.git))

## Step 1: Mount Drive and Clone the Repository

In [0]:
# Mount your google drive folder
from google.colab import drive
drive.mount('/content/drive')

# Change to the google drive folder and clone our repo and gpt-2
import os
os.chdir('/content/drive/My Drive')
!git clone https://github.com/kirubarajan/roft.git
os.chdir('/content/drive/My Drive/roft')

## Step 2: Install Dependencies

In [0]:
%tensorflow_version 1.x      # GPT-2 currently only supports tensorflow 1 
!pip3 install gpt-2-simple
import nltk 
nltk.download('punkt')

## Step 3: Sample Prompts from Human Text
At the moment this is only from AI dungeon training data, but we will expand this to include news and maybe blog posts later. 

### Usage
1.   Determine desired NUM_GENERATIONS, MAX_PROMPT_LENGTH, and PERCENT_NONHUMAN
2.   Pick the desired SAMPLE_FILE
3.   That's it! Run and see!

### Notes
1.   Prompt lengths are sampled on a *uniform* distribution (not a random distribution)
2.   Prompts are only sampled from the AI Dungeon *test* file. We should probably concatenate both the test and dev file to get maximum prompt diversity.
3.   The file is mmaped instead of loading it into RAM, so no need to worry about having a high-RAM instance of colab pro.
4.   We only sample from the start of generations (<|startoftext|>). This allows us to avoid instances where a prompt contains context from a previous unseen part of the story
5.   We use nltk punkt sentence tokenizer to tokenize sentences. This isn't really a perfect tokenizer to use, might be worth looking into other options
6.   There ARE duplicates. This is because the input text_adventures_test.txt has duplicates. Will probably MD5 hash check at some point in the future to get rid of these
7.   Prompts are always sampled from the beginning of the file and not randomly. This is to avoid running the regex on the entire file for large corpora. This can be changed in the future, but for now, do not expect this to give you two different sets of generations if run twice.



In [0]:
import os, re, mmap, random
from nltk.tokenize import sent_tokenize

In [0]:
def random_sample_prompt_length(percent_nonhuman, max_prompt_length):
  if (random.random() > percent_nonhuman):
    return max_prompt_length
  else:
    return random.randint(1, max_prompt_length - 1)

In [0]:
def sample_ai_dungeon(num_samples, max_prompt_length, percent_nonhuman):

  # AI Dungeon sample file is text_adventures_test.txt
  # (should probably update to include dev at some point)
  sample_file = './generation/text_adventures_test.txt'
  if not os.path.exists(sample_file):
    print('Error: AI Dungeon sample file "' + SAMPLE_FILE + '" does not exist')
    exit(-1)

  prompts = []
  successfully_sampled_prompts = 0
  with open(sample_file, 'r+b') as f:
    # mmap the file so we can regex it without loading it all into RAM
    data = mmap.mmap(f.fileno(), 0)

    # Grab all the spans of text that are between <|startoftext|> and <|endoftext|>
    # (use finditer instead of findall to only search for regex matches as necessary)
    pattern = re.compile(b'<\|startoftext\|\>((.|\n)*?)\<\|endoftext\|\>')
    for m in re.finditer(pattern, data):
      # If we're done sampling, no need to continue the loop
      if successfully_sampled_prompts >= num_samples: break

      # Randomly determine prompt length based on the specified percent nonhuman value
      prompt_length = random_sample_prompt_length(percent_nonhuman, max_prompt_length)

      # Use NLTK Sentence tokenizer to sample sentences and clean them
      tokenized_prompt = sent_tokenize(str(m.group(1), 'utf-8', 'ignore'))

      # Accept the prompt if it is longer than the desired length
      if len(tokenized_prompt) > prompt_length:
        prompts.append(tokenized_prompt[:prompt_length])
        successfully_sampled_prompts += 1

        print('Sampled prompt {0} of length {1}'.format(str(successfully_sampled_prompts), str(prompt_length)))
        for line in tokenized_prompt[:prompt_length]:
          print('\t' + repr(line))

  return prompts

In [0]:
import random
import re
from nltk.tokenize import sent_tokenize

NUM_GENERATIONS = 56 # Number of prompts to sample
MAX_PROMPT_LENGTH = 11 # This is the maximum length of the prompt (lengths will be uniformly sampled from 1 to this number)
PERCENT_NONHUMAN = 0.75 # This is the percentage of the prompts to have be randomly sampled from length 1 to MAX, the rest will all be human only
SAMPLE_FILE = './generation/text_adventures_test.txt' # file to sample prompts from

prompts = sample_ai_dungeon(NUM_GENERATIONS, MAX_PROMPT_LENGTH, PERCENT_NONHUMAN)

## Step 4: Download, Load, and Fine-Tune GPT-2

### Usage
1.   Determine which GPT2_MODEL_NAME to use (sizes are in comments)
2.   Determine number of FINETUNING_STEPS (1000 steps on 774M w/ colab pro took about an hour for reference)
3.   Determine PRETRAINING_FILE_NAME to pretrain on
4.   Run and see!

### Notes
1.   The gpt_2_simple library has a bad habit of not working if you ever interrupt it, so try your best to not interrupt it pretraining. If you do you will likely have to restart the runtime.
2.   One good side effect of using gpt_2_simple is that it implicitly saves and loads checkpoints. This means if you get disconnected at any point, you can restart and as long as you have the same drive mounted to the same folder with the same parameters, it will find your most recently fine-tuned model. 
3.   One bad side effect of using gpt_2_simple is that it implicitly saves and loads checkpoints. This means if you ever want to fine tune a different model or switch around some of your parameters, the library has a tendency to assume you want to run from a checkpoint and error. You can fix these errors by manually going into your google drive and deleting the checkpoint
4.   GPT-2 XL (1558M parameter) model is unable to fine tune using this library. I would love it if it were otherwise, but its the sad truth that Colab Pro's High-RAM GPUs still don't have enough VRAM. I get the feeling that there's an easy way around this and that it's becuase of a bug, but investigating that will be for another day. For the time being select XL at your own peril.
5.   The number of finetuning steps being 1000 has no real reference. We should probably double check to see if that is actually a sufficient amount of fine tuning.





In [0]:
import gpt_2_simple as gpt2

# Note trying to pretrain GPT-2 XL crashes even high-RAM colab pro
GPT2_MODEL_NAME = "774M" # Small = 124M, Medium = 355M, Large = 774M, XL = 1558M
FINETUNING_STEPS = 1000
FINETUNING_FILE_NAME = './samples/text_adventures_train.txt'

if not os.path.isdir(os.path.join("models", GPT2_MODEL_NAME)):
	print(f"Downloading {GPT2_MODEL_NAME} model...")
	gpt2.download_gpt2(model_name=GPT2_MODEL_NAME)

sess = gpt2.start_tf_sess()
gpt2.finetune(sess, FINETUNING_FILE_NAME, model_name=model_name, steps=FINETUNING_STEPS)

## Step 5: Generate Text with Fine-Tuned GPT-2 Model

### Usage
1.   Nothing special here, just run

### Notes
1.   We should probably split up the act of writing these generations to a JSON file every 20 or so generations. When I ran this for the first time, I crashed colab pro with OOM error at 94 generations and almost lost them all. This is probably the biggest TODO of the notebook in its current state.
2.   Once again like the prompt sampling, we purposely re-roll the generation if GPT-2 ever gives us <|endoftext|> or <|startoftext|>. This does mean that we're arbitrarily skewing the distribution but hopefully that doesn't effect our results too much
3.   The newlines in both the prompt and generated text are a bit of a nuisance. Currently we do not explicitly do anything with the newlines present in the prompt and generation because nltk tokenization generally takes care of them. However this means we do have to join the tokenized sentences with newlines when feeding gpt-2 the prompt. We may want to change this to space.
4.   We keep newlines untouched in an attempt to match the fine-tuning dataset as closely as possible and with the assumption that sometimes newlines are meaningful. However, in an ideal world, we would prefer to have sentences that do not span 10+ turns of dialogue (which we have actually seen, believe it or not). Maybe in the future we could solve this by replacing all newlines in the fine-tuning dataset with spaces. I wonder how much damage this would cause.





In [0]:
import json5

generations = []
for index, prompt in enumerate(prompts):
  if len(prompt) < MAX_PROMPT_LENGTH:
    generation_is_good = False
    while(not generation_is_good):
      generated_text = gpt2.generate(sess, prefix='\n'.join(prompt), return_as_list=True)[0]
      generation_is_good = True
      if '<|startoftext|>' in generated_text or '<|endoftext|>' in generated_text:
        generation_is_good = False

    ## Create the final output by concatenating the generation to the prompt with the separating boundary token
    final_generation = prompt + sent_tokenize(generated_text)[len(prompt):MAX_PROMPT_LENGTH]
    boundary = len(prompt) - 1
  else:
    final_generation = prompt
    boundary = -1

  generation = {
      'prompt': final_generation[0],
      'text': final_generation[1:],
      'boundary': boundary,
  }

  print('=============GENERATION NUMBER: ' + str(index) + '=============')
  print(generation)
  generations.append(generation)

## Step 6: Output to JSON

### Notes:
1.   Using JSON5 here becuase it handles double quotes within JSON fields well, which is super important for dialogue based prompts like the ones we see in text_adventures
2.   We concat the UNIX timestamp into the output file name to prevent overwriting. It's up to you to later go into your drive and combine however many of these JSON files you want together. 
3.   We need to give these prompts a unique ID, something to connect the annotation to them permanently, maybe a combination of the timestamp plus their prompt number ?



In [0]:
import time
output_file = './samples/generations_' + str(int(time.time())) + '.json'

with open(output_file, 'w', encoding='utf-8') as f:
    json5.dump(generations, f, ensure_ascii=False, indent=4)