# CIS700 - Project "TRICK" Prompt Generation Script
This is the notebook for generating prompts for the project [Learning To Trick Humans](https://github.com/kirubarajan/trick.git). 

Developed by Liam Dugan and Arun Kirubarajan in Spring 2020 for CIS700 "Interactive Fiction and Text Generation"

## Step 1: Mount Drive and Clone the Repository

In [0]:
# Mount your google drive folder
from google.colab import drive
drive.mount('/content/drive')

# Change to the google drive folder and clone our repo and gpt-2
import os
os.chdir('/content/drive/My Drive')
!git clone https://github.com/kirubarajan/trick.git
os.chdir('/content/drive/My Drive/trick')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
fatal: destination path 'trick' already exists and is not an empty directory.


## Step 2: Install Dependencies

In [0]:
%tensorflow_version 1.x      # GPT-2 currently only supports tensorflow 1 
!pip3 install gpt-2-simple
import nltk 
nltk.download('punkt')

TensorFlow 1.x selected.


## Step 3: Sample Prompts from Human Text
At the moment this is only from AI dungeon training data, but we will expand this to include news and maybe blog posts later. 

### Usage
1.   Determine desired NUM_GENERATIONS, MAX_PROMPT_LENGTH, and PERCENT_NONHUMAN
2.   Pick the desired SAMPLE_FILE
3.   That's it! Run and see!

### Notes
1.   Prompt lengths are sampled on a *uniform* distribution (not a random distribution)
2.   Prompts are sampled from a file that is a combined train/dev/test file. This means that certain prompts may have been seen by the model in fine-tuning, we might want to change this in the future
3.   Sampling file is read into memory dynamically as needed in order to save RAM. We read in prompt_length number of lines, however lines are not necessarily equal to sentences. When there are more sentences read in than lines this is no issue because we can just trim the excesses but the reverse case is a problem. We have yet to see it, but it can't be ruled out quite yet
4.   We specifically re-roll any prompt that contains '<|startoftext|>' or '<|endoftext|>' for fear that it may bee too much of a giveaway. It is not sufficient to just remove these with a regex because the topic and setting of the text changes dramatically between appearances of these tokens. This does mean that the text that appears close to these tokens has a lower chance of appearing in a prompt.
5.   We use nltk punkt sentence tokenizer to tokenize sentences. This isn't really a perfect tokenizer to use, might be worth looking into other options



In [0]:
def read_file_lines(filename, start_line, stop_line):
    output = []
    with open(filename, 'r') as f:
        for index, line in enumerate(f):
          if index >= start_line and index < stop_line:
            if line.isspace():
              stop_line = stop_line + 1
            else:
              output.append(line)
    return output

In [0]:
import random
import re
from nltk.tokenize import sent_tokenize

NUM_GENERATIONS = 56 # Number of prompts to sample
MAX_PROMPT_LENGTH = 11 # This is the maximum length of the prompt (lengths will be uniformly sampled from 1 to this number)
PERCENT_NONHUMAN = 0.75 # This is the percentage of the prompts to have be randomly sampled from length 1 to MAX, the rest will all be human only
SAMPLE_FILE = './samples/text_adventures.txt' # file to sample prompts from

prompts = []
for i in range(NUM_GENERATIONS):
  # Randomly determine prompt length based on the specified percent nonhuman value
  if (random.random() > PERCENT_NONHUMAN):
    prompt_length = MAX_PROMPT_LENGTH
  else:
    prompt_length = random.randint(1, MAX_PROMPT_LENGTH - 1)

  # Randomly determine which file to sample from (at the moment we're only sampling from AI dungeon becuase its easier)
  # Eventually we should generalize and fix this up to sample from other domains
  wc_output = !wc -l {SAMPLE_FILE}
  file_len = wc_output[0].split()[0]

  prompt_is_good = False
  prompt_string = ''
  while(not prompt_is_good):
    # Randomly decide which line in the given file to sample from
    starting_line = random.randint(1, int(file_len))
    print('Starting Prompt at ' + str(starting_line))
    print('Prompt Length is ' + str(prompt_length))

    lines = read_file_lines(SAMPLE_FILE, starting_line, starting_line + prompt_length)
    for line in lines:
      prompt_string += line

    prompt_is_good = True
    if '<|startoftext|>' in prompt_string or '<|endoftext|>' in prompt_string:
      prompt_is_good = False
      prompt_string = ''

  # Instead of going by line, decide prompt length on number of sentences in the prompt
  cleaned_sentences = sent_tokenize(prompt_string)[:prompt_length]
  prompts.append(cleaned_sentences)

## Step 4: Download, Load, and Pre-Train GPT-2

### Usage
1.   Determine which GPT2_MODEL_NAME to use (sizes are in comments)
2.   Determine number of FINETUNING_STEPS (1000 steps on 774M w/ colab pro took about an hour for reference)
3.   Determine PRETRAINING_FILE_NAME to pretrain on
4.   Run and see!

### Notes
1.   The gpt_2_simple library has a bad habit of not working if you ever interrupt it, so try your best to not interrupt it pretraining. If you do you will likely have to restart the runtime.
2.   One good side effect of using gpt_2_simple is that it implicitly saves and loads checkpoints. This means if you get disconnected at any point, you can restart and as long as you have the same drive mounted to the same folder with the same parameters, it will find your most recently fine-tuned model. 
3.   One bad side effect of using gpt_2_simple is that it implicitly saves and loads checkpoints. This means if you ever want to fine tune a different model or switch around some of your parameters, the library has a tendency to assume you want to run from a checkpoint and error. You can fix these errors by manually going into your google drive and deleting the checkpoint
4.   GPT-2 XL (1558M parameter) model is unable to fine tune using this library. I would love it if it were otherwise, but its the sad truth that Colab Pro's High-RAM GPUs still don't have enough VRAM. I get the feeling that there's an easy way around this and that it's becuase of a bug, but investigating that will be for another day. For the time being select XL at your own peril.
5.   The number of finetuning steps being 1000 has no real reference. We should probably double check to see if that is actually a sufficient amount of fine tuning.





In [0]:
import gpt_2_simple as gpt2

# Note trying to pretrain GPT-2 XL crashes even high-RAM colab pro
GPT2_MODEL_NAME = "774M" # Small = 124M, Medium = 355M, Large = 774M, XL = 1558M
FINETUNING_STEPS = 1000
PRETRAINING_FILE_NAME = './samples/text_adventures_train.txt'

if not os.path.isdir(os.path.join("models", GPT2_MODEL_NAME)):
	print(f"Downloading {GPT2_MODEL_NAME} model...")
	gpt2.download_gpt2(model_name=GPT2_MODEL_NAME)

sess = gpt2.start_tf_sess()
gpt2.finetune(sess, PRETRAINING_FILE_NAME, model_name=model_name, steps=FINETUNING_STEPS)

## Step 5: Generate Text with Fine-Tuned GPT-2 Model

### Usage
1.   Nothing special here, just run

### Notes
1.   We should probably split up the act of writing these generations to a JSON file every 20 or so generations. When I ran this for the first time, I crashed colab pro with OOM error at 94 generations and almost lost them all. This is probably the biggest TODO of the notebook in its current state.
2.   Once again like the prompt sampling, we purposely re-roll the generation if GPT-2 ever gives us <|endoftext|> or <|startoftext|>. This does mean that we're arbitrarily skewing the distribution but hopefully that doesn't effect our results too much
3.   The newlines in both the prompt and generated text are a bit of a nuisance. Currently we do not explicitly do anything with the newlines present in the prompt and generation because nltk tokenization generally takes care of them. However this means we do have to join the tokenized sentences with newlines when feeding gpt-2 the prompt. We may want to change this to space.
4.   We keep newlines untouched in an attempt to match the fine-tuning dataset as closely as possible and with the assumption that sometimes newlines are meaningful. However, in an ideal world, we would prefer to have sentences that do not span 10+ turns of dialogue (which we have actually seen, believe it or not). Maybe in the future we could solve this by replacing all newlines in the fine-tuning dataset with spaces. I wonder how much damage this would cause.





In [0]:
import json5

generations = []
for index, prompt in enumerate(prompts):
  if len(prompt) < MAX_PROMPT_LENGTH:
    generation_is_good = False
    while(not generation_is_good):
      generated_text = gpt2.generate(sess, prefix='\n'.join(prompt), return_as_list=True)[0]
      generation_is_good = True
      if '<|startoftext|>' in generated_text or '<|endoftext|>' in generated_text:
        generation_is_good = False

    ## Create the final output by concatenating the generation to the prompt with the separating boundary token
    final_generation = prompt + sent_tokenize(generated_text)[len(prompt):MAX_PROMPT_LENGTH]
    boundary = len(prompt) - 1
  else:
    final_generation = prompt
    boundary = -1

  generation = {
      'prompt': final_generation[0],
      'text': final_generation[1:],
      'boundary': boundary,
  }

  print('=============GENERATION NUMBER: ' + str(index) + '=============')
  print(generation)
  generations.append(generation)

## Step 6: Output to JSON

### Notes:
1.   Using JSON5 here becuase it handles double quotes within JSON fields well, which is super important for dialogue based prompts like the ones we see in text_adventures
2.   We concat the UNIX timestamp into the output file name to prevent overwriting. It's up to you to later go into your drive and combine however many of these JSON files you want together. 
3.   We need to give these prompts a unique ID, something to connect the annotation to them permanently, maybe a combination of the timestamp plus their prompt number ?



In [3]:
import time
output_file = './samples/generations_' + str(int(time.time())) + '.json'

with open(output_file, 'w', encoding='utf-8') as f:
    json5.dump(generations, f, ensure_ascii=False, indent=4)

./samples/generations_1587592430.json
