# ROFT Prompt Sampler
This notebook efficiently samples prompts from large corpora for use in the RoFT project http://roft.io

**Input:** file of articles or stories -- one full article per line

**Output:** "prompts-{sample filename}.json" in format:
  ```
{
  "dataset": "nyt-articles",
  "split": "dev",
  "date-sampled": "17/09/2020",
  "prompts": [
    [
      "The Newark Teachers Union filed a follow-up complaint with the Federal Department of Education yesterday, accusing the Newark school district of continuing to violate Federal laws barring discrimination against handicapped students.",
      "Under an agreement brokered in January, officials at the state-run district agreed to a 15-point plan intended to bring Newark schools into compliance with Federal laws.",
      "That plan has still not been implemented, said Mitchell Gerry, a spokesman for the 5,000-member union, which sent the complaint letter to the department's Office of Civil Rights.",
      "E. J. Miranda, a spokesman for the district, said the district had been making efforts to comply with the Rehabilitation Act of 1973."
    ],
    ...
  ]
}
  ```

# Usage

Just edit the three cells below with your desired preferences and run!

# Notes
1. PERCENT_MAX is an upper bound for the percentage of max length prompts in the output. If the sampler attempts to sample at max length and the article is shorter than max it will accept the full length of that article instead of resampling for another longer article. You can check how many of your samples had this happen by checking the warning at the end of the sampling prints "Warning: 1 articles were too short for prompt length" This may be something worth fixing in the future. 

2. The sampler assumes the file extension to a sample file is .txt

In [None]:
''' Browse the cloud bucket for your desired sample file '''
!gsutil ls gs://roft_datasets/prompts

In [None]:
''' Once you've found it, download it '''
!gsutil cp gs://roft_datasets/prompts/news/nyt-articles-dev.txt .

In [None]:
''' Set your preferences and you're good to go! '''
SAMPLE_FILE_PATH = '/content/nyt-articles-dev.txt' # The path to the downloaded sample file
NUM_SAMPLES = 200 # The total number of prompts you want to sample (will default to total number of articles in input file when too large)
MAX_LEN = 10 # The maximum number of sentences of each prompt
PERCENT_MAX = 0.20 # The percentage of prompts that will be sampled at that maximum length (i.e. the percentage of "all human" examples)

In [None]:
import os
import random
import mmap
import numpy as np
import json
import time
from datetime import date

import nltk
nltk.download('punkt')
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
class PromptSampler:
  def print_prompt_sampled(self, prompt, index, total, line):
    print('Sampled prompt {0}/{1} of length {2} from line {3}'.format(
            str(index), str(total), str(len(prompt)), str(line)))
    for line in prompt:
      print('\t' + repr(line))

  def print_prompt_too_short_warning(self, index, article_len, prompt_len):
    print('Warning: Article #{0} (len: {1}) is too short for prompt length of {2}'.format(
              str(index), str(article_len), str(prompt_len)))
    
  def random_sample_prompt_len(self, percent_human, max_prompt_len):
    if (random.random() < percent_human):
      return max_prompt_len
    else:
      return random.randint(1, max_prompt_len)
      
  def sample_corpus(self, sample_file, num_samples, max_prompt_len, percent_human, random_seed=436421):
    random.seed(random_seed)

    if not os.path.exists(sample_file):
      print('Error: sample file "' + sample_file + '" does not exist')
      exit(-1)

    prompts = [] # The 2D array of prompts
    num_shortened = 0 # The number of prompts that were too small to be full length
    with open(sample_file, 'r+b') as f:
      # mmap the file to avoid loading the whole thing into RAM
      map_file = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
      
      # Randomly decide which articles to grab our prompts from
      wc_output = !wc -l $sample_file
      num_lines = int(wc_output[0].split()[0])

      # If we ask for more samples than we can give, give as much as we can
      if num_samples > num_lines: num_samples = num_lines
      
      articles_to_sample = random.sample(range(num_lines), num_samples)

      # Iterate over all articles in the file and sample from only the selected articles
      for index, line in enumerate(iter(map_file.readline, b"")):
        if index not in articles_to_sample: continue

        # Randomly determine this prompt's length based on the percent human value
        prompt_length = self.random_sample_prompt_len(percent_human, max_prompt_len)

        # Use NLTK Sentence tokenizer to split this prompt into sentences
        article = sent_detector.tokenize(str(line, 'utf-8', 'ignore'))

        # If article is shorter than the desired prompt length, shorten the prompt length
        if len(article) < prompt_length: 
          self.print_prompt_too_short_warning(index, len(article), prompt_length)
          prompt_length = len(article)
          num_shortened += 1

        # Append the prompt to the list of prompts
        prompts.append(article[:prompt_length])

        self.print_prompt_sampled(article[:prompt_length], len(prompts), num_samples, index)
    
    print('Warning: {0} articles were too short for prompt length'.format(str(num_shortened)))
    return prompts

In [None]:
# Sample the prompts
sampler = PromptSampler()
prompts = sampler.sample_corpus(SAMPLE_FILE_PATH, NUM_SAMPLES, MAX_LEN, PERCENT_MAX)

In [None]:
# Save the prompts to the json file
sample_file_name = SAMPLE_FILE_PATH.split('/')[-1]
to_save = {
  'sample-file': sample_file_name,
  'date-sampled': date.today().strftime("%d/%m/%Y"),
  'prompts': prompts
}
  
with open('prompts-{}.json'.format(sample_file_name[:-4]), 'w') as f:
  json.dump(to_save, f, indent=2)