# Dataset construction notebook

Walking thru how we make scratchpad datasets given a sample RAG-like dataset. 

### Setup
Typical finetuning question-answering datasets follow a template like:  
```
Context: [insert (long) context here which may have distracting or irrelevant parts]

Question: What is X?

Answer: X is blah
```
and train a model via next-token prediction, that way when faced with a new sample:
```
Context: [insert (long) context here which may have distracting or irrelevant parts]

Question: What is Y?
```
the model is finetuned to output text (e.g., `Answer: Y is blah`) that answers the new question given the new context.

However, this is prone to models being distracted by the irrelevant parts of the provided context. 

If we know what parts of the context are relevant in our training data, we'll  instead show how augmenting this dataset with a ["scratchpad"](https://arxiv.org/abs/2112.00114) can help models answer questions correctly. 

With the scratchpad template, our samples will look like
```
Context: [insert (long) context here which may have distracting or irrelevant parts]

Question: What is X?

Supporting context: [supporting chunk 1], [supporting chunk 2], ..

Question: What is X?

Answer: X is blah
```
such that upon given the same new sample above:
```
Context: [insert (long) context here which may have distracting or irrelevant parts]

Question: What is Y?
```

the model may now output the full scratchpad + answer:
```
Supporting context: [supporting chunk 1], [supporting chunk 2], ..

Question: What is Y?

Answer: Y is blah
```

We'll use Hotpot-QA as an example (specifically the distractor setting). We'll see that this already gives us the supporting chunks (which we'll need to do ourselves for other datasets potentially), and with these chunks, we can construct a "scratchpad" version of the dataset to use for training.

Most of this should be pretty consistent + easy to apply for new datasets. The key thing will be figuring out how to construct chunks + identify the supporting chunks for datasets which don't come in Hotpot-QA's nice formatting (e.g., for SCROLLs datasets such as QAsper, it may just be the text)


### Setup

In [1]:
import sys
sys.path.append('../')  # get our imports

In [2]:
from omegaconf import OmegaConf  # how we specify configs

### Load dataset config

In [3]:
dataset_config = """
name: hotpot_qa
dataset_config:
  path: hotpot_qa
  name: distractor
  cache_dir: '/juice/scr/scr110/scr/nlp/data/hotpot-qa-hf'
  seed: 42
  num_train_samples: null
  num_val_samples: null
  include_support: false
pretrained_model_config: 
  pretrained_model_name_or_path: 'mistralai/Mistral-7B-v0.1'
  cache_dir: '/juice/scr/scr110/scr/nlp/data/neo/hub/'
"""
dataset_config = OmegaConf.create(dataset_config)

In [4]:
print(OmegaConf.to_container(dataset_config))  # see as dict

{'name': 'hotpot_qa', 'dataset_config': {'path': 'hotpot_qa', 'name': 'distractor', 'cache_dir': '/juice/scr/scr110/scr/nlp/data/hotpot-qa-hf', 'seed': 42, 'num_train_samples': None, 'num_val_samples': None, 'include_support': False}, 'pretrained_model_config': {'pretrained_model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'cache_dir': '/juice/scr/scr110/scr/nlp/data/neo/hub/'}}


### Alternatively load from config file

In [5]:
config_path = '../configs/experiment/hotpot_qa_distractor.yaml'
dataset_config = OmegaConf.load(config_path).dataset
print(OmegaConf.to_container(dataset_config))  # see as dict

{'name': 'hotpot_qa', 'dataset_config': {'path': 'hotpot_qa', 'name': 'distractor', 'cache_dir': '/juice/scr/scr110/scr/nlp/data/hotpot-qa-hf', 'seed': 42, 'num_train_samples': None, 'num_val_samples': None, 'include_support': False}, 'pretrained_model_config': {'pretrained_model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'cache_dir': '/juice/scr/scr110/scr/nlp/data/neo/hub/'}, 'preprocess_config': None}


## Dataset construction

#### Load raw data 

In [6]:
from datasets import load_dataset  # Huggingface datasets

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
dataset_kwargs = ['path', 'name', 'cache_dir']
raw_datasets = load_dataset(
    **{k: v for k, v in dataset_config.dataset_config.items() if k in dataset_kwargs})

In [8]:
# Can look at features and a sample
raw_datasets.keys()

dict_keys(['train', 'validation'])

In [9]:
raw_datasets['train']

Dataset({
    features: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context'],
    num_rows: 90447
})

In [10]:
raw_datasets['train'][0]

{'id': '5a7a06935542990198eaf050',
 'question': "Which magazine was started first Arthur's Magazine or First for Women?",
 'answer': "Arthur's Magazine",
 'type': 'comparison',
 'level': 'medium',
 'supporting_facts': {'title': ["Arthur's Magazine", 'First for Women'],
  'sent_id': [0, 0]},
 'context': {'title': ['Radio City (Indian radio station)',
   'History of Albanian football',
   'Echosmith',
   "Women's colleges in the Southern United States",
   'First Arthur County Courthouse and Jail',
   "Arthur's Magazine",
   '2014–15 Ukrainian Hockey Championship',
   'First for Women',
   'Freeway Complex Fire',
   'William Rast'],
  'sentences': [["Radio City is India's first private FM radio station and was started on 3 July 2001.",
    ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).',
    ' It plays Hindi, English and regional songs.',
    ' It was launch

### Converting to text

For Hotpot-QA, we'll need to convert the above dictionary into valid text prompts to feed into an LLM. We'll do so in two parts:  
1. **Dictionary preprocessing**: we'll clean up the dictionary into a consistent format that we can share for all datasets
2. **Prompt templating**: we'll take the cleaned dictionary and use it to populate a string as our prompt

### Dictionary preprocessing  

We'll aim to get the above raw dictionary into a standardized format:
```python
sample = {
    'question': sample['question'],
    'answer': sample['answer'],
    'context': context,
    'support': support,
    'support_indices': support_indices,
}
```
We can do so with the following:

In [11]:
# from dataloaders.hotpot_qa.py

def process_sample(sample: dict):
    """
    Preprocess data into question, answer, full context, support
    -> Include full paragraphs for supporting contexts
    """
    support = []
    context = []
    support_indices = []  # Which result they showed up in

    context_titles = sample['context']['title']
    support_titles = sample['supporting_facts']['title']

    context_sentences = sample['context']['sentences']
    support_sent_ids  = sample['supporting_facts']['sent_id']

    # Add contexts
    for cix, sentences in enumerate(context_sentences):
        _context = {'title': context_titles[cix], 'text': ''.join(sentences)}
        context.append(_context)
        #  Add supporting facts
        if context_titles[cix] in support_titles:
            support.append(_context)
            support_indices.append(cix)
            
    sample = {
        'question': sample['question'],
        'answer': sample['answer'],
        'context': context,
        'support': support,
        'support_indices': support_indices,
    }
    return sample

Apply the above to our datasets (we'll just use the smaller val split for now), we get:

In [12]:
val_set = raw_datasets['validation']

# Convert to question, answer, context, support format
val_set   = val_set.map(process_sample, remove_columns=list(val_set.features),
                        load_from_cache_file=False)

Map: 100%|███████████████████████████████████████████████████████████████████████| 7405/7405 [00:01<00:00, 4543.38 examples/s]


In [13]:
val_set  # see how features are updated

Dataset({
    features: ['question', 'answer', 'context', 'support', 'support_indices'],
    num_rows: 7405
})

In [14]:
val_set[0]  # and we have a nicer set of chunks. We'll aim to get this format for other datasets

{'question': 'Were Scott Derrickson and Ed Wood of the same nationality?',
 'answer': 'yes',
 'context': [{'text': "Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood. The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau. Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast.",
   'title': 'Ed Wood (film)'},
  {'text': 'Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. He lives in Los Angeles, California. He is best known for directing horror films such as "Sinister", "The Exorcism of Emily Rose", and "Deliver Us From Evil", as well as the 2016 Marvel Cinematic Universe installment, "Doctor Strange."',
   'title': 'Scott Derrickson'},
  {'text': 'Woodson is a census-designated place (CDP)

### Prompt templating
Given those chunks, we'll now convert it into a single string object that can be tokenized and fed to an LLM.

We can do the above with the following two functions, taken from `dataloaders.utils.py` (with slight adjustments)

In [15]:
from typing import Callable
from functools import partial
from os.path import join, isdir
from torch.utils.data import Dataset

from datasets import load_from_disk  # hf datasets


def tokenize_dataset(dataset: Dataset, split_name: str,                      
                     tokenize_func: Callable,
                     cache_dir: str,
                     tokenizer_name: str,
                     **tokenize_kwargs: any):
    """
    Apply prompt formatting and tokenize dataset
    """
    save_path = join(cache_dir, f'{tokenizer_name}_{split_name}')  # 'train_anc'
    
    try:  # If we've already formatted + tokenized, we save and can pull from disk
        dataset = load_from_disk(save_path)
        print(f'Tokenized dataset loaded from {save_path}!')
    
    except Exception as e:  # If not found, we'll format + tokenize
        print(e)
        if not isdir(save_path):  
            # For this notebook we won't save anything, so commenting out
            # os.makedirs(save_path)
            # print(f'-> Created {save_path}')
            # print(f'-> Tokenizing {split_name} dataset...')
            dataset = dataset.map(partial(tokenize_func, **tokenize_kwargs),
                                  remove_columns=list(dataset.features),
                                  load_from_cache_file=False)
            # dataset.save_to_disk(save_path)
            # print(f'Tokenized {split_name} dataset saved to {save_path}!')
    return dataset

In [16]:
from transformers import AutoTokenizer


def tokenize_add_label(sample: dict, tokenizer: AutoTokenizer, 
                       context_source: str='context', 
                       include_label: bool=True,
                       instruct_tune: bool=False,
                       include_support: bool=False,):
    """
    Convert RAG training samples into a single input text and tokenize
    """
    question = sample['question']
    if question[-1] != '?': question += '?'  # Add punctuation

    # Making the template to finetune our LLMs over. 
    # -> The additional flavor-text may not be necessary given we're finetuning,
    #    but it could be helpful / in general is taken from prior work, so we'll
    #    be consistent with that.
    template = f"""Write a high-quality answer for the given question using only the provided context (some of which might be irrelevant).

Question: {{question}}

Context:
{{context}}
    
Question: {{question}}

Answer:"""
    if instruct_tune:  
        # Instruction-tuned models often come with specific additional prompt templating,
        # e.g., adding [INST] and [/INST] to the beginning and end
        template = '[INST] ' + template + ' [/INST]'
    
    context = []  # Populating our context
    for ix, c in enumerate(sample[context_source]):
        context.append(f"Document (Title: {c['title']}) {c['text']}")
    context = '\n\n'.join(context)

    # Filling in the template
    prompt = template.format(context=context, question=sample['question'].capitalize())

    # Tokenizing -> note we add the bos and eos tokens manually, and don't let the
    # tokenizer do this automatically (via add_special_tokens=False)
    prompt = f'{tokenizer.bos_token}{prompt}'
    prompt = tokenizer.encode(prompt, add_special_tokens=False)

    # For scratchpad, we'll extend our answer with the supporting context
    if include_support:  
        # First we'll start by getting the model to repeat the question
        sample["answer"] = f"{sample['question'].capitalize()}\n\n" + sample["answer"]

        # Then we'll add the supporting chunks
        for ix, c in enumerate(sample['support']):  
            support = f"\nDocument (Title: {c['title']}) {c['text']}\n"
            support = tokenizer.encode(support, add_special_tokens=False)[1:]

            sample["answer"] = f"Document (Title: {c['title']}) {c['text']}\n\n" + sample["answer"]
    
    if include_label:  # For training, we use next-token prediction and so include the label in the text
        answer = tokenizer.encode(f'{sample["answer"]}{tokenizer.eos_token}', add_special_tokens=False)
    else:  # For final testing, we just want to give the context + question to the LLM
        answer = []
        target = tokenizer.encode(f'{sample["answer"]}{tokenizer.eos_token}', add_special_tokens=False)

    input_ids = prompt + answer
    attn_mask = [1] * len(input_ids)
    
    sample =  {
        "input_ids": input_ids,
        "attention_mask" : attn_mask,
        # we only finetune the model on the answer parts during training
        # setting label to -100 bc -100 is the default ignored class with nn.CrossEntropy
        "labels": [-100] * len(prompt) + answer if include_label else target,
    }
    return sample

So if we now apply the above to our dataset, we'll be good to go.

### First get tokenizer

In [17]:
# First get model tokenizer
from dataloaders.utils import get_tokenizer_from_config

2024-01-11 10:02:36.352331: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [18]:
tokenizer = get_tokenizer_from_config(dataset_config.pretrained_model_config)
tokenizer

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [19]:
dataset_config['name']

'hotpot_qa'

### Then tokenize -> first standard training dataset (no scratchpad)

In [20]:
# Then tokenize
name = dataset_config['name']
if dataset_config.dataset_config['include_support']:
    name += f'-is=1'
tokenizer_name = dataset_config['pretrained_model_config']['pretrained_model_name_or_path']
tokenizer_name = tokenizer_name.split('/')[-1]  # just get mistral-7b or smth
tokenizer_name = f'{name}_{tokenizer_name}'
print(tokenizer_name)

tokenize_kwargs = {
    'tokenizer': tokenizer,
    'tokenizer_name': tokenizer_name,
    'tokenize_func': tokenize_add_label,
    'context_source': 'context',  # for "gold" / no-distractor dataset can set this to 'support'
    'include_label': True,  # True for training or val splits, and False for testing
    'cache_dir': dataset_config.dataset_config['cache_dir'],
    'instruct_tune': 'instruct' in tokenizer_name.lower(),  # e.g., Mistral-7B-Instruct-v0.1 vs Mistral-7B-v0.1
    'include_support': dataset_config.dataset_config['include_support'],  # True for scratchpad, False otherwise
}

val_set_lm = tokenize_dataset(val_set, 'val_lm_anc', **tokenize_kwargs)

hotpot_qa_Mistral-7B-v0.1
Directory /juice/scr/scr110/scr/nlp/data/hotpot-qa-hf/hotpot_qa_Mistral-7B-v0.1_val_lm_anc not found


Map: 100%|████████████████████████████████████████████████████████████████████████| 7405/7405 [00:26<00:00, 282.68 examples/s]


### Inspect tokenized dataset

In [21]:
val_set_lm  # tokens are input_ids

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 7405
})

In [22]:
# View the sample
print(tokenizer.decode(val_set_lm[0]['input_ids']))

<s> Write a high-quality answer for the given question using only the provided context (some of which might be irrelevant).

Question: Were scott derrickson and ed wood of the same nationality?

Context:
Document (Title: Ed Wood (film)) Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood. The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau. Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast.

Document (Title: Scott Derrickson) Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. He lives in Los Angeles, California. He is best known for directing horror films such as "Sinister", "The Exorcism of Emily Rose", and "Deliver Us From Evil", as well as the 2016 Marvel Cinematic Universe installm

### Now tokenize scratchpad training dataset

In [23]:
# Then tokenize
name = dataset_config['name']
if dataset_config.dataset_config['include_support']:
    name += f'-is=1'
tokenizer_name = dataset_config['pretrained_model_config']['pretrained_model_name_or_path']
tokenizer_name = tokenizer_name.split('/')[-1]  # just get mistral-7b or smth
tokenizer_name = f'{name}_{tokenizer_name}'
print(tokenizer_name)

tokenize_kwargs = {
    'tokenizer': tokenizer,
    'tokenizer_name': tokenizer_name,
    'tokenize_func': tokenize_add_label,
    'context_source': 'context',  # for "gold" / no-distractor dataset can set this to 'support'
    'include_label': True,  # True for training or val splits, and False for testing
    'cache_dir': dataset_config.dataset_config['cache_dir'],
    'instruct_tune': 'instruct' in tokenizer_name.lower(),  # e.g., Mistral-7B-Instruct-v0.1 vs Mistral-7B-v0.1
    'include_support': True,  # True for scratchpad, False otherwise
}

val_set_scratchpad_lm = tokenize_dataset(val_set, 'val_scratchpad_lm', **tokenize_kwargs)

hotpot_qa_Mistral-7B-v0.1
Directory /juice/scr/scr110/scr/nlp/data/hotpot-qa-hf/hotpot_qa_Mistral-7B-v0.1_val_scratchpad_lm not found


Map: 100%|████████████████████████████████████████████████████████████████████████| 7405/7405 [00:24<00:00, 304.26 examples/s]


In [24]:
# View the sample -> note the additional chunks at the end
print(tokenizer.decode(val_set_scratchpad_lm[0]['input_ids']))

<s> Write a high-quality answer for the given question using only the provided context (some of which might be irrelevant).

Question: Were scott derrickson and ed wood of the same nationality?

Context:
Document (Title: Ed Wood (film)) Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood. The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau. Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast.

Document (Title: Scott Derrickson) Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. He lives in Los Angeles, California. He is best known for directing horror films such as "Sinister", "The Exorcism of Emily Rose", and "Deliver Us From Evil", as well as the 2016 Marvel Cinematic Universe installm

### Finally tokenize testing dataset 

Note how there will be no answer in the sample

In [25]:
# Then tokenize
name = dataset_config['name']
if dataset_config.dataset_config['include_support']:
    name += f'-is=1'
tokenizer_name = dataset_config['pretrained_model_config']['pretrained_model_name_or_path']
tokenizer_name = tokenizer_name.split('/')[-1]  # just get mistral-7b or smth
tokenizer_name = f'{name}_{tokenizer_name}'
print(tokenizer_name)

tokenize_kwargs = {
    'tokenizer': tokenizer,
    'tokenizer_name': tokenizer_name,
    'tokenize_func': tokenize_add_label,
    'context_source': 'context',  # for "gold" / no-distractor dataset can set this to 'support'
    'include_label': False,  # True for training or val splits, and False for testing
    'cache_dir': dataset_config.dataset_config['cache_dir'],
    'instruct_tune': 'instruct' in tokenizer_name.lower(),  # e.g., Mistral-7B-Instruct-v0.1 vs Mistral-7B-v0.1
    'include_support': False,  # True for scratchpad, False otherwise
}

test_set = tokenize_dataset(val_set, 'test', **tokenize_kwargs)

hotpot_qa_Mistral-7B-v0.1
Directory /juice/scr/scr110/scr/nlp/data/hotpot-qa-hf/hotpot_qa_Mistral-7B-v0.1_test not found


Map: 100%|████████████████████████████████████████████████████████████████████████| 7405/7405 [00:17<00:00, 426.16 examples/s]


In [26]:
# View the sample -> note no answer at the end
print(tokenizer.decode(test_set[0]['input_ids']))

<s> Write a high-quality answer for the given question using only the provided context (some of which might be irrelevant).

Question: Were scott derrickson and ed wood of the same nationality?

Context:
Document (Title: Ed Wood (film)) Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Ed Wood. The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau. Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast.

Document (Title: Scott Derrickson) Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. He lives in Los Angeles, California. He is best known for directing horror films such as "Sinister", "The Exorcism of Emily Rose", and "Deliver Us From Evil", as well as the 2016 Marvel Cinematic Universe installm