# Preprocess SAR dataset
the goal of this notebook is to preprocess the SAR dataset to a format that can be consumed by `MegatronGPTModel`.   

following the process outlined in [GPT Training Tutorial](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/gpt/gpt_training.html)  

- Input (step 1):
    - `train_assoc_recall_<n-samples>_<vocab-size>_<contet-length>.pt`
    - `test_assoc_recall_<n-samples>_<vocab-size>_<contet-length>.pt`

- Step 2: Extract row data - save the files as train_data.jsonl (and test_data.jsonl)

- Step 3: Tokenizer - define the `sar-vocab.txt` tokenizer file


- step 4: convert training data into memory map format
    - generating `gpt_training_data_text_document.bin` and `gpt_training_data_text_document.idx`
    these files can be then loaded to the MegatronGPT model for training. 

In [1]:
import os
import torch
from torch.utils.data import TensorDataset, Dataset, DataLoader
from typing import Dict
import numpy as np
from tqdm import tqdm
from collections import Counter

root_path = os.path.abspath('..')
print(root_path)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

/home/gkoren/scratch/code/github/guyk1971/NeMo
cuda


# Step 1: Explore the input

## safari tensors

In [2]:
data_root_path=os.path.join(root_path,'sandbox')
ds_files = [f for f in os.listdir(data_root_path) if f.endswith('.pt')]
ds_files

['train_assoc_recall_4000_40_1024.pt', 'test_assoc_recall_4000_40_1024.pt']

In [3]:
# load a dataset
num_examples=4000
vocab_size=40
input_seq_len=1024
train_tensor = torch.load(os.path.join(data_root_path, 
    f"train_assoc_recall_{num_examples}_{vocab_size}_{input_seq_len}.pt"))
test_tensor = torch.load(os.path.join(data_root_path, 
    f"test_assoc_recall_{num_examples}_{vocab_size}_{input_seq_len}.pt"))

print(train_tensor.shape)
print(test_tensor.shape)

torch.Size([4000, 2, 1026])
torch.Size([500, 2, 1026])


In [4]:
print(train_tensor[0,0,:])

tensor([ 6, 22, 15,  ..., 27, 39, 17])


In [5]:
print(train_tensor[0,1,:])

tensor([22, 15, 28,  ..., 39, 17, 27])


In [6]:
print(test_tensor[0,0,:])

tensor([ 8, 21, 19,  ..., 35, 39, 13])


In [7]:
print(test_tensor[0,1,:])

tensor([-100, -100, -100,  ..., -100, -100,   35])


In [8]:
tnp=test_tensor.numpy()
tnp=np.concatenate((tnp[:,0,4:],tnp[:,1,-1].reshape(-1,1)),axis=-1)
tnp.shape

(500, 1023)

In [9]:
tnp[0]

array([ 5, 20,  6, ..., 39, 13, 35])

## train_data.jsonl
`train_data.jsonl` will contain our training data in the json line format. We are interested in the data under `text` field.

In [None]:
import json

def load_jsonl_file(file_path, num_lines=None):
    data = []
    with open(file_path, 'r') as file:
        for idx, line in enumerate(file):
            if num_lines is not None and idx >= num_lines:
                break  # Stop reading after reaching the specified number of lines
            # Each line is a JSON object, so we parse it and add it to the list
            data.append(json.loads(line))
    return data

In [None]:
train_json_files = [f for f in os.listdir(data_root_path) if f.endswith('.jsonl')]
print(train_json_files)

In [None]:
file_path = os.path.join(data_root_path,'train_data.jsonl')
num_lines_to_read = 10  # Replace with the number of lines you want to read
result = load_jsonl_file(file_path, num_lines_to_read)
print(result)


In [None]:
result[0].keys()

In [None]:
result[0]['text']

In [None]:
type(result)

# Step 2: Extract Raw Data (Converting tensor to jsonl)
we need to take each sample in the tensor as a document. i.e. `train_tensor[0,0,:]` is equivalent to `result[0]['text']` except for the fact that we need to feed the last token as well (the value of the query). so it looks like the model size should be 1024+3=1027, right ? its not a good number. (although not much worse than 1026, right? maybe we need to pad it further ?)






In [None]:
tnp=train_tensor.numpy()
# take the list token from tnp[:,1,:] and concat to tnp[:,0,:]
tnp=np.concatenate((tnp[:,0,4:],tnp[:,1,-1].reshape(-1,1)),axis=-1)
tnp.shape

In [None]:
" ".join([str(i) for i in tnp[0,:]])

In [None]:
# convert sample to string to create the equivalent to result[:]['text'] that should be saved to json
dict2jsn=[{'text':" ".join([str(i) for i in tnp[j,:]])} for j in range(len(tnp)) ]
dict2jsn[0]['text']

write to jsonl file:

In [None]:
import json

def write_jsonl_file(file_path, data):
    with open(file_path, 'w') as file:
        for item in data:
            # Convert each dictionary to a JSON string and write it as a line
            json_line = json.dumps(item)
            file.write(json_line + '\n')

In [None]:
file_path = os.path.join(data_root_path,'train_sar.jsonl')
write_jsonl_file(file_path, dict2jsn)

In [None]:
# validate
num_lines_to_read = 10  # Replace with the number of lines you want to read
result = load_jsonl_file(file_path, num_lines_to_read)
print(result[0])

In [None]:
tnp.shape

In [None]:
tnp[0,:10]

In [None]:
tnp[0,-10:]

### debugging a sample from the batch (in MegatronGPTModel)

In [None]:
txt='25 10 33 12 32 10 33 11 22 4 27 7 37 3 37 9 23 9 23 6 29 2 21 17 27 6 29 14 38 17 27 2 21 4 27 2 21 14 38 5 23 16 28 14 38 6 29 14 38 4 27 10 33 18 34 9 23 1 21 4 27 9 23 8 34 13 25 5 23 12 32 12 32 11 22 15 25 5 23 6 29 16 28 13 25 1 21 19 22 10 33 10 33 11 22 10 33 3 37 15 25 19 22 6 29 16 28 12 32 15 25 10 33 7 37 11 22 9 23 4 27 19 22 15 25 15 25 15 25 17 27 2 21 7 37 17 27 9 23 15 25 17 27 4 27 16 28 4 27 8 34 11 22 4 27 13 25 16 28 12 32 3 37 2 21 8 34 18 34 1 21 10 33 15 25 9 23 9 23 15 25 6 29 15 25 14 38 15 25 1 21 11 22 8 34 16 28 4 27 7 37 1 21 10 33 13 25 6 29 3 37 7 37 15 25 19 22 16 28 15 25 15 25 15 25 14 38 14 38 16 28 14 38 5 23 10 33 3 37 18 34 6 29 10 33 13 25 9 23 13 25 1 21 14 38 19 22 16 28 17 27 9 23 15 25 15 25 2 21 7 37 19 22 15 25 10 33 2 21 10 33 1 21 11 22 14 38 5 23 14 38 2 21 11 22 19 22 7 37 2 21 5 23 3 37 4 27 5 23 6 29 17 27 12 32 11 22 7 37 12 32 4 27 12 32 3 37 14 38 2 21 9 23 2 21 3 37 4 27 5 23 1 21 13 25 19 22 4 27 3 37 2 21 8 34 17 27 2 21 12 32 8 34 8 34 18 34 7 37 9 23 10 33 9 23 8 34 4 27 4 27 19 22 3 37 7 37 10 33 5 23 5 23 10 33 16 28 18 34 15 25 12 32 14 38 2 21 10 33 18 34 13 25 6 29 6 29 9 23 11 22 4 27 12 32 12 32 19 22 5 23 3 37 11 22 7 37 4 27 2 21 7 37 7 37 5 23 7 37 8 34 16 28 14 38 16 28 13 25 12 32 4 27 2 21 3 37 2 21 6 29 14 38 9 23 17 27 16 28 9 23 9 23 14 38 18 34 7 37 13 25 9 23 4 27 12 32 3 37 4 27 17 27 19 22 10 33 8 34 5 23 10 33 17 27 17 27 14 38 4 27 18 34 17 27 13 25 2 21 18 34 4 27 15 25 4 27 5 23 16 28 13 25 3 37 4 27 14 38 10 33 5 23 4 27 12 32 19 22 16 28 18 34 19 22 15 25 9 23 9 23 6 29 17 27 3 37 7 37 9 23 7 37 7 37 17 27 15 25 13 25 15 25 15 25 10 33 2 21 6 29 8 34 17 27 7 37 1 21 15 25 1 21 3 37 10 33 19 22 18 34 10 33 13 25 17 27 7 37 1 21 12 32 3 37 5 23 8 34 6 29 13 25 17 27 19 22 2 21 2 21 13 25 3 37 6 29 16 28 8 34 11 22 2 21 6 29 18 34 5 23 13 25 3 37 9 23 5 23 9 23 14 38 16 28 6 29 18 34 16 28 19 22 19 22 1 21 15 25 3 37 14 38 8 34 6 29 16 28 1 21 2 21 19 22 5 23 1 21 8 34 7 37 16 28 16 28 14 38 15 25 6 29 5 23 6 29 6 29 2 21 18 34 1 21 13 25 10 33 11 22 16 28 2 21 3 37 9 23 14 38 15 25 11 22 4 27 1 21 6 29 3 37 16 28 2 21 2 21 5 23 8 34 13 25 10 33 18 34 15 25 12 32 13 25 14 38 14 38 7 37 11 22 39 2 21 13 24 15 35 6 32 11 31 11 31 11 31 9 28 8 36 18 21 10 26 16 20 18 21 11 31 13 24 8 36 13 24 1 35 2 22 2 22 4 26 4 26 11 31 7 37 1 35 19 32 2 22 6 32 17 35 10 26 11 31 7 37 5 27 13 24 8 36 14 34 6 32 18 21 10 26 5 27 5 27 8 36 10 26 11 31 8 36 4 26 2 22 16 20 18 21 16 20 1 35 2 22 12 34 10 26 19 32 6 32 1 35 10 26 9 28 6 32 13 24 17 35 6 32 14 34 12 34 7 37 14 34 12 34 11 31 14 34 16 20 9 28 3 20 11 31 14 34 12 34 8 36 15 35 1 35 12 34 16 20 15 35 8 36 13 24 14 34 15 35 17 35 9 28 16 20 6 32 1 35 11 31 4 26 6 32 3 20 19 32 18 21 9 28 19 32 18 21 5 27 15 35 1 35 16 20 18 21 4 26 13 24 18 21 12 34 3 20'

In [None]:
len(txt.split())

In [None]:
txt

In [None]:
txtn=np.array([int(k) for k in txt.split()])


In [None]:
np.where(txtn==39)

In [None]:
tnp[0,-10:]

In [None]:
tnp[0,[i+1 for i in np.where(tnp[0]==16)]]


In [None]:
txtn[[i+1 for i in np.where(txtn==8)]]

# Step 3: Tokenizer
The objective is to define the `sar-vocab.txt` that we'll use.  
need to look at the `WordTokenizer` and see if the vocabulary fits. 
specifically, do we need special tokens ? does the 'copy' token have special meaning ?

In [None]:
!head -n 1 gpt2-vocab.json

In [None]:
def write_json_file(file_path, data):
    with open(file_path, 'w') as file:
        # Write the dictionary to the file using json.dump()
        json.dump(data, file, indent=2)  # 'indent' adds pretty formatting (optional)

In [None]:
def write_vocab_file(file_path, vocab):
    with open(file_path, 'w') as file:
        # Write the dictionary to the file using json.dump()
        file.write('{"pad_token":"<pad>","eos_token":"<eos>"}'+'\n')
        for key in vocab.keys():
            file.write(f'"{key}"'+'\n')
            

In [None]:
vocab_size=40
sar_vocab = {str(i):i for i in range(vocab_size)}
sar_vocab

In [None]:
fname=f'sar-vocab-{vocab_size}.txt'
file_path = os.path.join(data_root_path,fname)
write_vocab_file(file_path, sar_vocab)

In [None]:
!cat $fname

# Step 4: convert training data into memory map format
this should be done using the script they have provided in their tutorial: 
```
python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=train_data.jsonl \
--json-keys=text \
--tokenizer-library=megatron \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--output-prefix=hfbpe_gpt_training_data \
--append-eod \
--workers=32
```


note that we need to handle the following arguments:
- json-keys : what does it mean ?
- tokenizer-library : megatron
- vocab : we need to feed our own vocab.json
- tokenizer-type: we need to set something else. we need WordTokenizer.
- merge-file: optional. ignore it
- append-eod: what does it mean ? do we need it ?


so the command should be: 
```
python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=sandbox/train_sar.jsonl \
--tokenizer-library=megatron \
--vocab-file=sar-vocab-40.txt \
--dataset-impl=mmap \
--tokenizer-type=word \
--output-prefix=sar_gpt_training_data \
--workers=32
```
