# Imports

In [1]:
import lance
import pyarrow as pa

from datasets import load_dataset
from transformers import AutoTokenizer

from tqdm import tqdm

import warnings
warnings.simplefilter('ignore')

# Tokenizer and Dataset

We are using the alpaca instruction dataset in this example walkthrough. We'll be splitting the dataset into train and validation splits. Both splits will be saved as separate lance datasets.

In [2]:
split = 0.10 # We'll use 10% data for validation set

dataset = load_dataset("tatsu-lab/alpaca").shuffle(seed=42)
dataset['val'] = load_dataset("tatsu-lab/alpaca", split=f"train[:{int(split * 100)}%]").shuffle(seed=42)
dataset['train'] = load_dataset("tatsu-lab/alpaca", split=f"train[{int(split * 100)}%:]").shuffle(seed=42)

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Processing Samples

This is the most important function here. We go over each sample in the ðŸ¤— dataset, tokenize all the column-values in that sample and then yield a pyarrow `RecordBatch` consisting of the tokens we just tokenized.

Each sample from the original dataset will be stored as a row in the pyarrow table (unlike the llm-pretraining dataset, where all the samples were stored in one large contiguous row of tokens).

This will make it considerably easier to iterate over them during training.

In [3]:
def process(dataset, tokenizer):
    for sample in tqdm(dataset):
        inst, inp, outp, text = sample['instruction'], sample['input'], sample['output'], sample['text']
        
        # There are empty strings present in the dataset which we are ignoring
        if not (inst and inp and outp and text):
            continue

        # Tokenize all the text data
        inst = tokenizer(inst)['input_ids']
        inp = tokenizer(inp)['input_ids']
        outp = tokenizer(outp)['input_ids']
        text = tokenizer(text)['input_ids']
        
        # Return a Pyarrow record batch with all the tokenized data
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([inst], pa.list_(pa.int64(), -1)),
                pa.array([inp], pa.list_(pa.int64(), -1)),
                pa.array([outp], pa.list_(pa.int64(), -1)),
                pa.array([text], pa.list_(pa.int64(), -1)),
            ],
            ["instructions", "inputs", "outputs", "texts"],
        )

# Writing the dataset to disk

Now that our processing function is ready, we define a schema that tells pyarrow what format of data it should be expecting in the table and we define reader functions (for both train and val splits) that will take in the schema and an iterator (or a function) which will yield the RecordBatches.

Finally, we use those reader functions by calling `lance.write_dataset` to write these pyarrow tables to disk in the highly efficient and fast, lance file format.

That's it!

In [4]:
# Schema to tell pyarrow the type of data we are expecting in our table
schema = pa.schema([
    pa.field("instructions", pa.list_(pa.int64(), -1)),
    pa.field("inputs", pa.list_(pa.int64(), -1)),
    pa.field("outputs", pa.list_(pa.int64(), -1)),
    pa.field("texts", pa.list_(pa.int64(), -1))
])

In [5]:
# These will be used by lance to write the dataset
train_reader = pa.RecordBatchReader.from_batches(schema, process(dataset['train'], tokenizer))
val_reader = pa.RecordBatchReader.from_batches(schema, process(dataset['val'], tokenizer))

# Write the train and val datasets to disk
lance.write_dataset(
    train_reader,
    "alpaca_train.lance",
    schema
)
lance.write_dataset(
    val_reader,
    "alpaca_val.lance",
    schema
)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 46802/46802 [00:15<00:00, 3081.87it/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5200/5200 [00:01<00:00, 2914.08it/s]


<lance.dataset.LanceDataset at 0x7feb91b24880>

## Sanity check

Let's load our newly created dataset and see how many samples we have in our dataset.

You might notice a significantly lower number of samples in our newly created training set below. This is due to the fact that many columns have empty strings which would raise an error not just with the tokenizer but also with pyarrow and hence we are not including those rows altogether.

If your use-case allows, you can always assign a placeholder token or any other string in place of an empty string to keep all the samples intact.

In [6]:
# Load the dataset to inspect the total number of samples
ds = lance.dataset('alpaca_train.lance')
print(f"Total samples in alpaca-train set: {ds.count_rows():,d}")

Total samples in alpaca-train set: 18,385
