#Data Packaging

##There are two steps in Data Packing
###1. Tokenizing and creating input_ids


In [13]:
ls drive/MyDrive/data

preprocessed_dataset.parquet


In [14]:
import datasets

dataset = datasets.load_dataset(
    "parquet",
    data_files = "drive/MyDrive/data/preprocessed_dataset.parquet",
    split="train"
)
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 2000809
})


Use the `shard` method of the Hugging Face `Dataset` object to split the dataset into 10 smaller pieces, or *shards* (think shards of broken glass). You can read more about sharding at [this link](https://huggingface.co/docs/datasets/en/process#shard).

In [17]:
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 200081
})


In [37]:
from transformers import AutoTokenizer

model_path = "upstage/SOLAR-10.7B-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

In [38]:
tokenizer.tokenize("Hi my name is Bye")

['▁Hi', '▁my', '▁name', '▁is', '▁By', 'e']

In [39]:
def tokenization(example):
  #tokenize
  tokens = tokenizer.tokenize(example['text'])
  #Convert tokens to ids
  token_ids = tokenizer.convert_tokens_to_ids(tokens)

  # Add <bos>, <eos> tokens to the front and back of tokens_ids
  # bos: begin of sequence, eos: end of sequence
  token_ids = [
      tokenizer.bos_token_id] \
      + token_ids \
      + [tokenizer.eos_token_id]

  example['input_ids'] = token_ids
  # We will be using this column to count the total number of tokens
    # in the final dataset
  example["num_tokens"] = len(token_ids)
  return example

In [40]:
dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/200081 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 200081
})


In [41]:
sample = dataset[0]

print(f"Text: {sample['text']}")
print(f"Input IDs: {sample['input_ids']}")
print("\nnum_tokens", sample["num_tokens"])

Text: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.
Input IDs: [1, 2387, 1370, 28725, 264, 1628, 2746, 5160, 25492, 1419, 264, 25710, 297, 559, 2003, 28723, 985, 2580, 378, 403, 3796, 298, 1156, 395, 378, 1096, 378, 403, 10227, 28723, 25492, 2613, 298, 4098, 272, 25710, 395, 559, 1948, 28725, 579, 630, 829, 427, 28727, 264, 6261, 356, 559, 11

Check the total number of tokens in the dataset

In [42]:
import numpy as np
np.sum(dataset['num_tokens'])

np.int64(48148950)

###2. Packing the data

Concatenate all input ids for all examples into a single list:

In [47]:
input_ids = np.concatenate(dataset['input_ids'])
print(len(input_ids))

48148950


In [50]:
max_seq_length = 32

In [51]:
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

48148928


Discard extra tokens from end of the list so number of tokens is exactly divisible by `max_seq_length`:

In [52]:
input_ids = input_ids[:total_length]
print(input_ids.shape)

(48148928,)


In [53]:
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape

(1504654, 32)

In [54]:
type(input_ids_reshaped)

numpy.ndarray

####Convert to Hugging Face Dataset

In [57]:
input_ids_list = input_ids_reshaped.tolist()

packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {'input_ids':input_ids_list}
)

print(packaged_pretrain_dataset)

Dataset({
    features: ['input_ids'],
    num_rows: 1504654
})


 Save the packed dataset

In [58]:
packaged_pretrain_dataset.to_parquet("drive/MyDrive/data/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/1505 [00:00<?, ?ba/s]

198614328