### Loading data example

In this notebook, there will be:

1. How to load data using huggingface dataset into Dataset
2. How to transform Dataset into DataLoader

In [None]:
from tqdm import tqdm
from transformers import AutoTokenizer

In [None]:
max_length = 128
batch_size = 8

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", truncation=True)

### Load text data into Dataset

Load text data using `datasets` library

In [None]:
filename = "./examples/data/text_forward.txt"

In [None]:
from datasets import load_dataset

In [None]:
# load data into Dataset
ds = load_dataset("text", data_files=filename)

# tokenize
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

# Dataset with correct format
encoded_dataset = ds.map(preprocess_function, batched=True)

In [None]:
# ds['train'][:5] # uncomment to check
# encoded_dataset['train'][:5] # uncomment to check

In [None]:
ds_train = encoded_dataset['train']

In project `newlm` we merge several data so that it would has len close to max_len and also minimize padding. 
<br>See: https://github.com/madenindya/newlm/blob/main/newlm/lm/bert/lm_builder.py#L120
<br>Note: In this implementation, we do not allow truncation in the middle of sentence.

### Dataset into DataLoader

Before we could train the data, usually the data need to be in DataLoader format which is iterable. 
See: https://pytorch.org/docs/stable/data.html

1. Data Collator

PyTorch needs you to pass `collate_fn` to do it. Fortunately, huggingface alreader provide several collator that we could easily use.
See: https://huggingface.co/docs/transformers/main_classes/data_collator

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, # if false, it becomes like GPT
    mlm_probability=0.15,
)

You can create your own `collate_fn` but make sure the output format is the same

2. Get DataLoader

Huggingface Trainer internally has function to create DataLoader, thus we only need to pass the data and its collator

In [None]:
from transformers import TrainingArguments, Trainer, BertModel, BertConfig

cfg = BertConfig()
model = BertModel(cfg)

# A helper function
args = TrainingArguments(output_dir="tmpout", per_device_train_batch_size=8)
trainer = Trainer(model=model, args=args, data_collator=data_collator, train_dataset=ds_train)

dl = trainer.get_train_dataloader() # called internally
batch = next(iter(dl))
# batch # uncomment to check

I recommend to copy the see: https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/trainer.py#L587 and simply modified the implementation.

But, if you need to code it manually, here a brief example

In [None]:
ds_train_man = [{'input_ids': x['input_ids']} for x in ds_train]
# ds_train[:10]

A very basic implementation for DataLoader

In [None]:
from torch.utils.data.sampler import RandomSampler, SequentialSampler
train_sampler = RandomSampler(ds_train_man)
# eval_sampler = RandomSampler(ds_train_man)

dl = DataLoader(
    ds_train_man,
    batch_size=batch_size,
    sampler=train_sampler,
    collate_fn=data_collator,
)

In [None]:
batch = next(iter(dl))
# batch # uncomment to check

After you have this DataLoader format, you can easily pass the data into PyTorch Ligtning Trainer: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html