a question, thank you for your reply #15

zhouyuan888888 · 2024-05-09T06:30:58Z

Hi, thank you for your nice work. I have a question about training. If I want to train your model on the pile-uncopyrighted dataset (just uncopyrighted pile), how should I prepare or pre-process the dataset?

simran-arora · 2024-05-09T06:50:37Z

Hi!
The option that we used was to follow the EleutherAI tokenization instructions provided here (prepare data): https://github.com/EleutherAI/gpt-neox
We then plugged in the file paths for the resulting train, val, and test splits here:

based/train/datamodules/language_modeling_neox.py

Line 74 in f6552fa

"train": ["/var/cr06_data/sim_data/pile/pile/pile_text_document"],

Our code base also supports tokenization for any hugging face dataset without any additional effort on your part. E.g., if you want to try Fine Web, Slim Pajama, etc.

This is the code that does the tokenization:

based/train/datamodules/language_modeling_hf.py

Line 41 in f6552fa

class LMDataModule(LightningDataModule):
This style of data config, where you can replace the dataset_name (and dataset_config_name if relevant) to the hugging face dataset that you choose, would work without any additional effort:

based/train/configs/datamodule/wikitext103.yaml

Line 1 in f6552fa

_target_: train.datamodules.language_modeling_hf.LMDataModule
And you just point to that data config you constructed here before launching:

based/train/configs/experiment/example/based-360m.yaml

Line 4 in f6552fa

- override /datamodule: wikitext103

zhouyuan888888 · 2024-05-09T06:53:49Z

So thank you for your timely reply!

lisc110h · 2024-06-19T08:00:51Z

I am currently trying to train a model from scratch using the Pile dataset. I would like to add that it was necessary to run make at based/train/datamodules/neox_utils to build helpers.cpp.

simran-arora closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a question, thank you for your reply #15

a question, thank you for your reply #15

zhouyuan888888 commented May 9, 2024

simran-arora commented May 9, 2024

zhouyuan888888 commented May 9, 2024

lisc110h commented Jun 19, 2024

a question, thank you for your reply #15

a question, thank you for your reply #15

Comments

zhouyuan888888 commented May 9, 2024

simran-arora commented May 9, 2024

zhouyuan888888 commented May 9, 2024

lisc110h commented Jun 19, 2024