Multi-Domain Expert Learning
To set up the development environment, run make setup_dev
. This will setup the
pre-commit hooks.
First, make sure you followed the Environment Setup guidelines.
To create an expert dataset using the Pile data, follow these steps:
- Download the Pile shard 1 data:
./scripts/get_pile_shard1_data.sh
- To set the domain, edit the variable
SUBSET_NAME
inscripts/create_domain_pile_mix.sh
. This should be set to a valid value of the Pile's variablepile_set_name
. A list of valid values can be found below. - Run the above script to process the dataset
- Authenticate into Hugginface:
export HF_ACCESS_TOKEN={YOUR HUGGINGFACE TOKEN}
- Set the dataset name in
scripts/upload_to_hf.sh
- Run the above script to upload the processed dataset to HuggingFace
- Pile-CC
- PubMed Central
- Books3†
- OpenWebText2
- ArXiv
- Github
- FreeLaw
- Stack Exchange
- USPTO Backgrounds
- PubMed Abstracts
- Gutenberg (PG-19)†
- OpenSubtitles†
- Wikipedia (en)†
- DM Mathematics†
- Ubuntu IRC
- BookCorpus2
- EuroParl†
- HackerNews
- YoutubeSubtitles
- PhilPapers
- NIH ExPorter
- Enron Emails†
- Clone this repo and follow the Environment Setup instructions
- Set up HF authentication:
export HUGGING_FACE_HUB_TOKEN=[FILL ME]
- Set up W&B authentication:
export WANDB_API_KEY=[FILL ME]
- Edit the variable
DATASET
in scriptsrc/mdel/train.sh
to match a valid dataset name on the MDEL HF. - Run the above script in background mode to start the training:
./train.sh &
- The trained model should be uploaded to the MDEL HF
- Clone this repo and follow the Environment Setup instructions
- Set up HF authentication:
export HUGGING_FACE_HUB_TOKEN=[FILL ME]
- Run the merge script
python src/mdel/merge_experts.py \
--hf-repo your_hf_username/desired_name_of_merged_model \
-e mdel/expert_1 \
-e mdel/expert_2 \
-e mdel/expert_n
- Clone this repo and follow the Environment Setup instructions
- Set up HF authentication:
export HUGGING_FACE_HUB_TOKEN=[FILL ME]
- Run the perplexity script
python3 src/mdel/calculate_perplexity.py \
--model Multi-Domain-Expert-Layers/expert-arxiv \
--dataset Multi-Domain-Expert-Layers/arxiv \
--split validation_domain
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2020).The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.