MDEL

Multi-Domain Expert Learning

Environment Setup

To set up the development environment, run make setup_dev. This will setup the pre-commit hooks.

Creating Expert Datasets

First, make sure you followed the Environment Setup guidelines.

To create an expert dataset using the Pile data, follow these steps:

Download the Pile shard 1 data: ./scripts/get_pile_shard1_data.sh
To set the domain, edit the variable SUBSET_NAME in scripts/create_domain_pile_mix.sh. This should be set to a valid value of the Pile's variable pile_set_name. A list of valid values can be found below.
Run the above script to process the dataset
Authenticate into Hugginface: export HF_ACCESS_TOKEN={YOUR HUGGINGFACE TOKEN}
Set the dataset name in scripts/upload_to_hf.sh
Run the above script to upload the processed dataset to HuggingFace

Pile Subsets

Pile-CC
PubMed Central
Books3†
OpenWebText2
ArXiv
Github
FreeLaw
Stack Exchange
USPTO Backgrounds
PubMed Abstracts
Gutenberg (PG-19)†
OpenSubtitles†
Wikipedia (en)†
DM Mathematics†
Ubuntu IRC
BookCorpus2
EuroParl†
HackerNews
YoutubeSubtitles
PhilPapers
NIH ExPorter
Enron Emails†

Training Expert Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Set up W&B authentication: export WANDB_API_KEY=[FILL ME]
Edit the variable DATASET in script src/mdel/train.sh to match a valid dataset name on the MDEL HF.
Run the above script in background mode to start the training: ./train.sh &
The trained model should be uploaded to the MDEL HF

Merging Expert Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Run the merge script

python src/mdel/merge_experts.py \
   --hf-repo your_hf_username/desired_name_of_merged_model \
   -e mdel/expert_1 \
   -e mdel/expert_2 \
   -e mdel/expert_n

Evaluating Perplexity of Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Run the perplexity script

python3 src/mdel/calculate_perplexity.py \
   --model Multi-Domain-Expert-Layers/expert-arxiv \
   --dataset Multi-Domain-Expert-Layers/arxiv \
   --split validation_domain

References

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2020).The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
.github/workflows		.github/workflows
.idea		.idea
Model Merge And Analysis Tools		Model Merge And Analysis Tools
clustering		clustering
configs		configs
distillation_sparsification		distillation_sparsification
docs		docs
lora-x		lora-x
notebooks		notebooks
scripts		scripts
src/mdel		src/mdel
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conda-mdel.yml		conda-mdel.yml
requirements.txt		requirements.txt
resources.md		resources.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MDEL

Environment Setup

Creating Expert Datasets

Pile Subsets

Training Expert Models

Merging Expert Models

Evaluating Perplexity of Models

References

About

Releases

Packages

Contributors 13

Languages

License

huu4ontocord/MDEL

Folders and files

Latest commit

History

Repository files navigation

MDEL

Environment Setup

Creating Expert Datasets

Pile Subsets

Training Expert Models

Merging Expert Models

Evaluating Perplexity of Models

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages