lmd

Code for paper titled "Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models" (accepted to EMNLP 2022). The arxiv version is here: https://arxiv.org/abs/2210.10289

Install

Create virtual env if needed

python3 -m venv .venv
source .venv/bin/activate

Install from pip (https://pypi.org/project/nlp.lmd/)

pip install nlp.lmd

Install from source

git clone git@github.com:haozhg/lmd.git
cd lmd
pip install -e .

To use lmd cli, run lmd --help or python -m lmd.cli --help

$ lmd --help
usage: Language Model Decomposition [-h] [--target TARGET] [--basis BASIS]
                                    [--tokenizer-name TOKENIZER_NAME]
                                    [--max-seq-length MAX_SEQ_LENGTH]
                                    [--batch-size BATCH_SIZE]
                                    [--dataset-name DATASET_NAME]
                                    [--dataset-config-name DATASET_CONFIG_NAME]
                                    [--val-split-percentage VAL_SPLIT_PERCENTAGE]
                                    [--test-split-percentage TEST_SPLIT_PERCENTAGE]
                                    [--max-train-samples MAX_TRAIN_SAMPLES]
                                    [--max-val-samples MAX_VAL_SAMPLES]
                                    [--max-test-samples MAX_TEST_SAMPLES]
                                    [--preprocessing-num-workers PREPROCESSING_NUM_WORKERS]
                                    [--overwrite_cache OVERWRITE_CACHE]
                                    [--preprocess-dir PREPROCESS_DIR]
                                    [--embedding-dir EMBEDDING_DIR]
                                    [--results-dir RESULTS_DIR]
                                    [--models-dir MODELS_DIR] [--alpha ALPHA]
                                    [--log-level LOG_LEVEL]
                                    [--try-models TRY_MODELS]
                                    [--pre-select-multiplier PRE_SELECT_MULTIPLIER]
                                    [--seed SEED]

Results

To reproduce the results in Appendix B of the paper, run bash scripts/run.sh. The results are also stored in results/128k

Citation

If you find this paper/code useful, please cite us:

@misc{https://doi.org/10.48550/arxiv.2210.10289,
  doi = {10.48550/ARXIV.2210.10289},
  url = {https://arxiv.org/abs/2210.10289},
  author = {Zhang, Hao},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50 (Primary) 68T30, 68T07 (Secondary)},
  title = {Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
lmd		lmd
results/128k		results/128k
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lmd

Install

Results

Citation

About

Releases

Packages

Contributors 2

Languages

License

haozhg/lmd

Folders and files

Latest commit

History

Repository files navigation

lmd

Install

Results

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages