Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate a merged expert model's perplexity #53

Open
mrcabbage972 opened this issue May 13, 2023 · 3 comments
Open

Evaluate a merged expert model's perplexity #53

mrcabbage972 opened this issue May 13, 2023 · 3 comments
Assignees

Comments

@mrcabbage972
Copy link
Collaborator

mrcabbage972 commented May 13, 2023

The goal is to do a perplexity calculation on a few models:

  1. A model that is a weighted average of a few experts models
  2. A baseline model which is fine-tuned on the union of the experts' datasets
  3. Same as above but only the layers that were tuned for the experts (layer 9-13)

The model in (1) can be created using the script in this PR. The list of experts is:

  1. ArXiv
  2. Pubmed Central
  3. Pubmed Abstracts
  4. USPTO
  5. Philpapers
  6. Github
  7. Freelaw

The modes in (2, 3) should be prepared in the following issue.

The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).

The deliverables of this issue should be:

  1. The perplexity numbers for each model
  2. A script to reproduce the result
@mrcabbage972
Copy link
Collaborator Author

mrcabbage972 commented May 14, 2023

@ontocord Can you please review the description of this issue? This is an important one, so I'd like to make sure we're aligned on the details.

@Stillerman
Copy link
Collaborator

Perplexity script was run on all combinations of:
Model: expert-arxiv, expert-freelaw, expert-github, EleutherAI/pythia-1b-deduped
Dataset: arxiv, freelaw, github
Split: train, validation_domain, validation_pile

I will post the full results of the 36 experiments below, but to highlight the confusion parts
Screenshot 2023-05-17 at 11 03 08 AM

Perplexity of the training set is higher on the expert than the base. When I looked in WandB I found a run titled pythia-1b-deduped-arxiv so assuming that is the run that goes with this model, loss was going down as you can see in this chart

Screenshot 2023-05-17 at 11 06 32 AM

All of the datasets showed the expert having higher complexity than the base on both the training split and the validation_domain split.

distilgpt2 had very high perplexity on the datasets as expected.

Here are the complete results

{"date": 1684281437.6640549, "runtime": 58.9195, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.1806483251580175}
{"date": 1684281438.6176283, "runtime": 59.0754, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.589675635229376}
{"date": 1684281442.473437, "runtime": 59.5134, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.7452095055781855}
{"date": 1684281445.6582773, "runtime": 59.3279, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.255214207031588}
{"date": 1684281446.2353806, "runtime": 59.2127, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.366141232556662}
{"date": 1684281446.580657, "runtime": 63.7466, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.561219412349417}
{"date": 1684281450.5335333, "runtime": 62.0645, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.331658572574814}
{"date": 1684281465.1216207, "runtime": 78.9296, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.827975770776606}
{"date": 1684281498.398999, "runtime": 118.9044, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.036974293841685}
{"date": 1684281501.5622356, "runtime": 118.0725, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.012206926101761}
{"date": 1684281503.5901659, "runtime": 118.9363, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.075443609481571}
{"date": 1684281504.2154381, "runtime": 119.622, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 6.14366348878873}
{"date": 1684281505.1637416, "runtime": 118.9924, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.332868620596422}
{"date": 1684281507.152357, "runtime": 119.2627, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.367354075345387}
{"date": 1684281516.2976973, "runtime": 133.1081, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.18129673498018}
{"date": 1684281521.0476763, "runtime": 136.0308, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.25646632845322}
{"date": 1684281792.4260871, "runtime": 412.6193, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 5.603551498049122}
{"date": 1684281796.6836188, "runtime": 416.5038, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 5.679911295033531}
{"date": 1684281798.704142, "runtime": 418.6623, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 5.672741968171792}
{"date": 1684281801.901714, "runtime": 415.2171, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_domain", "max_length": 1024, "dataset_key": "text", "perplexity": 5.675035573566157}
{"date": 1684282123.3902223, "runtime": 742.2889, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.301404276401705}
{"date": 1684282128.2113533, "runtime": 745.8973, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.335512205839025}
{"date": 1684282135.2589061, "runtime": 749.2396, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.226164572185025}
{"date": 1684282153.224812, "runtime": 766.9294, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/github", "split": "validation_pile", "max_length": 1024, "dataset_key": "text", "perplexity": 6.151672905395438}
{"date": 1684284992.2658331, "runtime": 228.1556, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/github", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 5.625481166685019}
{"date": 1684284993.4468331, "runtime": 229.1428, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/github", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 5.703393393203626}
{"date": 1684284993.8094637, "runtime": 229.3736, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/github", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 5.696060657477567}
{"date": 1684284994.4632788, "runtime": 229.3474, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/github", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 5.698515176536461}
{"date": 1684285044.694415, "runtime": 228.2315, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.013907204773048}
{"date": 1684285045.2155526, "runtime": 229.1972, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.0760585282682245}
{"date": 1684285045.391916, "runtime": 229.2373, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.144929907359411}
{"date": 1684285045.5122395, "runtime": 229.2458, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/freelaw", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.038489375419607}
{"date": 1684285211.8575993, "runtime": 229.2016, "model": "EleutherAI/pythia-1b-deduped", "tokenizer": "EleutherAI/pythia-1b-deduped", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.539700840094572}
{"date": 1684285236.932937, "runtime": 228.7231, "model": "Multi-Domain-Expert-Layers/expert-arxiv", "tokenizer": "Multi-Domain-Expert-Layers/expert-arxiv", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.572989707328704}
{"date": 1684285238.87368, "runtime": 229.1865, "model": "Multi-Domain-Expert-Layers/expert-github", "tokenizer": "Multi-Domain-Expert-Layers/expert-github", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.801440338123199}
{"date": 1684285240.2605352, "runtime": 229.3909, "model": "Multi-Domain-Expert-Layers/expert-freelaw", "tokenizer": "Multi-Domain-Expert-Layers/expert-freelaw", "dataset": "Multi-Domain-Expert-Layers/arxiv", "split": "train", "max_length": 1024, "dataset_key": "text", "perplexity": 6.722750104407284}

@mrcabbage972
Copy link
Collaborator Author

Thanks @Stillerman! Let's close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants