Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing grounds for LLMs - Validation Framework #35

Open
arnocandel opened this issue Apr 12, 2023 · 14 comments
Open

Testing grounds for LLMs - Validation Framework #35

arnocandel opened this issue Apr 12, 2023 · 14 comments

Comments

@arnocandel
Copy link
Member

https://twitter.com/omarsar0/status/1641792530667675648/photo/1

@arnocandel
Copy link
Member Author

@arnocandel
Copy link
Member Author

arnocandel commented Apr 24, 2023

https://github.com/EleutherAI/lm-evaluation-harness has out-of-the-box HF connectors, seems easiest to use.
some of them like arc-e arc-ch piqa are also shown in tweet above

https://arxiv.org/pdf/2303.17564v1.pdf
FsjRuxhWAAE0BW3
image
image

@arnocandel
Copy link
Member Author

arnocandel commented Apr 24, 2023

CUDA_VISIBLE_DEVICES=0 torchrun main.py --model hf-causal --model_args pretrained=h2oai/h2ogpt-oig-oasst1-512-6.9b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-512-6.9b.eval.log

Task Version Metric Value Stderr
boolq 1 acc 0.6266 ± 0.0085
arc_challenge 0 acc 0.3225 ± 0.0137
acc_norm 0.3396 ± 0.0138
openbookqa 0 acc 0.2660 ± 0.0198
acc_norm 0.3660 ± 0.0216
arc_easy 0 acc 0.6776 ± 0.0096
acc_norm 0.6195 ± 0.0100
hellaswag 0 acc 0.4822 ± 0.0050
acc_norm 0.6465 ± 0.0048
winogrande 0 acc 0.6219 ± 0.0136
piqa 0 acc 0.7530 ± 0.0101
acc_norm 0.7606 ± 0.0100

h2ogpt-oig-oasst1-512-6.9b.eval.log

@arnocandel arnocandel changed the title Testing grounds for LLMs Testing grounds for LLMs - Validation Framework Apr 24, 2023
@arnocandel
Copy link
Member Author

arnocandel commented Apr 24, 2023

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-12b.eval.log
CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-20b.eval.log
image

@arnocandel
Copy link
Member Author

arnocandel commented Apr 25, 2023

h2ogpt-oasst1-512-12b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
arc_easy 0 acc 0.6932 ± 0.0095
acc_norm 0.6225 ± 0.0099
openbookqa 0 acc 0.2900 ± 0.0203
acc_norm 0.3740 ± 0.0217
winogrande 0 acc 0.6369 ± 0.0135
hellaswag 0 acc 0.5140 ± 0.0050
acc_norm 0.6803 ± 0.0047
piqa 0 acc 0.7682 ± 0.0098
acc_norm 0.7661 ± 0.0099
boolq 1 acc 0.6685 ± 0.0082
arc_challenge 0 acc 0.3157 ± 0.0136
acc_norm 0.3507 ± 0.0139

h2ogpt-oasst1-512-20b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-20b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
hellaswag 0 acc 0.5419 ± 0.0050
acc_norm 0.7259 ± 0.0045
boolq 1 acc 0.7125 ± 0.0079
piqa 0 acc 0.7742 ± 0.0098
acc_norm 0.7775 ± 0.0097
openbookqa 0 acc 0.2800 ± 0.0201
acc_norm 0.4000 ± 0.0219
arc_challenge 0 acc 0.3993 ± 0.0143
acc_norm 0.4420 ± 0.0145
winogrande 0 acc 0.6614 ± 0.0133
arc_easy 0 acc 0.7327 ± 0.0091
acc_norm 0.6894 ± 0.0095

@arnocandel
Copy link
Member Author

arnocandel commented Apr 25, 2023

https://huggingface.co/databricks/dolly-v2-12b
model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-3b 0.384 0.611532 0.589582 0.650767 0.370307 0.742655 0.575535 0.544886
EleutherAI/pythia-12b 0.364 0.627104 0.636148 0.668094 0.346416 0.760065 0.673394 0.559676
EleutherAI/gpt-j-6B 0.382 0.621633 0.651144 0.662617 0.363481 0.761153 0.655963 0.565936
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487
databricks/dolly-v1-6b 0.41 0.62963 0.643252 0.676758 0.384812 0.773667 0.687768 0.583431
EleutherAI/gpt-neox-20b 0.402 0.683923 0.656669 0.7142 0.408703 0.784004 0.695413 0.602236

@arnocandel
Copy link
Member Author

@arnocandel
Copy link
Member Author

arnocandel commented Apr 25, 2023

What the tasks look like:
tasks.zip

created with
python scripts/write_out.py --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --num_fewshot 5 --num_examples 10 --output_base_path tasks

@arnocandel
Copy link
Member Author

image

@arnocandel
Copy link
Member Author

arnocandel commented Apr 25, 2023

let's see if Dolly v2 12B is doing same for their numbers:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> dolly-v2-12b.eval.log

Task Version Metric Value Stderr
winogrande 0 acc 0.6298 ± 0.0136
arc_easy 0 acc 0.6713 ± 0.0096
acc_norm 0.6380 ± 0.0099
hellaswag 0 acc 0.5420 ± 0.0050
acc_norm 0.7109 ± 0.0045
piqa 0 acc 0.7399 ± 0.0102
acc_norm 0.7541 ± 0.0100
arc_challenge 0 acc 0.3618 ± 0.0140
acc_norm 0.3823 ± 0.0142
openbookqa 0 acc 0.2980 ± 0.0205
acc_norm 0.4060 ± 0.0220
boolq 1 acc 0.5624 ± 0.0087

yes, consistent with their reported numbers, so command itself seems reasonable.

@arnocandel
Copy link
Member Author

@arnocandel
Copy link
Member Author

undertrained older models perform worse indeed:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-20b.eval.log

h2ogpt-oig-oasst1-256-20b.eval.log

Task Version Metric Value Stderr
hellaswag 0 acc 0.5320 ± 0.0050
acc_norm 0.7115 ± 0.0045
arc_easy 0 acc 0.7138 ± 0.0093
acc_norm 0.6869 ± 0.0095
boolq 1 acc 0.6878 ± 0.0081
piqa 0 acc 0.7742 ± 0.0098
acc_norm 0.7786 ± 0.0097
openbookqa 0 acc 0.2760 ± 0.0200
acc_norm 0.3760 ± 0.0217
arc_challenge 0 acc 0.3686 ± 0.0141
acc_norm 0.3942 ± 0.0143
winogrande 0 acc 0.6630 ± 0.0133

@arnocandel
Copy link
Member Author

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-12b.eval.log

h2ogpt-oig-oasst1-256-12b.eval.log

Task Version Metric Value Stderr
hellaswag 0 acc 0.5189 ± 0.0050
acc_norm 0.6930 ± 0.0046
arc_challenge 0 acc 0.3276 ± 0.0137
acc_norm 0.3797 ± 0.0142
piqa 0 acc 0.7628 ± 0.0099
acc_norm 0.7720 ± 0.0098
winogrande 0 acc 0.6527 ± 0.0134
boolq 1 acc 0.6602 ± 0.0083
arc_easy 0 acc 0.6999 ± 0.0094
acc_norm 0.6629 ± 0.0097
openbookqa 0 acc 0.2940 ± 0.0204
acc_norm 0.4020 ± 0.0219

@arnocandel
Copy link
Member Author

https://github.com/EleutherAI/lm-evaluation-harness
4b701e228768052cfae9043dca13e82052ca5eea

diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py
index 4d3aa24..34b6967 100644
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -76,10 +76,10 @@ class HuggingFaceAutoLM(BaseLM):
         subfolder: Optional[str] = None,
         revision: Optional[str] = "main",
         batch_size: Optional[Union[int, str]] = 1,
-        max_gen_toks: Optional[int] = 256,
+        max_gen_toks: Optional[int] = 512,
         max_length: Optional[int] = None,
         add_special_tokens: Optional[bool] = None,
-        use_accelerate: Optional[bool] = False,
+        use_accelerate: Optional[bool] = True,
         device_map_option: Optional[str] = "auto",
         max_memory_per_gpu: Optional[Union[int, str]] = None,
         max_cpu_memory: Optional[Union[int, str]] = None,
@@ -87,9 +87,9 @@ class HuggingFaceAutoLM(BaseLM):
         dtype: Optional[Union[str, torch.dtype]] = None,
         device: Optional[Union[int, str]] = "cuda",
         peft: str = None,
-        load_in_8bit: Optional[bool] = False,
+        load_in_8bit: Optional[bool] = True,
         load_in_4bit: Optional[bool] = False,
-        trust_remote_code: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = True,
         gptq_use_triton: Optional[bool] = False,
     ):
         """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-falcon-40b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-falcon-40b.eval.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant