Testing grounds for LLMs - Validation Framework #35

arnocandel · 2023-04-12T09:20:01Z

https://twitter.com/omarsar0/status/1641792530667675648/photo/1

arnocandel · 2023-04-12T09:24:22Z

arnocandel · 2023-04-24T21:46:37Z

https://github.com/EleutherAI/lm-evaluation-harness has out-of-the-box HF connectors, seems easiest to use.
some of them like arc-e arc-ch piqa are also shown in tweet above

https://arxiv.org/pdf/2303.17564v1.pdf

arnocandel · 2023-04-24T21:50:02Z

CUDA_VISIBLE_DEVICES=0 torchrun main.py --model hf-causal --model_args pretrained=h2oai/h2ogpt-oig-oasst1-512-6.9b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-512-6.9b.eval.log

Task	Version	Metric	Value		Stderr
boolq	1	acc	0.6266	±	0.0085
arc_challenge	0	acc	0.3225	±	0.0137
		acc_norm	0.3396	±	0.0138
openbookqa	0	acc	0.2660	±	0.0198
		acc_norm	0.3660	±	0.0216
arc_easy	0	acc	0.6776	±	0.0096
		acc_norm	0.6195	±	0.0100
hellaswag	0	acc	0.4822	±	0.0050
		acc_norm	0.6465	±	0.0048
winogrande	0	acc	0.6219	±	0.0136
piqa	0	acc	0.7530	±	0.0101
		acc_norm	0.7606	±	0.0100

h2ogpt-oig-oasst1-512-6.9b.eval.log

arnocandel · 2023-04-24T22:19:24Z

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-12b.eval.log
CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-20b.eval.log

arnocandel · 2023-04-25T03:51:29Z

h2ogpt-oasst1-512-12b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
arc_easy	0	acc	0.6932	±	0.0095
		acc_norm	0.6225	±	0.0099
openbookqa	0	acc	0.2900	±	0.0203
		acc_norm	0.3740	±	0.0217
winogrande	0	acc	0.6369	±	0.0135
hellaswag	0	acc	0.5140	±	0.0050
		acc_norm	0.6803	±	0.0047
piqa	0	acc	0.7682	±	0.0098
		acc_norm	0.7661	±	0.0099
boolq	1	acc	0.6685	±	0.0082
arc_challenge	0	acc	0.3157	±	0.0136
		acc_norm	0.3507	±	0.0139

h2ogpt-oasst1-512-20b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-20b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5419	±	0.0050
		acc_norm	0.7259	±	0.0045
boolq	1	acc	0.7125	±	0.0079
piqa	0	acc	0.7742	±	0.0098
		acc_norm	0.7775	±	0.0097
openbookqa	0	acc	0.2800	±	0.0201
		acc_norm	0.4000	±	0.0219
arc_challenge	0	acc	0.3993	±	0.0143
		acc_norm	0.4420	±	0.0145
winogrande	0	acc	0.6614	±	0.0133
arc_easy	0	acc	0.7327	±	0.0091
		acc_norm	0.6894	±	0.0095

arnocandel · 2023-04-25T04:16:06Z

https://huggingface.co/databricks/dolly-v2-12b
model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-3b 0.384 0.611532 0.589582 0.650767 0.370307 0.742655 0.575535 0.544886
EleutherAI/pythia-12b 0.364 0.627104 0.636148 0.668094 0.346416 0.760065 0.673394 0.559676
EleutherAI/gpt-j-6B 0.382 0.621633 0.651144 0.662617 0.363481 0.761153 0.655963 0.565936
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487
databricks/dolly-v1-6b 0.41 0.62963 0.643252 0.676758 0.384812 0.773667 0.687768 0.583431
EleutherAI/gpt-neox-20b 0.402 0.683923 0.656669 0.7142 0.408703 0.784004 0.695413 0.602236

arnocandel · 2023-04-25T04:19:47Z

https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

arnocandel · 2023-04-25T04:57:54Z

What the tasks look like:
tasks.zip

created with
python scripts/write_out.py --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --num_fewshot 5 --num_examples 10 --output_base_path tasks

arnocandel · 2023-04-25T05:49:49Z

arnocandel · 2023-04-25T06:39:00Z

let's see if Dolly v2 12B is doing same for their numbers:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> dolly-v2-12b.eval.log

Task	Version	Metric	Value		Stderr
winogrande	0	acc	0.6298	±	0.0136
arc_easy	0	acc	0.6713	±	0.0096
		acc_norm	0.6380	±	0.0099
hellaswag	0	acc	0.5420	±	0.0050
		acc_norm	0.7109	±	0.0045
piqa	0	acc	0.7399	±	0.0102
		acc_norm	0.7541	±	0.0100
arc_challenge	0	acc	0.3618	±	0.0140
		acc_norm	0.3823	±	0.0142
openbookqa	0	acc	0.2980	±	0.0205
		acc_norm	0.4060	±	0.0220
boolq	1	acc	0.5624	±	0.0087

yes, consistent with their reported numbers, so command itself seems reasonable.

arnocandel · 2023-04-25T07:37:42Z

https://arxiv.org/pdf/2302.13971.pdf

arnocandel · 2023-04-25T15:52:07Z

undertrained older models perform worse indeed:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-20b.eval.log

h2ogpt-oig-oasst1-256-20b.eval.log

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5320	±	0.0050
		acc_norm	0.7115	±	0.0045
arc_easy	0	acc	0.7138	±	0.0093
		acc_norm	0.6869	±	0.0095
boolq	1	acc	0.6878	±	0.0081
piqa	0	acc	0.7742	±	0.0098
		acc_norm	0.7786	±	0.0097
openbookqa	0	acc	0.2760	±	0.0200
		acc_norm	0.3760	±	0.0217
arc_challenge	0	acc	0.3686	±	0.0141
		acc_norm	0.3942	±	0.0143
winogrande	0	acc	0.6630	±	0.0133

arnocandel · 2023-04-25T16:56:57Z

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-12b.eval.log

h2ogpt-oig-oasst1-256-12b.eval.log

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5189	±	0.0050
		acc_norm	0.6930	±	0.0046
arc_challenge	0	acc	0.3276	±	0.0137
		acc_norm	0.3797	±	0.0142
piqa	0	acc	0.7628	±	0.0099
		acc_norm	0.7720	±	0.0098
winogrande	0	acc	0.6527	±	0.0134
boolq	1	acc	0.6602	±	0.0083
arc_easy	0	acc	0.6999	±	0.0094
		acc_norm	0.6629	±	0.0097
openbookqa	0	acc	0.2940	±	0.0204
		acc_norm	0.4020	±	0.0219

arnocandel · 2023-06-06T22:57:44Z

https://github.com/EleutherAI/lm-evaluation-harness
4b701e228768052cfae9043dca13e82052ca5eea

diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py
index 4d3aa24..34b6967 100644
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -76,10 +76,10 @@ class HuggingFaceAutoLM(BaseLM):
         subfolder: Optional[str] = None,
         revision: Optional[str] = "main",
         batch_size: Optional[Union[int, str]] = 1,
-        max_gen_toks: Optional[int] = 256,
+        max_gen_toks: Optional[int] = 512,
         max_length: Optional[int] = None,
         add_special_tokens: Optional[bool] = None,
-        use_accelerate: Optional[bool] = False,
+        use_accelerate: Optional[bool] = True,
         device_map_option: Optional[str] = "auto",
         max_memory_per_gpu: Optional[Union[int, str]] = None,
         max_cpu_memory: Optional[Union[int, str]] = None,
@@ -87,9 +87,9 @@ class HuggingFaceAutoLM(BaseLM):
         dtype: Optional[Union[str, torch.dtype]] = None,
         device: Optional[Union[int, str]] = "cuda",
         peft: str = None,
-        load_in_8bit: Optional[bool] = False,
+        load_in_8bit: Optional[bool] = True,
         load_in_4bit: Optional[bool] = False,
-        trust_remote_code: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = True,
         gptq_use_triton: Optional[bool] = False,
     ):
         """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-falcon-40b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-falcon-40b.eval.log

arnocandel changed the title ~~Testing grounds for LLMs~~ Testing grounds for LLMs - Validation Framework Apr 24, 2023

arnocandel mentioned this issue Apr 25, 2023

validation framework #8

Closed

arnocandel mentioned this issue May 9, 2023

Attempt to improve models #125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing grounds for LLMs - Validation Framework #35

Testing grounds for LLMs - Validation Framework #35

arnocandel commented Apr 12, 2023

arnocandel commented Apr 12, 2023

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023

arnocandel commented Jun 6, 2023

Testing grounds for LLMs - Validation Framework #35

Testing grounds for LLMs - Validation Framework #35

Comments

arnocandel commented Apr 12, 2023

arnocandel commented Apr 12, 2023

arnocandel commented Apr 24, 2023 • edited

arnocandel commented Apr 24, 2023 • edited

arnocandel commented Apr 24, 2023 • edited

arnocandel commented Apr 25, 2023 • edited

arnocandel commented Apr 25, 2023 • edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023 • edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023 • edited

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023

arnocandel commented Apr 25, 2023

arnocandel commented Jun 6, 2023

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 24, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited

arnocandel commented Apr 25, 2023 •

edited