Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reproduce the codellama baseline with the code base #16

Closed
xufangzhi opened this issue Dec 13, 2023 · 1 comment
Closed

Fail to reproduce the codellama baseline with the code base #16

xufangzhi opened this issue Dec 13, 2023 · 1 comment

Comments

@xufangzhi
Copy link

xufangzhi commented Dec 13, 2023

Hello, thanks for sharing the great work!

But we tried to reproduce the codellama-13b-pal results on GSM-Hard dataset, with the following script:

set -ex

MODEL_NAME_OR_PATH="LOCAL CODELLAMA-13B MODEL WEIGHTS"

# DATA_LIST = ['math', 'gsm8k', 'gsm-hard', 'svamp', 'tabmwp', 'asdiv', 'mawps']

DATA_NAME="gsm-hard"

SPLIT="test"
PROMPT_TYPE="pal"
NUM_TEST_SAMPLE=-1

CUDA_VISIBLE_DEVICES=0 TOKENIZERS_PARALLELISM=false \
python -um infer.inference \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \

The output results are as follows:

Num samples: 1319
Num scores: 1319
Timeout samples: 10
Empty samples: 293
Mean score: [0.1]

Time use: 7540.60s
Time use: 125:40

It shows the result is 0.1, which has large gaps with the reported one. Would you please share the baseline script on CodeLLaMA with PAL strategy? Thanks a lot.

@ZubinGou
Copy link
Contributor

ZubinGou commented Jan 3, 2024

Maybe check if the environment is consistent with the requirements.txt ~

@ZubinGou ZubinGou closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants