This paper is the code implementation of "Estimating Large Language Model Capabilities without Labeled Test Data" by Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, and Robin Jia.
This repo contains code for both the In-context Learning LLM inference to generate meta-training data and the meta-model training.
pip install torch==1.13.1
pip install transformers==4.22.1
To run experiments with MMLU and MCQA datasets:
cd mcqa
To run experiments with CBQA datasets:
cd cbqa
Please see mcqa/config.py
and cbqa/config.py
for the full ontology of datasets in each collection.
While under mcqa/
, run the following command to do inference using the OPT model on the MMLU/MCQA
dataset:
python opt_mmlu_worker.py\
--model_size opt-6.7b --num_shots 5 --temperature 0 --template mmlu --seed 1
--model_size
: the size of the OPT model, such asopt-6.7b
oropt-13b
--num_shots
: number of few-shot examples in the prompt--template
: the prompt template to demonstrate the few-shot examples. Choose frommmlu
,subject
,gopher
,gpt
, anduser
.--temperature
: hyperparameter to control the randomness--seed
: random seed
Similarly, while under cbqa/
, run the following command to do inference using the OPT model on the CBQA
dataset:
python opt_worker.py\
--model_size opt-6.7b --num_shots 5 --seed 1
--model_size
: the size of the OPT model, such asopt-6.7b
oropt-13b
--num_shots
: number of few-shot examples in the prompt--seed
: random seed
Then under either directory, run
python transform_embed.py
to retrieve and store the PCA-transformed embeddings
Under mcqa/
, run
python train_classifier.py\
--setting cv --cv_k 5 --tasks mmlu --num_unlabeled 1000 --data_dim 100 --only_size 13B \
--seed 1 --llama --mmlu --metric conf
--setting
: general setting for train/test split--cv_k
: number of splits for cross-validation--tasks
: task defined inconfig.py
as the metadata, choose frommmlu
andmcqa
--num_unlabeled
: how much data to include in a single confidence profile--data_dim
: dimension of confidence profile--only_size
: inference of the specified size of LLM--only_shots
: inference of the specified k-shot results--llama/--opt
: use llama model or opt model--mmlu/--mcqa
: do inference onmmlu
ormcqa
--metric
: metric for processing confidence profile, choose fromconf
,pca_embed
,conf_embed
--train_size
: number of seeds to include in the training/test data--seed
: random seed--do_sigmoid
,--dropout
,--lr
,--lr_lambda
,--num_epochs
: MLP hyperparameters
Under cbqa/
, run
python train_classifier.py\
--setting cv --cv_k 5 --tasks cbqa --num_unlabeled 1000 --data_dim 100 --only_size llama13B \
--seed 1 --llama --metric conf
--setting
: general setting for train/test split--cv_k
: number of splits for cross-validation--tasks
: task defined inconfig.py
as the metadata, choose fromcbqa
andseq2seq
--num_unlabeled
: how much data to include in a single confidence profile--data_dim
: dimension of confidence profile--only_size
: inference of the specified size of LLM--only_shots
: inference of the specified k-shot results--llama/--opt
: use llama model or opt model--mmlu/--mcqa
: do inference onmmlu
ormcqa
--metric
: metric for processing confidence profile, choose fromconf
,pca_embed
,conf_embed
--seed
: random seed--do_sigmoid
,--dropout
,--lr
,--lr_lambda
,--num_epochs
: MLP hyperparameters
We did not include meta-training data in this repo due to its large magnitude. We did not include the inference code for LLaMA models to avoid certain copyright issues. We thank 🤗 huggingface datasets for making the datasets and LLMs easily accessible.