Ranking LLM-Generated Loop Invariants for Program Verification

Official code release of out EMNLP 2023 work NeuralInvariantRanker.

About the Ranker

The tool is used for ranking LLM-generated Loop Invariants. We evaluate the invariants from two different LLMs (gpt-3.5-turbo and gpt-4).

News

February 2024: Official code release.

Step by step guide

Step 1 - Make sure all the requirements are successfully installed.
Step 2 - Process the data.
Step 3 - Gather the initial embeddings.
Step 4 - Train rankers.
Step 5 - Evaluate and Visualize the results of the ranker.

Requirement

The requirements of this project are as follows. We built the tool based on python 3.9. We evaluated the project in PyTorch-1.13.1 (with cuda 11.6) and Transformers-4.36.0. Other versions of the libraries may need modification of the code, specially the trainer.py (and perhaps other places). To setup the appropriate library versions, run pip install -r requirements.txt.

About the data

We provide to version of the data. Follow the steps below to extract the datasets

Training data : Training data is available as a zip file in the data/ranker_data/data.zip. Extract the ranker data with cd data/ranker_data; unzip data.zip;
Evaluation data : Evaluation data is available in data/llm-genrated-solutions/solutions.zip. Extract the evaluation data with cd data/llm-genrated-solutions; unzip solutions.zip;.

Training data is formatted as jsonl file, where each line corresponds to an example. The format of an example is as follows

{
    "problem_id": "<unique id of the problem>",
    "code": "code of the problem, i.e., problem description",
    "positives": [
        {
            "code": "<a candidate invariant>",
            "score": "<score between 0 and 1, denoting how good a candidate invariant is, higher is better.>"
        }
        ...
    ],
    "negatives": [
        {
            "code": "<a candidate invariant>",
            "score": "<score between 0 and 1, denoting how good a candidate invariant is, higher is better.>"
        }
        ...
    ]
}

Initial Embeddings

We collect initial embeddings from two different OpenAI embedding models. Before starting the training, the initial embeddings must be gathered. We provide this script to gather the initial embeddings. To run the script,

python data/get_initial_embeddings.py \
    --model_name <MODEL_NAME, default = ada_002> \
    --use_azure --azure_config <AZURE_CONFIG> \ 
    --token_per_minute TOKEN_PER_MINUTE \
    --output_dir <directory where the embeddings will be saved as MODEL_NAME.json, default = data/embeddings>

Note that, this script can both use OpenAI api and Azure OpenAI api. For using OpenAI api, the API key must be exported as OPENAI_API_KEY environment variable. Run export OPENAI_API_KEY=<Your API Key>. For using Azure Open AI api, a azure config json file is needed with the above script (--azure_config). The Config file should be formatted as,

{
    "API_BASE": "<base url of your azure openai endpoint>",
    "API_KEY": "<your key>",
    "ENGINE": "<model/deployment name>",
    "API_VERSION": "<version of the api>"
}

While in the paper we experimented with two different embedding models, i.e., text-embedding-ada-002 and davinci-similarity, the davinci-similarity model is deprecated, and hence cannot be collected with the initial embedding script. Please download the initial embeddings from this link, to reproduce our results.

Important Note. The embedding files are large with the text-embedding-ada-002 embedding file (ada_002.json) sized 12GB and davinci-embedding embedding file (davinci.json) sized 34GB. Both the training script (this file), and evaluation script (this file) loads the entire embedding dictionary to memory, consequently the training and evaluation is very memoru intensive. For replicating this tool in a resource constraint environment, we suggest to re-implement the loading the initial embedding in a lazy fashion. In future, we may release an update with the lazy loading.

For any other embedding models than OpenAI embeddings, implement your Embedder extending the Embedder. Your class should contain the following method

def embed(
    self, 
    texts: List[str], 
    existing_embeddings: Optional[Dict[str, List[float]]] = {}
) -> Dict[str, List[float]]:

Training the ranker

To train the ranker,

bash train_ranker.sh <model name (ada_002/davinci)> <fold id 0/1/2/3/4>

For example, running

bash train_ranker.sh ada_002 0

will train a ranker based on the training data in data/ranker_data/fold_0 and the initial embedding from text-embedding-ada-002 (initial embeddings should already be computed as data/embeddings/ada_002.json).

Training parameters:

The training parameters (e.g., batch_size, max_training_steps, evaluation_strategy, etc. are controlled by an external json file.) The file is passed to the training in script/train_ranker.sh. By default, configs/ranker_config.json is used for the training. Change this file for experimenting with other parameters, or create a new file and update this line with appropriate config file path.

Evaluation and Visualization

After the training is done, we provide this notebook to evaluate and visualize the LLM-generated invariants. Note that, we performed 5-fold cross validation for evaluating the ranker. To get results from this notebook, make sure to train models for all 5 folds and both ada_002 amd dainci models.

Alternatively, for ranking other problems and corresponding candidate invariants, we refer to this function. This function signature is as follows

def rank_invariants(
    self, 
    model: nn.Module,
    examples: List[Dict[str, Any]],
    _cache: Dict[str, List[float]] = None,
    bar_tqdm=None
):

The model is the trained ranker model, examples are a list of ranking examples formatter as below, and _cache is the initial embeddings dictionary, that must contain the initial embeddings for the problem and invariants in the examples. Each of the example for this function should be

{
    "problem": <text of the problem description>,
    "invariants": [
        {
            "code": "<text of the candidate invariant>"
        },
        ...
    ]
}

The function will reorder the "invariants" list in each of the examples and add two more keys (i.e., "code_vector", and "similarity") for each invariant. In addition, the ranker will populate the "code_vector" for the "problem" as well. These "code_vector"s are the trained embeddings by the ranker model.

Invariant Generation

While the focus of this project to train ranker for invariants, we provide the prompt templete that we used to generate invariants using LLM in src/codex_models/chat.py.

Known Limitations

We assumed the cost of calling LLMs is negligible compared to the cost of calling the verifier for checking an invariant. In the case where call to LLM is much more expensive than LLM, this reanker will reduce the number of Z3 calls, but may not contribute to actual cost savings.
While the ranker empirically showed performance improvement to place the correct invariants at top positions of the rank list, before using the invariant in real verification, it should be empirically studied in the actual deployment scenario. Without such a study, there is no guarantee that the ranker would rank the correct invariant at the higher positions, which may theat the overall efficacy of the verifier to verify the program.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
configs		configs
data		data
script		script
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
requirements.txt		requirements.txt
setup.sh		setup.sh

License

microsoft/NeuralInvariantRanker

Folders and files

Latest commit

History

Repository files navigation

Ranking LLM-Generated Loop Invariants for Program Verification

About the Ranker

News

Step by step guide

Requirement

About the data

Initial Embeddings

Training the ranker

Training parameters:

Evaluation and Visualization

Invariant Generation

Known Limitations

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages