CodeGen

An exploration in code & assertion generation with large language models. An explanation for the prompting schema, data acquisition, and methodology can be found here, in my Master's thesis.

Project Abstract:

Software assertions play a critical role in the creation of test benches and the overall verification of systems. In the case of formal property verification, complex design specifications are interpreted by industry experts and translated into System Verilog Assertions (SVA). Recent research has pointed toward large language models as a potential tool for SVA generation, however, lack of data and standardization of software assertions has resulted in mixed results amongst methods of evaluations. Thus, this paper proposes a dataset of code and natural language data containing assertions in SystemVerilog and Python that can be used to train and test future collaborative coding models. Additionally, this paper provides a preliminary analysis vii and novel schema for the consistent generation of quality software assertions with OpenAI’s GPT-4.

A Brief Tour:

Main Components:
- AllInOne.ipynb is an aggregate of many of the processing files that includes code cleaning, prompt generation, and analysis. For more information on textbook analysis see GPT-ImageScrapper.ipynb.
- Data/exploration.ipynb gathers statistics and other visual analytics from the asserted supervised, asserted unsupervised, and textbook datasets.
- Organized Dataset/* contains cleaned results, prompts, image directories, & LLM responses.
  - The majority of this is focused on asserted code with the notable exception of Organized Dataset/Verilog Textbook Code/* which is all interpreted Verilog code.
- Data/* Contains most of the intermediary states of data as it is processed.
  - Data/BigQuery/* contains the raw code data from Google Cloud.
  - Data/Data/example-code/supervised-textbook/* contains the raw & partially-processed textbook data.
  - Data/example-code/verilog examples/open-titan/* contains the raw & partially-processed testbench data.
generally helpful documents:
- Gpt-Tutorial.ipynb has a brief overview of how to query openai's api & automate the process of extracting LLM-generated code
- GPTest.ipynb shows the train of thought and some preliminary analysis done with LLM responses.

Key:

any file with "supervised" in the title is related to the connection between a piece of code's behavioral description (as given by an LLM) and the code-text itself.
- likewise anything "unsupervised" relates exclusively to the code its assertions.
any file with "response" entails an LLM output.

For further questions feel free to reach me at korashughes+github@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Data		Data
Organized Dataset		Organized Dataset
Testbench		Testbench
.DS_Store		.DS_Store
.gitignore		.gitignore
AllInOne.ipynb		AllInOne.ipynb
GPT-ImageScrapper.ipynb		GPT-ImageScrapper.ipynb
GPTest.ipynb		GPTest.ipynb
Gpt-Tutorial.ipynb		Gpt-Tutorial.ipynb
OpenTitan-Scraper.ipynb		OpenTitan-Scraper.ipynb
README.md		README.md
testpage.jpg		testpage.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeGen

Project Abstract:

A Brief Tour:

Key:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeGen

Project Abstract:

A Brief Tour:

Key:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages