WebApp1K is a coding benchmark aiming to evaluate LLMs on their abilities to develop real-world web applications.
WebApp1K is designed to run on any environment as long as it has Python and Node.js.
Ensure your environmnt has the following installed:
- Python (version 3.8 or higher)
- Node.js (version 14 or higher)
- Clone the GitHub repository:
git clone https://github.com/onekq/WebApp1K.git cd WebApp1K
- Optional: Create virtual environment:
python -m venv venv source venv/bin/activate # Linux venv\Scripts\activate # Windows
- Install Python dependencies:
pip install -r requirements.txt
- Install Node.js dependencies:
cd staging npm install cd ..
If you desire deeper insights into the coding performance of a leading LLM available via public token-based API, you can run this benchmark to obtain all raw results, e.g. generated code and error logs.
Vendor | API key name | Code Generator | Models |
---|---|---|---|
OpenAI | OPENAI_API_KEY | gpt | GPT-4o, GPT-4o-mini, etc |
Anthropic | ANTHROPIC_API_KEY | claude | Claude-3.5-Sonnet, etc. |
GEMINI_API_KEY | gemini | Gemini-1.5-pro, etc. | |
Mistral | MISTRAL_API_KEY | mistral | mixtral-8x22b-instruct, etc. |
Fireworks | FIREWORKS_API_KEY | fireworks | open source models |
The above table shows tbe latest popular LLMs and their hosting services. Depending on which LLMs interest you, you need to purchase the corresponding API keys and save them as your environment variables.
Using OpenAI as an example:
export OPENAI_API_KEY='your-openai-api-key' # For Linux/MacOS
set OPENAI_API_KEY=your-openai-api-key # For Windows
Make sure you have followed all steps outlined in the Setup section, also you are in the WebApp1K
directory. Then run the following script.
python run_eval.py -g <code_generator> -m <model>
The code generator can be found in the above table. Using OpenAI as an example, if you want to evaluate the GPT-4o model, the script should be as follows.
python run_eval.py -g gpt -m gpt-4o
After the evaluation is finished, all the generated code can be found under a new models
subdirectory. If a Javascript file has the suffix success.js
, this means the code passes the tests. If the file has the suffix failure.js
, this means the code fails the tests. In addition, a companion failure.log
file of the same name logs the test errors.
Note: This script takes hours to finish. It is highly likely to crash due to various service-related exceptions such as rate limiting, server overload, unsafe content. We purposely let it crash and let you decide what to do. A common scenario is that you reach the daily token limit to run the model inference, and the simplest solution is wait until the next day. Resuming the run is easy: simply run the same script again. By examining files in the
staging
andmodels
folders, it is able to locate the breakpoint, then continue the evaluation from there.
If you pretrain or finetune your own LLM, and like to use this benchmark to evaluate your checkpoints, you should still follow the above process, but make the following changes.
-
You won't need any API keys, and majority of the code in
generate_code.py
won't be relevant to you. -
Modify the
generate_implementation
method to run inference against your own checkpoint.
For questions, inquiries, suggestions, please reach out to info@onekq.ai.