Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add codegeex2 and Humanevalx #210

Merged
merged 12 commits into from
Aug 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions configs/datasets/humanevalx/humanevalx_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from mmengine.config import read_base

with read_base():
from .humanevalx_gen_fd5822 import humanevalx_datasets # noqa: F401, F403
37 changes: 37 additions & 0 deletions configs/datasets/humanevalx/humanevalx_gen_fd5822.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator

humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')

humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))

humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}

humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
7 changes: 7 additions & 0 deletions configs/eval_codegeex2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from mmengine.config import read_base

with read_base():
from .datasets.humanevalx.humanevalx_gen import humanevalx_datasets
from .models.hf_codegeex2_6b import models

datasets = humanevalx_datasets
25 changes: 25 additions & 0 deletions configs/models/hf_codegeex2_6b.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from opencompass.models import HuggingFace

# refer to https://github.com/THUDM/CodeGeeX2/tree/main
# For pass@1 : n=20 , temperature=0.2, top_p=0.95
# For Pass@10 : n=200, temperature=0.8, top_p=0.95
# For Pass@100 : n=200, temperature=0.8, top_p=0.95

models = [
dict(
type=HuggingFace,
abbr='codegeex2-6b',
path='THUDM/codegeex2-6b',
tokenizer_path='THUDM/codegeex2-6b',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
85 changes: 85 additions & 0 deletions docs/en/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Code Evaluation Service

We support evaluating datasets of multiple programming languages, similar to [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). Before starting, make sure that you have started the code evaluation service. You can refer to the [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) project for the code evaluation service.

## Launching the Code Evaluation Service

Make sure you have installed Docker, then build an image and run a container service.

Build the Docker image:

```shell
git clone https://github.com/Ezra-Yu/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
```

After obtaining the image, create a container with the following commands:

```shell
# Log output format
sudo docker run -it -p 5000:5000 code-eval:latest python server.py

# Run the program in the background
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py

# Using different ports
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
```

Ensure that you can access the service and check the following commands (skip this step if you are running the service on a local host):

```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```

```note
If computing nodes cannot connect to the evaluation service, you can directly run `python run.py xxx...`. The resulting code will be saved in the 'outputs' folder. After migration, use [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) directly to get the results (no need to consider the eval_cfg configuration later).
```

## Configuration File

We have provided the [configuration file](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py) for evaluating huamaneval-x on codegeex2 .

The dataset and related post-processing configuration files can be found at this [link](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx). Note the `evaluator` field in `humanevalx_eval_cfg_dict`.

```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator

humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')

humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))

humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}

humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
20 changes: 11 additions & 9 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,6 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
user_guides/experimentation.md
user_guides/metrics.md

.. _AdvancedGuides:
.. toctree::
:maxdepth: 1
:caption: Advanced Guides

advanced_guides/new_dataset.md
advanced_guides/new_model.md
advanced_guides/evaluation_turbomind.md

.. _Prompt:
.. toctree::
:maxdepth: 1
Expand All @@ -56,6 +47,17 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
prompt/meta_template.md
prompt/chain_of_thought.md


.. _AdvancedGuides:
.. toctree::
:maxdepth: 1
:caption: Advanced Guides

advanced_guides/new_dataset.md
advanced_guides/new_model.md
advanced_guides/evaluation_turbomind.md
advanced_guides/code_eval_service.md

.. _Tools:
.. toctree::
:maxdepth: 1
Expand Down
86 changes: 86 additions & 0 deletions docs/zh_cn/advanced_guides/code_eval_service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# 代码评测服务

我们支持评测多编程语言的数据集,类似 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). 在启动之前需要确保你已经启动了代码评测服务,代码评测服务可参考[code-evaluator](https://github.com/Ezra-Yu/code-evaluator)项目。

## 启动代码评测服务

确保您已经安装了 docker,然后构建一个镜像并运行一个容器服务。

构建 Docker 镜像:

```shell
git clone https://github.com/Ezra-Yu/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
```

获取镜像后,使用以下命令创建容器:

```shell
# 输出日志格式
sudo docker run -it -p 5000:5000 code-eval:latest python server.py

# 在后台运行程序
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py

# 使用不同的端口
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
```

确保您能够访问服务,检查以下命令(如果在本地主机中运行服务,就跳过这个操作):

```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```

```note
如果运算节点不能连接到评估服务,也可直接运行 `python run.py xxx...`,代码生成结果会保存在 'outputs' 文件夹下,迁移后直接使用 [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) 评测得到结果(不需要考虑后面 eval_cfg 的配置)。
```

## 配置文件

我么已经给了 huamaneval-x 在 codegeex2 上评估的[配置文件](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py)

其中数据集以及相关后处理的配置文件为这个[链接](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx), 需要注意 `humanevalx_eval_cfg_dict` 中的
`evaluator` 字段。

```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator

humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')

humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))

humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}

humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
1 change: 1 addition & 0 deletions docs/zh_cn/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ OpenCompass 上手路线
advanced_guides/new_dataset.md
advanced_guides/new_model.md
advanced_guides/evaluation_turbomind.md
advanced_guides/code_eval_service.md

.. _工具:
.. toctree::
Expand Down
1 change: 1 addition & 0 deletions opencompass/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from .hellaswag import * # noqa: F401, F403
from .huggingface import * # noqa: F401, F403
from .humaneval import * # noqa: F401, F403
from .humanevalx import * # noqa: F401, F403
from .iwslt2017 import * # noqa: F401, F403
from .jigsawmultilingual import * # noqa: F401, F403
from .lambada import * # noqa: F401, F403
Expand Down
Loading