Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dataset #802

Merged
merged 46 commits into from
May 6, 2024
Merged

Conversation

Jintao-Huang
Copy link
Collaborator

@Jintao-Huang Jintao-Huang commented Apr 25, 2024

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support
  1. 支持以下格式的dataset:
    1. MS和HF hub, 以及dataset_sample的支持. e.g. MS::alpaca-zh#200, HF::jd-sentiment-zh#200 (默认使用的hub, 由USE_HF环境变量控制, 默认MS).
    2. 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. sharegpt-gpt4. 如果指定subsets则使用对应子集的数据集. e.g. sharegpt-gpt4:default/V3_format#2000. 使用'/'进行分隔. (原因: ','不合适, '|'在命令后模式有特殊含义).
    3. dataset_id的支持. e.g. AI-ModelScope/alpaca-gpt4-data-zh#20, HF::llm-wizard/alpaca-gpt4-data-zh#20, hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. (如果dataset_id已经注册,那么使用注册时的预处理函数处理. 否则使用smartpreprocessor, 并使用'default'的subsets, split设置为'train').
    4. 对自我认知数据集, custom_train_dataset_path, custom_val_dataset_path的统一. e.g. self-cognition#500, _custom#2000
    5. dataset_path的支持. e.g. alpaca.csv, swift_multi.jsonl, swift_multi.json#2. (支持采样, 如果是相对路径,则为相对于运行目录的相对路径).
  2. 对过去使用的命令后参数进行兼容,包括:train_dataset_sample, val_dataset_sample, self_cognition_sample, train_dataset_mix_ratio, train_dataset_mix_ds进行兼容,依旧可以使用这些参数进行训练. 对过去使用的dataset_name进行兼容, 保证曾经的最佳实践可以正常运行.
  3. dataset_info.json作为额外的方式进行数据集的“注册”(底层依旧使用函数注册的方式), 该方法作为注册数据集的补充, 具有简单的优点. 该方式支持dataset_id(MS和HF)和dataset_path. (如果是相对路径, 则为相对于dataset_info.json目录的相对路径, 这与直接传入dataset的情况不同)
  4. 修改sh, docs, test_run.py来适配新的数据集注册与使用机制.
  5. 对数据集统计量进行重新计算.

@Jintao-Huang Jintao-Huang marked this pull request as draft April 25, 2024 07:56
@Jintao-Huang Jintao-Huang marked this pull request as ready for review May 5, 2024 04:57
@tastelikefeet tastelikefeet merged commit 9dc41ab into modelscope:main May 6, 2024
2 checks passed
tastelikefeet added a commit to tastelikefeet/swift that referenced this pull request May 10, 2024
* main: (24 commits)
  fix pre-commit
  traindataset异常提示 (modelscope#859)
  Feat/pack (modelscope#881)
  fix swift cli exit code if subprocess is failed (modelscope#879)
  support Deepseek-V2-Chat and InternVL-Chat-V1.5-int8 model  (modelscope#876)
  add llava-llama (modelscope#873)
  support ORPO algorithm (modelscope#854)
  fix dataset info deepcopy (modelscope#871)
  fix dataset_test_ratio=1 (modelscope#869)
  Refactor dataset (modelscope#802)
  Feat/loras (modelscope#865)
  Update tuner docs (modelscope#853)
  update docs (modelscope#850)
  Fix code format and docs (modelscope#847)
  update (modelscope#846)
  fix xcomposer device_map (modelscope#844)
  fix merge_lora_dtype (modelscope#842)
  Fix infer default dtype (modelscope#834)
  fix ui (modelscope#830)
  support Internvl-chat-v1.5 model (modelscope#824)
  ...

# Conflicts:
#	docs/source/LLM/自定义与拓展.md
#	docs/source_en/LLM/Customization.md
#	examples/pytorch/llm/custom.py
#	scripts/benchmark/exp_utils.py
#	scripts/utils/run_dataset_info.py
#	swift/aigc/diffusers/train_controlnet.py
#	swift/aigc/diffusers/train_controlnet_sdxl.py
#	swift/aigc/diffusers/train_text_to_image.py
#	swift/aigc/diffusers/train_text_to_image_lora.py
#	swift/aigc/diffusers/train_text_to_image_lora_sdxl.py
#	swift/aigc/diffusers/train_text_to_image_sdxl.py
#	swift/llm/__init__.py
#	swift/llm/deploy.py
#	swift/llm/dpo.py
#	swift/llm/export.py
#	swift/llm/infer.py
#	swift/llm/sft.py
#	swift/llm/tuner.py
#	swift/llm/utils/__init__.py
#	swift/llm/utils/argument.py
#	swift/llm/utils/client_utils.py
#	swift/llm/utils/dataset.py
#	swift/llm/utils/model.py
#	swift/llm/utils/preprocess.py
#	swift/llm/utils/template.py
#	swift/llm/utils/utils.py
#	swift/trainers/trainers.py
#	swift/tuners/base.py
#	swift/ui/llm_infer/llm_infer.py
#	swift/ui/llm_infer/runtime.py
#	swift/ui/llm_train/dataset.py
#	swift/ui/llm_train/llm_train.py
#	tests/llm/test_run.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants