modelscope · Jintao-Huang · Jan 5, 2024 · Jan 4, 2024 · Jan 4, 2024 · Jan 4, 2024
diff --git a/README.md b/README.md
@@ -62,7 +62,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 
 
 ## 🎉 News
-- 2023.1.4: Support for **VLLM deployment**, compatible with the OpenAI API style. For more details, please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署)
+- 2023.1.4: Support for **VLLM deployment**, compatible with the **OpenAI API** style. For more details, please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署)
 - 2023.1.4: Update [Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md) to facilitate viewing the training speed and GPU memory required for different models.
 - 🔥 2023.12.29: Support web-ui for training and inference, use `swift web-ui` after the installation of ms-swift.
 - 🔥 2023.12.29: Support DPO RLHF(Reinforcement Learning from Human Feedback) and two datasets: AI-ModelScope/stack-exchange-paired and AI-ModelScope/hh-rlhf for this task. Use [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh) to start training!
@@ -113,7 +113,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 - Quickly perform **inference** on LLM and build a **Web-UI**, see the [LLM Inference Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - Rapidly **fine-tune** and perform inference on LLM, and build a Web-UI. See the [LLM Fine-tuning Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md) and [WEB-UI Documentation](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
 - **DPO training** supported, start by using [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh).
-- Utilize VLLM for **inference acceleration** and **deployment(openai API)**. Please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md) for more information.
+- Utilize VLLM for **inference acceleration** and **deployment(OpenAI API)**. Please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md) for more information.
 - View the models and datasets supported by Swift. You can check [supported models and datasets](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - Expand and customize models, datasets, and dialogue templates in Swift, see [Customization and Expansion](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
 - Check command-line parameters for fine-tuning and inference, see [Command-Line parameters](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).

diff --git a/README_CN.md b/README_CN.md
@@ -60,7 +60,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 用户可以查看 [SWIFT官方文档](docs/source/GetStarted/快速使用.md) 来了解详细信息。
 
 ## 🎉 新闻
-- 2023.1.4: 支持**VLLM部署**, 兼容openai API样式, 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署).
+- 2023.1.4: 支持**VLLM部署**, 兼容**OpenAI API**样式, 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署).
 - 2023.1.4: 更新[Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md), 方便查看不同模型训练的速度和所需显存.
 - 🔥 2023.12.29: 支持web-ui进行sft训练和推理，安装ms-swift后使用`swift web-ui`开启
 - 🔥 2023.12.29: 支持 DPO RLHF(Reinforcement Learning from Human Feedback) 和两个用于此任务的数据集: AI-ModelScope/stack-exchange-paired 以及 AI-ModelScope/hh-rlhf. 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh)开启训练！
@@ -111,7 +111,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - 快速对LLM进行**微调**, 推理并搭建Web-UI. 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md) 和 [WEB-UI文档](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
 - 支持**DPO训练**, 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh)开启训练
-- 使用VLLM进行**推理加速**和**部署(openai API)**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
+- 使用VLLM进行**推理加速**和**部署(OpenAI API)**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
 - 查询微调和推理的命令行参数, 可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).

diff --git a/docs/source/LLM/VLLM推理加速与部署.md b/docs/source/LLM/VLLM推理加速与部署.md
@@ -19,6 +19,7 @@ pip install -e .[llm]
 
 # vllm与cuda版本有对应关系，请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本
 pip install vllm -U
+pip install openai -U
 
 # 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
 pip install -r requirements/framework.txt  -U
@@ -239,14 +240,50 @@ swift使用VLLM作为推理后端, 并兼容openai的API样式.
 客户端的openai的API参数可以参考: https://platform.openai.com/docs/api-reference/introduction.
 
 ### 原始模型
-**qwen-7b-chat**
+#### qwen-7b-chat
 
-服务端:
+**服务端:**
 ```bash
 CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b-chat
 ```
 
-客户端:
+**客户端:**
+
+使用swift:
+```python
+from swift.llm import get_model_list_client, XRequest, inference_client
+
+model_list = get_model_list_client()
+model_type = model_list.data[0].id
+print(f'model_type: {model_type}')
+
+query = '浙江的省会在哪里?'
+request_kwargs = XRequest(model=model_type, seed=42)
+resp = inference_client(query, request_kwargs=request_kwargs)
+response = resp.choices[0].message.content
+print(f'query: {query}')
+print(f'response: {response}')
+
+history = [(query, response)]
+query = '这有什么好吃的?'
+request_kwargs = XRequest(model=model_type, stream=True, seed=42)
+stream_resp = inference_client(query, history, request_kwargs=request_kwargs)
+print(f'query: {query}')
+print('response: ', end='')
+for chunk in stream_resp:
+    print(chunk.choices[0].delta.content, end='', flush=True)
+print()
+
+"""Out[0]
+model_type: qwen-7b-chat
+query: 浙江的省会在哪里?
+response: 浙江省的省会是杭州市。
+query: 这有什么好吃的?
+response: 杭州有许多美食，例如西湖醋鱼、东坡肉、龙井虾仁、叫化童子鸡等。此外，杭州还有许多特色小吃，如西湖藕粉、杭州小笼包、杭州油条等。
+"""
+```
+
+使用openai:
 ```python
 from openai import OpenAI
 client = OpenAI(
@@ -263,7 +300,8 @@ messages = [{
 }]
 resp = client.chat.completions.create(
     model=model_type,
-    messages=messages)
+    messages=messages,
+    seed=42)
 response = resp.choices[0].message.content
 print(f'query: {query}')
 print(f'response: {response}')
@@ -272,14 +310,15 @@ print(f'response: {response}')
 messages.append({'role': 'assistant', 'content': response})
 query = '这有什么好吃的?'
 messages.append({'role': 'user', 'content': query})
-stream = client.chat.completions.create(
+stream_resp = client.chat.completions.create(
     model=model_type,
     messages=messages,
-    stream=True)
+    stream=True,
+    seed=42)
 
 print(f'query: {query}')
 print('response: ', end='')
-for chunk in stream:
+for chunk in stream_resp:
     print(chunk.choices[0].delta.content, end='', flush=True)
 print()
 
@@ -288,19 +327,67 @@ model_type: qwen-7b-chat
 query: 浙江的省会在哪里?
 response: 浙江省的省会是杭州市。
 query: 这有什么好吃的?
-response:
-浙江省是一个美食天堂，有着丰富多样的美食，如新鲜海鲜、麻糍、竹筒饭、西湖醋鱼、小吃等。至于具体哪个更好吃，可能还要看您个人的口味。
+response: 杭州有许多美食，例如西湖醋鱼、东坡肉、龙井虾仁、叫化童子鸡等。此外，杭州还有许多特色小吃，如西湖藕粉、杭州小笼包、杭州油条等。
 """
 ```
 
-**qwen-7b**
+#### qwen-7b
 
-服务端:
+**服务端:**
 ```bash
 CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b
 ```
 
-客户端:
+**客户端:**
+
+使用swift:
+```python
+from swift.llm import get_model_list_client, XRequest, inference_client
+
+model_list = get_model_list_client()
+model_type = model_list.data[0].id
+print(f'model_type: {model_type}')
+
+query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
+request_kwargs = XRequest(model=model_type, max_tokens=32, temperature=0.1, seed=42)
+resp = inference_client(query, request_kwargs=request_kwargs)
+response = resp.choices[0].text
+print(f'query: {query}')
+print(f'response: {response}')
+
+request_kwargs.stream = True
+stream_resp = inference_client(query, request_kwargs=request_kwargs)
+print(f'query: {query}')
+print('response: ', end='')
+for chunk in stream_resp:
+    print(chunk.choices[0].text, end='', flush=True)
+print()
+
+"""Out[0]
+model_type: qwen-7b
+query: 浙江 -> 杭州
+安徽 -> 合肥
+四川 ->
+response:  成都
+广东 -> 广州
+江苏 -> 南京
+浙江 -> 杭州
+安徽 -> 合肥
+四川 -> 成都
+
+query: 浙江 -> 杭州
+安徽 -> 合肥
+四川 ->
+response:  成都
+广东 -> 广州
+江苏 -> 南京
+浙江 -> 杭州
+安徽 -> 合肥
+四川 -> 成都
+"""
+```
+
+使用openai:
 ```python
 from openai import OpenAI
 client = OpenAI(
@@ -311,20 +398,19 @@ model_type = client.models.list().data[0].id
 print(f'model_type: {model_type}')
 
 query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
-kwargs = {'model': model_type, 'prompt': query, 'seed': 42, 'temperature': 0., 'max_tokens': 32}
+kwargs = {'model': model_type, 'prompt': query, 'seed': 42, 'temperature': 0.1, 'max_tokens': 32}
 
 resp = client.completions.create(**kwargs)
 response = resp.choices[0].text
 print(f'query: {query}')
 print(f'response: {response}')
 
 # 流式
-query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
-stream = client.completions.create(stream=True, **kwargs)
+stream_resp = client.completions.create(stream=True, **kwargs)
 response = resp.choices[0].text
 print(f'query: {query}')
 print('response: ', end='')
-for chunk in stream:
+for chunk in stream_resp:
     print(chunk.choices[0].text, end='', flush=True)
 print()
 
@@ -360,4 +446,4 @@ swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx'
 CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged'
 ```
 
-客户端代码示例同原始模型.
+客户端示例代码同原始模型.
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -9,7 +9,7 @@
 - `--model_id_or_path`: 表示模型在ModelScope Hub中的`model_id`, 不区分大小写, 默认为`None`. 如果`--model_id_or_path`未被注册, 则会抛出异常. 你可以使用`model_type`的方式指定模型类型, 也可以通过`model_id_or_path`的方式指定模型类型.
 - `--model_revision`: 表示模型在ModelScope Hub中对应`model_id`的版本号, 默认为`None`. `model_revision`指定为`None`, 则使用注册在`MODEL_MAPPING`中的revision. 否则强制使用命令行传入的`model_revision`.
 - `--model_cache_dir`: 默认为`None`. 如果模型在本地已经有缓存, 且缓存路径并非ModelScope默认cache路径, 可以通过指定该参数从cache_dir中导入model和tokenizer.
-- `--sft_type`: 表示微调的方式, 默认是`'lora'`. 你可以选择的值包括: 'lora', 'full'. 如果你要使用qlora, 你需设置`--sft_type lora --quantization_bit 4`.
+- `--sft_type`: 表示微调的方式, 默认是`'lora'`. 你可以选择的值包括: 'lora', 'full', 'longlora', 'qalora'. 如果你要使用qlora, 你需设置`--sft_type lora --quantization_bit 4`.
 - `--freeze_parameters`: 当sft_type指定为'full'时, 将模型最底部的参数进行freeze. 指定范围为0. ~ 1., 默认为`0.`. 该参数提供了lora与全参数微调的折中方案.
 - `--tuner_backend`: 表示lora, qlora的后端支持, 默认是`'swift'`. 你可以选择的值包括: 'swift', 'peft'.
 - `--template_type`: 表示使用的对话模板的类型, 默认是`'AUTO'`, 即根据`model_type`查找`MODEL_MAPPING`中的`template`. 可以选择的`template_type`可以查看`TEMPLATE_MAPPING.keys()`.

diff --git a/requirements/framework.txt b/requirements/framework.txt
@@ -1,4 +1,5 @@
 accelerate
+dacite
 datasets
 jieba
 matplotlib

diff --git a/requirements/llm.txt b/requirements/llm.txt
@@ -1,6 +1,8 @@
 charset_normalizer
 cpm_kernels
+fastapi
 gradio>=3.40.0
 sentencepiece
 tiktoken
 transformers_stream_generator
+uvicorn
diff --git a/swift/llm/deploy.py b/swift/llm/deploy.py
@@ -217,7 +217,7 @@ async def _generate_stream():
                     usage=usage_info,
                     id=request_id,
                     created=created_time)
-            yield f'data:{json.dumps(asdict(response))}\n\n'
+            yield f'data:{json.dumps(asdict(response), ensure_ascii=False)}\n\n'
         yield 'data:[DONE]\n\n'
 
     if request.stream:

diff --git a/swift/llm/tuner.py b/swift/llm/tuner.py
@@ -5,28 +5,28 @@
 from swift.tuners import (LongLoRAConfig, LongLoRAModelType, LoraConfig,
                           LoRAConfig, NEFTuneConfig, Swift)
 from swift.utils import freeze_model_parameters, get_logger
-from .utils import SftArguments, find_all_linear_for_lora
+from .utils import SftArguments, find_all_linear_for_lora, is_lora
 
 logger = get_logger()
 
 
 def prepare_model(model, args: SftArguments):
     # Preparing LoRA
-    if args.sft_type in ('lora', 'qalora', 'longlora'):
+    if is_lora(args.sft_type):
         if args.resume_from_checkpoint is None:
             if 'ALL' in args.lora_target_modules:
                 assert len(args.lora_target_modules) == 1
                 args.lora_target_modules = find_all_linear_for_lora(
                     model, args.quantization_bit, args.model_type)
                 logger.info(
                     f'Setting lora_target_modules: {args.lora_target_modules}')
+            lora_kwargs = {
+                'r': args.lora_rank,
+                'target_modules': args.lora_target_modules,
+                'lora_alpha': args.lora_alpha,
+                'lora_dropout': args.lora_dropout_p
+            }
             if args.sft_type == 'lora':
-                lora_kwargs = {
-                    'r': args.lora_rank,
-                    'target_modules': args.lora_target_modules,
-                    'lora_alpha': args.lora_alpha,
-                    'lora_dropout': args.lora_dropout_p
-                }
                 if args.tuner_backend == 'swift':
                     lora_config = LoRAConfig(
                         lora_dtype=args.lora_dtype, **lora_kwargs)
@@ -36,35 +36,26 @@ def prepare_model(model, args: SftArguments):
                 model = Swift.prepare_model(model, lora_config)
                 logger.info(f'lora_config: {lora_config}')
             elif args.sft_type == 'longlora':
-                assert args.tuner_backend != 'peft', (
-                    'peft does not support longlora. You need to set `--tuner_backend swift`.'
-                )
+                assert args.tuner_backend == 'swift'
                 assert LongLoRAModelType.LLAMA in args.model_type
                 longlora_config = LongLoRAConfig(
-                    r=args.lora_rank,
-                    target_modules=args.lora_target_modules,
-                    lora_alpha=args.lora_alpha,
-                    lora_dropout=args.lora_dropout_p,
                     lora_dtype=args.lora_dtype,
                     model_type=LongLoRAModelType.LLAMA,
-                    use_flash_attn=args.use_flash_attn)
+                    use_flash_attn=args.use_flash_attn,
+                    **lora_kwargs)
                 model = Swift.prepare_model(model, longlora_config)
                 logger.info(f'longlora_config: {longlora_config}')
             elif args.sft_type == 'qalora':
+                assert args.tuner_backend == 'swift'
                 assert getattr(
                     model, 'quantization_method',
                     None) == 'gptq', 'qalora must be used with auto_gptq'
-                lora_kwargs = {}
-                lora_config = LoRAConfig(
-                    r=args.lora_rank,
-                    target_modules=args.lora_target_modules,
-                    lora_alpha=args.lora_alpha,
-                    lora_dropout=args.lora_dropout_p,
+                qalora_config = LoRAConfig(
                     lora_dtype=args.lora_dtype,
                     use_qa_lora=True,
                     **lora_kwargs)
-                model = Swift.prepare_model(model, lora_config)
-                logger.info(f'lora_config: {lora_config}')
+                model = Swift.prepare_model(model, qalora_config)
+                logger.info(f'qalora_config: {qalora_config}')
         else:
             model = Swift.from_pretrained(
                 model, args.resume_from_checkpoint, is_trainable=True)

diff --git a/swift/llm/utils/__init__.py b/swift/llm/utils/__init__.py
@@ -1,6 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from .argument import (DeployArguments, DPOArguments, InferArguments,
-                       RomeArguments, SftArguments)
+                       RomeArguments, SftArguments, is_lora)
+from .client_utils import get_model_list_client, inference_client
 from .dataset import (DATASET_MAPPING, DatasetName, GetDatasetFunction,
                       HfDataset, add_self_cognition_dataset, get_dataset,
                       get_dataset_from_repo, load_dataset_from_local,
@@ -23,9 +24,10 @@
                        CompletionResponseChoice,
                        CompletionResponseStreamChoice,
                        CompletionStreamResponse, DeltaMessage, Model,
-                       ModelList, UsageInfo, random_uuid)
+                       ModelList, UsageInfo, XRequest, random_uuid)
 from .template import (DEFAULT_SYSTEM, TEMPLATE_MAPPING, History, Prompt,
-                       Template, TemplateType, get_template, register_template)
+                       StopWords, Template, TemplateType, get_template,
+                       register_template)
 from .utils import (LazyLLMDataset, LLMDataset, data_collate_fn, dataset_map,
                     download_dataset, find_all_linear_for_lora, get_time_info,
                     history_to_messages, inference, inference_stream,