Dev gzf deepspeed (#1833)

* update with main (#1817) * add cmakelist * add paraformer-torch * add debug for funasr-onnx-offline * fix redefinition of jieba StdExtension.hpp * add loading torch models * update funasr-onnx-offline * add SwitchArg for wss-server * add SwitchArg for funasr-onnx-offline * update cmakelist * update funasr-onnx-offline-rtf * add define condition * add gpu define for offlne-stream * update com define * update offline-stream * update cmakelist * update func CompileHotwordEmbedding * add timestamp for paraformer-torch * add C10_USE_GLOG for paraformer-torch * update paraformer-torch * fix func FunASRWfstDecoderInit * update model.h * fix func FunASRWfstDecoderInit * fix tpass_stream * update paraformer-torch * add bladedisc for funasr-onnx-offline * update comdefine * update funasr-wss-server * add log for torch * fix GetValue BLADEDISC * fix log * update cmakelist * update warmup to 10 * update funasrruntime * add batch_size for wss-server * add batch for bins * add batch for offline-stream * add batch for paraformer * add batch for offline-stream * fix func SetBatchSize * add SetBatchSize for model * add SetBatchSize for model * fix func Forward * fix padding * update funasrruntime * add dec reset for batch * set batch default value * add argv for CutSplit * sort frame_queue * sorted msgs * fix FunOfflineInfer * add dynamic batch for fetch * fix FetchDynamic * update run_server.sh * update run_server.sh * cpp http post server support (#1739) * add cpp http server * add some comment * remove some comments * del debug infos * restore run_server.sh * adapt to new model struct * 修复了onnxruntime在macos下编译失败的错误 (#1748) * Add files via upload 增加macos的编译支持 * Add files via upload 增加macos支持 * Add files via upload target_link_directories(funasr PUBLIC ${ONNXRUNTIME_DIR}/lib) target_link_directories(funasr PUBLIC ${FFMPEG_DIR}/lib) 添加 if(APPLE) 限制 --------- Co-authored-by: Yabin Li <wucong.lyb@alibaba-inc.com> * Delete docs/images/wechat.png * Add files via upload * fixed the issues about seaco-onnx timestamp * fix bug (#1764) 当语音识别结果包含 `http` 时，标点符号预测会把它会被当成 url * fix empty asr result (#1765) 解码结果为空的语音片段，text 用空字符串 * update export * update export * docs * docs * update export name * docs * update * docs * docs * keep empty speech result (#1772) * docs * docs * update wechat QRcode * Add python funasr api support for websocket srv (#1777) * add python funasr_api supoort * change little to README.md * add core tools stream * modified a little * fix bug for timeout * support for buffer decode * add ffmpeg decode for buffer * libtorch demo * update libtorch infer * update utils * update demo * update demo * update libtorch inference * update model class * update seaco paraformer * bug fix * bug fix * auto frontend * auto frontend * auto frontend * auto frontend * auto frontend * auto frontend * auto frontend * auto frontend * Dev gzf exp (#1785) * resume from step * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * batch * train_loss_avg train_acc_avg * train_loss_avg train_acc_avg * train_loss_avg train_acc_avg * log step * wav is not exist * wav is not exist * decoding * decoding * decoding * wechat * decoding key * decoding key * decoding key * decoding key * decoding key * decoding key * dynamic batch * start_data_split_i=0 * total_time/accum_grad * total_time/accum_grad * total_time/accum_grad * update avg slice * update avg slice * sensevoice sanm * sensevoice sanm * sensevoice sanm --------- Co-authored-by: 北念 <lzr265946@alibaba-inc.com> * auto frontend * update paraformer timestamp * [Optimization] support bladedisc fp16 optimization (#1790) * add cif_v1 and cif_export * Update SDK_advanced_guide_offline_zh.md * add cif_wo_hidden_v1 * [fix] fix empty asr result (#1794) * english timestamp for valilla paraformer * wechat * [fix] better solution for handling empty result (#1796) * update scripts * modify the qformer adaptor (#1804) Co-authored-by: nichongjia-2007 <nichongjia@gmail.com> * add ctc inference code (#1806) Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com> * Update auto_model.py 修复空字串进入speaker model时报raw_text变量不存在的bug * Update auto_model.py 修复识别出空串后spk_model内变量未定义问题 * update model name * fix paramter 'quantize' unused issue (#1813) Co-authored-by: ZihanLiao <liaozihan1@xdf.cn> * wechat * Update cif_predictor.py (#1811) * Update cif_predictor.py * modify cif_v1_export under extreme cases, max_label_len calculated by batch_len misaligns with token_num * Update cif_predictor.py torch.cumsum precision degradation, using float64 instead * update code --------- Co-authored-by: 雾聪 <wucong.lyb@alibaba-inc.com> Co-authored-by: zhaomingwork <61895407+zhaomingwork@users.noreply.github.com> Co-authored-by: szsteven008 <97944818+szsteven008@users.noreply.github.com> Co-authored-by: Ephemeroptera <605686962@qq.com> Co-authored-by: 彭震东 <zhendong.peng@qq.com> Co-authored-by: Shi Xian <40013335+R1ckShi@users.noreply.github.com> Co-authored-by: 维石 <shixian.shi@alibaba-inc.com> Co-authored-by: 北念 <lzr265946@alibaba-inc.com> Co-authored-by: xiaowan0322 <wanchen.swc@alibaba-inc.com> Co-authored-by: zhuangzhong <zhuangzhong@corp.netease.com> Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com> Co-authored-by: nichongjia-2007 <nichongjia@gmail.com> Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com> Co-authored-by: liugz18 <57401541+liugz18@users.noreply.github.com> Co-authored-by: Marlowe <54339989+ZihanLiao@users.noreply.github.com> Co-authored-by: ZihanLiao <liaozihan1@xdf.cn> Co-authored-by: zhong zhuang <zhuangz@lamda.nju.edu.cn> * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice * sensevoice --------- Co-authored-by: 雾聪 <wucong.lyb@alibaba-inc.com> Co-authored-by: zhaomingwork <61895407+zhaomingwork@users.noreply.github.com> Co-authored-by: szsteven008 <97944818+szsteven008@users.noreply.github.com> Co-authored-by: Ephemeroptera <605686962@qq.com> Co-authored-by: 彭震东 <zhendong.peng@qq.com> Co-authored-by: Shi Xian <40013335+R1ckShi@users.noreply.github.com> Co-authored-by: 维石 <shixian.shi@alibaba-inc.com> Co-authored-by: 北念 <lzr265946@alibaba-inc.com> Co-authored-by: xiaowan0322 <wanchen.swc@alibaba-inc.com> Co-authored-by: zhuangzhong <zhuangzhong@corp.netease.com> Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com> Co-authored-by: nichongjia-2007 <nichongjia@gmail.com> Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com> Co-authored-by: liugz18 <57401541+liugz18@users.noreply.github.com> Co-authored-by: Marlowe <54339989+ZihanLiao@users.noreply.github.com> Co-authored-by: ZihanLiao <liaozihan1@xdf.cn> Co-authored-by: zhong zhuang <zhuangz@lamda.nju.edu.cn>
modelscope · Jun 20, 2024 · e65b1f7 · e65b1f7
1 parent 45d7aa9
commit e65b1f7
Show file tree

Hide file tree

Showing 10 changed files with 270 additions and 64 deletions.
diff --git a/examples/industrial_data_pretraining/llm_asr/demo_speech2text.py b/examples/industrial_data_pretraining/llm_asr/demo_speech2text.py
@@ -9,19 +9,20 @@
 
 from funasr import AutoModel
 
-ckpt_dir = "/nfs/beinian.lzr/workspace/GPT-4o/Exp/exp6/5m-8gpu/exp6_speech2text_linear_ddp_0609"
-ckpt_id = "model.pt.ep0.90000"
-jsonl = (
-    "/nfs/beinian.lzr/workspace/GPT-4o/Data/Speech2Text/TestData/aishell1_test_speech2text.jsonl"
-)
-output_dir = f"{os.path.join(ckpt_dir, ckpt_id)}"
-device = "cuda:0"
-
-ckpt_dir = sys.argv[1]
-ckpt_id = sys.argv[2]
-jsonl = sys.argv[3]
-output_dir = sys.argv[4]
-device = sys.argv[5]
+if len(sys.argv) > 1:
+    ckpt_dir = sys.argv[1]
+    ckpt_id = sys.argv[2]
+    jsonl = sys.argv[3]
+    output_dir = sys.argv[4]
+    device = sys.argv[5]
+else:
+    ckpt_dir = "/nfs/beinian.lzr/workspace/GPT-4o/Exp/exp6/5m-8gpu/exp6_speech2text_linear_ddp_0609"
+    ckpt_id = "model.pt.ep0.90000"
+    jsonl = "/nfs/beinian.lzr/workspace/GPT-4o/Data/Speech2Text/TestData/aishell1_test_speech2text.jsonl"
+    dataset = jsonl.split("/")[-1]
+    output_dir = os.path.join(ckpt_dir, f"inference-{ckpt_id}", dataset)
+    device = "cuda:0"
+
 
 model = AutoModel(
     model=ckpt_dir,

diff --git a/examples/industrial_data_pretraining/llm_asr/demo_speech2text_multi.py b/examples/industrial_data_pretraining/llm_asr/demo_speech2text_multi.py
@@ -0,0 +1,76 @@
+#!/usr/bin/env python3
+# -*- encoding: utf-8 -*-
+# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
+#  MIT License  (https://opensource.org/licenses/MIT)
+
+import json
+import os
+import sys
+
+from funasr import AutoModel
+
+
+if len(sys.argv) > 1:
+    ckpt_dir = sys.argv[1]
+    ckpt_id = sys.argv[2]
+    jsonl = sys.argv[3]
+    output_dir = sys.argv[4]
+    device = sys.argv[5]
+else:
+    ckpt_dir = "/nfs/beinian.lzr/workspace/GPT-4o/Exp/exp7/5m-8gpu/exp5-1-0619"
+    ckpt_id = "model.pt.ep6"
+    jsonl = (
+        "/nfs/beinian.lzr/workspace/GPT-4o/Data/Speech2Text/TestData/s2tchat.v20240619.test.jsonl"
+    )
+    dataset = jsonl.split("/")[-1]
+    output_dir = os.path.join(ckpt_dir, f"inference-{ckpt_id}", dataset)
+
+
+model = AutoModel(
+    model=ckpt_dir,
+    init_param=f"{os.path.join(ckpt_dir, ckpt_id)}",
+    output_dir=output_dir,
+    device=device,
+    fp16=False,
+    bf16=False,
+    llm_dtype="bf16",
+)
+
+
+with open(jsonl, "r") as f:
+    lines = f.readlines()
+
+tearchforing = False
+for i, line in enumerate(lines):
+
+    key_i = f"dialog_{i}"
+
+    data_dict = json.loads(line.strip())
+    data = data_dict["messages"]
+
+    contents = model.model.data_template(data)
+
+    system = contents["system"]
+    user = contents["user"]
+    assistant = contents["assistant"]
+
+    system_i, user_i, assistant_i = [], [], []
+
+    contents_i = []
+    for j, (system_prompt, user_prompt, target_out) in enumerate(zip(system, user, assistant)):
+        key = f"{key_i}_turn_{j}"
+
+        if j == 0:
+            contents_i.append({"role": "system", "content": system_prompt})
+
+        contents_i.append({"role": "user", "content": user_prompt})
+        contents_i.append({"role": "assistant", "content": target_out})
+
+        res = model.generate(
+            input=[contents_i],
+            tearchforing=tearchforing,
+            cache={},
+            key=key,
+        )
+
+        print(res)
diff --git a/funasr/__init__.py b/funasr/__init__.py
@@ -1,8 +1,6 @@
 """Initialize funasr package."""
 
 import os
-import pkgutil
-import importlib
 
 dirname = os.path.dirname(__file__)
 version_file = os.path.join(dirname, "version.txt")

diff --git a/funasr/auto/auto_model.py b/funasr/auto/auto_model.py
@@ -92,7 +92,8 @@ def prepare_data_iterator(data_in, input_len=None, data_type=None, key=None):
                 if isinstance(data_i, str) and os.path.exists(data_i):
                     key = misc.extract_filename_without_extension(data_i)
                 else:
-                    key = "rand_key_" + "".join(random.choice(chars) for _ in range(13))
+                    if key is None:
+                        key = "rand_key_" + "".join(random.choice(chars) for _ in range(13))
                 key_list.append(key)
 
     else:  # raw text; audio sample point, fbank; bytes

diff --git a/funasr/datasets/openai_datasets/datasets.py b/funasr/datasets/openai_datasets/datasets.py
@@ -283,10 +283,11 @@ def __init__(
 
         self.pattern = re.compile(r"(<\|startofspeech\|>.*?<\|endofspeech\|>)")
         # self.kwargs = kwargs
-        self.max_token_length = kwargs.get("max_token_length", 1024)
+        self.max_token_length = kwargs.get("max_token_length", 1500)
         self.batch_size_scale_ratio_max = kwargs.get("batch_size_scale_ratio_max", 1.5)
         self.batch_size_token_max = kwargs.get("batch_size_token_max", 2500)
         self.multiturn_num_max = kwargs.get("multiturn_num_max", 5)
+        self.max_source_length = kwargs.get("max_source_length", 3000)
 
     def get_source_len(self, index):
         item = self.index_ds[index]
@@ -334,6 +335,12 @@ def __getitem__(self, index):
             ):
                 if i >= self.multiturn_num_max:
                     break
+                if len(input_ids) > self.max_token_length:
+                    logging.info(
+                        f"input_ids > max_token_length: {len(input_ids)}>{self.max_token_length}, {item}"
+                    )
+                    break
+
                 if i == 0:
                     source_input = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
                 else:
@@ -372,6 +379,11 @@ def __getitem__(self, index):
                                 frontend=self.frontend,
                                 is_final=True,
                             )  # speech: [b, T, d]
+                            if speech_lengths > self.max_source_length:
+                                logging.info(
+                                    f"speech_lengths > max_source_length: {speech_lengths}>{self.max_source_length}, {item}"
+                                )
+                                badcase_flag = True
                             if self.permute:
                                 speech = speech.permute(0, 2, 1)
                             # if speech_lengths > self.batch_size:
@@ -399,13 +411,9 @@ def __getitem__(self, index):
                 fbank_mask += fbank_mask_i
                 fbank_lens.append(speech_lengths)
 
-            if len(input_ids) > self.max_token_length:
-                logging.info(
-                    f"input_ids > max_token_length: {len(input_ids)}>{self.max_token_length}, {item}"
-                )
-                badcase_flag = True
             if badcase_flag:
                 continue
+
             input_ids = torch.tensor(input_ids, dtype=torch.int64)  # [: self.max_token_length]
             attention_mask = torch.tensor([1] * len(input_ids), dtype=torch.int32)
             labels = torch.tensor(labels, dtype=torch.int64)  # [: self.max_token_length]

diff --git a/funasr/datasets/openai_datasets/index_ds.py b/funasr/datasets/openai_datasets/index_ds.py
@@ -16,6 +16,12 @@ class OpenAIIndexDSJsonl(torch.utils.data.Dataset):  # torch.utils.data.Dataset
     def __init__(self, path: str, **kwargs):
         super().__init__()
 
+        self.max_source_length = kwargs.get("max_source_length", 3000)
+        self.min_source_length = kwargs.get("min_source_length", 0)
+        self.max_target_length = kwargs.get("max_target_length", 2048)
+        self.min_target_length = kwargs.get("min_target_length", 0)
+        self.max_token_length = kwargs.get("max_token_length", 2200)
+
         is_training = kwargs.get("is_training", True)
         if not (path.endswith(".jsonl") or path.endswith(".json")):
             # jsonl list file
@@ -47,6 +53,15 @@ def __init__(self, path: str, **kwargs):
                     data = data_dict["messages"]
                     speech_length = data_dict.get("speech_length", -1) // 8
                     text_length = data_dict.get("text_length", 0)
+                    if speech_length > self.max_source_length:
+                        logging.info(
+                            "speech_length: {speech_length} > {self.max_source_length}, drop it"
+                        )
+                        continue
+                    if text_length > self.max_target_length:
+                        continue
+
+                    self.max_target_length = kwargs.get("max_target_length", 2048)
 
                     system, user, assistant = [], [], []
                     for i, item in enumerate(data):

diff --git a/funasr/download/download_from_hub.py b/funasr/download/download_from_hub.py
@@ -84,6 +84,12 @@ def download_from_ms(**kwargs):
         from funasr.utils.install_model_requirements import install_requirements
 
         install_requirements(requirements)
+    if kwargs.get("trust_remote_code", False):
+
+        import model
+
+        # from funasr.register import tables
+        # tables.print("model")
     return kwargs