[bug] encoding阶段生成的codec.txt, 无法直接读取？ #34

zyy-fc · 2024-03-26T07:08:27Z

按照提供的encoding_decoding.sh脚本，encoding阶段会生成codec.txt文件

这个文件的形式类似于：
utts_id "空格" json.dumps(codecs)

这个形式无法被read_text.py直接读取，需要改写“load_jsonl_trans_int”函数，如下

def load_jsonl_trans_int(path: Union[Path, str]) -> Dict[str, np.ndarray]: d = read_2column_text(path) retval = {} for k, v in d.items(): try: value = json.loads(v) if isinstance(value, dict): retval[k] = np.array(value["trans"], dtype=int) elif isinstance(value, list): retval[k] = np.array(value, dtype=int) else: raise TypeError except TypeError: logging.error(f'Error happened with path="{path}", id="{k}", value="{v}"') raise return retval

The text was updated successfully, but these errors were encountered:

ZhihaoDU · 2024-03-27T03:21:36Z

Thanks for your report. As expect, the codec.txt should be loaded with load_codec_json function in funcodec/datasets/iterable_dataset.py. The function load_jsonl_trans_int in read_text.py is not used to load codec tokens. Thanks for your modification as well.

zyy-fc changed the title ~~encoding阶段生成的codec.txt, 无法直接读取？~~ [bug] encoding阶段生成的codec.txt, 无法直接读取？ Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] encoding阶段生成的codec.txt, 无法直接读取？ #34

[bug] encoding阶段生成的codec.txt, 无法直接读取？ #34

zyy-fc commented Mar 26, 2024 •

edited

Loading

ZhihaoDU commented Mar 27, 2024

[bug] encoding阶段生成的codec.txt, 无法直接读取？ #34

[bug] encoding阶段生成的codec.txt, 无法直接读取？ #34

Comments

zyy-fc commented Mar 26, 2024 • edited Loading

ZhihaoDU commented Mar 27, 2024

zyy-fc commented Mar 26, 2024 •

edited

Loading