Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

重写整个 rust-pinyin #36

Closed
wants to merge 8 commits into from
Closed

Conversation

upsuper
Copy link
Collaborator

@upsuper upsuper commented May 23, 2019

使用trait+iterator的方式重新实现了这个crate,数据全部由build.rs在编译时生成,运行时不进行任何分配。

我写了两个简单的程序对比当前实现和这个重写的各项指标:

// pinyin-old
use std::io::{self, Write as _};
use std::fs;

fn main() -> io::Result<()> {
    let content = fs::read_to_string("../weicheng.txt")?;
    let stdout = io::stdout();
    let mut stdout = stdout.lock();
    let args = pinyin::Args {
        style: pinyin::Style::Tone,
        heteronym: true,
    };
    let result = pinyin::pinyin(&content, &args);
    for item in result.iter() {
        for pinyin in item.iter() {
            write!(stdout, "{},", pinyin)?;
        }
        writeln!(stdout)?;
    }
    Ok(())
}
// pinyin-new
use pinyin::ToPinyinMulti;
use std::io::{self, Write as _};
use std::fs;

fn main() -> io::Result<()> {
    let content = fs::read_to_string("../weicheng.txt")?;
    let stdout = io::stdout();
    let mut stdout = stdout.lock();
    for multi in content.as_str().to_pinyin_multi() {
        if let Some(multi) = multi {
            for pinyin in multi.into_iter() {
                write!(stdout, "{},", pinyin.with_tone())?;
            }
        }
        writeln!(stdout)?;
    }
    Ok(())
}

根据我本地的粗略测量,使用#34 (comment)里提到的《围城》文本,在 --release 模式下,输出到 /dev/null 时,重写版本的性能大约是当前实现的10倍(0.15s vs. 1.54s)。

编译后的代码体积根据我在macOS上编译结果计算,重写版本大约791KB(strip以后686KB),当前版本1.4MB(strip以后1.3MB)。重写版本比当前版本小了一半,而且重写版本还支持使用feature裁剪不必要的数据。如果去除了多音字并仅支持带声调的样式的话,编译结果只有501KB(strip以后397KB)。

编译时间上,重写版本的debug编译时间从头开始大约5s,当前实现大约3.5s。

还需要进一步改进的地方:

  • 调整CI以测试feature组合,至少需要测试是否有多音字x不同样式
  • 更新README
  • 增加benchmark?
  • 增加集成测试?(目前已经包含了数个doctest覆盖提供的接口)

const INITIALS: &[&str] = &[
"b", "p", "m", "f", "d", "t", "n", "l", "g", "k", "h", "j", "q", "x", "r", "zh", "ch",
"sh", "z", "c", "s",
];
Copy link
Collaborator

@LuoZijun LuoZijun May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像漏了 3 个: 'ẑ''ĉ''ŝ' , 这三个声母通常会写成 zhchsh
虽然不是很常用 ...

参考: http://www.moe.gov.cn/s78/A19/yxs_left/moe_810/s230/195802/t19580201_186000.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个数组是从原来的代码里拷出来的:

rust-pinyin/src/lib.rs

Lines 71 to 74 in 51cb892

const _INITIALS: [&str; 21] = [
"b", "p", "m", "f", "d", "t", "n", "l", "g", "k", "h", "j", "q", "x", "r", "zh", "ch", "sh",
"z", "c", "s",
];

似乎目前的拼音数据里也没有用到,所以加不加大概没什么关系?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,目前的数据是没有问题的。
我这里也只是借这个机会把它列出来。

声母简写:

  • zh =>
  • ch => ĉ
  • sh => ŝ

鼻音简写:

  • ng => ŋ

Copy link
Collaborator

@LuoZijun LuoZijun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里并非就是要改的,我也知道之前的代码就是这样。
我这里仅仅只是把它列出来。

build.rs Show resolved Hide resolved
build.rs Show resolved Hide resolved
@coveralls
Copy link

coveralls commented May 23, 2019

Coverage Status

Coverage decreased (-3.6%) to 96.386% when pulling 46a3e7a on upsuper:rewrite into 97d477b on mozillazg:master.

@upsuper
Copy link
Collaborator Author

upsuper commented May 24, 2019

@mozillazg @LuoZijun 请问两位是否有意合并这个PR?如果是的话还请指出还有哪些需要修改的地方,我可以继续完善。

@LuoZijun
Copy link
Collaborator

@upsuper

@mozillazg 最近比较忙,一般周末才有时间 :)

@mozillazg
Copy link
Owner

@upsuper Thanks!

请问两位是否有意合并这个PR?如果是的话还请指出还有哪些需要修改的地方,我可以继续完善。

对于重写这个事情我这边没问题,个人感觉需要在这个 PR 或合并后新开 PR 进行完善的地方:

  • 如你前面所说的,需要麻烦更新一下文档和测试。
  • 是否需要考虑兼容旧版 API ?这里的兼容旧版 API 指的是:用户升级后可以继续如旧版一样使用 lazy_pinyinpinyinArgsStyle 这几个常用 API。如果不方便兼容或者觉得没必要兼容的话也可以接受,我后面会更新一下文档说明这个不兼容的情况。

@LuoZijun 关于我上面提到的两点,你怎么看。以及对这个 PR 是否有其他需要补充的,比如:代码或设计方面或者其他方面。

@mozillazg mozillazg requested a review from LuoZijun May 25, 2019 04:36
@upsuper
Copy link
Collaborator Author

upsuper commented May 25, 2019

需要兼容旧版的话也是可以的,新版的接口相对灵活,可以额外添加一层兼容层兼容旧版接口然后标记为deprecate。不过这个或许可以在独立的PR里做,在新版发布之前加就好了我觉得。

@LuoZijun
Copy link
Collaborator

LuoZijun commented May 27, 2019

@mozillazg 我基本看过了,感觉挺好的,特别是 PinyinData 这块的设计 。

更新:

关于兼容旧有API这块,我觉得还是有必要。可以考虑向后2-4个版本这样子 :)

@upsuper
Copy link
Collaborator Author

upsuper commented May 27, 2019

添加了旧API的兼容模块,并且将原来的测试加回来作为兼容模块的测试了。

@LuoZijun
Copy link
Collaborator

@upsuper .travis.yml 这个冲突文件能解决下吗?解决完后,我们就可以开始合并了 :)

@upsuper
Copy link
Collaborator Author

upsuper commented May 28, 2019

两个改的行没有重叠我原以为不会冲突呢……

@LuoZijun
Copy link
Collaborator

@upsuper 对了,我看 @mozillazg 是先合并到 origin/develop 分支的,我猜你的代码应该是基于 master 分支的对吧?

@upsuper upsuper changed the base branch from master to develop May 28, 2019 13:14
@upsuper upsuper changed the base branch from develop to master May 28, 2019 13:14
@upsuper
Copy link
Collaborator Author

upsuper commented May 28, 2019

所以呢?我需要把base切换到develop分支吗?

(其实我觉得这样的项目并没有搞develop分支的必要……)

LuoZijun added a commit that referenced this pull request May 28, 2019
LuoZijun added a commit that referenced this pull request May 28, 2019
LuoZijun added a commit that referenced this pull request May 28, 2019
LuoZijun added a commit that referenced this pull request May 28, 2019
@LuoZijun
Copy link
Collaborator

@upsuper 我已经在 develop 分支手动合并了,你看下没问题,我就关闭这个啦。

@LuoZijun
Copy link
Collaborator

@upsuper 看了下 CI 结果,已经没什么问题了。所以这里我就先关闭了。

@mozillazg
Copy link
Owner

感谢 @upsuper 费心费时重写这个项目,以及感谢 @LuoZijun 的 review 👍 ❤️

@upsuper 什么时候感觉不再需要其他补充了,可以发新版了,欢迎 ping 我,我整理一下发个新版本。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants