New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

内置一个最大正向匹配分词模块 #81

Merged
merged 14 commits into from Oct 1, 2017

Conversation

Projects
None yet
5 participants
@mozillazg
Owner

mozillazg commented May 7, 2017

closed #63

@mozillazg mozillazg changed the base branch from master to develop May 7, 2017

@coveralls

This comment has been minimized.

coveralls commented May 7, 2017

Coverage Status

Coverage decreased (-10.9%) to 88.221% when pulling ea15951 on trie-seg into bc9bac1 on develop.

@gumblex

This comment has been minimized.

Contributor

gumblex commented May 7, 2017

trie 在 Python 内存效率不高

@mozillazg

This comment has been minimized.

Owner

mozillazg commented May 7, 2017

@gumblex 那用什么结构比较好呢?

@gumblex

This comment has been minimized.

Contributor

gumblex commented May 7, 2017

我之前一直用的前缀字典(包括结巴),省好多内存。

mozillazg added some commits May 8, 2017

@mozillazg

This comment has been minimized.

Owner

mozillazg commented May 8, 2017

@gumblex 试了一下,确实比 trie 要少占用很多内存 👍

@coveralls

This comment has been minimized.

coveralls commented May 8, 2017

Coverage Status

Coverage decreased (-10.7%) to 88.442% when pulling aaa435f on trie-seg into bc9bac1 on develop.

@coveralls

This comment has been minimized.

coveralls commented May 8, 2017

Coverage Status

Coverage decreased (-7.7%) to 91.429% when pulling aaa435f on trie-seg into bc9bac1 on develop.

@homu

This comment has been minimized.

homu commented May 14, 2017

☔️ The latest upstream changes (presumably bd6f00a) made this pull request unmergeable. Please resolve the merge conflicts.

@coveralls

This comment has been minimized.

coveralls commented May 14, 2017

Coverage Status

Coverage decreased (-5.6%) to 93.75% when pulling b5c442a on trie-seg into ad6fab3 on develop.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented May 14, 2017

Coverage Status

Coverage decreased (-5.6%) to 93.75% when pulling b5c442a on trie-seg into ad6fab3 on develop.

@homu

This comment has been minimized.

homu commented May 29, 2017

☔️ The latest upstream changes (presumably ddd19dd) made this pull request unmergeable. Please resolve the merge conflicts.

@mozillazg

This comment has been minimized.

Owner

mozillazg commented Sep 21, 2017

@bors-homu retry

mozillazg added some commits Sep 21, 2017

@mozillazg mozillazg changed the base branch from develop to master Sep 21, 2017

@coveralls

This comment has been minimized.

coveralls commented Sep 21, 2017

Coverage Status

Coverage decreased (-5.7%) to 93.487% when pulling 4acbf10 on trie-seg into c665d31 on master.

@codecov

This comment has been minimized.

codecov bot commented Sep 21, 2017

Codecov Report

Merging #81 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #81      +/-   ##
==========================================
+ Coverage   99.18%   99.24%   +0.05%     
==========================================
  Files          19       20       +1     
  Lines         492      530      +38     
==========================================
+ Hits          488      526      +38     
  Misses          4        4
Impacted Files Coverage Δ
pypinyin/contrib/mmseg.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04a0f39...ecead60. Read the comment docs.

@mozillazg

This comment has been minimized.

Owner

mozillazg commented Sep 26, 2017

分词的准确率(使用 https://github.com/HIT-SCIR/scir-training-day/tree/master/2-python-practice/3-max-matching-word-segmentation 中的代码进行测试):

$ python max-match.py eval.raw vocab.bin  output.dat
$ python eval.py --format=segment --mode=segment --eval=output.dat --gold=eval.dat
1909	3463	2594	55.12561363%	73.59290671%	63.03450553%

$ python max-match.py eval.raw vocab.large.bin  output.dat
$ python eval.py --format=segment --mode=segment --eval=output.dat --gold=eval.dat
2439	2624	2594	92.94969512%	94.02467232%	93.48409352%

这些输出结果表示
正确切分的词的个数 输出词个数 正确词个数 P值 R值 F值

修改后的 max-match.py:

#!/usr/bin/env python
import cPickle as pickle
import sys
from io import open

from pypinyin.contrib.mmseg import PrefixSet, Seg

pset = PrefixSet()
seg = Seg(pset)


def max_match_segment( line, dic ):
    # write your code here
    pset.train(list(dic))
    return list(seg.cut(line))

if __name__=="__main__":

    try:
        fpi=open(sys.argv[1], "r")
    except:
        print >> sys.stderr, "failed to open file"
        sys.exit(1)

    try:
        dic = pickle.load(open(sys.argv[2]))
        dic = (x.decode('utf-8') for x in dic)
    except:
        print >> sys.stderr, "failed to load dict"
        sys.exit(1)

    with open(sys.argv[3], 'w', encoding='utf-8') as output:
        for line in fpi:
            output.write("\t".join( max_match_segment(line.strip(), dic)))
            output.write(u'\n')

@mozillazg mozillazg changed the title from [WIP] 内置一个按已知词语分词的分词模块 to [WIP] 内置一个最大正向匹配分词模块 Sep 30, 2017

mozillazg added some commits Oct 1, 2017

@coveralls

This comment has been minimized.

coveralls commented Oct 1, 2017

Coverage Status

Coverage decreased (-0.3%) to 98.868% when pulling 9a26ec3 on trie-seg into c665d31 on master.

@coveralls

This comment has been minimized.

coveralls commented Oct 1, 2017

Coverage Status

Coverage increased (+0.06%) to 99.245% when pulling 4fab49b on trie-seg into c665d31 on master.

mozillazg added some commits Oct 1, 2017

@coveralls

This comment has been minimized.

coveralls commented Oct 1, 2017

Coverage Status

Coverage increased (+0.06%) to 99.245% when pulling ecead60 on trie-seg into 04a0f39 on master.

@mozillazg

This comment has been minimized.

Owner

mozillazg commented Oct 1, 2017

1 similar comment
@mozillazg

This comment has been minimized.

Owner

mozillazg commented Oct 1, 2017

@bors-homu

This comment has been minimized.

Collaborator

bors-homu commented Oct 1, 2017

📋 Looks like this PR is still in progress, ignoring approval

@mozillazg mozillazg changed the title from [WIP] 内置一个最大正向匹配分词模块 to 内置一个最大正向匹配分词模块 Oct 1, 2017

@mozillazg mozillazg merged commit a1b52d3 into master Oct 1, 2017

5 checks passed

codecov/patch 100% of diff hit (target 99.18%)
Details
codecov/project 99.24% (+0.05%) compared to 04a0f39
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.06%) to 99.245%
Details

@mozillazg mozillazg deleted the trie-seg branch Oct 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment