# Comparison of chinese segmenters

I compare precision of two chinese segmenters:
 - [stanford nlp chinese segmenter](https://nlp.stanford.edu/software/segmenter.html)
 - [jieba](https://github.com/fxsjy/jieba) python package (java and rust implementation are floating around on github)
 
Segmentation was compared against this dataset:
https://www.microsoft.com/en-us/download/details.aspx?id=52531

Segmentation_distance between two segmentations is calculated int wo steps:
 - Convert segmented sentences into list of indices of word boundaries ("I am Groot" would be converted into `[1,3]`),
 - Calculate Levenstein distance between two lists of word boundaries. Levenstein distance is minimal number of insertions, deletions or substitutions needed to convert one list to another.

In [82]:
import jieba
import nltk
from collections import defaultdict
from utils import segmentation_distance
from Dataset import parse_dataset

In [83]:
sentences=parse_dataset('msra-chinese-word-segmentation-data-v1/msra_bakeoff3_training.xml')

In [84]:
len(sentences)

46364

Install stanfordCoreNlp as described here https://github.com/ellie-icekler/StanfordCoreNLP-Chinese/blob/master/README.md

Start service by:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -preload tokenize,ssplit,pos,lemma,ner,parse -status_port 9001  -port 9001 -timeout 15000

In [62]:
from nltk.tokenize.stanford import CoreNLPParser
sttok = CoreNLPParser('http://localhost:9001')
print(list(sttok.tokenize(u'我家没有电脑。')))



['我家', '没有', '电脑', '。']


In [63]:
jieba_counts=defaultdict(int)
stanford_counts=defaultdict(int)
lengthsum=0
for n,s in enumerate(sentences):
    lengthsum+=len(s)-1
    fullsentence="".join(s)
    jieba_cut=jieba.lcut(fullsentence)
    stanford_cut=list(sttok.tokenize(fullsentence))
    jieba_counts[segmentation_distance(s,jieba_cut)]+=1
    stanford_counts[segmentation_distance(s,stanford_cut)]+=1
    #if not n%100:
    #    print(n)

In [77]:
avrg_j=sum([k*v for k,v in jieba_counts.items()])/sum(jieba_counts.values())
avrg_s=sum([k*v for k,v in stanford_counts.items()])/sum(stanford_counts.values())
avrg_w=lengthsum/(n+1)
print('average boundaries in a sentence: ',avrg_w)
print('average jieba distance: ',avrg_j,',   ratio:',avrg_j/avrg_w)
print('average stanford distance: ',avrg_s,',   ratio:',avrg_s/avrg_w)
print("distance: jieba    stanford")
for k in range(101):
    print(k,' :     ',jieba_counts[k],'    ',stanford_counts[k])


average boundaries in a sentence:  26.309313260288153
average jieba distance:  4.380726425675093 ,   ratio: 0.16650858128963236
average stanford distance:  3.4891294970235527 ,   ratio: 0.1326195580441136
distance: jieba    stanford
0  :      4622      4843
1  :      7516      8265
2  :      7527      8466
3  :      5991      7102
4  :      4796      5329
5  :      3683      3767
6  :      2812      2593
7  :      2062      1844
8  :      1638      1273
9  :      1283      827
10  :      916      577
11  :      690      423
12  :      577      274
13  :      440      199
14  :      344      163
15  :      250      100
16  :      207      75
17  :      180      56
18  :      147      43
19  :      97      25
20  :      90      19
21  :      83      14
22  :      63      6
23  :      46      6
24  :      49      6
25  :      39      9
26  :      19      6
27  :      25      6
28  :      11      1
29  :      22      0
30  :      14      2
31  :      12      0
32  :      10      1
33  :   

In [65]:
a=['我们', '变', '而', '以', '书会友', '，', '以', '书', '结缘', '，', '把', '欧美']

In [66]:
b=jieba.lcut(''.join(a))
c=list(sttok.tokenize(''.join(a)))

In [67]:
print(a)
print(b)
print(c)

['我们', '变', '而', '以', '书会友', '，', '以', '书', '结缘', '，', '把', '欧美']
['我们', '变而', '以书', '会友', '，', '以书', '结缘', '，', '把', '欧美']
['我们', '变', '而', '以', '书', '会友', '，', '以', '书', '结缘', '，', '把', '欧美']


In [68]:
segmentation_distance(a,b)

3

In [69]:
segmentation_distance(a,c)

1