# Chinese Word Segmentation

Exploratory notebook for CWS.

Segment a standardly written Chinese sentence into words.

For example, give the following Chinese sentence:

**星期天我们去吃重庆火锅**

It produces the output with proper segmentation:

**星期天 我们 去 吃 重庆 火锅**

In [2]:
# util libs
import numpy as np
import time
from itertools import izip
#from __future__ import division

# model libs
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
from nltk.classify.scikitlearn import SklearnClassifier

# in-house libs
import string_util
#reload(string_util)
StringUtil = string_util.StringUtil()

## Tagging

Tag for a sentence. Input a segmented sentence, and output a list of tags.

Types of tags:
- **s** single character as a word
- **b** beginning of a word
- **m** middle of a word
- **e** end of a word

Punctuations (either English or Chinese) will be tagged as **s**.

For example, the sample sentence will be tagged as following:

**星&nbsp;期&nbsp;天&nbsp;&nbsp;&nbsp;&nbsp;我&nbsp;们&nbsp;&nbsp;&nbsp;&nbsp;去&nbsp;&nbsp;&nbsp;&nbsp;吃&nbsp;&nbsp;&nbsp;重&nbsp;庆&nbsp;&nbsp;&nbsp;火&nbsp;锅 &nbsp;&nbsp;&nbsp;！**

**b&nbsp;&nbsp;&nbsp;m&nbsp;&nbsp;&nbsp;e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;&nbsp;e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s&nbsp;&nbsp;&nbsp;&nbsp;s&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;&nbsp;&nbsp;e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;&nbsp;&nbsp;e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s**

In [3]:
def tag_for_sentence(sentence):
    words = sentence.decode('utf-8').strip().split()
    tags = []
    for word in words:
        if len(word) > 1:
            tags.append('b')
            for char in word[1:(len(word) - 1)]:
                tags.append('m')
            tags.append('e')
        else: 
            tags.append('s')
            
    return tags

Tag for a corpus, and output the tags to a file with each line is a tag

In [4]:
def tag_for_file(input_path, output_path):
    print 'Tagging for %s and output to %s...' % (input_path, output_path)
    start = time.time()
    
    with open(output_path, "w+") as output_file:
        for line in open(input_path, "r").readlines():
            line = line.strip().decode("utf-8")
            tags = tag_for_sentence(line)
            for tag in tags:
                output_file.writelines(tag + StringUtil.NEWLINE)
    
    print 'Done. Total time taken %d seconds' % (time.time() - start)

Tag for test corpus

In [104]:
tag_for_file("data/training/test.utf8", "data/preprocessed/tag_test.utf8")

Tagging for data/training/test.utf8 and output to data/preprocessed/tag_test.utf8...
Tagging done. Total time taken 0 seconds


Tag for PKU and MSR corpora

In [70]:
tag_for_file("data/training/pku_training.utf8", "data/preprocessed/tag_pku.utf8")
tag_for_file("data/training/msr_training.utf8", "data/preprocessed/tag_msr.utf8")

Tagging for data/training/pku_training.utf8 and output to data/preprocessed/tag_pku.utf8...
Tagging done. Total time taken 11 seconds
Tagging for data/training/msr_training.utf8 and output to data/preprocessed/tag_msr.utf8...
Tagging done. Total time taken 25 seconds


## Features Extraction

Extract features for a sentence. In case of window size 5, the features are defined as following:

- **(a)** $C_{n}$, where $n$ if from -2 to 2
- **(b)** $C_{n}C_{n+1}$, where $n$ is from -2 to 1
- **(c)** $C_{-1}C_{1}$
- **(d)** $Pu$, a boolean value (0 or 1) representing if the current character is a punctuation
- **(e)** $T(C_{-2})T(C_{-1})T(C_{0})T(C_{1})T(C_{2})$, type seq of the char seq

Sentence boundary: the start-of-sentence character is defined as **< s >**, while the end-of-sentence character is defined as **< /s >**

Types are defined as following:
- **0** sentence boundary
- **1** number char
- **2** date char
- **3** English letter 
- **4** others

In [5]:
def extract_feature_for_sentence(sentence, window_size=5):
    if window_size % 2 == 0 or not window_size > 1:
        raise ValueError('Window size must be odd number and larger than 1')
    
    sentence = StringUtil.remove_whitespace(sentence)
    total_len = len(sentence)
    context_size = window_size / 2
    
    feature_dict_list = []
    
    for i, c in enumerate(sentence):
        feature_dict = dict()
        
        c_features = dict()
        c_features["c0"] = c
        
        t_features = dict()
        t_features["t0"] = StringUtil.get_character_type(c)
        
        for context_i in range(1, context_size+1):
            c_features["c_"+str(context_i)] = sentence[i-context_i] if i-context_i >=0 \
                                                else StringUtil.SENTENCE_START.decode("utf-8")
            c_features["c"+str(context_i)] = sentence[i+context_i] if i+context_i < total_len \
                                                else StringUtil.SENTENCE_END.decode("utf-8")
            
            t_features["t_"+str(context_i)] = StringUtil.get_character_type(c_features["c_"+str(context_i)])
            t_features["t"+str(context_i)] = StringUtil.get_character_type(c_features["c"+str(context_i)])
        
        # feature a
        feature_dict.update(c_features)
        
        # feature b
        for context_i in reversed(range(1, context_size+1)):
            if context_i-1 == 0:
                feature_dict["c_"+str(context_i)+"c0"] = \
                    c_features["c_"+str(context_i)] + c_features["c0"]
                    
                feature_dict["c0"+"c"+str(context_i)] = \
                    c_features["c0"] + c_features["c"+str(context_i)]    
                
            else:
                feature_dict["c_"+str(context_i)+"c_"+str(context_i-1)] = \
                    c_features["c_"+str(context_i)] + c_features["c_"+str(context_i-1)]
                    
                feature_dict["c"+str(context_i-1)+"c"+str(context_i)] = \
                    c_features["c_"+str(context_i-1)] + c_features["c_"+str(context_i)]
        
        # feature c
        feature_dict["c_1c1"] = c_features["c_1"] + c_features["c1"]
        
        # feature d
        feature_dict["p"] = "1" if StringUtil.is_punctuation(c) else "0"
        
        # feature e
        type_feature = reduce(lambda x,y: x+y, t_features.itervalues())
        feature_dict["t"] = type_feature
        
        feature_dict_list.append(feature_dict)
        
    return feature_dict_list

Quick test

In [6]:
extract_feature_for_sentence("一年1y。".decode("utf-8"), 3)

[{'c0': u'\u4e00',
  'c0c1': u'\u4e00\u5e74',
  'c1': u'\u5e74',
  'c_1': u'<s>',
  'c_1c0': u'<s>\u4e00',
  'c_1c1': u'<s>\u5e74',
  'p': '0',
  't': '012'},
 {'c0': u'\u5e74',
  'c0c1': u'\u5e741',
  'c1': u'1',
  'c_1': u'\u4e00',
  'c_1c0': u'\u4e00\u5e74',
  'c_1c1': u'\u4e001',
  'p': '0',
  't': '121'},
 {'c0': u'1',
  'c0c1': u'1y',
  'c1': u'y',
  'c_1': u'\u5e74',
  'c_1c0': u'\u5e741',
  'c_1c1': u'\u5e74y',
  'p': '0',
  't': '213'},
 {'c0': u'y',
  'c0c1': u'y\u3002',
  'c1': u'\u3002',
  'c_1': u'1',
  'c_1c0': u'1y',
  'c_1c1': u'1\u3002',
  'p': '0',
  't': '134'},
 {'c0': u'\u3002',
  'c0c1': u'\u3002</s>',
  'c1': u'</s>',
  'c_1': u'y',
  'c_1c0': u'y\u3002',
  'c_1c1': u'y</s>',
  'p': '1',
  't': '340'}]

In [6]:
def extract_feature_for_file(input_path, output_path, window_size=5):
    print 'Extracting features for %s and output to %s...' % (input_path, output_path)
    start = time.time()
    
    with open(output_path, "w+") as output_file:
        for line in open(input_path, "r").readlines():
            line = line.strip().decode("utf-8")
            feature_dict_list = extract_feature_for_sentence(line, window_size)
            for feature_dict in feature_dict_list:
                output_file.writelines(StringUtil.to_json(feature_dict) + StringUtil.NEWLINE)
    
    print 'Done. Total time taken %d seconds' % (time.time() - start)

Extract features for test corpus

In [8]:
extract_feature_for_file("data/training/test.utf8", "data/preprocessed/feature_test.utf8")

Extracting features for data/training/test.utf8 and output to data/preprocessed/feature_test.utf8...
Done. Total time taken 0 seconds


Extract features for PKU and MSR corpora with window size = 5

In [204]:
extract_feature_for_file("data/training/pku_training.utf8", "data/preprocessed/feature_pku.utf8")
extract_feature_for_file("data/training/msr_training.utf8", "data/preprocessed/feature_msr.utf8")

Extracting features for data/training/pku_training.utf8 and output to data/preprocessed/feature_pku.utf8...
Done. Total time taken 61 seconds
Extracting features for data/training/msr_training.utf8 and output to data/preprocessed/feature_msr.utf8...
Done. Total time taken 138 seconds


### Improvements: Use of External Dictionary

Additional features to use external dictionary. We want to find a word in the dictionary with maximum length that contains the current character.

Let $W$ be the matched word containing $C_{0}$, $t_{0}$ be the tag for $C_{0} in $W$, $and $L$ be the length of $W$. Then we can add the following additional features:
- **(f)** $Lt_{0}$
- **(g)** $C_{n}t_{0}$, where $n$ = -1, 0, 1

In [7]:
def load_dict(dict_path):
    ref_dict = dict()
    for line in open(dict_path, 'r').readlines():
        word = line.strip().decode('utf-8')
        ref_dict[word] = len(word)
        
    return ref_dict

In [8]:
def extract_dict_feature(sentence, curr_i, ref_dict, window_size=10):
    start_index = curr_i - (window_size - 1)
    if start_index < 0:
        start_index = 0
        
    end_index = curr_i + window_size
    if end_index > len(sentence):
        end_index = len(sentence)
    
    t0 = 's'
    L = 0
    
    for i in range(start_index, curr_i+1):
        for j in range(curr_i+1, end_index+1):
            if sentence[i:j] in ref_dict and j-i > L:
                L = j - i
                
                if j-i == 1:
                    t0 = 's'
                if i == curr_i:
                    t0 = 'b'
                elif j-1 == curr_i:
                    t0 = 'e'
                else:
                    t0 = 'm'
                
    return (L, t0)

Quick test

In [9]:
ref_dict = load_dict("data/dict/processed.utf8")

In [5]:
extract_dict_feature("争分夺秒".decode('utf-8'), 2, ref_dict)

(4, 'm')

Updated feature extraction function

In [10]:
def extract_feature_for_sentence_v2(sentence, ref_dict, window_size=5):
    if window_size % 2 == 0 or not window_size > 1:
        raise ValueError('Window size must be odd number and larger than 1')
    
    sentence = StringUtil.remove_whitespace(sentence)
    total_len = len(sentence)
    context_size = window_size / 2
    
    feature_dict_list = []
    
    for i, c in enumerate(sentence):
        feature_dict = dict()
        
        c_features = dict()
        c_features["c0"] = c
        
        t_features = dict()
        t_features["t0"] = StringUtil.get_character_type(c)
        
        for context_i in range(1, context_size+1):
            c_features["c_"+str(context_i)] = sentence[i-context_i] if i-context_i >=0 \
                                                else StringUtil.SENTENCE_START.decode("utf-8")
            c_features["c"+str(context_i)] = sentence[i+context_i] if i+context_i < total_len \
                                                else StringUtil.SENTENCE_END.decode("utf-8")
            
            t_features["t_"+str(context_i)] = StringUtil.get_character_type(c_features["c_"+str(context_i)])
            t_features["t"+str(context_i)] = StringUtil.get_character_type(c_features["c"+str(context_i)])
        
        # feature a
        feature_dict.update(c_features)
        
        # feature b
        for context_i in reversed(range(1, context_size+1)):
            if context_i-1 == 0:
                feature_dict["c_"+str(context_i)+"c0"] = \
                    c_features["c_"+str(context_i)] + c_features["c0"]
                    
                feature_dict["c0"+"c"+str(context_i)] = \
                    c_features["c0"] + c_features["c"+str(context_i)]    
                
            else:
                feature_dict["c_"+str(context_i)+"c_"+str(context_i-1)] = \
                    c_features["c_"+str(context_i)] + c_features["c_"+str(context_i-1)]
                    
                feature_dict["c"+str(context_i-1)+"c"+str(context_i)] = \
                    c_features["c_"+str(context_i-1)] + c_features["c_"+str(context_i)]
        
        # feature c
        feature_dict["c_1c1"] = c_features["c_1"] + c_features["c1"]
        
        # feature d
        feature_dict["p"] = "1" if StringUtil.is_punctuation(c) else "0"
        
        # feature e
        type_feature = reduce(lambda x,y: x+y, t_features.itervalues())
        feature_dict["t"] = type_feature
        
        matched = extract_dict_feature(sentence, i, ref_dict)
        
        # feature f
        feature_dict["Lt0"] = str(matched[0]) + matched[1]

        # feature g
        feature_dict["c_1t0"] = c_features["c_1"] + matched[1]
        feature_dict["c0t0"] = c_features["c0"] + matched[1]
        feature_dict["c1t0"] = c_features["c1"] + matched[1]

        feature_dict_list.append(feature_dict)
        
    return feature_dict_list

Quick test

In [7]:
extract_feature_for_sentence_v2("一年争分夺秒。".decode("utf-8"), ref_dict, 3)

[{'Lt0': '0s',
  'c0': u'\u4e00',
  'c0c1': u'\u4e00\u5e74',
  'c0t0': u'\u4e00s',
  'c1': u'\u5e74',
  'c1t0': u'\u5e74s',
  'c_1': u'<s>',
  'c_1c0': u'<s>\u4e00',
  'c_1c1': u'<s>\u5e74',
  'c_1t0': u'<s>s',
  'p': '0',
  't': '012'},
 {'Lt0': '0s',
  'c0': u'\u5e74',
  'c0c1': u'\u5e74\u4e89',
  'c0t0': u'\u5e74s',
  'c1': u'\u4e89',
  'c1t0': u'\u4e89s',
  'c_1': u'\u4e00',
  'c_1c0': u'\u4e00\u5e74',
  'c_1c1': u'\u4e00\u4e89',
  'c_1t0': u'\u4e00s',
  'p': '0',
  't': '124'},
 {'Lt0': '4b',
  'c0': u'\u4e89',
  'c0c1': u'\u4e89\u5206',
  'c0t0': u'\u4e89b',
  'c1': u'\u5206',
  'c1t0': u'\u5206b',
  'c_1': u'\u5e74',
  'c_1c0': u'\u5e74\u4e89',
  'c_1c1': u'\u5e74\u5206',
  'c_1t0': u'\u5e74b',
  'p': '0',
  't': '244'},
 {'Lt0': '4m',
  'c0': u'\u5206',
  'c0c1': u'\u5206\u593a',
  'c0t0': u'\u5206m',
  'c1': u'\u593a',
  'c1t0': u'\u593am',
  'c_1': u'\u4e89',
  'c_1c0': u'\u4e89\u5206',
  'c_1c1': u'\u4e89\u593a',
  'c_1t0': u'\u4e89m',
  'p': '0',
  't': '444'},
 {'Lt0': '4m

In [11]:
def extract_feature_for_file_v2(input_path, output_path, ref_dict, window_size=5):
    print 'Extracting features (v2) for %s and output to %s...' % (input_path, output_path)
    start = time.time()
    
    with open(output_path, "w+") as output_file:
        for line in open(input_path, "r").readlines():
            line = line.strip().decode("utf-8")
            feature_dict_list = extract_feature_for_sentence_v2(line, ref_dict, window_size)
            for feature_dict in feature_dict_list:
                output_file.writelines(StringUtil.to_json(feature_dict) + StringUtil.NEWLINE)
    
    print 'Done. Total time taken %d seconds' % (time.time() - start)

## Train a MaxEnt Model (nltk)

Use nltk MaxEnt model

In [11]:
import nltk
import pickle

In [12]:
def train_model(feature_path, tag_path, model_path, iteration=100):
    print 'Loading training data..'
    train_data = []
    with open(feature_path, 'r') as feature_file, open(tag_path, 'r') as tag_file:
        for feature_line, tag_line in izip(feature_file, tag_file):
            feature_json = feature_line.strip()
            tag = tag_line.strip()
            #print("{0}\t{1}".format(feature_json, tag))
            
            feature_dict = StringUtil.from_json(feature_json)
            train_data.append((feature_dict, tag))
    print 'Done'
            
    algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] #GIS Algorithm  
    print("Start to train the model with {0}...".format(algorithm))
    start = time.time()
    
    classifier = nltk.MaxentClassifier.train(train_data, algorithm, max_iter=iteration)
    
    print('Done. Total time taken {0} seconds'.format(time.time() - start))
    
    # save the model to a file
    f = open(model_path, 'wb')
    pickle.dump(classifier, f)
    f.close()

#### Test with small dataset

In [247]:
train_model("data/preprocessed/feature_test.utf8", "data/preprocessed/tag_test.utf8", "data/model/test.pickle", 100)

Start to train the model with GIS...
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.38629        0.278
             2          -0.19100        1.000
             3          -0.08377        1.000
             4          -0.05237        1.000
             5          -0.03788        1.000
             6          -0.02961        1.000
             7          -0.02429        1.000
             8          -0.02058        1.000
             9          -0.01785        1.000
            10          -0.01576        1.000
            11          -0.01411        1.000
            12          -0.01277        1.000
            13          -0.01167        1.000
            14          -0.01074        1.000
            15          -0.00995        1.000
            16          -0.00926        1.000
            17          -0.00867        1.000
            18          -0.00815        1.000
          

In [235]:
f = open('data/model/test.pickle', 'rb')
classifier = pickle.load(f)
f.close()

feature_list = extract_feature_for_sentence("测试一！通过：333。".decode("utf-8"))
results = classifier.classify_many(feature_list)
print results

['b', 'e', 's', 's', 'b', 'e', 's', 'b', 'm', 'e', 's']


In [238]:
tests = classifier.prob_classify(feature_list[0])
print tests.samples()
print tests.logprob('b')

['m', 's', 'b', 'e']
-6.43708881328


#### Train with corpus

In [None]:
train_model("data/preprocessed/feature_pku.utf8", "data/preprocessed/tag_pku.utf8", "data/model/pku.pickle", 100)

![GIS model diagram](img/GIS nltk.PNG)

The nltk implementation is too slow. Almost 1 hour for 10 iterations. It is not practical to use in production.

## Train a MaxEnt Model (scikit-learn)

Use model from scikit-learn package for better performance

In [13]:
def train_model_sklearn(feature_path, tag_path, model_path):
    print 'Loading training data..'
    train_data = []
    with open(feature_path, 'r') as feature_file, open(tag_path, 'r') as tag_file:
        for feature_line, tag_line in izip(feature_file, tag_file):
            feature_json = feature_line.strip()
            tag = tag_line.strip()
            #print("{0}\t{1}".format(feature_json, tag))
            
            feature_dict = StringUtil.from_json(feature_json)
            train_data.append((feature_dict, tag))
    print 'Done'
            
    model = SklearnClassifier(LogisticRegression())
    print "Start to train the model..."
    start = time.time()
    
    classifier = model.train(train_data)
    
    print('Done. Total time taken {0} seconds'.format(time.time() - start))
    
    # save the model to a file
    joblib.dump(classifier, model_path, compress=0)

#### Test with small dataset

In [5]:
train_model_sklearn("data/preprocessed/feature_test.utf8", "data/preprocessed/tag_test.utf8", "data/model/test_sklearn")

Loading training data..
Done
Start to train the model...
[LibLinear]Done. Total time taken 0.0090000629425 seconds


In [6]:
classifier = joblib.load("data/model/test_sklearn", mmap_mode='r')

feature_list = extract_feature_for_sentence("测试一！通过：333。".decode("utf-8"))
results = classifier.classify_many(feature_list)
print results

['b', 'e', 's', 's', 'b', 'e', 's', 'b', 'm', 'e', 's']


#### Train with corpus

Train with PKU corpus

In [21]:
train_model_sklearn("data/preprocessed/feature_pku.utf8", "data/preprocessed/tag_pku.utf8", "data/model/pku_sklearn")

Loading training data..
Done
Start to train the model...
Done. Total time taken 647.667000055 seconds


## Model Evaluation

Dynamic programming to make sure the correct tag sequence.

List of invalid tag sequences:
- m -> s
- m -> b
- s -> m
- s -> e
- b -> s
- e -> m
- b -> b
- e -> e
- m -> m

Also, a sentence should never start with tag **m** or **e**.

In [14]:
def get_best_tag_seq(feature_list, classifier):
    
    # Map of (index, tag) -> log probability
    delta = dict()
    # Backpointers, map of (index, tag) -> previous tag
    bp = dict()
    
    for i, feature_dict in enumerate(feature_list):
        prob_distribution = classifier.prob_classify(feature_dict)
        
        if i == 0:
            for t in StringUtil.TAG_SET:
                if not t == 'm' and not t == 'e':
                    delta[(i, t)] = prob_distribution.logprob(t)

        else:
            for t in StringUtil.TAG_SET:
                max_term = max([(prob_distribution.logprob(t) + delta[(i - 1, t_1)], t_1)
                                    if not t_1 + t in StringUtil.INVALID_TAG_SEQ and (i - 1, t_1) in delta else (-np.inf, t_1)
                                        for t_1 in StringUtil.TAG_SET])
                
                delta[(i, t)] = max_term[0]
                bp[(i, t)] = max_term[1]
    
    n = len(feature_list)
    end_score, end_tag = max([(delta[(n-1, t)], t) if (n-1, t) in delta else (-np.inf, t) for t in StringUtil.TAG_SET])

    # Follow backpointers to obtain sequence with the highest score.
    tags = [end_tag]
    for i in reversed(range(0, n-1)):
        tags.append(bp[(i + 1, tags[-1])])
    return list(reversed(tags))

#### Quick test for valid sequence function

In [79]:
test_classifier = joblib.load("data/model/test_sklearn", mmap_mode='r')
feature_list = extract_feature_for_sentence("测试一！通过：333。".decode("utf-8"))
print get_best_tag_seq(feature_list, test_classifier)

['b', 'e', 's', 's', 'b', 'e', 's', 'b', 'm', 'e', 's']


#### Test tagging against another corpus

Function to output predicted tags to a file

In [15]:
def get_tags_and_output_to_file(input_file_path, result_output_path, model_path):
    classifier = joblib.load(model_path, mmap_mode='r')
        
    with open(result_output_path, "w+") as output_file:
        count = 1
        for line in open(input_file_path, "r").readlines():
            line = line.strip().decode("utf-8")
            feature_list = extract_feature_for_sentence(line)
            predicted_tags = get_best_tag_seq(feature_list, classifier)
            for tag in predicted_tags:
                output_file.writelines(tag + StringUtil.NEWLINE)

            # for evaluation purpose, only take first 1K sentences
            count += 1
            if count == 1000:
                break

Function to compare predicted tags against gold standard tags and print model accuracy

In [16]:
def print_model_accuracy(result_file_path, gold_standard_file_path):
    count = 0
    correct = 0
    with open(gold_standard_file_path, 'r') as expect_file, open(result_file_path, 'r') as result_file:
        for expect_line, result_line in izip(expect_file, result_file):
            count += 1
            if expect_line.strip() == result_line.strip():
                correct +=1

    print "Accuracy: {0}".format(float(correct)/count)

Test model trained with PKU corpus against MSR corpus

In [134]:
get_tags_and_output_to_file("data/training/msr_training.utf8", "data/result/msr.utf8", "data/model/pku_sklearn")

print_model_accuracy("data/result/msr.utf8", "data/preprocessed/tag_msr.utf8")

Accuracy: 0.911795996002


#### Comparing the model with window size = 3

Extract features for PKU and MSR corpora with window size = 3

In [103]:
extract_feature_for_file("data/training/pku_training.utf8", "data/preprocessed/feature_w3_pku.utf8", 3)
extract_feature_for_file("data/training/msr_training.utf8", "data/preprocessed/feature_w3_msr.utf8", 3)

Extracting features for data/training/pku_training.utf8 and output to data/preprocessed/feature_w3_pku.utf8...
Done. Total time taken 41 seconds
Extracting features for data/training/msr_training.utf8 and output to data/preprocessed/feature_w3_msr.utf8...
Done. Total time taken 97 seconds


In [106]:
train_model_sklearn("data/preprocessed/feature_w3_pku.utf8", "data/preprocessed/tag_pku.utf8", "data/model/pku_w3_sklearn")

Loading training data..
Done
Start to train the model...
Done. Total time taken 537.802000046 seconds


In [110]:
get_tags_and_output_to_file("data/training/msr_training.utf8", "data/result/msr_w3.utf8", "data/model/pku_w3_sklearn")

print_model_accuracy("data/result/msr_w3.utf8", "data/preprocessed/tag_msr.utf8")

Accuracy: 0.911247944808


#### Comparing the model with window size = 7

Extract features for PKU and MSR corpora with window size = 7

In [21]:
extract_feature_for_file("data/training/pku_training.utf8", "data/preprocessed/feature_w7_pku.utf8", 7)

Extracting features for data/training/pku_training.utf8 and output to data/preprocessed/feature_w7_pku.utf8...
Done. Total time taken 86 seconds


In [22]:
train_model_sklearn("data/preprocessed/feature_w7_pku.utf8", "data/preprocessed/tag_pku.utf8", "data/model/pku_w7_sklearn")

Loading training data..
Done
Start to train the model...
Done. Total time taken 770.940000057 seconds


In [23]:
get_tags_and_output_to_file("data/training/msr_training.utf8", "data/result/msr_w7.utf8", "data/model/pku_w7_sklearn")

print_model_accuracy("data/result/msr_w7.utf8", "data/preprocessed/tag_msr.utf8")

Accuracy: 0.909313646475


Comparing the accuracies:
- window size 5: **0.911795996002**
- window size 3: **0.911247944808**
- window size 7: **0.909313646475**

Looks like the model with window size **5** is the best.

#### Test model with external dictionary features

In [13]:
ref_dict = load_dict("data/dict/processed.utf8")

In [13]:
extract_feature_for_file_v2("data/training/pku_training.utf8", "data/preprocessed/feature_pku_v2.utf8", ref_dict)

Extracting features (v2) for data/training/pku_training.utf8 and output to data/preprocessed/feature_pku_v2.utf8...
Done. Total time taken 100 seconds


In [14]:
train_model_sklearn("data/preprocessed/feature_pku_v2.utf8", "data/preprocessed/tag_pku.utf8", "data/model/pku_sklearn_v2")

Loading training data..
Done
Start to train the model...
Done. Total time taken 844.114000082 seconds


In [15]:
def get_tags_and_output_to_file_v2(input_file_path, result_output_path, model_path, ref_dict):
    classifier = joblib.load(model_path, mmap_mode='r')
        
    with open(result_output_path, "w+") as output_file:
        count = 1
        for line in open(input_file_path, "r").readlines():
            line = line.strip().decode("utf-8")
            feature_list = extract_feature_for_sentence_v2(line, ref_dict)
            predicted_tags = get_best_tag_seq(feature_list, classifier)
            for tag in predicted_tags:
                output_file.writelines(tag + StringUtil.NEWLINE)

            # for evaluation purpose, only take first 1K sentences
            count += 1
            if count == 1000:
                break

In [16]:
get_tags_and_output_to_file_v2("data/training/msr_training.utf8", "data/result/msr_v2.utf8", "data/model/pku_sklearn_v2",
                               ref_dict)

print_model_accuracy("data/result/msr_v2.utf8", "data/preprocessed/tag_msr.utf8")

Accuracy: 0.920339146974


Comparing the accuracies:
- Model without external dictionary: **0.911795996002**
- Model with external dictionary: **0.920339146974**

The model with external dictionary has improved.

## Output Segmented Sentence

Output tags

In [17]:
def get_tags_for_sentence(sentence, classifier):
    feature_list = extract_feature_for_sentence(sentence)
    return get_best_tag_seq(feature_list, classifier)

Output the segmented result

In [18]:
def do_segmentation_for_partial_sentence(sentence, classifier):
    tags = get_tags_for_sentence(sentence, classifier)
    
    output = []
    total_len = len(tags)
    
    for i, t in enumerate(tags):
        output.append(sentence[i])
        if i < total_len -1:
            if t == 's' or t == 'e':
                output.append(StringUtil.SPACE)
                
    return StringUtil.EMPTY.join(output)

If the original sentence contains white spaces, take the spaces as natural segmentations

In [19]:
def do_segmentation_for_sentence(sentence, classifier):
    sentence = sentence.strip().decode("utf-8")
    segments = sentence.split()
    
    return StringUtil.SPACE.join([do_segmentation_for_partial_sentence(s, classifier) for s in segments])

### Quick test

Load a model

In [25]:
pku_classifier = joblib.load("data/model/pku_sklearn", mmap_mode='r')

Test file i/o with Chinese characters

In [23]:
result = do_segmentation_for_sentence("星期天我们去吃重庆火锅", pku_classifier)
with open("data/result/test_segmentation.utf8", "w+") as output_file:
    output_file.writelines(result.encode('utf-8'))
for line in open("data/result/test_segmentation.utf8", "r").readlines():
    print line

星期天 我们 去 吃 重庆 火锅


With V2 feature extraction:

In [20]:
def get_tags_for_sentence_v2(sentence, classifier, ref_dict):
    feature_list = extract_feature_for_sentence_v2(sentence, ref_dict)
    return get_best_tag_seq(feature_list, classifier)

In [21]:
def do_segmentation_for_partial_sentence_v2(sentence, classifier, ref_dict):
    tags = get_tags_for_sentence_v2(sentence, classifier, ref_dict)
    
    output = []
    total_len = len(tags)
    
    for i, t in enumerate(tags):
        output.append(sentence[i])
        if i < total_len -1:
            if t == 's' or t == 'e':
                output.append(StringUtil.SPACE)
                
    return StringUtil.EMPTY.join(output)

In [22]:
def do_segmentation_for_sentence_v2(sentence, classifier, ref_dict):
    sentence = sentence.strip().decode("utf-8")
    segments = sentence.split()
    
    return StringUtil.SPACE.join([do_segmentation_for_partial_sentence_v2(s, classifier, ref_dict) for s in segments])

#### Test for any sentence

In [26]:
ref_dict = load_dict("data/dict/processed.utf8")

In [27]:
v2_classifier = joblib.load("data/model/pku_sklearn_v2", mmap_mode='r')

In [29]:
print do_segmentation_for_sentence("苹果6S通用手机防尘塞金属", pku_classifier)
print do_segmentation_for_sentence_v2("苹果6S通用手机防尘塞金属", v2_classifier, ref_dict)

苹果 6S 通用 手机 防尘 塞 金属
苹果 6S 通用 手机 防尘塞 金属


In [30]:
print do_segmentation_for_sentence("男韩版宽松短袖T恤 圆领日系五分袖学生上衣", pku_classifier)
print do_segmentation_for_sentence_v2("男韩版宽松短袖T恤 圆领日系五分袖学生上衣", v2_classifier, ref_dict)

男韩版 宽松 短袖 T恤 圆领 日 系 五 分袖 学生 上衣
男 韩版 宽松 短袖 T恤 圆领 日系 五分袖 学生 上衣


In [31]:
print do_segmentation_for_sentence("美式复古豹子卡通短袖T恤潮男女原宿街头滑板宽松圆领体恤半袖TEE", pku_classifier)
print do_segmentation_for_sentence_v2("美式复古豹子卡通短袖T恤潮男女原宿街头滑板宽松圆领体恤半袖TEE", v2_classifier, ref_dict)

美式 复古 豹子 卡通 短袖 T恤潮 男女 原宿 街头 滑板 宽松 圆领 体 恤半袖TE E
美式 复古 豹子 卡通 短袖 T恤 潮 男女 原宿 街头 滑板 宽松 圆领 体恤 半袖 TEE


In [40]:
print do_segmentation_for_sentence("上海自来水来自海上", pku_classifier)
print do_segmentation_for_sentence_v2("上海自来水来自海上", v2_classifier, ref_dict)

上海 自来水 来自 海上
上海 自来水 来自 海上


### Post processing

Combine consecutive segments containing only English letters or digits into one segment.

For example: 
- **su n d i py** will be combined as a single word **sundipy**
- **01 235 8** will be combined as a single word **012358**

In [23]:
def post_processing(sentence):
    segments = sentence.split()
    results = []
    
    i = 0
    while i < len(segments):
        word = segments[i]
        if StringUtil.is_digit_or_letter(word):
            while i+1 < len(segments):
                if StringUtil.is_digit_or_letter(segments[i+1]):
                    word = word + segments[i+1]
                    i += 1
                else:
                    break
            results.append(word)
            
        else:
            results.append(word)
            
        i += 1
    
    return StringUtil.SPACE.join(results)

Quick test

In [46]:
print post_processing("EK T 男士 纯色 V 领 半袖 T恤 01 234 8".decode('utf-8'))

EKT 男士 纯色 V 领 半袖 T恤 012348


In [30]:
def do_segmentation_for_partial_sentence_v3(sentence, classifier, ref_dict):
    tags = get_tags_for_sentence_v2(sentence, classifier, ref_dict)
    
    output = []
    total_len = len(tags)
    
    for i, t in enumerate(tags):
        output.append(sentence[i])
        if i < total_len -1:
            if t == 's' or t == 'e':
                output.append(StringUtil.SPACE)
    
    return post_processing(StringUtil.EMPTY.join(output))

In [31]:
def do_segmentation_for_sentence_v3(sentence, classifier, ref_dict):
    sentence = sentence.strip().decode("utf-8")
    segments = sentence.split()
    
    return StringUtil.SPACE.join([do_segmentation_for_partial_sentence_v3(s, classifier, ref_dict) for s in segments])

In [32]:
print do_segmentation_for_sentence("sundipy男士拼接短袖T恤夏季男款韩版修身圆领休闲青年打底衫男潮", pku_classifier)
print do_segmentation_for_sentence_v3("sundipy男士拼接短袖T恤夏季男款韩版修身圆领休闲青年打底衫男潮", v2_classifier, ref_dict)

su n d i py 男士 拼接 短袖 T恤 夏季 男款 韩版 修身 圆领 休闲 青年 打底 衫 男潮
sundipy 男士 拼接 短袖 T恤 夏季 男款 韩版 修身 圆领 休闲 青年 打底衫 男潮


In [28]:
print do_segmentation_for_sentence("EKT男士纯色V领半袖T恤 韩版修身薄款体恤衫 冰丝无痕短袖夏装潮", pku_classifier)
print do_segmentation_for_sentence_v3("EKT男士纯色V领半袖T恤 韩版修身薄款体恤衫 冰丝无痕短袖夏装潮", v2_classifier, ref_dict)

EK T 男士 纯色 V领 半 袖 T恤 韩版 修身 薄款 体恤衫 冰丝无痕 短袖 夏 装潮
EKT 男士 纯色 V领 半袖T 恤 韩版 修身 薄款 体恤衫 冰丝 无痕 短袖 夏装潮


In [32]:
print do_segmentation_for_sentence("波司登男装短袖T恤夏装男士时尚休闲新款polo衫中青年免烫百搭", pku_classifier)
print do_segmentation_for_sentence_v3("波司登男装短袖T恤夏装男士时尚休闲新款polo衫中青年免烫百搭", v2_classifier, ref_dict)

波司 登 男装 短袖 T恤 夏装 男士 时尚 休闲 新款 po l o衫 中青年 免烫 百 搭
波司登 男装 短袖 T恤 夏装 男士 时尚休闲 新款 polo衫 中青年 免烫 百搭


In [24]:
print do_segmentation_for_sentence_v3("华为 Ascend P8(高配版) 快捷解锁， 超灵敏触摸屏， 超薄全铝合金", v2_classifier, ref_dict)

华为 Ascend P8 ( 高 配版 ) 快捷 解锁 ， 超灵 敏 触摸屏 ， 超薄 全 铝合金
