# Chinese Word Segmentation

Segment a standardly written Chinese sentence into words.

For example, give the following Chinese sentence:

**华为Mate9双摄2x变焦新全网通**

It produces the output with proper segmentation:

**华为&nbsp;&nbsp;Mate9&nbsp;&nbsp;双摄&nbsp;&nbsp;2x&nbsp;&nbsp;变焦&nbsp;&nbsp;新&nbsp;&nbsp;全网通**

In [4]:
# python libs
from itertools import izip
from sklearn.externals import joblib
#from __future__ import division

# in-house libs
import tagger
#reload(tagger)
tagger = tagger.Tagger()

import feature_extractor
#reload(feature_extractor)
featureExtractor = feature_extractor.FeatureExtractor()

import maxent_model
#reload(maxent_model)
model = maxent_model.MaxEntModel()

import segmentor
#reload(segmentor)

## Tag Training Data

Tag for a sentence. Input a segmented sentence, and output a list of tags.

Types of tags:
- **s** single character as a word
- **b** beginning of a word
- **m** middle of a word
- **e** end of a word

Punctuations (either English or Chinese) will be tagged as **s**.

For example, the example sentence will be tagged as following:

**华 为&nbsp;&nbsp;&nbsp;&nbsp;M a&nbsp;&nbsp;t&nbsp;&nbsp;e&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;&nbsp;双 摄&nbsp;&nbsp;&nbsp;&nbsp;2 x&nbsp;&nbsp;&nbsp;&nbsp;变 焦&nbsp;&nbsp;&nbsp;&nbsp;新&nbsp;&nbsp;&nbsp;&nbsp;全 网 通**

**&nbsp;&nbsp;b e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b m m m e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b m e**

In [2]:
tagger.tag_for_file("data/training/data.utf8", "data/preprocessed/tag.utf8")

Tagging for data/training/data.utf8 and output to data/preprocessed/tag.utf8...
Done. Total time taken 41 seconds


## Extract Features for Training Data

Extract features for a sentence. In case of window size 5, the features are defined as following:

- **(a)** $C_{n}$, where $n$ if from -2 to 2
- **(b)** $C_{n}C_{n+1}$, where $n$ is from -2 to 1
- **(c)** $C_{-1}C_{1}$
- **(d)** $Pu$, a boolean value (0 or 1) representing if the current character is a punctuation
- **(e)** $T(C_{-2})T(C_{-1})T(C_{0})T(C_{1})T(C_{2})$, type seq of the char seq

Sentence boundary: the start-of-sentence character is defined as **< s >**, while the end-of-sentence character is defined as **< /s >**

Types are defined as following:
- **0** sentence boundary
- **1** number char
- **2** date char
- **3** English letter 
- **4** others

Additional features to use external dictionary. We want to find a word in the dictionary with maximum length that contains the current character.

Let $W$ be the matched word containing $C_{0}$, $t_{0}$ be the tag for $C_{0} in $W$, $and $L$ be the length of $W$. Then we can add the following additional features:
- **(f)** $Lt_{0}$
- **(g)** $C_{n}t_{0}$, where $n$ = -1, 0, 1

In [3]:
featureExtractor.extract_feature_for_file("data/training/data.utf8", "data/preprocessed/feature.utf8")

Extracting features (v2) for data/training/data.utf8 and output to data/preprocessed/feature.utf8...
Done. Total time taken 351 seconds


## Train MaxEnt Model

Train model and save into file

In [4]:
model.train("data/preprocessed/feature.utf8", "data/preprocessed/tag.utf8", "data/model/model")

Loading training data..
Done
Start to train the model...
Done. Total time taken 7751.04100013 seconds


## Use Model to do Segmentation

When use the model to tag a sentence, need to make sure the correct tag sequence.

List of invalid tag sequences:
- m -> s
- m -> b
- s -> m
- s -> e
- b -> s
- e -> m
- b -> b
- e -> e
- m -> m

Also, a sentence should never start with tag **m** or **e**.

Load the trained model

In [5]:
segmentor = segmentor.Segmentor(joblib.load("data/model/model", mmap_mode='r'))

Do segmentation for sentences in a file

In [9]:
segmentor.do_segmentation_for_file("data/test/taobao.utf8", "data/result/taobao_segmentation.utf8")

Done. Total time taken 0 seconds


## Demo

Do segmentation for a sentence

In [7]:
print segmentor.do_segmentation_for_sentence("华为Mate9双摄2x变焦新全网通")

华为 Mate9 双摄 2x 变焦 新 全网通


In [8]:
print segmentor.do_segmentation_for_sentence("男韩版宽松短袖T恤 圆领日系五分袖学生上衣")

男 韩版 宽松 短袖 T恤 圆领 日系 五分袖 学生 上衣


In [8]:
print segmentor.do_segmentation_for_sentence("EKT男士纯色V领半袖T恤 韩版修身薄款体恤衫 冰丝无痕短袖夏装")

EKT 男士 纯色 V领 半袖 T恤 韩版 修身 薄款 体恤衫 冰丝 无痕 短袖 夏装 潮


In [16]:
print segmentor.do_segmentation_for_sentence("林珊珊 2017新款夏装圆领纯色显瘦内搭打底衫修身针织")

林珊珊 2017 新款 夏装 圆领 纯色 显瘦 内搭 打底衫 修身 针织


In [26]:
print segmentor.do_segmentation_for_sentence("日系潮牌 军事风 暗黑迷彩男士休闲POLO衫 纯棉宽松翻领")

日系 潮牌 军事 风 暗黑 迷彩 男士 休闲 POLO衫 纯棉 宽松 翻领
