 Chinese Word Segmentation
========================

In this notebook, I applied the method of "*word  boundary  decision  (WBD)  segmentation*" model based on the paper [Huang, Chu-Ren, et al. "A realistic and robust model for Chinese word segmentation." arXiv preprint arXiv:1905.08732 (2019)](https://arxiv.org/pdf/1905.08732.pdf) for Chinese word segmentation.

I followed the method mentioned in "ChineseWordSegmentation.pdf".

<div class="alert alert-block alert-success">
<b>Note:</b> All the essential functions are defined in the "util_funcs.py".
</div>
These functions are imported here.

In [1]:
# import
from scipy import sparse
import util_funcs as uf  # essential functions 

### Input

`n_gram`: number of the consecutive characters considered eg. `n_gram`=4 we will have *树高遭雷*.

`vec_size`: size of the feature vector. In this approach the feature vector defined for the example above will be  *\[树高, 高, 高遭, 遭, 遭雷\]*.

`Path2Data`: Path to the input data

`FileTrain`/`FileTest`: Name of the file (training/test)

In [2]:
n_gram = 4 
vec_size = 5  
Path2Data = './data/'
FileTrain = 'training.txt'
FileTest = 'test.txt'

Note:There were some decoding erros in the files. I ignored those characters.



### Generating feature vector and labels of training and test dataset

`corpus`: A dictionary containing all the uni-gram and bi-grams of the characters in the training dataset. <div class="alert alert-block alert-success">
<b>Note:</b> I added a "oov" to the corpus to capture the out-of-vocabulary in the test dataset. Whenever an unseen word shows up in the test dataset, the index of "oov" will be used.
</div>

`X_train`, `X_test`: Feature vector of traiing/test dataset

`Y_train`, `Y_test`: Labels of training/test dataset. The values are 0 or 1.

In [3]:
# Train data set
corpus, X_train, Y_train = uf.Train_XY(Path2Data, FileTrain)

# Save the sparse matrix to save time for future!
sparse.save_npz("SparseTrain.npz", X_train)

# Load the sparse matrix for training
# X_train = sparse.load_npz("SparseTrain.npz")

# Test dataset
X_test, Y_test = uf.Test_XY(Path2Data, FileTest, corpus)

### Classification

Since `X_train` is very sparse, to increase the computation time I used a Bernouli Naive Bayes method.

In [4]:
# classification based on Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB(binarize=None)
nb.fit(X_train, Y_train)


BernoulliNB(alpha=1.0, binarize=None, class_prior=None, fit_prior=True)

### Results

In [6]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

Y_predict = nb.predict(X_test)

print(f'Accuracy: {accuracy_score(Y_test, Y_predict)*100}% \n')

print(f'f1 score: {f1_score(Y_test, Y_predict)*100}% \n')

print(f'precision score: {precision_score(Y_test, Y_predict)*100}% \n')

print(f'recall score: {recall_score(Y_test, Y_predict)*100}% \n')

Accuracy: 88.82034132841329% 

f1 score: 94.0643462821808% 

precision score: 90.16431924882629% 

recall score: 98.31701542202597% 

