# A semi-supervised combined word tokenizer-POS tagger for Hmong

This post introduces a semi-supervised approach to word tokenization and POS tagging that enables support for resource-poor languages.

The Hmong language is a resource-poor language [1] where corpora of POS-tagged data are previously unavailable, precluding supervised approaches. At the same time, the Hmong language has an unusually high number of homonyms and features syllable-based spacing in its orthography, meaning that widespread ambiguity will create serious problems for unsupervised approaches. A semi-supervised approach is in order.

The approach featured here follows a relatively unusual strategy: combining word tokenization and POS tagging as a single step. Because Hmong has an orthography where spaces are placed between _syllables_ rather than words, word tokenization will be potentially non-trivial. However, a much more prominent language, Vietnamese, has the same issue, yet unlike Hmong, it is a relatively resource-rich language. This means that, with the relevant adaptations to handle a resource-poor language, approaches that work with Vietnamese should prove useful. One of these approaches is in fact combining word tokenization and POS tagging [2][3].

In this approach, word tokenization is combined with POS tagging as a sequence-labeling task where position in the word is handled using IOB tags, where B marks the first syllable of the word, I marks all other syllables of the word, and O marks everything that is not a word. Here, I combine these with POS tags using a hyphen, so that the first syllable of a noun is B-NN and the second syllable of a verb is I-VV.

In my approach here, I use pretrained word embeddings. Though Hmong is a resource-poor language, the Internet has proven popular with Hmong speakers, meaning that speakers have produced thousands of forum posts on the soc.culture.hmong listserv over the past 20 years or so. These have been organized into the approximately 12-million token SCH corpus, which is available for free download here: http://listserv.linguistlist.org/pipermail/my-hm/2015-May/000028.html.

These pretrained word embeddings are created through Word2Vec and loaded as an embedding layer into a Keras-based BiLSTM model. The BiLSTM model is excellent for the word tokenization/POS tagging task as it is specially designed for handling sequences where individual output values are dependent neighboring values.

The model is trained on a set of eight documents—approximately 6000 (actual) words—fully tagged with the combined word position-POS tags mentioned above.

Let's begin by importing the relevant libraries.

#### Import libraries.

In [1]:
import os
import sqlite3
from itertools import groupby
import numpy as np
from pandas import DataFrame

from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec

from keras.models import Sequential
from keras.layers import Bidirectional, LSTM, Dense, InputLayer, Embedding, TimeDistributed, Activation
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.optimizers import Adam

Using TensorFlow backend.


#### Load existing database with POS-tagged words.

Next, we navigate to the folder containing the database file and load the database using sqlite3.

In [2]:
os.chdir(os.path.expanduser('~/python_workspace/medical_corpus_scripting/corpus/hminterface/static/hminterface'))
conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()

#### Retrieve tags from database.

Next we retrieve all of the tag types from the database using SQL and creating a dictionary that converts all of the tags to indices that can be used in the Keras model. The result is a unique index for each combination of word position IOB tag and POS tag that are actually attested in the corpus database to date.

In [3]:
query = """SELECT DISTINCT loc, pos_label FROM types
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type;"""

# set the padding tag combination first, then add tag combinations from database
tag_combinations = [('O', 'PAD')]
tag_combinations += crsr.execute(query).fetchall()

tag_indices = {'-'.join(t): i for i, t in enumerate(tag_combinations)}
print(tag_indices.items())

dict_items([('O-PAD', 0), ('B-CL', 1), ('B-NN', 2), ('O-PU', 3), ('B-FW', 4), ('B-VV', 5), ('B-PP', 6), ('I-NN', 7), ('B-QU', 8), ('I-CL', 9), ('B-LC', 10), ('I-VV', 11), ('B-AD', 12), ('B-DT', 13), ('B-CC', 14), ('I-CC', 15), ('B-CV', 16), ('I-AD', 17), ('B-RL', 18), ('B-CS', 19), ('B-PN', 20), ('I-CS', 21), ('I-FW', 22), ('B-NR', 23), ('I-NR', 24), ('I-PU', 25), ('B-PU', 26), ('B-CM', 27), ('B-ON', 28), ('I-QU', 29), ('I-PN', 30), ('B-JJ', 31)])


#### Retrieve word tokens and tags as numerical codes.

The database is organized such that each "word" (i.e., syllable or punctuation demarcated by spaces) type is assigned its own index in the table `types`. This means that a dataframe can be created using the database data to convert between indices and word types.

In [4]:
query = """SELECT ind, type_form FROM types;"""
word_index_list = crsr.execute(query).fetchall()

# Visualize data
index_words = DataFrame(data=word_index_list, columns=['Index', 'Word_Type'])
index_words.set_index('Index', inplace=True)
print(index_words.head(15))

         Word_Type
Index             
1              tus
2              mob
3                –
4      shigellosis
5          disease
6             fact
7            sheet
8           series
9              zoo
10              li
11             cas
12               ?
13             yog
14              ib
15             tug


The following retrieves the word indices from the eight documents stored in the corpus database that we are going to use, and uses the `itertools.groupby` function to organize them in sequence as a list of sentence lists.

In [5]:
query = """SELECT doc_ind, sent_ind, word_type_ind, loc, pos_label FROM tokens
JOIN types ON tokens.word_type_ind=types.ind
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type
WHERE doc_ind<=8;"""
query_words = crsr.execute(query).fetchall()
sentences_list = []
tags_list = []
for k, g in groupby(query_words, lambda x: (x[0], x[1])):
    sent = list(g)
    sentences_list.append([(w[2],) for w in sent])
    tags_list.append([(tag_indices['-'.join(w[3:])],) for w in sent])
    
# print the second sentence as a word type index sequence
print(sentences_list[1])
# print the sentence as a word type sequence, using index_words dataframe from above
print(list(index_words.at[word[0], 'Word_Type'] for word in sentences_list[1]))

[(4,), (13,), (14,), (15,), (2,), (16,), (17,), (18,), (19,), (20,), (21,), (16,), (22,)]
['shigellosis', 'yog', 'ib', 'tug', 'mob', 'los', 'ntawm', 'cov', 'kab', 'mob', 'bacteria', 'los', '.']


#### Handling padding and out-of-vocabulary items.

The Keras model we will use below requires each element in the training input to have the same number of tokens. This means that we will need to pad every sentence that is not as long as the longest sentence in the training set. We can achieve this by adding a `0` index value to our `index_words` dataframe.

Likewise, in testing and production we will inevitably run into items that are not in the vocabulary used in training the model. This can be handled by adding a row in the `index_words` dataframe with an index value beyond the current maximum for the value "out of vocabulary." This ensures compatibility with the existing database values.

In [6]:
index_words.loc[0] = ['$PAD']
index_words.loc[index_words.index.max() + 1] = ['$OUT']
print(index_words.tail())

              Word_Type
Index                  
951    electromyography
952                 emg
953                 tom
0                  $PAD
954                $OUT


#### Split data into training and testing.

Here, we split the data into training and testing components using `sklearn.model_selection.train_test_split`. `train_test_split` splits the sentences randomly, so the training and testing portions will both contain portions of all eight documents. This means that the testing portion of the data will provide a clear indication as to whether training the model below has been successful, but we will still need to test it again later on a fully unseen document. Here, we split the data based on a common threshold: 20% of the sentences for testing and 80% for training. 

In [9]:
X_train, X_test, y_train, y_test = train_test_split(sentences_list, tags_list, test_size=0.2)

#### Replacing X_test terms.

Because we will train the model below on the `X_train` set created above, the word type numerical values found in `X_test` that are not found in `X_train` will create trouble for the model, as the values will produce word embeddings for which the model has not been trained to process. We handle this here by replacing these numerical values with the out-of-vocabulary value, which is equal to `index_words.index.max()`.

In [10]:
words = set(word_value for sent in X_train for word_value in sent)

pre_sample_sentence_index = 10
X_test_pre_sample = X_test[pre_sample_sentence_index]
X_test = [[word_value if word_value in words else (index_words.index.max(),) for word_value in sent] \
          for sent in X_test]

print('Original words: ', list(index_words.at[ind[0], 'Word_Type'] for ind in X_test_pre_sample))
print('Before out-of-vocabulary conversion: ', X_test_pre_sample)
print('After out-of-vocabulary conversion:  ', X_test[pre_sample_sentence_index])

Original words:  ['*', 'qees', 'tus', 'neeg', 'uas', 'muaj', 'hom', 'kab', 'mob', 'tb', 'no', 'yuav', 'kis', 'tau', 'rau', 'lwm', 'leej', 'lwm', 'tus', '.']
Before out-of-vocabulary conversion:  [(539,), (787,), (383,), (29,), (80,), (23,), (253,), (19,), (20,), (782,), (32,), (69,), (70,), (71,), (26,), (149,), (427,), (788,), (383,), (22,)]
After out-of-vocabulary conversion:   [(539,), (954,), (383,), (29,), (80,), (23,), (253,), (19,), (20,), (782,), (32,), (69,), (70,), (71,), (26,), (149,), (427,), (954,), (383,), (22,)]


#### Padding sentences.

Next, we need to pad the sentences such that each sentence has the same length. We do this by finding the longest sentence by tokens in `X_train` and using this as the `maxlen` parameter of `keras.preprocessing.sequence.pad_sequences` for each of the four data types.

In [11]:
LEN_MAX = len(max(X_train, key=len))

X_train = pad_sequences([[w[0] for w in line] for line in X_train], maxlen=LEN_MAX, padding='post')
y_train = pad_sequences(y_train, maxlen=LEN_MAX, padding='post')

X_test = pad_sequences([[w[0] for w in line] for line in X_test], maxlen=LEN_MAX, padding='post')
y_test = pad_sequences(y_test, maxlen=LEN_MAX, padding='post')

#### Load the pretrained word embedding model.

Now, we can load the Word2Vec word embedding model pretrained on the SCH corpus.

In [12]:
word2vec_model = Word2Vec.load('word2vec_Hmong_SCH.model')

#### Populate embedding matrix.

The embedding matrix in our Keras model below will use the word embedding vectors from the Word2Vec model above. However, we want to populate our embedding matrix using only those vectors that correspond to our training set. We create a matrix that can contain the full number of word indices in the database vocabulary, plus padding and out-of-vocabulary values. We then populate the matrix with the word embeddings at row positions corresponding to the word indices.

In [13]:
maximum_vocab_size = index_words.index.max() + 1
embedding_matrix = np.zeros((maximum_vocab_size, 150))
for ind in words:
    try:
        embedding_vector = word2vec_model.wv[index_words.at[ind[0], 'Word_Type']]
    except KeyError as e:
        embedding_vector = None
    if embedding_vector is not None:
        embedding_matrix[ind[0]] = embedding_vector

#### Create Keras model.

Now, we create the Keras model. We use the Sequential() model type, as this is a sequential labeling task.

We use the `weights` parameter of `Embedding` to input the word embedding matrix we just created above.

We then compile the model using categorical cross-entropy as a loss, and Adam as an optimizer.

In [14]:
model = Sequential()
model.add(InputLayer(input_shape=(LEN_MAX, )))
model.add(Embedding(maximum_vocab_size, 150, weights=[embedding_matrix], trainable=False))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag_indices))))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])

model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 93, 150)           143250    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 93, 512)           833536    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 93, 32)            16416     
_________________________________________________________________
activation_1 (Activation)    (None, 93, 32)            0         
Total params: 993,202
Trainable params: 849,952
Non-trainable params: 143,250
_________________________________________________________________


#### Train the model.

Now we train the model using the X_train data, with y_train converted to one-hot vectors using `keras.utils.np_utils.to_categorical`. We choose a batch size of 16, and set aside 20% of our training set for validation, leaving the rest for training.

In [15]:
model.fit(X_train, to_categorical(y_train, num_classes=max(tag_indices.values()) + 1), batch_size=16, epochs=50, validation_split=0.2)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 224 samples, validate on 57 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7efc20bca128>

#### Evaluate model on test set.

Now we use `evaluate` to evaluate the accuracy of the model on the test set. As mentioned above, the test set contains sentences from the same documents as the training set, so the results will be higher than on previously unseen documents, which we address below.

In [16]:
scores = model.evaluate(X_test, to_categorical(y_test, num_classes=max(tag_indices.values()) + 1))
print("Accuracy: {result} percent".format(result=(scores[1]*100)))

Accuracy: 96.68332581788721 percent


#### Evaluate on unseen data.

Finally, we evaluate our model on unseen data—a word position/POS-tagged ninth document, which we load from the database.

In [17]:
query = """SELECT doc_ind, sent_ind, word_type_ind, loc, pos_label FROM tokens
JOIN types ON tokens.word_type_ind=types.ind
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type
WHERE doc_ind==9;"""
query_words = crsr.execute(query).fetchall()
sentences_list = []
tags_list = []
for k, g in groupby(query_words, lambda x: (x[0], x[1])):
    sent = list(g)
    sentences_list.append([(w[2],) for w in sent])
    tags_list.append([(tag_indices['-'.join(w[3:])],) for w in sent])

X_new = [[word_value if word_value in words else (index_words.index.max(),) for word_value in sent] \
          for sent in sentences_list]
    
X_new = pad_sequences([[w[0] for w in line] for line in X_new], maxlen=LEN_MAX, padding='post')
y_new = pad_sequences(tags_list, maxlen=LEN_MAX, padding='post')

scores = model.evaluate(X_new, to_categorical(y_new, num_classes=max(tag_indices.values()) + 1))
print("Accuracy: {result} percent".format(result=(scores[1]*100)))

Accuracy: 96.25993371009827 percent


As can be seen above, even on an unseen text, the accuracy of this model still reaches 96.26% in this case, with an input of only about 6000 tagged words.

#### Conclusion

Altogether, combining word tokenization and POS tagging successfully tackles the problem of syllable-spacing in Hmong, and using a BiLSTM model with pretrained word embeddings using Word2Vec overcomes the limitations on available tagged data.

#### References and further reading

[1] Lewis, William D. and Phong Yang. 2012. Building MT for a Severely Under-Resourced Language: White Hmong. In _Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas_. https://pdfs.semanticscholar.org/098c/96c2ad281ac617fbe0766623834eb295ec2c.pdf

[2] Takahashi, Kanji and Kazuhide Yamamoto. 2016. Fundamental tools and resource are available for Vietnamese analysis. In _Proceedings of the 2016 International Conference on Asian Lanuage Processing_, p. 246–249. https://ieeexplore.ieee.org/document/7875978

[3] Nguyen Dat Quoc, Thanh Vu, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2017. In _Proceedings of the Australasian Language Technology Association Workshop_, p. 108–113. https://www.aclweb.org/anthology/U17-1013/

##### Other further reading links:

Some additional inspiration for my implementation of the approach using BiLSTM above, including especially the Keras model design, can be found at https://nlpforhackers.io/lstm-pos-tagger-keras/.
