### Natcha Jangphiphatnawakit 63340500031

# HW 3 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Tensorflow 2. Tensorflow is a deep learning framwork developed by Google to provide an easier way to use standard layers and networks.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging with Word Embedding using Fixed / non-Fixed Pretrained weights
- Neural POS Tagging with Viterbi / Marginal CRF

Pretrained word embeddding are already given for you to use (albeit, a very bad one).

We also provide the code for data cleaning, preprocessing and some starter code for tensorflow 2 in this notebook but feel free to modify those parts to suit your needs. Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to change hardware accelrator to GPU in runtime on Google Colab ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.nectec.or.th/corpus/index.php?league=pm), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# %tensorflow_version 2.x

In [3]:
# import shutil
# shutil.copy("/content/drive/MyDrive/FRA 501 IntroNLP&DL/Dataset/resources.zip", "/content/resources.zip")
# !unzip resources.zip

In [4]:
# !pip install python-crfsuite
# !pip install tensorflow-addons
# !pip install tf2crf

In [5]:
# %tensorflow_version 2.x

In [6]:
from resources.data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
import tensorflow as tf
np.random.seed(42)

In [7]:
yunk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
print(train_data[1])

[('โครงการวิจัยและพัฒนา', 'NCMN'), ('อิเล็กทรอนิกส์', 'NCMN'), ('และ', 'JCRG'), ('คอมพิวเตอร์', 'NCMN')]


In [8]:
yunk_emb

array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004,
       -0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783,
       -0.56228753, -1.01283112,  0.31424733, -0.90802408, -1.4123037 ,
        1.46564877, -0.2257763 ,  0.0675282 , -1.42474819, -0.54438272,
        0.11092259, -1.15099358,  0.37569802, -0.60063869, -0.29169375,
       -0.60170661,  1.85227818])

Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [9]:
import pickle
fp = open('resources./basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

In [10]:
embeddings

{'พุทธเจ้าพระองค์': array([-0.01917224, -0.00415204, -0.02412283, -0.04142096,  0.04691369,
         0.03376952, -0.00270034, -0.04676848,  0.03299177,  0.03790374,
         0.0432213 , -0.01537431, -0.02517369, -0.04052844, -0.01157572,
         0.00185845, -0.00034374,  0.03099574, -0.00553056,  0.03075998,
        -0.02743803, -0.03812069, -0.02771009, -0.00890391, -0.03464903,
        -0.03346384, -0.04095409,  0.03574741,  0.04473687,  0.0170097 ,
        -0.00490531,  0.01063981], dtype=float32),
 'จุ๊บุ': array([ 0.02896592,  0.02110482,  0.03715003,  0.02296479, -0.03441135,
         0.03496312,  0.03625641, -0.02355627, -0.03617386,  0.01206947,
         0.02429886, -0.02565069, -0.02642049, -0.03778682, -0.00951525,
        -0.0446926 , -0.02631601,  0.04875654,  0.04526813,  0.0079442 ,
         0.0340622 ,  0.00625456,  0.01675535,  0.01817935, -0.03839616,
        -0.04811118,  0.03423071,  0.015117  ,  0.00746933,  0.02313724,
         0.01740095,  0.02209598], dtype=floa

In [11]:
print(len(embeddings.keys()))

701354


The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here [https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/])

## 2. Prepare Data

In [12]:
word_to_idx ={}     #dictionary collect word as a key and number as value (from data in train dataset)
idx_to_word ={}     #inverse of word_to_idx
label_to_idx = {}   #dictionary collect PoS as a key and number as value (from data in train dataset)
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1  #number of types of PoS

In [13]:
n_classes

48

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [14]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [15]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

Next we create train and test dataset, then we use tensorflow 2 to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [16]:
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data], dtype=object)
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

In [17]:
x_train=tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=tf.keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
x_test=tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=102, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [18]:
print(x_train[0],x_train.shape)
print(y_train[0][1],y_train.shape)

[1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] (18500, 102)
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (18500, 102, 48)


## 3. Evaluate

Our output from tf keras is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo

In [19]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [20]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['type'] = list(filter(lambda x: label_to_idx[x] == tag, label_to_idx))[0]
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else None
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else None
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else None
        
        eval_list.append(eval_result)

    # eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': 0, 'precision': 0, 'recall': 0, 'f_score': 0})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['type', 'tag', 'precision', 'recall', 'f_score', 'correct_count']]
    df['f_score'] = df['f_score'].astype(float)
    df.sort_values(by='f_score', inplace=True, ascending=False)
    display(df)
    print('accuracy=%.2f' % accuracy)
    
    return (df, accuracy)

## 4. Train a model

In [21]:
from keras import models 
from keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking
from keras.optimizers import Adam

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

Instead of using tensorflow.keras.models.Sequential, we use tensorflow.keras.models.Model. The latter is better as it can have multiple input/output, of which Sequential model could not. Due to this reason, the Model class is widely used for building a complex deep learning model.

In [22]:
from keras.callbacks import TensorBoard, EarlyStopping
from keras import backend as K

inputs = Input(shape=(102,), dtype='int32')
output = (Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))(inputs)
output = Bidirectional(GRU(32, return_sequences=True))(output)
output = Dropout(0.2)(output)
output = TimeDistributed(Dense(n_classes,activation='softmax'))(output)
model = models.Model(inputs, output)
model.compile(optimizer=Adam(learning_rate=0.001),  loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 102)]             0         
                                                                 
 embedding (Embedding)       (None, 102, 32)           480608    
                                                                 
 bidirectional (Bidirectiona  (None, 102, 64)          12672     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 102, 64)           0         
                                                                 
 time_distributed (TimeDistr  (None, 102, 48)          3120      
 ibuted)                                                         
                                                                 
Total params: 496,400
Trainable params: 496,400
Non-trainable

In [23]:
K.clear_session()
print('start training')
verbose = 1

weight_no_crf = './model_PoS/my_pos_no_crf_manual_save/'

callbacks_list_no_crf = [TensorBoard(log_dir='./Graph_PoS/no_crf', histogram_freq=1, write_graph=True, write_grads=False),]

train_params = [(20, 64)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  model.fit(x_train, y_train, 
            epochs=epochs, 
            batch_size=batch_size, 
            verbose=verbose,
            callbacks=callbacks_list_no_crf,)

start training
train with 20 epochs and 64 batch size
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [24]:
model.save(weight_no_crf)



INFO:tensorflow:Assets written to: ./model_PoS/my_pos_no_crf_manual_save/assets


INFO:tensorflow:Assets written to: ./model_PoS/my_pos_no_crf_manual_save/assets


In [25]:
model = models.load_model(weight_no_crf)

y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
df_tagger, acc_tagger = evaluation_report(y_test, ypred)



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
38,NEG,40,100.0,100.0,100.0,280
3,PUNC,4,99.961086,99.396378,99.677933,12844
0,FIXN,1,99.778516,97.8019,98.780321,3604
17,XVBM,18,97.45978,99.65368,98.544521,1151
18,XVMM,19,98.816568,96.811594,97.803807,334
19,DIBQ,20,98.281787,96.949153,97.610922,286
6,JCRG,7,98.240469,96.681097,97.454545,2010
34,CMTR@PUNC,36,100.0,93.75,96.774194,15
11,XVAM,12,96.034696,96.754057,96.393035,775
35,JCMP,37,93.518519,99.019608,96.190476,101


accuracy=92.90


## 4.2 CRF Viterbi

Your next task is to incorporate Conditional random fields (CRF) to your model.

To use the CRF layer, you need to use an extension repository for tensorflow library, call tf2crf. If you want to see the detailed implementation, you should read the official tensorflow extention of CRF (https://www.tensorflow.org/addons/api_docs/python/tfa/text).

tf2crf link :  https://github.com/xuxingya/tf2crf

For inference, you should look at crf.py at the method call and view the input/output argmunets. 
Link : https://github.com/xuxingya/tf2crf/blob/master/tf2crf/crf.py



### 4.2.1 CRF without pretrained weight
### #TODO 1
Incoperate CRF layer to your model in 4.1. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight.

In [26]:
# INSERT YOUR CODE HERE
from tf2crf import CRF, ModelWithCRFLoss
from keras.layers import Input, Embedding, Bidirectional, GRU, Dense

def get_model_crf():

    inputs = Input(shape=(102,), dtype='float32')
    output = (Embedding(len(word_to_idx), 32, input_length=102, mask_zero=True))(inputs)
    output = Bidirectional(GRU(32, return_sequences=True))(output)
    output = Dropout(0.2)(output)
    output = TimeDistributed(Dense(n_classes,activation='relu'))(output)
    crf = CRF(units=n_classes, dtype='float32')
    potentials = crf(output)
    base_model = models.Model(inputs, potentials)
    model = ModelWithCRFLoss(base_model, sparse_target=False)
    model.compile(Adam(learning_rate=0.001))

    base_model.summary()

    return model

In [27]:
from keras.callbacks import TensorBoard, EarlyStopping
from keras import backend as K

K.clear_session()
print('start training')
verbose = 1
model_crf = get_model_crf()

weight_crf = 'model_PoS/my_pos_crf_manual_save'

callbacks_list_crf = [
        TensorBoard(log_dir='./Graph_PoS/crf', histogram_freq=1, write_graph=True, write_grads=False),
        EarlyStopping(monitor='accuracy',
                      min_delta=0.04,
                      patience=10,
                      verbose=0)
  ]

train_params = [(50, 64)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  model_crf.fit(x_train, y_train, 
                epochs=epochs, 
                batch_size=batch_size, 
                verbose=verbose,
                callbacks=callbacks_list_crf,)


start training
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 102)]             0         
                                                                 
 embedding (Embedding)       (None, 102, 32)           480608    
                                                                 
 bidirectional (Bidirectiona  (None, 102, 64)          12672     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 102, 64)           0         
                                                                 
 time_distributed (TimeDistr  (None, 102, 48)          3120      
 ibuted)                                                         
                                                                 
 crf (CRF)                   ((None, 102),    

In [28]:
model_crf.save(weight_crf)



INFO:tensorflow:Assets written to: model_PoS/my_pos_crf_manual_save\assets


INFO:tensorflow:Assets written to: model_PoS/my_pos_crf_manual_save\assets


In [29]:
def pred_label_cut(y_pred, y_test):
    y_pred_cut = []
    for i, s in enumerate(y_test):
        arr = y_test[i].copy()
        for j, w in enumerate(s):
            arr[j] = y_pred[i][j]
        y_pred_cut.append(arr.copy())
    return y_pred_cut

In [30]:
model_crf = models.load_model(weight_crf)

y_pred = model_crf.predict(x_test)
y_pred = pred_label_cut(y_pred, y_test)

df_crf, acc_crf = evaluation_report(y_test, y_pred)



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
38,NEG,40,100.0,100.0,100.0,280
3,PUNC,4,99.945593,99.512459,99.728556,12859
0,FIXN,1,99.755368,99.592944,99.67409,3670
17,XVBM,18,95.892707,99.047619,97.444634,1144
6,JCRG,7,97.484277,96.921597,97.202123,2015
19,DIBQ,20,98.263889,95.932203,97.084048,283
4,CFQC,5,95.652174,98.507463,97.058824,66
34,CMTR@PUNC,36,100.0,93.75,96.774194,15
18,XVMM,19,96.253602,96.811594,96.531792,334
16,DDAC,17,95.205479,95.205479,95.205479,556


accuracy=92.99



### 4.2.2 CRF with pretrained weight

### #TODO 2

We would like you create a neural CRF POS tagger model  with the pretrained word embedding as an input and the word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus.

You can read more information about using predtained weight in embedding layer on Keras from the following link:

https://keras.io/examples/nlp/pretrained_word_embeddings/

Optionally, you can use your own pretrained word embedding.

#### Hint: You can get the embedding from get_embeddings function from embeddings/emb_reader.py . 

(You may want to read about Tensorflow Masking layer and Trainable parameter)

In [31]:
from resources.embeddings import emb_reader
new_embedding = emb_reader.get_embeddings()

In [32]:
word_to_vec = {}
not_found = 0

for word in word_to_idx.keys():
    if word not in new_embedding.keys(): 
        vector = np.zeros(len(list(new_embedding.values())[0]))
        not_found+=1
    else: vector = new_embedding[word]
    
    word_to_vec[word] = vector
print('vector dim:', len(list(new_embedding.values())[0]))
print('Not found:', not_found)

vector dim: 64
Not found: 11313


In [33]:
num_tokens = len(word_to_vec) + 1
embedding_dim = len(list(word_to_vec.values())[0])
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
# for word, i in embeddings.items():
for i, word in enumerate(word_to_vec.keys()):
    embedding_vector = word_to_vec.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 15019 words (0 misses)


In [34]:
# INSERT YOUR CODE HERE
from keras.initializers import Constant

def get_model_crf_pretrained():

    inputs = Input(shape=(102,), dtype='float32')
    output = (Embedding(num_tokens,
                        embedding_dim,
                        input_length=102,
                        embeddings_initializer=Constant(embedding_matrix),
                        mask_zero=True,
                        trainable=False))(inputs)
    output = Bidirectional(GRU(32, return_sequences=True))(output)
    output = Dropout(0.2)(output)
    output = TimeDistributed(Dense(n_classes,activation='relu'))(output)
    crf = CRF(units=n_classes, dtype='float32')
    potentials = crf(output)
    base_model = models.Model(inputs, potentials)
    model = ModelWithCRFLoss(base_model, sparse_target=False)
    model.compile(Adam(learning_rate=0.001))

    base_model.summary()
    
    return model

In [35]:
K.clear_session()
print('start training')
verbose = 1
model_crf_pretrained = get_model_crf_pretrained()

weight_crf_pretrained = './model_PoS/my_pos_crf_pretained_manual_save'

callbacks_list_crf_pretrained = [
        TensorBoard(log_dir='./Graph_PoS/crf_pretrained', histogram_freq=1, write_graph=True, write_grads=False),

        EarlyStopping(monitor='loss',
                      min_delta=0.05,
                      patience=5,
                      verbose=0)
  ]

train_params = [(50, 64)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  model_crf_pretrained.fit(x_train, y_train, 
                epochs=epochs, 
                batch_size=batch_size, 
                verbose=verbose,
                callbacks=callbacks_list_crf_pretrained)

start training
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 102)]             0         
                                                                 
 embedding (Embedding)       (None, 102, 64)           961280    
                                                                 
 bidirectional (Bidirectiona  (None, 102, 64)          18816     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 102, 64)           0         
                                                                 
 time_distributed (TimeDistr  (None, 102, 48)          3120      
 ibuted)                                                         
                                                                 
 crf (CRF)                   ((None, 102),    

In [36]:
model_crf_pretrained.save(weight_crf_pretrained)



INFO:tensorflow:Assets written to: ./model_PoS/my_pos_crf_pretained_manual_save\assets


INFO:tensorflow:Assets written to: ./model_PoS/my_pos_crf_pretained_manual_save\assets


In [37]:
model_crf_pretrained = models.load_model(weight_crf_pretrained)

y_pred = model_crf_pretrained.predict(x_test)
y_pred = pred_label_cut(y_pred, y_test)

df_crf_pre, acc_crf_pre = evaluation_report(y_test, y_pred)



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
0,FIXN,1,99.891156,99.620081,99.755435,3671
6,JCRG,7,97.9793,95.622896,96.786758,1988
3,PUNC,4,95.064895,94.660269,94.862151,12232
20,PREL,21,93.683511,93.0033,93.342166,1409
35,JCMP,37,90.47619,93.137255,91.78744,95
11,XVAM,12,88.929889,90.262172,89.591078,723
13,RPRE,14,89.977728,88.322099,89.142227,4848
22,XVAE,23,87.482517,88.660524,88.067582,1251
27,DIAC,29,91.752577,84.493671,87.973641,267
17,XVBM,18,90.655106,81.471861,85.818513,941


accuracy=80.03


### #TODO 3
Compare the result between all neural tagger models in 4.1 and 4.2.x and provide a convincing reason and example for the result of these models (which model perform better, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b>

# 1. Raw Result

จากการ train model 3 แบบคือ

1. Simple PoS Tagger : model ที่มี structure ดังนี้

        inputs = Input(shape=(102,), dtype='int32')
        output = (Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))(inputs)
        output = Bidirectional(GRU(32, return_sequences=True))(output)
        output = Dropout(0.2)(output)
        output = TimeDistributed(Dense(n_classes,activation='softmax'))(output)
        model = models.Model(inputs, output)

2. CRF without Pre-Trained : model ที่มี structure เหมือน Simple PoS Tagger ที่เพิ่ม layer สุดท้ายเป็น CRF layer

3. CRF with Pre-Trained : model ที่เพิ่มการใช้ dictionary เพื่อหา vector ของคำ (ใช้ embededing vector จาก polyglot-th.pkl ซึ่ง vector มีขนาด 64)

ได้ผลลัพธ์ดังนี้

In [46]:
num_of_sample = 5
print("Simple PoS Tagger ==============================================================")
display(df_tagger.iloc[:num_of_sample])
display(df_tagger.iloc[-num_of_sample:])
print("accuracy: {}% \n".format(acc_tagger))

print("CRF without Pre-Trained ========================================================")
display(df_crf.iloc[:num_of_sample])
display(df_crf.iloc[-num_of_sample:])
print("accuracy: {}% \n".format(acc_crf))

print("CRF with Pre-Trained ===========================================================")
display(df_crf_pre.iloc[:num_of_sample])
display(df_crf_pre.iloc[-num_of_sample:])
print("accuracy: {}% \n".format(acc_crf_pre))



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
38,NEG,40,100.0,100.0,100.0,280
3,PUNC,4,99.961086,99.396378,99.677933,12844
0,FIXN,1,99.778516,97.8019,98.780321,3604
17,XVBM,18,97.45978,99.65368,98.544521,1151
18,XVMM,19,98.816568,96.811594,97.803807,334


Unnamed: 0,type,tag,precision,recall,f_score,correct_count
7,NCNM,8,67.398119,51.807229,58.583106,215
9,NPRP,10,62.407407,40.166865,48.875997,337
41,DDAQ,43,100.0,11.111111,20.0,1
42,EAFF,45,,0.0,,0
43,DIAQ,46,,0.0,,0


accuracy: 92.90099660184471% 



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
38,NEG,40,100.0,100.0,100.0,280
3,PUNC,4,99.945593,99.512459,99.728556,12859
0,FIXN,1,99.755368,99.592944,99.67409,3670
17,XVBM,18,95.892707,99.047619,97.444634,1144
6,JCRG,7,97.484277,96.921597,97.202123,2015


Unnamed: 0,type,tag,precision,recall,f_score,correct_count
36,ADVS,38,51.162791,56.410256,53.658537,22
9,NPRP,10,61.384335,40.166865,48.559078,337
41,DDAQ,43,0.0,0.0,,0
42,EAFF,45,0.0,0.0,,0
43,DIAQ,46,,0.0,,0


accuracy: 92.99237556754905% 



Unnamed: 0,type,tag,precision,recall,f_score,correct_count
0,FIXN,1,99.891156,99.620081,99.755435,3671
6,JCRG,7,97.9793,95.622896,96.786758,1988
3,PUNC,4,95.064895,94.660269,94.862151,12232
20,PREL,21,93.683511,93.0033,93.342166,1409
35,JCMP,37,90.47619,93.137255,91.78744,95


Unnamed: 0,type,tag,precision,recall,f_score,correct_count
39,PNTR,41,,0.0,,0
40,EITT,42,,0.0,,0
41,DDAQ,43,0.0,0.0,,0
42,EAFF,45,,0.0,,0
43,DIAQ,46,,0.0,,0


accuracy: 80.02798480824696% 



จากผลลัพธ์ภาพรวมของค่า f_score ของแต่ละ tag และ accuracy สามารถเรียงลำดับ model ที่สามารถ predict tag ได้เรียงจากมากไปน้อยเป็น Simple PoS Tagger, CRF without Pre-Trained และ CRF with Pre-Trained ตามลำดับ

โดยอันดับของ tag ที่แต่ละ model สามารถ predict ได้ดี และแย่ที่สุดใน 5 อันดับคือ 

* Good Predict

| No. | Simple PoS Tagger | CRF without Pre-Trained | CRF with Pre-Trained |
| :--:| :---------------: | :---------------------: | :------------------: |
| 1   | NEG  | NEG  | FIXN |
| 2   | PUNC | PUNC | JCRG |
| 3   | FIXN | FIXN | PUNC |
| 4   | XVBM | XVBM | PREL |
| 5   | XVMM | JCRG | JCMP |

* Bad Predict

| No. | Simple PoS Tagger | CRF without Pre-Trained | CRF with Pre-Trained |
| :--:| :---------------: | :---------------------: | :------------------: |
| 1   | NCNM | ADVS | PNTR |
| 2   | NPRP | NPRP | EITT |
| 3   | DDAQ | DDAQ | DDAQ |
| 4   | EAFF | EAFF | EAFF |
| 5   | DIAQ | DIAQ | DIAQ |

และสามารถสรุปค่า accuracy ของ training set และ testing set ของ ทั้ง 3 model

* Accuracy of Training Set and Testing Set

| model                   | Training Set | Testing Set | 
| :----------------------:| :----------: | :---------: |
| Simple PoS Tagger       | 97.18%       | 92.90% |
| CRF without Pre-Trained | 97.21%       | 92.99% |
| CRF with Pre-Trained    | 80.91%       | 80.02% |

# 2. Analysis

## 2.1 Analysis Tensorboard and Accuracy

In [3]:
%load_ext tensorboard
%tensorboard --logdir='./Graph_PoS/'

Reusing TensorBoard on port 6006 (pid 13640), started 17:34:32 ago. (Use '!kill 13640' to kill it.)

- หากดูจาก loss และ accuracy ที่เปลี่ยนแปลงในแต่ละ epoch จะเห็นว่า model Simple PoS Tagger (no_crf) และ model CRF without Pre-Trained (crf) มีการเปลี่ยนแปลงเยอะและเริ่มู่เข้าตั้งแต่ epoch แรก ๆ ต่างจาก model CRF with Pre-Trained (crf_pretrained) ที่ค่อยลู่เข้าค่อนข้างช้า ซึ่งเป็นไปได้ว่าเกิดจาก model CRF with Pre-Trained มีจำนวน parameter เกือบ 1,000,000 ตัว แต่อีกสอง model ที่เหลือมีอยู่ประมาณ 500,000 ตัว และเมื่อใช้ Optimizer ตัวเดียวกันที่ learning rate เท่ากัน (Adam lr=0.001) จะทำให้การลู่เข้าของ model ที่มี parameter มากกว่าช้ากว่า
- และเมื่อวิเคราะห์ค่า accuracy ระหว่าง training set และ testing set ของแต่ละ model ในทั้ง 3 model แล้ว พบว่ามีค่าใกล้เคียงกันจึงคาดว่า model ที่่ train ไม่ได้เกิด overfiting โดยค่า accuracy ของ model Simple PoS Tagger และ model CRF without Pre-Trained ใกล้เคียงกันที่ประมาณ 90% ส่วน model CRF with Pre-Trained จะต่ำกว่าอยู่ที่ประมาณ 80%

## 2.2 Analysis from Tag

ในการวิเคราะห์นี้จะวิเคราะห์จาก Tag ที่ model สามารถ predict ได้ โดยเปรียบเทียบระหว่าง 2 model ว่า Tag นั้น model ไหนสามารถ predict เจอมากกว่ากัน (จะให้ความสำคัญเป็นอัตราส่วนระหว่างผลต่างของจำนวนที่หาเจอกับจจำนวนที่เจอของ model ที่เจอมากกว่า -> diff/correct_count)

In [118]:
#
def diff_correct_count(df1, df2, name1, name2): 
    dict_count_df1 = {}
    dict_count_df2 = {}

    for tag in df1['type']:
        diff = df1[df1['type']==tag]['correct_count'].values[0] - df2[df2['type']==tag]['correct_count'].values[0]
        if diff > 0: dict_count_df1[tag] = diff*100/df1[df1['type']==tag]['correct_count'].values[0] 
        elif diff < 0: dict_count_df2[tag] = np.abs(diff)*100/df2[df2['type']==tag]['correct_count'].values[0] 

    dict_count_df1 = sorted(dict_count_df1.items(), key=lambda x:x[1]) 
    dict_count_df2 = sorted(dict_count_df2.items(), key=lambda x:x[1]) 

    print(name1, '>', name2)
    for tag, diff_percent in dict_count_df1:
        print('{} diff {:0.1f}% : {} {}'.format(tag, diff_percent, df1[df1['type']==tag]['correct_count'].values[0], df2[df2['type']==tag]['correct_count'].values[0]))

    print("\n=======================\n")
    print(name1, '<', name2)
    for tag, diff_percent in dict_count_df2:
        print('{} diff {:0.1f}% : {} {}'.format(tag, diff_percent, df1[df1['type']==tag]['correct_count'].values[0], df2[df2['type']==tag]['correct_count'].values[0]))

### 2.2.1 model Simple PoS Tagger VS model CRF without Pre-Trained

In [120]:
diff_correct_count(df_crf, df_tagger, "df_crf", "df_tagger")

df_crf > df_tagger
PUNC diff 0.1% : 12859 12844
JCRG diff 0.2% : 2015 2010
XVAM diff 0.8% : 781 775
XVAE diff 0.8% : 1343 1332
DDBQ diff 0.8% : 121 120
ADVN diff 0.9% : 789 782
VATT diff 1.1% : 1324 1309
VACT diff 1.3% : 7873 7770
DONM diff 1.5% : 475 468
FIXN diff 1.8% : 3670 3604
DDAN diff 2.3% : 86 84
PREL diff 2.6% : 1446 1408
NLBL diff 3.4% : 523 505
CNIT diff 7.8% : 230 212
CLTV diff 7.8% : 102 94
NCNM diff 8.9% : 236 215
DDAC diff 9.5% : 556 503
CMTR diff 11.9% : 303 267
PNTR diff 12.5% : 16 14
NTTL diff 16.7% : 84 70


df_crf < df_tagger
VSTA diff 0.4% : 3004 3015
XVBM diff 0.6% : 1144 1151
JSBR diff 0.7% : 2074 2088
RPRE diff 0.8% : 5080 5119
DIBQ diff 1.0% : 283 286
NCMN diff 1.3% : 16165 16385
FIXV diff 1.9% : 152 155
DIAC diff 2.9% : 300 309
PDMN diff 5.0% : 76 80
EITT diff 6.2% : 15 16
DCNM diff 7.0% : 714 768
PPRS diff 14.3% : 96 112
DDAQ diff 100.0% : 0 1


จากการคำนวณที่ได้ส่วนใหญ่ model CRF without Pre-Trained จะ predict Tag ถูกมากกว่า โดย Tag ที่มีอัตราส่วน predict ถูกมากที่สุด 3 อันดับคือ 

1. NTTL (Title noun) เช่น ครู, พลเอก
2. PNTR (Interrogative pronoun) เช่น ใคร, อะไร, อย่างไร
3. CMTR (Measurement classifier) เช่น กิโลกรัม, แก้ว, ชั่วโมง

จะเห็นว่าในทั้ง 3 Tag เป็น Tag ที่ต้องดูบริบทของคำข้าง ๆ หรือคำใกล้เคียง เช่น NTTL จะเป็คำขยายของคำนามที่ตามหลัง, PNTR เป็นคำนามเกี่ยวกับการถามที่อ้างอิงจากบริบททั้งประโยคว่าต้องการสื่อสารอะไร และ CMTR เป็นลักษณะนามของคำนามที่ใช้คู่กัน ซึ่งเป็นไปได้ว่า model model CRF without Pre-Trained ที่มี CRF Layer เพิ่มเข้ามา สามารถทำให้ model สามารถ predict Tag โดยอ้างอิงจาก Tag ของคำข้าง ๆ ได้เหมือนเป็นการดูบริบทส่วนหนึ่งของประโยด Tag ดังกล่าวทั้ง 3 จึงสามารถ predict ได้ดีขึ้น

### 2.2.3 model CRF without Pre-Trained และ model CRF with Pre-Trained

In [121]:
diff_correct_count(df_crf, df_crf_pre, "df_crf", "df_crf_pre")

df_crf > df_crf_pre
JCRG diff 1.3% : 2015 1988
PREL diff 2.6% : 1446 1409
RPRE diff 4.6% : 5080 4848
PUNC diff 4.9% : 12859 12232
JCMP diff 5.9% : 101 95
XVAE diff 6.9% : 1343 1251
XVAM diff 7.4% : 781 723
PPRS diff 8.3% : 96 88
NCMN diff 9.9% : 16165 14557
NLBL diff 10.3% : 523 469
DIAC diff 11.0% : 300 267
XVBM diff 17.7% : 1144 941
DDBQ diff 19.0% : 121 98
VACT diff 21.9% : 7873 6145
DCNM diff 22.4% : 714 554
DONM diff 22.7% : 475 367
XVMM diff 23.4% : 334 256
DDAN diff 26.7% : 86 63
JSBR diff 30.1% : 2074 1450
DDAC diff 30.8% : 556 385
DIBQ diff 31.8% : 283 193
ADVN diff 34.2% : 789 519
VATT diff 37.8% : 1324 823
VSTA diff 37.8% : 3004 1867
NTTL diff 39.3% : 84 51
NPRP diff 43.6% : 337 190
CMTR diff 44.2% : 303 169
CLTV diff 52.9% : 102 48
CNIT diff 56.1% : 230 101
FIXV diff 56.6% : 152 66
NCNM diff 69.9% : 236 71
PDMN diff 71.1% : 76 22
NEG diff 75.4% : 280 69
CMTR@PUNC diff 93.3% : 15 1
CFQC diff 95.5% : 66 3
EITT diff 100.0% : 15 0
PNTR diff 100.0% : 16 0
ADVI diff 100.0% : 5 0


In [124]:
len(word_to_idx)

15019

In [123]:
word_to_vec = {}
not_found = 0

for word in word_to_idx.keys():
    if word not in new_embedding.keys(): 
        vector = np.zeros(len(list(new_embedding.values())[0]))
        not_found+=1
    else: vector = new_embedding[word]
    
    word_to_vec[word] = vector
print('vector dim:', len(list(new_embedding.values())[0]))
print('Not found:', not_found)

vector dim: 64
Not found: 11313


จากผลที่ได้ model CRF without Pre-Trained สามารถ predict Tag ได้ดีกว่า model CRF with Pre-Trained แทบทั้งหมดซึ่งคิดว่าเหตุผลมาจาก pre-trained weight ที่นำมาใช้เป็น Emebedding Vector ของคำในการ train เนื่องจากพบว่า library ที่ใช้ไม่มี pre-trained weight ของคำกว่า 11,313 คำที่พบใน dataset จากคำทั้งหมดประมาณ 15,019 คำ ซึ่งเป็นจำนวนที่ค่อนข้างเยอะ ทำให้ต้องใช้ vector 0 ขนาด 64 แทน 

โดยเมื่อวิเคราะห์ประกอบกับ Tag ที่ predict ไม่ได้เลย เช่น

- ADVI (Adverb with iterative form) = เร็วๆ, เสมอๆ, ช้าๆ
- ADVP (Adverb with prefixed form) = โดยเร็ว
- ADVS (Sentential adverb) = โดยปกติ, ธรรมดา

จะเห็นว่า Tag เหล่านี้ไม่ใช่คำทั่วไปและใช้บ่อยเหมือนกับคำนามหรือคำกริยา ซึ่งอาจไม่มีใน library pre-trained weight สุดท้ายแล้ว model จึงไม่สามารถ predict ได้ ต่างจาก model CRF without Pre-Trained ใช้การ Embedding จากการ random weight ที่อาจจะให้ embedding vector ที่ไม่มีความหมายแต่ก็ยังสามารถเอามาอ้างอิงถึงคำต่าง ๆ ได้


# 3. Conculsion

จากผลลัพธ์และการวิเคราะห์คิดว่า model ที่มี CRF Layer สามารถนำมา predict Tag โดยคำนึงถึงบริบทของ Tag ในส่วนอื่นของประโยคร่วมได้ ซึ่งการบอก Part of Speech หรือ Tag จะให้มีความแม่นยำและเข้าใจในเชิงความหมายแล้ว model CRF จึงเป็น model ที่ดีกว่า ส่วนในเรื่องการใช้ pre-trained weight ทีจะให้ embedding vector ที่ปริมาณสื่อถึงความหมายบางอย่างได้ ควรให้ผลลัพธ์ที่ดีกว่าแต่ในการ train model CRF with Pre-Trained นี้มีข้อสังเกตเกี่ยวกับ library ที่ไม่มี pre-trained weight ของคำครบ ซึ่งส่งผลต่อการ train model