## Word Tokenizer exercise#

In this exercise, you are going to implement a (sort of) real world task using Tensorflow and Keras. Tensorflow is a deep learning framwork developed by Google and Keras is a frontend library built on top of Tensorflow (and Theano) to provide an easier way to use standard layers and networks.

To complete this exercise, you will need to build deep learning models for word tokenization in Thai (แบ่งเว้นวรรคภาษาไทย) using NECTEC's BEST corpus, one model for each of the following type:
- Feedforward Neural Network
- One-Dimentional Convolution Neural Network (1D-CNN)
- Recurrent Neural Network (RNN, LSTM, or GRU)

and one more model of your choice to achieve highest score possible.

We provide code for data cleaning and starter code for keras in this notebook but feel free to modify those parts to suit your needs. You can also complete this exercise using only Tensorflow (without using Keras) or using additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

This notebook assumes you have already installed Tensorflow and Keras with python3 and had GPU enabled.

In [2]:
# Run setup code
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Prepare data
# You don't need to run the following code as we already did it for you to give everyone the same dataset
# import cattern.data_utils
# cattern.data_utils.generate_best_dataset(os.getcwd()+'/data', create_val=True)

For simplicity, we are going to build a word tokenization model which is a binary classification model trying to predict whether a character is the begining of the word or not (if it is, then there is a space in front of it) and without using any knowledge about type of character (vowel, number, English character etc.).

For example,

'แมวดำน่ารักมาก' -> 'แมว ดำ น่า รัก มาก'

will have these true labels:

[(แ,1), (ม,0), (ว,0) (ด,1), ( ำ,0), (น,1), (-่,0), (า,0), (ร,1), (-ั,0), (ก,0), (ม,1), (า,0), (ก,0)]

In this task, we will use only the character in question and the characters that surround it but you can imagine that a more complex model will try to include knowledge about each character into the model.
You can do that too if you feel like it.

In [3]:
# Create map of dictionary to character
CHARS = [
  '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+',
  ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8',
  '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E',
  'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
  'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_',
  'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
  'n', 'o', 'other', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
  'z', '}', '~', 'ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', 'ช',
  'ซ', 'ฌ', 'ญ', 'ฎ', 'ฏ', 'ฐ', 'ฑ', 'ฒ', 'ณ', 'ด', 'ต', 'ถ', 'ท',
  'ธ', 'น', 'บ', 'ป', 'ผ', 'ฝ', 'พ', 'ฟ', 'ภ', 'ม', 'ย', 'ร', 'ฤ',
  'ล', 'ว', 'ศ', 'ษ', 'ส', 'ห', 'ฬ', 'อ', 'ฮ', 'ฯ', 'ะ', 'ั', 'า',
  'ำ', 'ิ', 'ี', 'ึ', 'ื', 'ุ', 'ู', 'ฺ', 'เ', 'แ', 'โ', 'ใ', 'ไ',
  'ๅ', 'ๆ', '็', '่', '้', '๊', '๋', '์', 'ํ', '๐', '๑', '๒', '๓',
  '๔', '๕', '๖', '๗', '๘', '๙', '‘', '’', '\ufeff'
]
CHARS_MAP = {v: k for k, v in enumerate(CHARS)}

In [4]:
def create_n_gram_df(df, n_pad):
  """
  Given input dataframe, create feature dataframe of shifted characters
  Input:
  df: timeseries of size (N)
  n_pad: a number of context, for a given character at position idx
    character at position [idx-n_pad/2 : idx+n_pad/2] will be used 
    as features for that character.
  
  Output:
  dataframe of size (N * n_pad) which each row contains the character, 
    n_pad_2 characters to the left, and n_pad_2 characters to the right
    of that character.
  """
  n_pad_2 = int((n_pad - 1)/2)
  for i in range(n_pad_2):
      df['char-{}'.format(i+1)] = df['char'].shift(i + 1)
      df['char{}'.format(i+1)] = df['char'].shift(-i - 1)
  return df[n_pad_2: -n_pad_2]

In [5]:
def prepare_feature(best_processed_path, option='train'):
  """
  Transform processed path into feature matrix and output array
  Input:
  best_processed_path: str, path to processed BEST dataset
  option: str, 'train' or 'test'
  """
  # padding for training and testing set
  n_pad = 21
  n_pad_2 = int((n_pad - 1)/2)
  pad = [{'char': ' ', 'target': True}]
  df_pad = pd.DataFrame(pad * n_pad_2)

  df = []
  # article types in BEST corpus
  article_types = ['article', 'encyclopedia', 'news', 'novel']
  for article_type in article_types:
      df.append(pd.read_csv(os.path.join(best_processed_path, option, 'df_best_{}_{}.csv'.format(article_type, option))))
  
  df = pd.concat(df)
  # pad with empty string feature
  df = pd.concat((df_pad, df, df_pad))

  # map characters to numbers, use 'other' if not in the predefined character set.
  df['char'] = df['char'].map(lambda x: CHARS_MAP.get(x, 80))

  # Use nearby character as features
  df_with_context = create_n_gram_df(df, n_pad=n_pad)

  char_row = ['char' + str(i + 1) for i in range(n_pad_2)] + \
             ['char-' + str(i + 1) for i in range(n_pad_2)] + ['char']

  # convert pandas dataframe to numpy array to feed to the model
  x_char = df_with_context[char_row].as_matrix()
  y = df_with_context['target'].astype(int).as_matrix()

  return x_char, y

Before running the following commands, we must inform you that our data is quite large and by loading the whole dataset at once will use a lot of memory (~6 GB after processing and up to ~12GB while processing). We expect you to be running this on Google Cloud so that you will not run into this problem. But, if, for any reason, you have to run this on your PC or machine with not enough memory, you might need to write a data generator to process a few entries at a time then feed it to the model while training.

For keras, you can use [fit_generator](https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory) to cope with that.

In [6]:
# Path to the preprocessed data
best_processed_path = 'cleaned_data'

In [None]:
# Load preprocessed BEST corpus
x_train_char, y_train = prepare_feature(best_processed_path, option='train')
x_val_char, y_val = prepare_feature(best_processed_path, option='val')
x_test_char, y_test = prepare_feature(best_processed_path, option='test')

# As a sanity check, we print out the size of the training, val, and test data.
print('Training data shape: ', x_train_char.shape)
print('Training data labels shape: ', y_train.shape)
print('Validation data shape: ', x_val_char.shape)
print('Validation data labels shape: ', y_val.shape)
print('Test data shape: ', x_test_char.shape)
print('Test data labels shape: ', y_test.shape)

In [7]:
# Print some entry from the data to make sure it is the same as what you think.
print('First 3 features: ', x_train_char[:3])
print('First 30 class labels', y_train[:30])

First 3 features:  [[ 112.  140.  114.  148.  130.  142.   94.  142.  128.  128.    1.    1.
     1.    1.    1.    1.    1.    1.    1.    1.   97.]
 [ 140.  114.  148.  130.  142.   94.  142.  128.  128.  141.   97.    1.
     1.    1.    1.    1.    1.    1.    1.    1.  112.]
 [ 114.  148.  130.  142.   94.  142.  128.  128.  141.  109.  112.   97.
     1.    1.    1.    1.    1.    1.    1.    1.  140.]]
First 30 class labels [1 1 1 1 1 1 1 1 1 1]


In [12]:
from keras.models import Model
from keras.layers import Input, Dense, Embedding, \
    Concatenate, Flatten, SpatialDropout1D, \
    BatchNormalization, Conv1D, Maximum, ZeroPadding1D
from keras.layers import TimeDistributed
from keras.optimizers import Adam


def get_convo_nn2(no_word=101, n_gram=21, no_char=178): # no_word=200
    input1 = Input(shape=(n_gram,))
    # input2 = Input(shape=(n_gram,))

    a = Embedding(no_char, 32, input_length=n_gram)(input1)
    a = SpatialDropout1D(0.2)(a)
    '''
    a2 = Conv1D(no_word, 2, strides=1, padding="valid", activation='relu')(a)
    a2 = TimeDistributed(Dense(5, input_shape=(n_gram-1, no_word)))(a2)
    a2 = ZeroPadding1D(padding=(0, 1))(a2)

    a3 = Conv1D(no_word, 3, strides=1, padding="valid", activation='relu')(a)
    a3 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a3)
    a3 = ZeroPadding1D(padding=(0, 2))(a3)

    a4 = Conv1D(no_word, 4, strides=1, padding="valid", activation='relu')(a)
    a4 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a4)
    a4 = ZeroPadding1D(padding=(0, 3))(a4)

    a5 = Conv1D(no_word, 5, strides=1, padding="valid", activation='relu')(a)
    a5 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a5)
    a5 = ZeroPadding1D(padding=(0, 4))(a5)

    a6 = Conv1D(no_word, 6, strides=1, padding="valid", activation='relu')(a)
    a6 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a6)
    a6 = ZeroPadding1D(padding=(0, 5))(a6)

    a7 = Conv1D(no_word, 7, strides=1, padding="valid", activation='relu')(a)
    a7 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a7)
    a7 = ZeroPadding1D(padding=(0, 6))(a7)

    a8 = Conv1D(no_word, 8, strides=1, padding="valid", activation='relu')(a)
    a8 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a8)
    a8 = ZeroPadding1D(padding=(0, 7))(a8)

    a9 = Conv1D(no_word - 50, 9, strides=1, padding="valid", activation='relu')(a)
    a9 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a9)
    a9 = ZeroPadding1D(padding=(0, 8))(a9)

    a10 = Conv1D(no_word - 50, 10, strides=1, padding="valid", activation='relu')(a)
    a10 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a10)
    a10 = ZeroPadding1D(padding=(0, 9))(a10)

    a11 = Conv1D(no_word - 50, 11, strides=1, padding="valid", activation='relu')(a)
    a11 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a11)
    a11 = ZeroPadding1D(padding=(0, 10))(a11)

    a12 = Conv1D(no_word - 100, 12, strides=1, padding="valid", activation='relu')(a)
    a12 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a12)
    a12 = ZeroPadding1D(padding=(0, 11))(a12)

    a_concat = [a2, a3, a4, a5,
                a6, a7, a8, a9,
                a10, a11, a12]
    a_sum = Maximum()(a_concat)
    
    b = Embedding(12, 12, input_length=n_gram)(input2)
    b = SpatialDropout1D(0.2)(b)
    
    x = Concatenate(axis=-1)([a, a_sum])
    '''
    x = BatchNormalization()(a)

    x = Flatten()(x)
    x = Dense(100, activation='relu')(x)
    x = Dense(100, activation='relu')(x) # new 
    out = Dense(1, activation='sigmoid')(x)

    model = Model(inputs=input1, outputs=out)
    model.compile(optimizer=Adam(),
                  loss='binary_crossentropy',
                  metrics=['acc'])
    return model

In [11]:
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint
weight_path='weight/model_weight_1.h5'
callbacks_list = [
        ReduceLROnPlateau(),
        ModelCheckpoint(
            weight_path,
            save_best_only=True,
            save_weights_only=True,
            monitor='val_loss',
            mode='min',
            verbose=1
        )
  ]

verbose = 1
# verbose: 0, 1, or 2. Verbosity mode. 0 = silent, 1 = verbose, 2 = one log line per epoch.
print('start training')
# train model
model = get_convo_nn2()
train_params = [(10, 256), (3, 512), (3, 2048), (3, 4096), (3, 8192)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  if validation_set:
    model.fit(x_train_char, y_train, epochs=epochs, batch_size=batch_size, verbose=verbose,
              callbacks=callbacks_list,
              validation_data=(x_val_char, y_val))
  else:
    model.fit(x_train_char, y_train, epochs=epochs, batch_size=batch_size, verbose=verbose,
              callbacks=callbacks_list)


start training
train with 10 epochs and 256 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
train with 3 epochs and 512 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 2048 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 4096 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 8192 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [16]:
train_params = [(3, 2048), (3, 4096), (3, 8192)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  if validation_set:
    model.fit(x_train_char, y_train, epochs=epochs, batch_size=batch_size, verbose=verbose,
              callbacks=callbacks_list,
              validation_data=(x_val_char, y_val))

train with 3 epochs and 2048 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 4096 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 8192 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [9]:
x_test_char, y_test = prepare_feature(best_processed_path, option='test')

In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score
def evaluate(model):
  """
  Evaluate model on splitted 10 percent testing set
  """

  y_predict = model.predict(x_test_char)
  y_predict = (y_predict.ravel() > 0.5).astype(int)

  f1score = f1_score(y_test, y_predict)
  precision = precision_score(y_test, y_predict)
  recall = recall_score(y_test, y_predict)

  return f1score, precision, recall

In [18]:
evaluate(best_processed_path, model) #old

(0.97495518445136331, 0.97053699312028929, 0.97941378574982818)

In [14]:
evaluate(best_processed_path, model) # 2 Layer NN

(0.97809158644993555, 0.97241148881158623, 0.98383843170706664)

In [34]:
def get_convo5_nn(no_word=100, n_gram=21, no_char=178): # no_word=200
    input1 = Input(shape=(n_gram,))
    # input2 = Input(shape=(n_gram,))

    # a = Embedding(no_char, 32, input_length=n_gram)(input1)
    # a = SpatialDropout1D(0.2)(a)
    a = input1
    
    #a5 = Conv1D(no_word, 5, strides=1, padding="valid", activation='relu')(a)
    #a5 = TimeDistributed(Dense(5, input_shape=(n_gram, no_word)))(a5)
    #a5 = ZeroPadding1D(padding=(0, 4))(a5)
    
    #x = Concatenate(axis=-1)([a, a5])
    #x = BatchNormalization()(a)

    #x = Flatten()(a)
    x = Dense(100, activation='relu')(a)
    #x = BatchNormalization()(x)
    x = Dense(100, activation='relu')(x)
    x = Dense(100, activation='relu')(x)
    out = Dense(1, activation='sigmoid')(x)

    model = Model(inputs=input1, outputs=out)
    model.compile(optimizer=Adam(),
                  loss='binary_crossentropy',
                  metrics=['acc'])
    return model

In [35]:
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint
weight_path_conv5='weight/model_weight_conv0.h5'
callbacks_list_conv5 = [
        ReduceLROnPlateau(),
        ModelCheckpoint(
            weight_path_conv5,
            save_best_only=True,
            save_weights_only=True,
            monitor='val_loss',
            mode='min',
            verbose=1
        )
  ]

verbose = 1
model_convo5 = get_convo5_nn()
train_params = [(10, 512), (3, 512), (3, 2048), (3, 4096), (3, 8192)]
for (epochs, batch_size) in train_params:
  print("train with {} epochs and {} batch size".format(epochs, batch_size))
  #if validation_set:
  model_convo5.fit(x_train_char, y_train, epochs=epochs, batch_size=batch_size, verbose=verbose,
            callbacks=callbacks_list_conv5,
            validation_data=(x_val_char, y_val))
  #else:
  #  model_convo5.fit(x_train_char, y_train, epochs=epochs, batch_size=batch_size, verbose=verbose,
  #            callbacks=callbacks_list_conv5)


train with 10 epochs and 512 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
train with 3 epochs and 512 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 2048 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 4096 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
train with 3 epochs and 8192 batch size
Train on 16461637 samples, validate on 2035694 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [14]:
#weight_path_conv5='weight/model_weight_1.h5'
#model_convo5 = get_convo5_nn()
#model_convo5.load_weights(weight_path_conv5)
evaluate(model_convo5)

(0.9765748053135006, 0.97086056942490806, 0.98235670449968016)

In [37]:
evaluate(model_convo5)

(0.90810695726913582, 0.90661995670143669, 0.90959884368409827)