__This notebook re-implements paper:__ _https://arxiv.org/pdf/1912.07095.pdf_

_Some images about some blocks we will implement._

![img1](images/truecasing-1.png)
![img2](images/truecasing-2.png)

## Utility Functions

In [45]:
def load_data(path):
    with open(path, "r", encoding="utf8") as f:
        data = f.readlines()
    
    X, y = [], []
    for sample in data:
        tmp = sample.split("\t")
        X.append(tmp[0])
        y.append(tmp[1])

    return X, y

## Load Data

In [58]:
from pprint import pprint
from sklearn.model_selection import train_test_split

In [48]:
data_path = "./data/ner-fullname.txt"
X, y = load_data(data_path)

In [49]:
X[0], y[0]

('họ và tên tôi là trần hoài an huyền', 'trần hoài an huyền|sys.full_name\n')

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
pprint([sample for sample in zip(X_train[:5], y_train[:5])])

[('lều chí sình là sếp của tôi', 'lều chí sình|sys.full_name\n'),
 ('thái ngô khánh tuyền là mình', 'thái ngô khánh tuyền|sys.full_name\n'),
 ('lý a lầy là anh họ tôi', 'lý a lầy|sys.full_name\n'),
 ('nguyễn duy là tên của một đồng nghiệp', 'nguyễn duy|sys.full_name\n'),
 ('tên trên giấy tờ của tôi là trần đức giang',
  'trần đức giang|sys.full_name\n')]


## Modeling

### Trucasing Model

In [52]:
import numpy as np
import tensorflow as tf
from tensorflow.data import Dataset
from tensorflow.keras.layers import TextVectorization

In [7]:
print(f"Tensorflow version: {tf.__version__}")

Tensorflow version: 2.6.0


In [8]:
print("Check GPU", end="\n\n")
print(tf.config.list_physical_devices("GPU"))

Check GPU

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


2021-10-30 10:26:53.789752: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-30 10:26:53.794788: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-30 10:26:53.794960: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [None]:
# This number should be inspected from your own text corpus.
# Here it's just a naive assumption.
MAXLEN = 90

# We need vectorize all of texts available in the dataset
# and mapping every single characters to its unique index
# in the vocabulary.
vectorizer = TextVectorization(split=tf.strings.bytes_split,
                               output_mode="int",
                               output_sequence_length=MAXLEN)
vectorizer.adapt(Dataset.from_tensor_slices(X_train[]))

In [44]:
text_dataset = tf.data.Dataset.from_tensor_slices(["foo", "bar", "baz"])
max_features = 5000  # Maximum vocab size.
max_len = 50  # Sequence length to pad the outputs to.

# def _chars_split(x):
#     splits = []
#     for i, _ in enumerate(x):
#         chars = np.array(list(x[i])).reshape(1, -1).squeeze()
#         print(chars)
#         splits.append(chars.tolist())
#     return splits

# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 split=tf.strings.bytes_split,
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(text_dataset.batch(64))
# vectorize_layer.adapt(np.array(["foo", "bar", "baz"]))

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["foo qux bar"], ["qux baz"]]
print(model.predict(input_data))
print(vectorize_layer.get_vocabulary())

[[7 2 2 1 1 1 1 1 3 4 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 3 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
['', '[UNK]', 'o', 'b', 'a', 'z', 'r', 'f']


## Training

## Evaluation