<a href="https://colab.research.google.com/github/jonnyli1125/piemanese-translator/blob/main/piemanese/tm/notebooks/character_cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Piemanese TM Experiment: Character CNN
## Overview

This is an experiment for Piemanese TM using a character-level convolutional neural network.

We define this as a binary classification problem. We give two inputs to the model (Piemanese word, English word), and the model classifies it as either a true pair or false pair.

In the original use case, Character-level CNNs were robust against the noisy user data that contained many misspellings, which is a key requirement for a Piemanese TM.

## References
- https://arxiv.org/pdf/1509.01626.pdf
- https://github.com/ahmedbesbes/character-based-cnn
- https://www.youtube.com/watch?v=CNY8VjJt-iQ
- https://towardsdatascience.com/character-level-cnn-with-keras-50391c3adf33

## Data Preprocessing

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [16]:
import glob
import os.path

def load_dataset(word_pairs_dir, file_weight={}):
    rows = []
    for dirname, label in [('true', 1), ('false', 0)]:
        for filename in glob.glob(f'{word_pairs_dir}/{dirname}/*'):
            with open(filename, 'r', encoding='utf-8') as f:
                for line in f:
                    pi_word, en_word = line.strip().split('\t')
                    n_repeats = file_weight.get(os.path.basename(filename), 1)
                    for i in range(n_repeats):
                        rows.append([pi_word, en_word, label])
    return pd.DataFrame(rows, columns=['pi_word', 'en_word', 'is_pair'])

word_pairs = load_dataset(
    'word_pairs',  # colab path
    file_weight={
        'benchmark.tsv': 25,
        'replacements.tsv': 25,
        'upweight.tsv': 50
    }
)
word_pairs

Unnamed: 0,pi_word,en_word,is_pair
0,are,are,1
1,could,could,1
2,though,though,1
3,a,a,1
4,ya,ya,1
...,...,...,...
2095037,yus,leaf,0
2095038,yus,tdf,0
2095039,yus,you,0
2095040,yus,do,0


In [3]:
X_pi, X_en, y = word_pairs['pi_word'], word_pairs['en_word'], word_pairs['is_pair']

# train/valid/test split: 70/15/15
X_pi_train, X_pi_test, X_en_train, X_en_test, y_train, y_test = train_test_split(X_pi, X_en, y, test_size=0.3, random_state=42)
X_pi_valid, X_pi_test, X_en_valid, X_en_test, y_valid, y_test = train_test_split(X_pi_test, X_en_test, y_test, test_size=0.5, random_state=42)
X_pi_train

1582100      steel
1962911       nott
2055086       utub
1739595       leze
419197       games
            ...   
259178     control
1414414      swerr
131932          no
671155         the
121958     friends
Name: pi_word, Length: 1466529, dtype: object

## Model Architecture

In [4]:
from tensorflow.keras.initializers import Constant
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.layers import TextVectorization, Embedding, Concatenate, Flatten, Dense, Conv1D, MaxPool1D
from tensorflow.keras import Input, Model

In [5]:
# sequence length
seq_len = int(np.ceil(max(max(len(w) for w in word_pairs['pi_word']), max(len(w) for w in word_pairs['en_word'])) / 2) * 2)
seq_len

20

In [6]:
# construct keras model
pi_input = Input(shape=(1,), name='pi_input', dtype=tf.string)
en_input = Input(shape=(1,), name='en_input', dtype=tf.string)

# tokenizer
chars = list("abcdefghijklmnopqrstuvwxyz0123456789' ")
tokenizer = TextVectorization(standardize='lower', split='character', output_sequence_length=seq_len, vocabulary=chars)
pi_tokens = tokenizer(pi_input)
en_tokens = tokenizer(en_input)

# embedding layer
V = tokenizer.vocabulary_size() - 2
embedding_weights = np.concatenate([np.zeros((2, V)), np.diag(np.ones(V))], axis=0)
embedding = Embedding(*embedding_weights.shape, input_length=seq_len, trainable=False, embeddings_initializer=Constant(embedding_weights))
pi_embedding = embedding(pi_tokens)  # (B, S, V)
en_embedding = embedding(en_tokens)
x = Concatenate()([pi_embedding, en_embedding])  # (B, S, 2V)

# conv layers
x = Conv1D(32, 3, activation='relu')(x)
x = Conv1D(32, 3, activation='relu')(x)
x = Conv1D(32, 3, activation='relu')(x)
#x = Conv1D(32, 3, activation='relu', padding='same')(x)
#x = Conv1D(32, 3, activation='relu', padding='same')(x)
#x = Conv1D(32, 3, activation='relu', padding='same')(x)
x = MaxPool1D(3)(x)
x = Flatten()(x)

# dense layers
#x = Dense(64, activation='relu')(x)

# output layer
output = Dense(1, activation='sigmoid')(x)  # (B, 1)

model = Model(inputs=[pi_input, en_input], outputs=output, name='piemanese_tm_cnn')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Precision(), Recall()])
model.summary()

Model: "piemanese_tm_cnn"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 pi_input (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 en_input (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 text_vectorization (TextVector  (None, 20)          0           ['pi_input[0][0]',               
 ization)                                                         'en_input[0][0]']               
                                                                                                  
 embedding (Embedding)          (None, 20, 38)       1520        ['text_vectorizati

## Training

In [14]:
from google.colab import auth
auth.authenticate_user()

!gcloud config set project piemanese-translator

Updated property [core/project].


In [7]:
model.fit(
    x=[X_pi_train, X_en_train],
    y=y_train,
    validation_data=([X_pi_valid, X_en_valid], y_valid),
    batch_size=128,
    epochs=20,
    verbose=2,
    callbacks=[ModelCheckpoint('tm_char_cnn/ckpt/{epoch:02d}-{val_loss:.4f}'), EarlyStopping(patience=3, restore_best_weights=True)])

Epoch 1/20
11458/11458 - 434s - loss: 0.0337 - precision: 0.9243 - recall: 0.8358 - val_loss: 0.0124 - val_precision: 0.9520 - val_recall: 0.9504 - 434s/epoch - 38ms/step
Epoch 2/20
11458/11458 - 417s - loss: 0.0098 - precision: 0.9617 - recall: 0.9672 - val_loss: 0.0072 - val_precision: 0.9696 - val_recall: 0.9739 - 417s/epoch - 36ms/step
Epoch 3/20
11458/11458 - 411s - loss: 0.0066 - precision: 0.9733 - recall: 0.9804 - val_loss: 0.0070 - val_precision: 0.9579 - val_recall: 0.9938 - 411s/epoch - 36ms/step
Epoch 4/20
11458/11458 - 422s - loss: 0.0051 - precision: 0.9798 - recall: 0.9867 - val_loss: 0.0058 - val_precision: 0.9744 - val_recall: 0.9880 - 422s/epoch - 37ms/step
Epoch 5/20
11458/11458 - 397s - loss: 0.0044 - precision: 0.9832 - recall: 0.9894 - val_loss: 0.0055 - val_precision: 0.9689 - val_recall: 0.9963 - 397s/epoch - 35ms/step
Epoch 6/20
11458/11458 - 412s - loss: 0.0039 - precision: 0.9854 - recall: 0.9906 - val_loss: 0.0056 - val_precision: 0.9669 - val_recall: 0.9966

<keras.callbacks.History at 0x7f64c5cce6d0>

In [8]:
model.evaluate(x=[X_pi_test, X_en_test], y=y_test, batch_size=128)



[0.0033481966238468885, 0.9889034032821655, 0.9931531548500061]

In [9]:
model.save('tm_char_cnn')

In [10]:
# zip to archive
!zip -r tm_char_cnn.zip tm_char_cnn

  adding: tm_char_cnn/ (stored 0%)
  adding: tm_char_cnn/saved_model.pb (deflated 90%)
  adding: tm_char_cnn/ckpt/ (stored 0%)
  adding: tm_char_cnn/ckpt/04-0.0058/ (stored 0%)
  adding: tm_char_cnn/ckpt/04-0.0058/saved_model.pb (deflated 90%)
  adding: tm_char_cnn/ckpt/04-0.0058/assets/ (stored 0%)
  adding: tm_char_cnn/ckpt/04-0.0058/variables/ (stored 0%)
  adding: tm_char_cnn/ckpt/04-0.0058/variables/variables.data-00000-of-00001 (deflated 19%)
  adding: tm_char_cnn/ckpt/04-0.0058/variables/variables.index (deflated 66%)
  adding: tm_char_cnn/ckpt/04-0.0058/keras_metadata.pb (deflated 94%)
  adding: tm_char_cnn/ckpt/05-0.0055/ (stored 0%)
  adding: tm_char_cnn/ckpt/05-0.0055/saved_model.pb (deflated 89%)
  adding: tm_char_cnn/ckpt/05-0.0055/assets/ (stored 0%)
  adding: tm_char_cnn/ckpt/05-0.0055/variables/ (stored 0%)
  adding: tm_char_cnn/ckpt/05-0.0055/variables/variables.data-00000-of-00001 (deflated 19%)
  adding: tm_char_cnn/ckpt/05-0.0055/variables/variables.index (deflated 

In [11]:
# download locally
from google.colab import files
files.download('tm_char_cnn.zip') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [15]:
# copy to bucket
!gsutil -m cp -r tm_char_cnn gs://piemanese-translator-data/tm/

Copying file://tm_char_cnn/keras_metadata.pb [Content-Type=application/octet-stream]...
/ [0/48 files][    0.0 B/  7.1 MiB]   0% Done                                   Copying file://tm_char_cnn/saved_model.pb [Content-Type=application/octet-stream]...
/ [0/48 files][    0.0 B/  7.1 MiB]   0% Done                                   Copying file://tm_char_cnn/ckpt/05-0.0055/saved_model.pb [Content-Type=application/octet-stream]...
/ [0/48 files][    0.0 B/  7.1 MiB]   0% Done                                   Copying file://tm_char_cnn/ckpt/04-0.0058/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://tm_char_cnn/ckpt/04-0.0058/saved_model.pb [Content-Type=application/octet-stream]...
/ [0/48 files][    0.0 B/  7.1 MiB]   0% Done                                   / [0/48 files][    0.0 B/  7.1 MiB]   0% Done                                   Copying file://tm_char_cnn/ckpt/04-0.0058/keras_metadata.pb [Content-Type=application/octet-stream]...
/ [0/48 