# Model Training <a id="model-training"></a>

This notebook is for training the language recognition models. These will be seuquence models that receive
character-level tokenized sentences and output a label (language of the text). They will train using the dataset
prepared with the `make_dataset` script from the TED talks data.

## Contents
- [Dataset Preparation](#dataset-preparation)
- [Model Fitting](#model-fitting)
- [Save Model](#save-model)
- [Model Evaluation](#model-evaluation)

In [None]:
import os
import sys
import boto3
import dotenv
import numpy as np
import tensorflow as tf
from datetime import date
from tensorflow import keras
import tensorflow_text as tf_text

dotenv.load_dotenv(os.path.join("..", ".env"));

## Dataset Preparation <a id="dataset-preparation"></a>

First we prepare the dataset for model training. We will make use of the keras preprocessing API and TensorFlow datasets.

[Back to top](#model-training)

In [None]:
# Constants and utilities
tok = tf_text.UnicodeCharTokenizer()
PAD_LENGTH: int = 128  # Length to which to pad sequences.
BATCH_SIZE: int = 1024  # Size of dataset batches
VOCAB_SIZE: int = 256  # Vocabulary size (we use ASCII characters)
DATASET_PATH: str = os.path.join("..", "data", "processed-text")  # Path to dataset files

# Labels
LANGUAGES = [
    "English",
    "French",
    "Italian",
    "Portuguese",
    "Spanish",
    "Turkish"
]

LANGUAGES.sort()
NUM_CLASSES: int = len(LANGUAGES)  # Number of categories to classify

# Functions to prepare input for model
def drop_newlines(strings):
    return tf.strings.regex_replace(strings, r"\n", "")

def prepare_input(strings):
    """
    Prepare the strings for the model by dropping newlines and tokenizing them.
    """
    # Remove newline characters from strings
    strings_clean = drop_newlines(strings)
    
    # Tokenize
    tokens = tok.tokenize(strings_clean)
    
    # Pad sequences
    return tokens

In [None]:
def make_dataset(split: str = "train"):
    """
    Make a dataset for the given data split (train or test) by making a
    TextLineDataset for each language and then interleaving them.
    """
    datasets = []
    for i, lang in enumerate(LANGUAGES):
        path = os.path.join(DATASET_PATH, split, lang)
        
        # Get TextxLineDataset for this language
        ds = tf.data.TextLineDataset([
            os.path.join(path, file_name)
            for file_name in os.listdir(path)
        ])
        
        # Prepare input: tokenize, pad token sequences, add labels (int)
        ds = ds.map(prepare_input)\
            .padded_batch(1, padded_shapes=PAD_LENGTH)\
            .unbatch()\
            .map(lambda tokens: (tokens, i))
        datasets.append(ds)
    
    # Interleave datasets
    merged = tf.data.Dataset.from_tensor_slices(datasets)\
        .interleave(lambda x: x)\
        .shuffle(1024)

    return merged

In [None]:
# Training and validation datasets
train_dataset = make_dataset("train")
valid_dataset = make_dataset("test")

# Preprocess data
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(128)
valid_dataset = valid_dataset.batch(BATCH_SIZE).prefetch(128)

## Model Fitting <a id="model-fitting"></a>

Now we will build and fit the models to the data.

[Back to top.](#model-training)

In [None]:
def make_model(embedding_dim: int = 32, lstm_dim: int = 32, dropout_rate: float = 0.3) -> keras.Model:
    """
    Make the prediction model!
    """
    assert 0.<= dropout_rate < 1.
    keras.backend.clear_session()

    # Input tokens and embed
    inputs = keras.layers.Input(shape=(PAD_LENGTH,), dtype=tf.int32)
    embed = keras.layers.Embedding(
        input_dim=VOCAB_SIZE, 
        output_dim=embedding_dim,
        mask_zero=True,
        input_length=PAD_LENGTH)(inputs)
    
    # Recurring / Convolutional layers
    x = keras.layers.Bidirectional(keras.layers.LSTM(
        lstm_dim,
        activation="tanh",
        return_sequences=True))(embed)
    x = keras.layers.Dropout(dropout_rate)(x)
    
    x = keras.layers.Conv1D(
        2 * lstm_dim,
        3,
        padding="same",
        activation="tanh")(x)
    x = keras.layers.Dropout(dropout_rate)(x)
    
    # Pooling, final output
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(16, activation=keras.layers.LeakyReLU())(x)
    x = keras.layers.Dropout(dropout_rate)(x)
    out = keras.layers.Dense(NUM_CLASSES, activation="softmax")(x)
    
    # Model
    model = keras.Model(inputs=inputs, outputs=out)
    model.summary()
    return model

In [None]:
model = make_model(lstm_dim=16, dropout_rate=.4)

In [None]:
# FIT Params
EPOCHS = 2
VERBOSE = 1

In [None]:
# Compile and fit
model.compile(
    optimizer=keras.optimizers.Adam(0.02),
    loss="sparse_categorical_crossentropy",
    metrics="accuracy"
)


history = model.fit(
    train_dataset, 
    epochs=EPOCHS,
    validation_data=valid_dataset,
    verbose=VERBOSE
)

## Save Model <a id="save-model"></a>

Now we save the model as an `h5` file and optionally as a `tflite` file to allow inference with a lighter version of TF.

[Back to top.](#model-training)

In [None]:
save_tflite = True

In [None]:
# Save model and upload to S3
model_dir = os.path.join("..", "data", "models")
if not os.path.isdir(model_dir):
    os.makedirs(model_dir)

model_file = os.path.join(model_dir, "model.h5")
model.save(model_file)

s3 = boto3.client("s3")
s3.upload_file(
    model_file,
    os.getenv("S3_BUCKET"),
    f"models/model-{date.today():%Y-%m-%d}.h5"
)

if save_tflite:
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    lite_model = converter.convert()
    
    # Save lite model in the server's directory
    out_dir = os.path.join("..", "service", "src", "assets")
    if not os.path.isdir(out_dir):
        os.makedirs(out_dir)
    
    out_file = os.path.join(out_dir, "model.tflite")
    with open(out_file, "wb") as f:
        f.write(lite_model)
        print(out_file)
    
    del converter, lite_model
    s3.upload_file(
        Bucket=os.getenv("S3_BUCKET"),
        Key="models/service/model.tflite",
        Filename=out_file
    )


## Model Evaluation <a id="model-evaluation"></a>

Now we evaluate the model's performance.

[Back to top.](#model-training)

In [None]:
# Benchmark
test_sentences = [
    "Mary had a little lamb, its fleece was white as snow",
    "nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura",
    "No tengo penas ni tengo amores y asi no sufro de sinsabores",
    "Non seulement prend plaisir aux malheurs des autres, mais aussi neglige le bien-etre de sa famille"
]

prepared = prepare_input(test_sentences)
prepared = prepared.to_tensor(0, shape=(prepared.shape[0], PAD_LENGTH))
pred = model.predict(prepared)
langs = np.argmax(pred, axis=1).tolist()
conf = np.max(pred, axis=1).tolist()

for sentence, lang, confidence in zip(test_sentences, langs, conf):
    print("Sentence:")
    print(sentence)
    print()
    
    print("Predicted Language:")
    print(LANGUAGES[lang])
    print("Confidence: %.2f" % (100 * confidence))
    print("=" * 50)