<a href="https://colab.research.google.com/github/phoenixfin/deeplearning-notebooks/blob/main/Document_Classification_HackerRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
import numpy as np
import random
from matplotlib import pyplot as plt

In [None]:
!wget https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt

--2020-12-15 07:18:10--  https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.110.254
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.110.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3337441 (3.2M) [text/plain]
Saving to: ‘trainingdata.txt’


2020-12-15 07:18:11 (5.29 MB/s) - ‘trainingdata.txt’ saved [3337441/3337441]



In [None]:
data = []
with open('/content/trainingdata.txt') as f:
  for line in f.readlines()[1:]:
    label = int(line.split(' ')[0])
    sentence = line[2:]
    data.append((sentence, label))

### Menyiapkan semua hiperparameter

Berikut adalah semua hiperparameter yang akan digunakan di model. Semuanya akan berpengaruh pada performa training.

In [None]:
batch_size = 32
split = 0.2
seed = 12
vocab_size = 1000
embedding_dim = 64
max_length = 1000
num_epochs = 100
stopping_patience = 10
success_threshold = 0.9
learning_rate = 0.0005

### Data Preprocessing

Memuat dataset dari direktori menggunakan metode
[`text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) sehingga diperoleh 2 objek `BatchDataset` dari tensorflow untuk training dan validasi

In [None]:
def split_data(data):
    # Separate out the sentences and labels into training and test sets
    shuffled_data = random.shuffle(data)
    training_size = int(len(data) * (1-split))

    train_data = data[:training_size]
    val_data = data[training_size:]
    return train_data, val_data 

train_data, val_data = split_data(data)

Memeriksa isi dari dataset yang telah digenerate dengan beberapa sampel

Menyiapkan proses tokenisasi menggunakan layer [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) yang telah disediakan tensorflow. Data text yang ada dikonversi ke data integer. Kamus tokenisasi dibangun dari `train_ds`

In [None]:
vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = vocab_size,
    output_sequence_length=max_length
)

In [None]:
train_text = list(zip(*train_data))[0]
vectorizer.adapt(train_text)
def tokenize(text, label):
    text = tf.expand_dims(text, -1)
    return vectorizer(text), label

Menerapkan tokenisasi pada seluruh dataset

In [None]:
def prepare_dataset(data_raw):
    data_raw = list(zip(*data_raw))
    sentences = tf.data.Dataset.from_tensor_slices(list(data_raw[0]))
    one_hot_labels = tf.one_hot(list(data_raw[1]), 8)
    labels = tf.data.Dataset.from_tensor_slices(one_hot_labels)
    dataset = tf.data.Dataset.zip((sentences, labels))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(tokenize)
    dataset = dataset.cache()
    return dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

train_ds = prepare_dataset(train_data)
val_ds = prepare_dataset(val_data)

Periksa hasil tokenisasi dengan sampel

In [None]:
for token_batch, labels in train_ds.take(1):
    for i in range(5):
        print("Tokens: ", token_batch.numpy()[i][:100], '...')
        print("Labels: ", labels[i])
        print('----------------')

Tokens:  [  1 158  28   1  21 236  17  32  11  10  11  17  10  69   9  10   9  99
 100  10  73  91  66   1   1 339   2 153   5  92  45  25   1 847  64 863
 226   2  26   1 328 235  43 566 643   3   9  45  99 100   1   2  45 265
   4 328   8   1  68 185 355   2   1 271  16   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0] ...
Labels:  tf.Tensor([0. 1. 0. 0. 0. 0. 0. 0.], shape=(8,), dtype=float32)
----------------
Tokens:  [303   1   1 787 303   1  66   1 429   4 749  19 787  63   1   5   6   1
   4 614   1   5   1   8   1 449  13  27 127   1   1 102 338 170   1   1
   7  14  12   1 246   1 770   6 787   1   7   5  43 865   2 338   1  57
 794  19   2 921   1   8 209  12 852   8 261   1 854 218   1 306   5 184
   8   6   1  35  84   1   4   1   4 395   1   2 737   2 148 854   1  33
   2 158   3  71   2   1   1  66  60   1] ...
Labels:  tf.Tensor([0. 0. 0. 0. 1. 0. 0. 0.], shape=(8,), dtype=float3

### Setup fungsi

Akan didefinisikan beberapa fungsi yang akan dibutuhkan kelak:

#### Fungsi `plot_graphs` untuk menggambar kurva akurasi dan loss

In [None]:
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_'+metric])
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])
    plt.show()

#### Fungsi `set_callbacks` untuk mengatur callbacks yang akan dipakai di training

Callbacks yang digunakan di sini ada 2, yakni untuk menghentikan proses training jika tidak ada progres signifikan dan untuk menghentikan proses training jika sudah mencapai akurasi yang diinginkan

In [None]:
def set_callbacks(model):
    callbacks = []
    CB = tf.keras.callbacks

    # no progress stopping callback
    impatient = CB.EarlyStopping(
        monitor='accuracy',
        patience = stopping_patience)
    callbacks.append(impatient)

    # stop when enough callback
    def stopper(epoch, logs):
        if logs['accuracy'] > success_threshold and logs['val_accuracy'] > success_threshold: 
            model.stop_training = True
    good_res = CB.LambdaCallback(on_epoch_end=lambda e,l: stopper(e,l))
    callbacks.append(good_res)
                        
    return callbacks

#### Fungsi `build_model` untuk membangun model ML yang akan digunakan

Model yang dibangun adalah model Sequential dengan *Embedding* dan dua lapis *LSTM* dua-arah. Model ditutup dengan layer terkoneksi penuh berisi 5 neuron yang mewakili 5 kelas output dari teks yang diinput. Model selanjutnya dicompile dengan Adam optimizer.

In [None]:
def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length, mask_zero=True),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
        tf.keras.layers.Dense(8, activation='softmax')
    ])
    model.compile(
        loss = 'categorical_crossentropy',
        optimizer = tf.keras.optimizers.Adam(learning_rate),
        metrics = ['accuracy']
    )
    model.summary()
    return model

## Melatih Model!

Nah ini saatnya membungkus semua yang sudah disiapkan.
Langsung saja jalankan blok kode di bawah ini untuk melatih model dan langsung menggambar grafik akurasi dan loss-nya.

In [None]:
model = build_model()
history = model.fit(
    train_ds, 
    epochs = num_epochs, 
    validation_data = val_ds, 
    callbacks = set_callbacks(model)
)
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1000, 64)          64000     
_________________________________________________________________
bidirectional (Bidirectional (None, 1000, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 8)                 1032      
Total params: 229,896
Trainable params: 229,896
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/1

NameError: ignored