<a href="https://colab.research.google.com/github/lailatulbadriyah24/2141720036-machine-learning-2023/blob/main/praktikum-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Praktikum 1: Generator Teks dengan RNN**

## **Setup**

### **Import TensorFlow**

In [2]:
import tensorflow as tf
import numpy as np
import os
import time

### **Download Dataset Shakespeare**

In [3]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


### **Load Data**

In [4]:
# Membaca teks dari file menggunakan mode 'rb' (binary mode) dan mendekode dengan encoding 'utf-8'
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# Panjang teks adalah jumlah karakter dalam teks tersebut
print(f'Length of text: {len(text)} characters')

Length of text: 1115394 characters


In [5]:
# Mencetak 250 karakter pertama dalam teks
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [6]:
# Mengidentifikasi karakter-karakter unik dalam teks
vocab = sorted(set(text))

# Mencetak jumlah karakter unik
print(f'{len(vocab)} unique characters')

65 unique characters


## **Olah Teks**

### **Vectorize Teks**

Sebelum training, Anda perlu mengonversi string menjadi representasi numerik. tf.keras.layers.StringLookup dapat mengubah setiap karakter menjadi ID numerik. Caranya adalah teks akan dipecah menjadi token terlebih dahulu.

In [7]:
# Daftar teks contoh
example_texts = ['abcdefg', 'xyz']

# Memecah teks menjadi karakter-karakter Unicode
chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')

# Menampilkan hasil karakter-karakter Unicode
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

sekarang buat tf.keras.layers.StringLookup layer:

In [8]:
# Membuat lapisan StringLookup untuk mengonversi karakter menjadi ID numerik
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab),  # Daftar karakter-karakter yang ingin diindeks
    mask_token=None  # Token masking (jika ada), dalam hal ini, tidak ada masking
)

perintah diatas mengconvert token menjadi id

In [9]:
# Mengonversi karakter-karakter Unicode menjadi ID numerik
ids = ids_from_chars(chars)

# Menampilkan hasil ID numerik
ids

<tf.RaggedTensor [[40, 41, 42, 43, 44, 45, 46], [63, 64, 65]]>

Karena tujuan tutorial ini adalah untuk menghasilkan teks, penting juga untuk membalikkan representasi ini. Untuk ini Anda dapat menggunakan kode
`tf.keras.layers.StringLookup(..., invert=True)`.

In [10]:
# Membuat lapisan StringLookup untuk mengonversi ID numerik ke karakter-karakter Unicode
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(),  # Menggunakan vocabulary yang telah diindeks sebelumnya
    invert=True,  # Mengatur invert ke True untuk mengonversi kembali dari ID ke karakter
    mask_token=None  # Token masking (jika ada), dalam hal ini, tidak ada masking
)

Lapisan ini mengconvert kembali karakter dari vektor ID, dan mengembalikannya sebagai karakter `tf.RaggedTensor`:

In [11]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [12]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

## **Prediksi**

### **Membuat Trianing Set dan Target**

In [13]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19, 48, 57, ..., 46,  9,  1])>

In [14]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [15]:
for ids in ids_dataset.take(10):
  print(chars_from_ids(ids).numpy().decode('utf-8'))

F
i
r
s
t
 
C
i
t
i


In [16]:
seq_length = 100

Metode batch memungkinkan Anda dengan mudah mengonversi karakter individual ini menjadi urutan ukuran yang diinginkan.

In [17]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)


akan lebih mudah untuk melihat apa yang dilakukan jika Anda menggabungkan token kembali menjadi string:

In [18]:
for seq in sequences.take(5):
    print(text_from_ids(seq).numpy())

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
b'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


Untuk pelatihan, Anda memerlukan kumpulan data pasangan (input, label). Dimana input dan label merupakan urutan. Pada setiap langkah waktu, inputnya adalah karakter saat ini dan labelnya adalah karakter berikutnya. Berikut adalah fungsi yang mengambil urutan sebagai masukan, menduplikasi, dan menggesernya untuk menyelaraskan masukan dan label untuk setiap langkah waktu:

In [19]:
def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

In [20]:
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

In [21]:
dataset = sequences.map(split_input_target)

In [22]:
for input_example, target_example in dataset.take(1):
  print("Input :", text_from_ids(input_example).numpy())
  print("Target:", text_from_ids(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


### **Membuat Batch Training**

Anda menggunakan tf.data untuk membagi teks menjadi sequence yang dapat diatur. Namun sebelum memasukkan data ini ke dalam model, Anda perlu mengacak data dan mengemasnya ke dalam batch.

In [23]:
# Batch size (ukuran batch) yang digunakan selama pelatihan
BATCH_SIZE = 64

# Buffer size (ukuran buffer) untuk mengacak urutan dataset
# TensorFlow data dirancang untuk bekerja dengan urutan yang mungkin tak terbatas,
# sehingga tidak mencoba untuk mengacak seluruh urutan di dalam memori.
# Sebaliknya, ia mempertahankan buffer di mana ia mengacak elemen.
BUFFER_SIZE = 10000

# Mengonfigurasi dataset dengan mengacak urutan, mengatur ukuran batch,
# dan menggunakan prefetch untuk optimalisasi
dataset = (
    dataset
    .shuffle(BUFFER_SIZE)  # Mengacak urutan dataset
    .batch(BATCH_SIZE, drop_remainder=True)  # Mengatur ukuran batch dengan menghapus sisa data yang tidak cukup untuk satu batch
    .prefetch(tf.data.experimental.AUTOTUNE)  # Menggunakan prefetch untuk optimalisasi
)

# Menampilkan dataset yang telah dikonfigurasi
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

### **Buat Model**

In [24]:
# Jumlah kata dalam vocabulary pada lapisan StringLookup
vocab_size = len(ids_from_chars.get_vocabulary())

# Dimensi embedding
embedding_dim = 256

# Jumlah unit RNN (Recurrent Neural Network)
rnn_units = 1024

In [25]:
# Mendefinisikan kelas model khusus MyModel
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)

    # Lapisan embedding untuk mengonversi ID numerik menjadi vektor embedding
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    # Lapisan GRU (Gated Recurrent Unit) dengan return_sequences dan return_state
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)

    # Lapisan dense (sepenuhnya terhubung) dengan vocab_size output
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs

    # Menggunakan lapisan embedding
    x = self.embedding(x, training=training)

    if states is None:
      # Mendapatkan initial_state dari lapisan GRU jika states adalah None
      states = self.gru.get_initial_state(x)

    # Melakukan langkah propagasi pada lapisan GRU
    x, states = self.gru(x, initial_state=states, training=training)

    # Melakukan langkah propagasi pada lapisan dense
    x = self.dense(x, training=training)

    if return_state:
      # Mengembalikan output dan states jika return_state adalah True
      return x, states
    else:
      # Mengembalikan hanya output jika return_state adalah False
      return x

In [26]:
model = MyModel(
    vocab_size=vocab_size,  # Jumlah kata dalam vocabulary
    embedding_dim=embedding_dim,  # Dimensi embedding
    rnn_units=rnn_units  # Jumlah unit dalam lapisan GRU
)

### **Uji Model**

In [27]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 66) # (batch_size, sequence_length, vocab_size)


In [28]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 4022850 (15.35 MB)
Trainable params: 4022850 (15.35 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [29]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices= tf.squeeze(sampled_indices, axis=-1).numpy()

In [30]:
sampled_indices

array([42, 31,  6,  7, 50, 45,  0, 41, 23, 44, 40, 33, 50, 59, 50, 26, 44,
       20, 35, 64, 29, 55, 51, 24,  3, 22, 52, 18, 15, 41, 15, 24, 20, 56,
       45, 33, 56, 56, 11,  1, 33, 26, 28, 34,  9,  8, 52, 33, 14, 20, 38,
       20,  8, 58, 39,  2, 36, 52, 57, 32,  0, 41,  8, 41, 65, 34,  0,  1,
       62, 46, 46, 34,  3, 12, 46, 27, 55, 46, 23, 21, 11, 26, 38, 15, 22,
        0, 18, 41,  3, 48, 47, 59, 37, 42, 62, 35, 55, 29, 15, 61])

Dekode kode berikut untuk melihat teks yang diprediksi oleh model tidak terlatih ini:

In [31]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b"forbid that I should wish them sever'd\nWhom God hath join'd together; ay, and 'twere pity\nTo sunder "

Next Char Predictions:
 b"cR',kf[UNK]bJeaTktkMeGVyPplK!ImEBbBKGqfTqq:\nTMOU.-mTAGYG-sZ WmrS[UNK]b-bzU[UNK]\nwggU!;gNpgJH:MYBI[UNK]Eb!ihtXcwVpPBv"


### **Train Model**

### **Tambahan optimizer dan fungsi loss**

loss function `tf.keras.losses.sparse_categorical_crossentropy` standar berfungsi dalam kasus ini karena diterapkan di seluruh dimensi terakhir prediksi. Karena model Anda mengembalikan logits, Anda perlu mengatur flag `from_logits`.

In [32]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [33]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.1890783, shape=(), dtype=float32)


Model yang baru diinisialisasi tidak boleh terlalu yakin dengan dirinya sendiri, semua log keluaran harus memiliki besaran yang sama. Untuk mengonfirmasi hal ini, Anda dapat memeriksa bahwa eksponensial dari loss rata-rata harus kira-kira sama dengan ukuran kosakata. Loss yang jauh lebih tinggi berarti model tersebut yakin akan jawaban yang salah, dan memiliki inisialisasi yang buruk:

In [34]:
tf.exp(example_batch_mean_loss).numpy()

65.96197

Konfigurasikan prosedur pelatihan menggunakan metode tf.keras.Model.compile. Gunakan tf.keras.optimizers.Adam dengan argumen default dan fungsi loss.

In [35]:
model.compile(optimizer='adam', loss=loss)

### **Konfigurasi Checkpoints**

Gunakan `tf.keras.callbacks.ModelCheckpoint` untuk memastikan bahwa checkpoint disimpan selama pelatihan:

In [36]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### **Lakukan Proses Training**

In [37]:
EPOCHS = 10

In [38]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### **Generate Teks**

Berikut ini membuat prediksi satu langkah:

In [39]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [41]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [42]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
Why, that, no Bontx; with it best warn,
King Ricobonds with mine art to make
O, Nappioly, as is nothing, but to stand again,
The gokn nor play the wall: a high now being
show'd, I'll warrant their way; whiles on me
For me hath brought not one that clouds 'proot's
jowerful, that I was here.

PETRUCHIO:
Ye remeetes which we give him not, sir? what:'
Jeposter, that resposs the king's blood.

BISHOP OF ELY:
Romeo.

CAMILLO:
He had she fall and go.

KATHARINA:
We'll have bade done was, make way.

NatreLIO:
Thy folly, Right! Why that I do inhoct and house,
That rich suffer' to put on prince,
And that his own a rabs to grant her,
And swift from many orn Agot:
Preparable from this lamentations,
Which he does sinken enemy may purthes
But they were sent to but a horse!

POLIXENES:
know, yield my son: I'll have him
farther, and set old man frinch the wing.

WARWICK:
Peace, do you more.

POLIXENES:
I power,
But make my hearing of it hatket she I
to little guard, fearful, sho draw that honse

In [43]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

tf.Tensor(
[b"ROMEO:\nWhat bring it to sat: his son--servire,\nThat standed government and Warwick, and thy father\nDispositions to see them come, which is his\nstill back me word, leisure as he was any oqued\nResent?' 'Trutching my lord, I'll but live away ty?\nFarewell: this morning spira's tows of death;\nAnd, witowords of Brancag's; even all the rack,\nWe nare smilingly be said that have ta'en time\nAnd yet run this faint, as if bearing my son,\nVervip's persupt enjoy, my uncle Clarence, else\nThoubs regard.\n\nFRIAR LAURENCE:\nDo spoken, hence? and his new deeds,\nThough the night of maze himself a haste,\nDeath neighmons of Paris; gentle people\nAction of dreadful scorn; no graced, serve\nAnd in love palters from this cold with your dreams;\nAnd when they shall resolve me joyal breath,\nAnd break our tranions or't, Sunday, ballaged,\nWith this grown better macks to deceive;\nBut what feel more beasts of enter'd,\nWe see, unuving idstence with down:\nFor on forwards of necess'd wi

### **Ekspor Model Generator**

In [44]:
tf.saved_model.save(one_step_model, 'one_step')
one_step_reloaded = tf.saved_model.load('one_step')



In [45]:
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(100):
  next_char, states = one_step_reloaded.generate_one_step(next_char, states=states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))

ROMEO:
Ah, tradiols me, on me, in that all, grace!
They shall be put to do:
I dreamt hew shou, and with a 
