## Notebook Summary

This notebook contains the data loading and cleaning as well as the models which use keras_nlp tokenization, which includes the LSTM, transformer model with no-pretraining and Bert encoder models

## Set up

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -q git+https://github.com/keras-team/keras-nlp.git --upgrade

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
def print_version(library_name):
    try:
        lib = __import__(library_name)
        version = getattr(lib, '__version__', 'Version number not found')
        print(f"{library_name} version: {version}")
    except ImportError:
        print(f"{library_name} not installed.")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
print_version('transformers')
print_version('sklearn')
print_version('keras')
print_version('tensorflow')

transformers version: 4.42.4
sklearn version: 1.2.2
keras version: 3.4.1
tensorflow version: 2.17.0


In [None]:
import pandas as pd
import keras_nlp
import tensorflow as tf
from tensorflow import keras
import numpy as np
import random
import os

## Data

### Bible Data

In [None]:
import os

In [None]:
bi_dat = pd.DataFrame(columns = ('verse_id', 'TI', 'EN'))

In [None]:
path = "/content/drive/MyDrive/bible_data/"
file = "ti_1.1.txt"

text_file = open(os.path.join(path, file),'r')
text = text_file.read()
text_file.close()

lines = text.split('\n')

for line in lines:
  if line == '':
    lines.remove(line)
  elif line.isdigit():
    verse_num = int(line)
    id = "11" + str(verse_num)
  else:
    verse_text = line
    bi_dat.loc[len(bi_dat.index)] = [id, verse_text, "BLANK"]


In [None]:
os.path.isfile(os.path.join(path, "ti_1."+str(verse_num_list[0])+'.txt'))

True

In [None]:
verse_num_list = np.arange(1, 41)
path = "/content/drive/MyDrive/bible_data/"

for i in verse_num_list:
  file_chk = os.path.isfile(os.path.join(path, "ti_1."+str(i)+'.txt'))
  if file_chk:
    text_file = open(os.path.join(path, "ti_1."+str(i)+'.txt'),'r')
    text = text_file.read()
    text_file.close()

    lines = text.split('\n')

    for line in lines:
      if line == '':
        lines.remove(line)
      elif line.isdigit():
        verse_num = int(line)
        id = "1" + str(i) + str(verse_num)
      else:
        verse_text = line
        bi_dat.loc[len(bi_dat.index)] = [id, verse_text, "BLANK"]


In [None]:
for i in verse_num_list:
  file_chk = os.path.isfile(os.path.join(path, "en_1."+str(i)+'.txt'))
  if file_chk:
    text_file = open(os.path.join(path, "en_1."+str(i)+'.txt'),'r')
    text = text_file.read()
    text_file.close()

    lines = text.split('\n')

    for line in lines:
      if line == '':
        lines.remove(line)
      elif line.isdigit():
        verse_num = int(line)
        id = "1" + str(i) + str(verse_num)
      else:
        verse_idx = bi_dat[bi_dat['verse_id'] == id]['verse_id'].index[0]
        verse_text = line
        bi_dat.loc[verse_idx, 'EN'] = verse_text


In [None]:
bi_dat.to_csv('/content/drive/MyDrive/bible_data/bi_dat.csv')

In [None]:
bi_dat = pd.read_csv('/content/drive/MyDrive/266_data/bible_data/bi_dat.csv', usecols=['verse_id', 'TI', 'EN', 'TI_tokenized'])
bi_dat

Unnamed: 0,verse_id,TI,EN,TI_tokenized
0,111,ኣምላኽ ብመጀመርታ ሰማይን ምድርን ፈጠረ።,In the beginning God created the heavens and t...,"['ኣምላኽ', 'ብመጀመርታ', 'ሰማይን', 'ምድርን', 'ፈጠረ']"
1,112,ምድሪ ድማ በረኻን ጥራያን ነበረት፡ ጸልማት ከኣ ኣብ ልዕሊ መዓሙቕ ነበረ...,Now the earth was formless and empty. Darkness...,"['ምድሪ', 'ድማ', 'በረኻን', 'ጥራያን', 'ነበረት', 'ጸልማት', ..."
2,113,ኣምላኽ ከኣ፥ ብርሃን ይኹን፡ በለ። ብርሃን ድማ ዀነ።,"God said, ""Let there be light,"" and there was ...","['ኣምላኽ', 'ከኣ', 'ብርሃን', 'ይኹን', 'በለ', 'ብርሃን', 'ድ..."
3,114,ኣምላኽ ድማ እቲ ብርሃን ጽቡቕ ከም ዝዀነ ረኣየ። ኣምላኽ ከኣ ነቲ ብርሃ...,"God saw the light, and saw that it was good. G...","['ኣምላኽ', 'ድማ', 'እቲ', 'ብርሃን', 'ጽቡቕ', 'ከም', 'ዝዀነ..."
4,115,ኣምላኽ ነቲ ብርሃን መዓልቲ ኣውጽኣሉ። ነቲ ጸልማት ከኣ ለይቲ ኣውጽኣሉ።...,"God called the light Day, and the darkness he ...","['ኣምላኽ', 'ነቲ', 'ብርሃን', 'መዓልቲ', 'ኣውጽኣሉ', 'ነቲ', ..."
...,...,...,...,...
921,14019,ድሕሪ ሰለስተ መዓልቲ ፈርኦን ርእስኻ ካባኻ ኪወስድ፡ ኣብ ዕጨይቲ ድማ ኪ...,"Within three more days, Pharaoh will lift up y...","['ድሕሪ', 'ሰለስተ', 'መዓልቲ', 'ፈርኦን', 'ርእስኻ', 'ካባኻ',..."
922,14020,ኰነ ድማ፡ ኣብ ሳልሰይቲ መዓልቲ፡ ንፈርኦን መዓልቲ ልደቱ ነበረ እሞ፡ ን...,"It happened the third day, which was Pharaoh''...","['ኰነ', 'ድማ', 'ኣብ', 'ሳልሰይቲ', 'መዓልቲ', 'ንፈርኦን', '..."
923,14021,ነቲ ሓለቓ ኣሰለፍቲ ሜስ ናብ ኣሰላፍነቱ መለሶ፡ ንሱ ድማ እቲ ጽዋእ ኣብ...,He restored the chief cupbearer to his positio...,"['ነቲ', 'ሓለቓ', 'ኣሰለፍቲ', 'ሜስ', 'ናብ', 'ኣሰላፍነቱ', '..."
924,14022,ነቲ ሓለቓ ሰንከትቲ እንጌራ ግና፡ ከምቲ ዮሴፍ ዝፈትሓሎም፡ ሰቐሎ።,"but he hanged the chief baker, as Joseph had i...","['ነቲ', 'ሓለቓ', 'ሰንከትቲ', 'እንጌራ', 'ግና', 'ከምቲ', 'ዮ..."


In [None]:
# get rid of punctuation TI input
ti_punc = "።፡፥፧፤፦"
tokenized_sen_ti = []

tokenized_ti_col = pd.Series()

for i in bi_dat.index:
  word_list = bi_dat.loc[i]['TI'].split()
  for word in word_list:
    no_punc_word = word.strip(ti_punc)
    tokenized_sen_ti.append(no_punc_word)
  # bi_dat.loc[i, 'TI_tokenized'] = tokenized_sen_ti
  # tokenized_sen_ti = []
  tokenized_ti_col.loc[i] = tokenized_sen_ti
  tokenized_sen_ti = []

In [None]:
bi_dat['TI_tokenized'] = tokenized_ti_col

In [None]:
# get rid of punctuation/lower EN input
en_punc = ".,;?!\:/"
tokenized_sen_en = []

tokenized_en_col = pd.Series()

for i in bi_dat.index:
  word_list = bi_dat.loc[i]['EN'].split()
  for word in word_list:
    lower_word = word.lower()
    no_punc_word = lower_word.strip(en_punc)
    tokenized_sen_en.append(no_punc_word)
  # bi_dat.loc[i, 'TI_tokenized'] = tokenized_sen_ti
  # tokenized_sen_ti = []
  tokenized_en_col.loc[i] = tokenized_sen_en
  tokenized_sen_en = []

In [None]:
bi_dat['EN_tokenized'] = tokenized_en_col

#### Train/Test Split

In [None]:
# create a shuffled index
shuffled_idx = np.random.permutation(len(bi_dat))

# number of training samples
num_tr = int(len(bi_dat) * 0.95)

# training and testing index
shuffled_tr = shuffled_idx[:num_tr]
shuffled_ts = shuffled_idx[num_tr:]

tr_samples_bi = bi_dat.loc[shuffled_tr]

In [None]:
MAX_SEQUENCE_LENGTH = 15

In [None]:
# ".,;?!\:/"
ti_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", "።", "፡", "፥", "፧", "፤", "፦"]
en_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", ".", ",", ";", "?", "!", "\ ", ":", "/"]

for i in tr_samples_bi['TI_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    # print(word)
    word = word.strip("[],' '")
    if word not in ti_vocab:
      ti_vocab.append(word)

for i in tr_samples_bi['EN_tokenized']:
  word_list = i
  for word in word_list:
    word = word.lower()
    if word not in en_vocab:
      en_vocab.append(word)

In [None]:
print(len(ti_vocab))
print(len(en_vocab))

4037
2115


In [None]:
ti_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=ti_vocab, lowercase=True
)
en_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=en_vocab, lowercase=True
)

In [None]:
# MAX_SEQUENCE_LENGTH = 15

tok_in = en_tokenizer(bi_dat['EN'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=en_tokenizer.token_to_id("[PAD]"),
    )
encoder_input = input_format(tok_in)
encoder_input

<tf.Tensor: shape=(926, 15), dtype=int32, numpy=
array([[  36,   21,  950, ...,    0,    0,    0],
       [ 237,   21,  460, ...,  458,   23,   21],
       [ 100,   60,    5, ...,   28, 1222,    4],
       ...,
       [  13, 1824,   21, ...,  142,   21, 1781],
       [ 150,   13,  964, ...,  156,    4,    0],
       [ 283,   21,  181, ...,  956,   19,    4]], dtype=int32)>

In [None]:
tok_out = ti_tokenizer(bi_dat['TI'])  # make as string

output_format = keras_nlp.layers.StartEndPacker(
      sequence_length=MAX_SEQUENCE_LENGTH + 1,
      start_value=ti_tokenizer.token_to_id("[START]"),
      end_value=ti_tokenizer.token_to_id("[END]"),
      pad_value=ti_tokenizer.token_to_id("[PAD]"),
  )
tr_output = output_format(tok_out)
tr_output

<tf.Tensor: shape=(926, 16), dtype=int32, numpy=
array([[   2,   95, 1372, ...,    0,    0,    0],
       [   2,  146,   26, ...,    4, 2937,    3],
       [   2,   95,   41, ...,    3,    0,    0],
       ...,
       [   2,  233,  417, ...,   28,  358,    3],
       [   2,  233,  417, ...,    3,    0,    0],
       [   2,   61,  417, ...,    0,    0,    0]], dtype=int32)>

In [None]:
# training examples
encoder_tr = encoder_in_arr[shuffled_tr]
decoder_in_tr = decoder_in_arr[shuffled_tr]
decoder_out_tr = decoder_out_arr[shuffled_tr]

# testing examples
encoder_ts = encoder_in_arr[shuffled_ts]
decoder_in_ts = decoder_in_arr[shuffled_ts]
decoder_out_ts = decoder_out_arr[shuffled_ts]

### add UN corpus to bible

In [None]:
# load in from save
combined_df = pd.read_csv('/content/drive/MyDrive/266_data/combined_df.csv', usecols=['EN', 'TI', 'EN_tokenized', 'TI_tokenized'])

In [None]:
# bi_dat = bi_dat.drop(columns=['verse_id'])
combined_df = pd.concat([created_df2, bi_dat])

NameError: name 'created_df2' is not defined

In [None]:
combined_df.reset_index(drop=True, inplace=True)

In [None]:
combined_df

Unnamed: 0,EN,TI,TI_tokenized,EN_tokenized
0,Can you find what you want to say here,እንታይ ክትብል ከም ዝደለኻ ካብዚ መጽሓፍ ክትረክቦ ምከኣልካዶ,"[እንታይ, ክትብል, ከም, ዝደለኻ, ካብዚ, መጽሓፍ, ክትረክቦ, ምከኣልካዶ]","[can, you, find, what, you, want, to, say, here]"
1,It has several languages,ብዙሓት ቋንቋታት ኣለዎ,"[ብዙሓት, ቋንቋታት, ኣለዎ]","[it, has, several, languages]"
2,We can try to communicate this way,በዚ ኣገባብ ጌርና ክንረዳዳእ ክንፍትን ኢና,"[በዚ, ኣገባብ, ጌርና, ክንረዳዳእ, ክንፍትን, ኢና]","[we, can, try, to, communicate, this, way]"
3,What is your name,መን'ዩ ስምካ,"[መን'ዩ, ስምካ]","[what, is, your, name]"
4,I’m hungry,ጥሜት ኣለኒ,"[ጥሜት, ኣለኒ]","[i’m, hungry]"
...,...,...,...,...
1174,"Within three more days, Pharaoh will lift up y...",ድሕሪ ሰለስተ መዓልቲ ፈርኦን ርእስኻ ካባኻ ኪወስድ፡ ኣብ ዕጨይቲ ድማ ኪ...,"[ድሕሪ, ሰለስተ, መዓልቲ, ፈርኦን, ርእስኻ, ካባኻ, ኪወስድ, ኣብ, ዕ...","[within, three, more, days, pharaoh, will, lif..."
1175,"It happened the third day, which was Pharaoh''...",ኰነ ድማ፡ ኣብ ሳልሰይቲ መዓልቲ፡ ንፈርኦን መዓልቲ ልደቱ ነበረ እሞ፡ ን...,"[ኰነ, ድማ, ኣብ, ሳልሰይቲ, መዓልቲ, ንፈርኦን, መዓልቲ, ልደቱ, ነበ...","[it, happened, the, third, day, which, was, ph..."
1176,He restored the chief cupbearer to his positio...,ነቲ ሓለቓ ኣሰለፍቲ ሜስ ናብ ኣሰላፍነቱ መለሶ፡ ንሱ ድማ እቲ ጽዋእ ኣብ...,"[ነቲ, ሓለቓ, ኣሰለፍቲ, ሜስ, ናብ, ኣሰላፍነቱ, መለሶ, ንሱ, ድማ, ...","[he, restored, the, chief, cupbearer, to, his,..."
1177,"but he hanged the chief baker, as Joseph had i...",ነቲ ሓለቓ ሰንከትቲ እንጌራ ግና፡ ከምቲ ዮሴፍ ዝፈትሓሎም፡ ሰቐሎ።,"[ነቲ, ሓለቓ, ሰንከትቲ, እንጌራ, ግና, ከምቲ, ዮሴፍ, ዝፈትሓሎም, ሰቐሎ]","[but, he, hanged, the, chief, baker, as, josep..."


In [None]:
# create a shuffled index
shuffled_idx = np.random.permutation(len(combined_df))

# number of training samples
num_tr = int(len(combined_df) * 0.95)

# training and testing index
shuffled_tr = shuffled_idx[:num_tr]
shuffled_ts = shuffled_idx[num_tr:]

tr_samples_co = combined_df.loc[shuffled_tr]

In [None]:
MAX_SEQUENCE_LENGTH = 15

In [None]:
# ".,;?!\:/"
ti_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", "።", "፡", "፥", "፧", "፤", "፦"]
en_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", ".", ",", ";", "?", "!", "\ ", ":", "/"]

for i in tr_samples_co['TI_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    # print(word)
    word = word.strip("[],' '()")
    if word not in ti_vocab:
      ti_vocab.append(word)

for i in tr_samples_co['EN_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    word = word.lower()
    if word not in en_vocab:
      en_vocab.append(word)

In [None]:
print(len(ti_vocab))
print(len(en_vocab))

4211
2751


In [None]:
ti_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=ti_vocab, lowercase=True
)
en_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=en_vocab, lowercase=True
)

In [None]:
# MAX_SEQUENCE_LENGTH = 15

tok_in = en_tokenizer(combined_df['EN'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=en_tokenizer.token_to_id("[PAD]"),
    )
encoder_input = input_format(tok_in)
encoder_input

<tf.Tensor: shape=(1179, 15), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 4, 0],
       [1, 1, 1, ..., 1, 1, 4]], dtype=int32)>

In [None]:
tok_out = ti_tokenizer(combined_df['TI'])  # make as string

output_format = keras_nlp.layers.StartEndPacker(
      sequence_length=MAX_SEQUENCE_LENGTH + 1,
      start_value=ti_tokenizer.token_to_id("[START]"),
      end_value=ti_tokenizer.token_to_id("[END]"),
      pad_value=ti_tokenizer.token_to_id("[PAD]"),
  )
tr_output = output_format(tok_out)
tr_output

<tf.Tensor: shape=(1179, 16), dtype=int32, numpy=
array([[   2,  961, 2354, ...,    0,    0,    0],
       [   2, 2555, 3438, ...,    0,    0,    0],
       [   2,  217, 2527, ...,    0,    0,    0],
       ...,
       [   2,   80,  237, ...,   12,  171,    3],
       [   2,   80,  237, ...,    3,    0,    0],
       [   2,   17,  237, ...,    0,    0,    0]], dtype=int32)>

In [None]:
encoder_in_arr = encoder_input.numpy()
decoder_in_arr = tr_output[:, :-1].numpy()
decoder_out_arr = tr_output[:, 1:].numpy()

In [None]:
# training examples
encoder_tr = encoder_in_arr[shuffled_tr]
decoder_in_tr = decoder_in_arr[shuffled_tr]
decoder_out_tr = decoder_out_arr[shuffled_tr]

# testing examples
encoder_ts = encoder_in_arr[shuffled_ts]
decoder_in_ts = decoder_in_arr[shuffled_ts]
decoder_out_ts = decoder_out_arr[shuffled_ts]

In [None]:
combined_df.to_csv('/content/drive/MyDrive/266_data/combined_df.csv')

### 5k Bible

In [None]:
bi_dat_2 = pd.DataFrame(columns = ['verse_id', 'TI', 'EN'])

In [None]:
chapters_num = np.arange(1, 6)
verse_num_list = np.arange(1, 41)
path = "/content/drive/MyDrive/266_data/bible_pull_down_2/"

for c in chapters_num:
  for i in verse_num_list:
    file_chk = os.path.isfile(os.path.join(path, "ti_" + str(c) + "." +str(i)+'.txt'))
    if file_chk:
      text_file = open(os.path.join(path, "ti_" + str(c) + "." +str(i)+'.txt'),'r')
      text = text_file.read()
      text_file.close()

      lines = text.split('\n')

      for line in lines:
        if line == '':
          lines.remove(line)
        elif line.isdigit():
          verse_num = int(line)
          id = str(c) +'.' + str(i)+ '.' + str(verse_num)
        else:
          verse_text = line
          bi_dat_2.loc[len(bi_dat_2.index)] = [id, verse_text, "BLANK"]

In [None]:
for c in chapters_num:
  for i in verse_num_list:
    file_chk = os.path.isfile(os.path.join(path, "en_" + str(c) + '.' +str(i)+'.txt'))
    if file_chk:
      text_file = open(os.path.join(path, "en_" + str(c) + '.' + str(i)+'.txt'),'r')
      text = text_file.read()
      text_file.close()

      lines = text.split('\n')

      for line in lines:
        if line == '':
          lines.remove(line)
        elif line.isdigit():
          verse_num = int(line)
          id = str(c) +'.' + str(i)+ '.' + str(verse_num)
        else:
          verse_idx = bi_dat_2[bi_dat_2['verse_id'] == id]['verse_id'].index[0]
          verse_text = line
          bi_dat_2.loc[verse_idx, 'EN'] = verse_text

In [None]:
# get rid of punctuation TI input
ti_punc = "።፡፥፧፤፦"
tokenized_sen_ti = []

tokenized_ti_col = pd.Series()

for i in bi_dat_2.index:
  word_list = bi_dat_2.loc[i]['TI'].split()
  for word in word_list:
    no_punc_word = word.strip(ti_punc)
    tokenized_sen_ti.append(no_punc_word)
  # bi_dat.loc[i, 'TI_tokenized'] = tokenized_sen_ti
  # tokenized_sen_ti = []
  tokenized_ti_col.loc[i] = tokenized_sen_ti
  tokenized_sen_ti = []

In [None]:
# get rid of punctuation/lower EN input
en_punc = ".,;?!\:/()[]"
tokenized_sen_en = []

tokenized_en_col = pd.Series()

for i in bi_dat_2.index:
  word_list = bi_dat_2.loc[i]['EN'].split()
  for word in word_list:
    lower_word = word.lower()
    no_punc_word = lower_word.strip(en_punc)
    tokenized_sen_en.append(no_punc_word)
  # bi_dat.loc[i, 'TI_tokenized'] = tokenized_sen_ti
  # tokenized_sen_ti = []
  tokenized_en_col.loc[i] = tokenized_sen_en
  tokenized_sen_en = []

In [None]:
bi_dat_2['TI_tokenized'] = tokenized_ti_col
bi_dat_2['EN_tokenized'] = tokenized_en_col

In [None]:
bi_dat_2.to_csv('/content/drive/MyDrive/266_data/bible_data/bi_dat_2.csv')

In [None]:
bi_dat = pd.read_csv('/content/drive/MyDrive/266_data/bible_data/bi_dat.csv')
bi_dat[bi_dat['EN'] == "BLANK"]

Unnamed: 0.1,Unnamed: 0,verse_id,TI,EN,TI_tokenized
16,16,1116,17-18,BLANK,['17-18']
17,17,1116,ኣብ ልዕሊ ምድሪ ምእንቲ ኼብርሁ፡ ኣብ መዓልትን ኣብ ለይትን ድማ ኪስልጥ...,BLANK,"['ኣብ', 'ልዕሊ', 'ምድሪ', 'ምእንቲ', 'ኼብርሁ', 'ኣብ', 'መዓ..."
158,158,1111,ምድሪ ብዘላ ብሓደ ቛንቋን ብሓደ ንግግርን ነበረት።,BLANK,"['ምድሪ', 'ብዘላ', 'ብሓደ', 'ቛንቋን', 'ብሓደ', 'ንግግርን', ..."
159,159,1112,ኰነ ድማ፡ ንምብራቕ ኣቢሎም ምስ ተጓዕዙ፡ ኣብ ምድሪ ሲነኣር ጐልጐል ረኸ...,BLANK,"['ኰነ', 'ድማ', 'ንምብራቕ', 'ኣቢሎም', 'ምስ', 'ተጓዕዙ', 'ኣ..."
160,160,1113,ንሓድሕዶም ድማ፤ ክላ ግዳ፡ ጡብ ንስራሕ እሞ ብሓዊ ንድፈኖ፡ ተባሃሀሉ፡ ...,BLANK,"['ንሓድሕዶም', 'ድማ', 'ክላ', 'ግዳ', 'ጡብ', 'ንስራሕ', 'እሞ..."
161,161,1114,ሽዑ ድማ፡ ክላ ግዳ፡ ኣብ ልዕሊ ገጽ ኵላ ምድሪ ምእንቲ ፋሕ ከይንብልሲ፡...,BLANK,"['ሽዑ', 'ድማ', 'ክላ', 'ግዳ', 'ኣብ', 'ልዕሊ', 'ገጽ', 'ኵ..."
162,162,1115,እግዚኣብሄር ከኣ ነቲ ደቂ ሰብ ዝሰርሕዎ ዝነበሩ ኸተማን ግምብን ኪርኢ ወረደ።,BLANK,"['እግዚኣብሄር', 'ከኣ', 'ነቲ', 'ደቂ', 'ሰብ', 'ዝሰርሕዎ', '..."
163,163,1116,እግዚኣብሄር ድማ፡ እንሆ፡ ኵላቶም ሓደ ዝዘረባኦም ሓደ ህዝቢ እዮም። መጀ...,BLANK,"['እግዚኣብሄር', 'ድማ', 'እንሆ', 'ኵላቶም', 'ሓደ', 'ዝዘረባኦም..."
166,166,1119,እግዚኣብሄር ኣብኣ ቛንቋ ዅላ ምድሪ ፋሕፋሕ ስለ ዘበለ፡ ስም እታ ኸተማ ...,BLANK,"['እግዚኣብሄር', 'ኣብኣ', 'ቛንቋ', 'ዅላ', 'ምድሪ', 'ፋሕፋሕ',..."
190,190,1121,እግዚኣብሄር ድማ ንኣብራም በሎ፤ ካብ ምድርኻን ካብ ኣዝማድካን ካብ እንዳ...,BLANK,"['እግዚኣብሄር', 'ድማ', 'ንኣብራም', 'በሎ', 'ካብ', 'ምድርኻን'..."


In [None]:
bi_dat_2 = pd.read_csv('/content/drive/MyDrive/266_data/bi_dat_2.csv', usecols = ['TI', 'EN', 'TI_tokenized', 'EN_tokenized'])
combined_df = pd.read_csv('/content/drive/MyDrive/266_data/combined_df.csv', usecols=['EN', 'TI', 'EN_tokenized', 'TI_tokenized'])

In [None]:
dat_5k = pd.concat([combined_df, bi_dat_2])

In [None]:
dat_5k.reset_index(drop=True, inplace=True)

In [None]:
dat_5k = dat_5k[dat_5k['EN'] != "BLANK"]

In [None]:
dat_5k.reset_index(drop=True, inplace=True)
dat_5k.head()

Unnamed: 0,EN,TI,TI_tokenized,EN_tokenized
0,Can you find what you want to say here,እንታይ ክትብል ከም ዝደለኻ ካብዚ መጽሓፍ ክትረክቦ ምከኣልካዶ,"['እንታይ', 'ክትብል', 'ከም', 'ዝደለኻ', 'ካብዚ', 'መጽሓፍ', ...","['can', 'you', 'find', 'what', 'you', 'want', ..."
1,It has several languages,ብዙሓት ቋንቋታት ኣለዎ,"['ብዙሓት', 'ቋንቋታት', 'ኣለዎ']","['it', 'has', 'several', 'languages']"
2,We can try to communicate this way,በዚ ኣገባብ ጌርና ክንረዳዳእ ክንፍትን ኢና,"['በዚ', 'ኣገባብ', 'ጌርና', 'ክንረዳዳእ', 'ክንፍትን', 'ኢና']","['we', 'can', 'try', 'to', 'communicate', 'thi..."
3,What is your name,መን'ዩ ስምካ,"[""መን'ዩ"", 'ስምካ']","['what', 'is', 'your', 'name']"
4,I’m hungry,ጥሜት ኣለኒ,"['ጥሜት', 'ኣለኒ']","['i’m', 'hungry']"


In [None]:
dat_5k.to_csv('/content/drive/MyDrive/266_data/full_5k.csv')

In [None]:
dat_5k = pd.read_csv('/content/drive/MyDrive/266_data/full_5k.csv', usecols=['EN', 'TI', 'EN_tokenized', 'TI_tokenized'])

In [None]:
dat_1k = dat_5k[:1000]

##### tokenizing whole dataset for back translation

In [None]:
MAX_SEQUENCE_LENGTH = 22

In [None]:
ti_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", "።", "፡", "፥", "፧", "፤", "፦", "'"]
en_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", ".", ",", ";", "?", "!", ":", "’", "'"]

for i in dat_5k['TI_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    word = word.strip("[],' '()\"")
    if word.find("'") != -1:
      word_chk = word.split("'")
    else:
      word_chk = [word]
    for wd in word_chk:
      if wd not in ti_vocab:
        ti_vocab.append(wd)

for i in dat_5k['EN_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    word = word.strip("[],' '.()\"")
    word = word.lower()
    if word.find("’") != -1:
      word_chk = word.split("’")
    elif word.find("''") != -1:
      word_chk = word.split("''")
    else:
      word_chk = [word]
    for wd in word_chk:
      if wd not in en_vocab:
        en_vocab.append(wd)

KeyboardInterrupt: 

In [None]:
ti_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=ti_vocab, lowercase=True
)
en_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=en_vocab, lowercase=True
)

In [None]:
print(len(ti_vocab))
print(len(en_vocab))

14311
4542


In [None]:
tok_in = en_tokenizer(dat_5k['EN'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=en_tokenizer.token_to_id("[PAD]"),
    )
encoder_input = input_format(tok_in)
encoder_input

<tf.Tensor: shape=(5035, 22), dtype=int32, numpy=
array([[  12,   13,   14, ...,    0,    0,    0],
       [  20,   21,   22, ...,    0,    0,    0],
       [  24,   12,   25, ...,    0,    0,    0],
       ...,
       [1985,  183,  211, ..., 4540, 1521,  713],
       [  47,  161,  189, ..., 3577,  175,   47],
       [2454,   47,  163, ...,  187,  303,    9]], dtype=int32)>

In [None]:
tok_out = ti_tokenizer(dat_5k['TI'])  # make as string

output_format = keras_nlp.layers.StartEndPacker(
      sequence_length=MAX_SEQUENCE_LENGTH + 1,
      start_value=ti_tokenizer.token_to_id("[START]"),
      end_value=ti_tokenizer.token_to_id("[END]"),
      pad_value=ti_tokenizer.token_to_id("[PAD]"),
  )
tr_output = output_format(tok_out)
tr_output

<tf.Tensor: shape=(5035, 23), dtype=int32, numpy=
array([[   2,   11,   12, ...,    0,    0,    0],
       [   2,   19,   20, ...,    0,    0,    0],
       [   2,   22,   23, ...,    0,    0,    0],
       ...,
       [   2, 4465,  910, ...,    0,    0,    0],
       [   2,  808, 3420, ...,    3,    0,    0],
       [   2, 5840,  732, ...,  412, 4487,    3]], dtype=int32)>

In [None]:
print(en_tokenizer.detokenize(encoder_input[1205]).numpy().decode('utf8'))
ti_tokenizer.detokenize(tr_output[3]).numpy().decode('utf8')

pharaoh ' ' s daughter came down to bathe at the river . her maidens walked along by the riverside . she


"[START] መን ' ዩ ስምካ [END] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [None]:
encoder_in_arr = encoder_input.numpy()
decoder_in_arr = tr_output[:, :-1].numpy()
decoder_out_arr = tr_output[:, 1:].numpy()

In [None]:
### tokenizing in other direction
tok_in = ti_tokenizer(dat_5k['TI'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=ti_tokenizer.token_to_id("[PAD]"),
    )
ti_encoder_input = input_format(tok_in)
ti_encoder_input

<tf.Tensor: shape=(5035, 22), dtype=int32, numpy=
array([[  11,   12,   13, ...,    0,    0,    0],
       [  19,   20,   21, ...,    0,    0,    0],
       [  22,   23,   24, ...,    0,    0,    0],
       ...,
       [4465,  910, 1190, ...,    0,    0,    0],
       [ 808, 3420,  220, ...,    0,    0,    0],
       [5840,  732, 7146, ...,  412, 4487, 2737]], dtype=int32)>

In [None]:
tok_out = en_tokenizer(dat_5k['EN'])  # make as string

output_format = keras_nlp.layers.StartEndPacker(
      sequence_length=MAX_SEQUENCE_LENGTH + 1,
      start_value=en_tokenizer.token_to_id("[START]"),
      end_value=en_tokenizer.token_to_id("[END]"),
      pad_value=en_tokenizer.token_to_id("[PAD]"),
  )
en_tr_output = output_format(tok_out)
en_tr_output

<tf.Tensor: shape=(5035, 23), dtype=int32, numpy=
array([[   2,   12,   13, ...,    0,    0,    0],
       [   2,   20,   21, ...,    0,    0,    0],
       [   2,   24,   12, ...,    0,    0,    0],
       ...,
       [   2, 1985,  183, ..., 4540, 1521,    3],
       [   2,   47,  161, ..., 3577,  175,    3],
       [   2, 2454,   47, ...,  187,  303,    3]], dtype=int32)>

In [None]:
ti_encoder_in_arr = ti_encoder_input.numpy()
en_decoder_in_arr = en_tr_output[:, :-1].numpy()
en_decoder_out_arr = en_tr_output[:, 1:].numpy()

### processing (tokenizing and splitting) general

In [None]:
dat_1k = dat_1k.reset_index(drop=True)

In [None]:
df = dat_1k

In [None]:
dat_1k

Unnamed: 0,EN,TI,TI_tokenized,EN_tokenized
0,"and the border shall go forth to Ziphron, and ...",እቲ ዶብ ድማ ናብ ዚፍሮን ይሕለፍ እሞ ናብ ሓጻር-ዔናን ይውጻእ፡ ናይ ሰ...,"['እቲ', 'ዶብ', 'ድማ', 'ናብ', 'ዚፍሮን', 'ይሕለፍ', 'እሞ',...","['and', 'the', 'border', 'shall', 'go', 'forth..."
1,"He said, ""Put your hand inside your cloak agai...",ኢድካ ናብ ትሽትሽካ ኣእቱ፡ ከኣ በለ። ኢዱ ናብ ትሽትሹ ኣእተወ። ካብ ት...,"['ኢድካ', 'ናብ', 'ትሽትሽካ', 'ኣእቱ', 'ከኣ', 'በለ', 'ኢዱ'...","['he', 'said', '""put', 'your', 'hand', 'inside..."
2,I will have a beer,ኣነ ቢራ ክህልወኒ እዩ,"['ኣነ', 'ቢራ', 'ክህልወኒ', 'እዩ']","['i', 'will', 'have', 'a', 'beer']"
3,Isaac brought her into his mother Sarah''s ten...,ይስሃቅ ከኣ ናብ ድንኳን ኣዲኡ ሳራ ኣእተዋ። ንርብቃ ድማ ወሰዳ እሞ ሰበ...,"['ይስሃቅ', 'ከኣ', 'ናብ', 'ድንኳን', 'ኣዲኡ', 'ሳራ', 'ኣእተ...","['isaac', 'brought', 'her', 'into', 'his', 'mo..."
4,Eber lived four hundred thirty years after he ...,ዔበር ንፌሌግ ምስ ወለደ ኸኣ፡ ድሕሪኡ ኣርባዕተ ሚእትን ሰላሳን ዓመት ገ...,"['ዔበር', 'ንፌሌግ', 'ምስ', 'ወለደ', 'ኸኣ', 'ድሕሪኡ', 'ኣር...","['eber', 'lived', 'four', 'hundred', 'thirty',..."
...,...,...,...,...
995,"Reuben, Simeon, Levi, and Judah,",ሮቤል፡ ስምኦን፡ ሌውን ይሁዳን፡,"['ሮቤል', 'ስምኦን', 'ሌውን', 'ይሁዳን']","['reuben', 'simeon', 'levi', 'and', 'judah']"
996,"Butter of the herd, and milk of the flock, wit...",ጠስሚ ላምን ጸባ በጊዕን ምስ ስብሒ ገንሸልን ደዓውል ባሳንን ኣጣልን ምስ...,"['ጠስሚ', 'ላምን', 'ጸባ', 'በጊዕን', 'ምስ', 'ስብሒ', 'ገንሸ...","['butter', 'of', 'the', 'herd', 'and', 'milk',..."
997,"Yahweh our God spoke to us in Horeb, saying, Y...",እግዚኣብሄር ኣምላኽና፡ ኣብ ሆሬብ ከምዚ ኢሉ ተዛረበና፡ ኣብዚ ኸረንዚ እ...,"['እግዚኣብሄር', 'ኣምላኽና', 'ኣብ', 'ሆሬብ', 'ከምዚ', 'ኢሉ',...","['yahweh', 'our', 'god', 'spoke', 'to', 'us', ..."
998,"and said, ""I have sworn by myself, says Yahweh...",ብርእሰይ መሐልኩ፡ ይብል እግዚኣብሄር፡ እዚ ነገርዚ ኻብ እትገብር፡ ነቲ ...,"['ብርእሰይ', 'መሐልኩ', 'ይብል', 'እግዚኣብሄር', 'እዚ', 'ነገር...","['and', 'said', '""i', 'have', 'sworn', 'by', '..."


In [None]:
# create a shuffled index
shuffled_idx = np.random.permutation(len(df))

# number of training samples
num_tr = int(len(df) * 0.80)

# training and testing index
shuffled_tr = shuffled_idx[:num_tr]
shuffled_ts = shuffled_idx[num_tr:]

tr_samples_co = df.loc[shuffled_tr]

In [None]:
shuffled_tr

array([2490, 4458, 4132, ..., 4064, 2604, 1133])

In [None]:
MAX_SEQUENCE_LENGTH = 22

In [None]:
ti_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", "።", "፡", "፥", "፧", "፤", "፦", "'", '"']
en_vocab = ["[PAD]", "[UNK]", "[START]", "[END]", ".", ",", ";", "?", "!", ":", "’", "'", '"']

for i in tr_samples_co['TI_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    word = word.strip("[],' '() \"\"")
    if word.find("'") != -1:
      word_chk = word.split("'")
    else:
      word_chk = [word]
    for wd in word_chk:
      if wd not in ti_vocab:
        ti_vocab.append(wd)

for i in tr_samples_co['EN_tokenized']:
  word_list = i
  word_list = word_list.split(',')
  for word in word_list:
    word = word.strip("[],' '.()\"")
    word = word.strip(' ""')
    word = word.lower()
    if word.find("’") != -1:
      word_chk = word.split("’")
    elif word.find("''") != -1:
      word_chk = word.split("''")
    else:
      word_chk = [word]
    for wd in word_chk:
      if wd not in en_vocab:
        en_vocab.append(wd)

In [None]:
print(len(ti_vocab))
print(len(en_vocab))

3191
1563


In [None]:
ti_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=ti_vocab, lowercase=True
)
en_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=en_vocab, lowercase=True
)

In [None]:
# MAX_SEQUENCE_LENGTH = 15

tok_in = en_tokenizer(df['EN'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=en_tokenizer.token_to_id("[PAD]"),
    )
encoder_input = input_format(tok_in)
encoder_input

<tf.Tensor: shape=(1000, 22), dtype=int32, numpy=
array([[ 160,  134,  161, ...,    0,    0,    0],
       [ 119,  423, 1383, ...,    0,    0,    0],
       [ 302,  160,  484, ...,    0,    0,    0],
       ...,
       [ 286,   55,   20, ...,   59,   20,   33],
       [  66,  181,  969, ...,   11,   11,  104],
       [ 286,   42,   51, ...,    0,    0,    0]], dtype=int32)>

In [None]:

tok_out = ti_tokenizer(df['TI'])  # make as string

output_format = keras_nlp.layers.StartEndPacker(
      sequence_length=MAX_SEQUENCE_LENGTH + 1,
      start_value=ti_tokenizer.token_to_id("[START]"),
      end_value=ti_tokenizer.token_to_id("[END]"),
      pad_value=ti_tokenizer.token_to_id("[PAD]"),
  )
tr_output = output_format(tok_out)
tr_output

<tf.Tensor: shape=(1000, 23), dtype=int32, numpy=
array([[   2,  188,  871, ...,    0,    0,    0],
       [   2, 2109, 2687, ...,    0,    0,    0],
       [   2,  605,  606, ...,    0,    0,    0],
       ...,
       [   2,  397,  695, ...,  266,   76,    3],
       [   2, 1368, 1595, ...,    0,    0,    0],
       [   2,  397,   39, ...,    0,    0,    0]], dtype=int32)>

In [None]:
encoder_in_arr = encoder_input.numpy()
decoder_in_arr = tr_output[:, :-1].numpy()
decoder_out_arr = tr_output[:, 1:].numpy()

In [None]:
# training examples
encoder_tr = encoder_in_arr[shuffled_tr]
decoder_in_tr = decoder_in_arr[shuffled_tr]
decoder_out_tr = decoder_out_arr[shuffled_tr]

# testing examples
encoder_ts = encoder_in_arr[shuffled_ts]
decoder_in_ts = decoder_in_arr[shuffled_ts]
decoder_out_ts = decoder_out_arr[shuffled_ts]

In [None]:
# pull 250 from testing to training
encoder_tr = np.vstack((encoder_tr, encoder_ts[:250]))
decoder_in_tr = np.vstack((decoder_in_tr, decoder_in_ts[:250]))
decoder_out_tr = np.vstack((decoder_out_tr, decoder_out_ts[:250]))

encoder_ts = encoder_ts[250:]
decoder_in_ts = decoder_in_ts[250:]
decoder_out_ts = decoder_out_ts[250:]

## Building Models

### LSTM

In [None]:
# seq-to-seq LSTM model
encoder_inputs = keras.Input(shape=([None, len(en_vocab)]))
encoder = keras.layers.LSTM(156, return_state=True)
encoder_outputs,state_h, state_c= encoder(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = keras.Input(shape=([None, len(ti_vocab)]))
decoder_lstm = keras.layers.LSTM(156, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(len(ti_vocab) , activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

seq2seq = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="s2sTransformer",
)

In [None]:
seq2seq.summary()

In [None]:
seq2seq.compile(
    optimizer = keras.optimizers.RMSprop(0.01), loss="categorical_crossentropy", metrics=["accuracy"]
)

### Transformer Architecture

#### Keras transformer layer

In [None]:
# vocab size
# en_vocab_size = len(en_vocab)
# ti_vocab_size = len(ti_vocab)

# 5k vocab size
en_vocab_size = 4300
ti_vocab_size = 12500
# en_vocab_size = 1600
# ti_vocab_size = 3300


#define some hyperparameter values for our transformers
EMBED_DIM = 100
INTERMEDIATE_DIM = 700
NUM_HEADS = 4

In [None]:
# Encoder
keras.backend.clear_session()
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=en_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

In [None]:
# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ti_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
    # mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.8)(x)
decoder_outputs = keras.layers.Dense(ti_vocab_size, activation="softmax",
                                     activity_regularizer = keras.regularizers.L1(0.01))(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

In [None]:
#connect the encoder and decoder together in sequence
seq2seq = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="s2sTransformer",
)

In [None]:
seq2seq.summary()

---------

##### tracking progress
learning rate: 0.01, 15 epochs: validation acc - 0.47

lr: 0.005, 15 epochs: validation acc - 0.3564

lr: 0.0005, 15 epochs: val acc - 0.3596

**embed: 70, inter: 500, numhead:4**

lr: 0.001, 15 epochs: val acc - 0.4479
lr: 0.003, 15 epochs: val acc - 0.4703

**embed: 70, inter: 500, numhead:4, l1 regularization (0.01)**

lr: 0.003, 15 epoch, val acc - 0.4734

dropout = 0.8, batchsize = 16, epochs = 20, val acc - 0.42

***embed: 75, inter: 220, numhead: 3**

lr: 0.001, epochs: 20, val acc = 0.4423

**embed: 100, inter: 700, numheads: 4**

lr = 0.002, epochs: 20, val acc: 0.4451

lr = 0.0005, epochs: 20, val acc:0.44

lr = 0.003, epochs: 20, bs: 32, val acc: 0.43




In [None]:
seq2seq.compile(
    optimizer = keras.optimizers.RMSprop(learning_rate = 0.003), loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

seq2seq.fit(x=[encoder_tr, decoder_in_tr], y=decoder_out_tr,
            batch_size = 64,
            epochs=25,
            validation_data = ([encoder_ts, decoder_in_ts], decoder_out_ts))


Epoch 1/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 1s/step - accuracy: 0.2760 - loss: 21.0556 - val_accuracy: 0.3597 - val_loss: 18.4165
Epoch 2/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 1s/step - accuracy: 0.3520 - loss: 18.9935 - val_accuracy: 0.3834 - val_loss: 18.0280
Epoch 3/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 1s/step - accuracy: 0.3811 - loss: 18.6317 - val_accuracy: 0.3955 - val_loss: 17.9080
Epoch 4/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 1s/step - accuracy: 0.3993 - loss: 18.3727 - val_accuracy: 0.4143 - val_loss: 17.7593
Epoch 5/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 1s/step - accuracy: 0.4144 - loss: 18.1777 - val_accuracy: 0.4255 - val_loss: 17.6791
Epoch 6/25
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m138s[0m 1s/step - accuracy: 0.4277 - loss: 17.9884 - val_accuracy: 0.4276 - val_loss: 17.6213
Epoch 7/25
[1m67/67[

<keras.src.callbacks.history.History at 0x7e00ad6d3460>

In [None]:
test_cases = [50, 143, 167]
for i in test_cases:
  print("-----")
  print(ti_tokenizer.detokenize(decoder_out_ts[i]).numpy().decode('utf-8'))
  test_pred = seq2seq.predict([np.array([encoder_ts[i]]), np.array([decoder_in_ts[i]])])
  print(ti_tokenizer.detokenize(test_pred.argmax(axis=2)).numpy()[0].decode('utf-8'))

-----
ነቶም ኣብ ሰዒር ዚነብሩ ኣሕዋትና ደቂ ኤሳው [UNK] ድማ ፡ ካብታ መገዲ ጐልጐልን ካብ [UNK] ካብ [UNK] - [UNK] [UNK] ። [END]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 828ms/step
ካብ ደቂ ቅድሚ ዚነብሩ ጓኖት ኣብ ኤሳው ፡ ፡ ዚህበካ ካብ መገዲ ፡ ፡ ደቂ ፡ [UNK] ፡ ኣራም ፡ ፡ [END]
-----
ንኣብራም ድማ ምእንትኣ ጽቡቕ ገበረሉ ፡ ኣባጊዕን ኣብዑርን [UNK] ገላዉን ኣግራድን ኣንስትዮ ኣእዱግን ኣግማልን [UNK] ። [END] [PAD] [PAD] [PAD] [PAD] [PAD]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
ኣምላኽ ፡ ፡ ፡ ኰነ ፡ ነፍሲ ፡ ወሲዱ ፡ ኣግራድን ኣግማልን ዕምባባ ጤለበጊዕን ኣእዱግን ። [END] [PAD] [PAD] [PAD] [PAD] [PAD]
-----
ብዕራይ ንተባዕታይ ባርያ ወይ [UNK] ባርያ እንተ ወግኤ ፡ እቲ ዋና እቲ ብዕራይ [UNK] እቲ [UNK] ሰላሳ ሲቃል ብሩር ይሀቦ ፡ [END]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
ንዚሐርር ፡ ነቲ ወይ ክልተ ፡ እንተ ዀነ ፡ ስምኦን ቚስሊ ዚኸውን ቚስሊ ፡ ፡ ኻህን ፡ ሲቃል [UNK] ፡ ። [END]


#### Bert Encoder keras transformer decoder

In [None]:
keras.backend.clear_session()
# vocab size
# en_vocab_size = len(en_vocab)
# ti_vocab_size = len(ti_vocab)
# en_vocab_size = 15000
ti_vocab_size = 12500
# ti_vocab_size = 3300

#define some hyperparameter values for our transformers
EMBED_DIM = 100
INTERMEDIATE_DIM = 700
NUM_HEADS = 3

In [None]:
# decoder

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, 768), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ti_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(ti_vocab_size, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)

decoder_outputs = decoder([decoder_inputs, encoded_seq_inputs])

bertseq2seq = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

In [None]:
bertseq2seq.summary()

In [None]:
bert_embeddings = np.load('/content/drive/MyDrive/266_data/bert_embeddings_5k_22.npy')

bert_tr = bert_embeddings[shuffled_tr]
bert_ts = bert_embeddings[shuffled_ts]


bert_tr = np.vstack((bert_tr, bert_ts[:250]))
bert_ts = bert_ts[250:]

In [None]:
##### 8/2 testing
bertseq2seq.compile(
    optimizer = keras.optimizers.RMSprop(learning_rate = 0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

bertseq2seq.fit(x=[decoder_in_tr, bert_tr], y=decoder_out_tr,
            batch_size = 80,
            epochs= 15,
            validation_data = ([decoder_in_ts, bert_ts], decoder_out_ts))


Epoch 1/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m111s[0m 2s/step - accuracy: 0.2719 - loss: 8.1011 - val_accuracy: 0.3214 - val_loss: 5.2222
Epoch 2/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 2s/step - accuracy: 0.3328 - loss: 5.4058 - val_accuracy: 0.3396 - val_loss: 4.6088
Epoch 3/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 2s/step - accuracy: 0.3468 - loss: 4.9958 - val_accuracy: 0.3683 - val_loss: 4.4594
Epoch 4/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m147s[0m 2s/step - accuracy: 0.3728 - loss: 4.7209 - val_accuracy: 0.3762 - val_loss: 4.2303
Epoch 5/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 2s/step - accuracy: 0.3846 - loss: 4.5361 - val_accuracy: 0.3852 - val_loss: 4.0931
Epoch 6/15
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 2s/step - accuracy: 0.3923 - loss: 4.4424 - val_accuracy: 0.3976 - val_loss: 4.0286
Epoch 7/15
[1m54/54[0m [32m━━━━━

<keras.src.callbacks.history.History at 0x79aa28651570>

In [None]:
beg_idx = 4000
end_idx = 5000

bert_mod_pred_tr = bertseq2seq.predict([np.array(decoder_in_tr)[beg_idx:end_idx],
                                        np.array(bert_tr)[beg_idx:end_idx]])

# train data
actual = df.loc[shuffled_tr]['TI']
actual_ts_to_tr = df.loc[shuffled_ts]['TI'][:250]
actual = np.hstack((np.array([actual]), np.array([actual_ts_to_tr])))
# actual = actual.reset_index(drop=True)
actual = actual[:, beg_idx:end_idx]
actual = actual.reshape((278))

bert_mod_pred_tr_slice = bert_mod_pred_tr
pred_tokens = np.argmax(bert_mod_pred_tr, axis=2)

predict_string = pd.Series()
for i in range(pred_tokens.shape[0]):
  predict_string.loc[i] = ti_tokenizer.detokenize(pred_tokens[i]).numpy().decode('utf-8')

prediction_df = pd.DataFrame({'actual': actual, 'predicted': predict_string})
print(prediction_df.head())

prediction_df.to_csv('/content/drive/MyDrive/266_data/bert_mod_tr_8_3_5.csv')



[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 175ms/step
                                              actual  \
0  ከምዚ ኣነ ሎሚ ኣብ ቅድሜኹም ዘንብረልኩም ዘሎኹ ዂሉ ሕጊ፡ ከምኡ ቅኑዕ ...   
1    ካባታተን ከኣ ዘይትበልዕወን እዚኣተን እየን፡ ንስርን ገምብን ወሓጥ ዓሳን፡   
2  በቲ እትፈርሆ ፍርሃት ልብኻን በቲ እትርእዮ ትርኢት ዓይንኻን ብጊሓት፡ ወ...   
3  ሰንደቕ ዕላማ ሰፈር ደቂ ኤፍሬም ድማ፡ ከከም ሰራዊቶም፡ ተጓዕዘ፡ ኣብ ል...   
4  ሙሴ ድማ ተመልሰ፡ ነተን ክልተ ጽላት ምስክር፡ በዝን በትን እተጻሕፋ፡ ብ...   

                                           predicted  
0  ንሱ እግዚኣብሄር እግዚኣብሄር እዩ በረኻ ፡ ዘሎኹ ዅሉ እዩ እዚ ንሱ ኸኣ...  
1  እዚ ድማ ፡ ካብ ፡ ፡ ንሱ ኣማልኽቲ እዩ ፡ ፡ [END] [PAD] [PA...  
2  እቲ ደም ኣብ እዚ ኣብ እንተ ፡ ፡ መስዋእቲ ፡ ወይ ፡ ዀነ ኸኣ ፡ ፡ ...  
3  ኣብ እቲ ሰፈር ደቂ ደቂ ናብ ፡ ካብ ፡ ፡ ካብ ፡ ፡ ልዕሊ ፡ ድማ ፡ ...  
4  ሙሴ ድማ ናብ ፡ ኣብ ክልተ ኸረን ኣብ ድማ ኣብ ኣብ ፡ ፡ ኣብ ፡ ድማ ...  


In [None]:
# test data
bert_mod_pred_ts = bertseq2seq.predict([np.array(decoder_in_ts), np.array(bert_ts)])
actual = df.loc[shuffled_ts]['TI']
actual = actual.reset_index(drop=True)
# actual.shape

pred_tokens = np.argmax(bert_mod_pred_ts, axis=2)

predict_string = pd.Series()
for i in range(pred_tokens.shape[0]):
  predict_string.loc[i] = ti_tokenizer.detokenize(pred_tokens[i]).numpy().decode('utf-8')

prediction_df = pd.DataFrame({'actual': actual, 'predicted': predict_string})
print(prediction_df.head())

prediction_df.to_csv('/content/drive/MyDrive/266_data/bert_mod_ts_1k.csv')

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 410ms/step
                                              actual  \
0  ኣምላኽ ነቲ ብርሃን መዓልቲ ኣውጽኣሉ። ነቲ ጸልማት ከኣ ለይቲ ኣውጽኣሉ።...   
1  ያእቆብ ድማ ንኤሳው እንጌራን ጸብሒ ብርስንን ገይሩ ሀቦ። ንሱ ድማ በሊዑ...   
2                                               ኮቦርታ   
3  ጐይታይ ድማ ኸምዚ ኢሉ ኣምሓለኒ፡ ካብ ኣዋልድ እዞም ኣነ ኣብ ምድሮም ዝ...   
4      ኤሳው ድማ፡ ንንቀል እሞ ንኺድ፡ ኣነ ድማ ቀቅድሜኻ ክኸይድ እየ፡ በለ።   

                                           predicted  
0  ኣምላኽ ድማ ብርሃን ይኹን ። ። [END] ብርሃን ብርሃን ፡ ኣብ ። [E...  
1  ያእቆብ ከኣ ንራሄል ምስ ሳዕ ኺሰርሕ [UNK] ፡ ። [END] ኸኣ ናብ ...  
2  ሽንቲ [END] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] ...  
3  ያእቆብ ድማ ናብ ቕንያት ምስ [UNK] እዛ ኤሳው ኣዋልደይ ኣነ ደኣ [U...  
4  ንሱ ከኣ [END] ኣነ ኣራም ፡ ኣራም ጽቡቕ ማይ እታ ዘለዋኸ ይኹን ። ...  


## Back translation

In [None]:
# vocab size
# en_vocab_size = len(en_vocab)
# ti_vocab_size = len(ti_vocab)
en_vocab_size = 5000
ti_vocab_size = 14500


#define some hyperparameter values for our transformers
EMBED_DIM = 250
INTERMEDIATE_DIM = 700
NUM_HEADS = 4

In [None]:
keras.backend.clear_session()
# Encoder
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=en_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

In [None]:
# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ti_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
    # mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(ti_vocab_size, activation="softmax",
                                     activity_regularizer = keras.regularizers.L1(0.01))(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

In [None]:
seq2seq = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="s2sTransformer",
)

In [None]:
seq2seq.summary()

In [None]:
seq2seq.compile(
    optimizer = keras.optimizers.RMSprop(learning_rate = 0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

In [None]:
seq2seq.fit(x=[encoder_in_arr, decoder_in_arr], y=decoder_out_arr,
            batch_size = 64,
            epochs=20)

Epoch 1/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 102ms/step - accuracy: 0.2913 - loss: 20.6724
Epoch 2/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 22ms/step - accuracy: 0.3643 - loss: 18.8980
Epoch 3/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 22ms/step - accuracy: 0.3940 - loss: 18.5041
Epoch 4/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.4138 - loss: 18.1960
Epoch 5/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - accuracy: 0.4354 - loss: 17.9353
Epoch 6/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 0.4545 - loss: 17.7060
Epoch 7/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 0.4771 - loss: 17.4865
Epoch 8/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 22ms/step - accuracy: 0.4973 - loss: 17.2712
Epoch 9/20
[1m79/79[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x7bd36c63bdc0>

In [None]:
tok_in = en_tokenizer(dat_5k['EN'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=en_tokenizer.token_to_id("[PAD]"),
    )
encoder_input = input_format(tok_in)
encoder_input

In [None]:
# mod_pred_tr = seq2seq.predict([encoder_in_arr, decoder_in_arr])
beg_idx = 4800
end_idx = 6000

mod_pred_tr = seq2seq.predict([encoder_in_arr[beg_idx:end_idx],
                                        decoder_in_arr[beg_idx:end_idx]])

# train data
actual = dat_5k['EN']

# actual = df.loc[shuffled_tr]['TI']
# actual_ts_to_tr = df.loc[shuffled_ts]['TI'][:250]
# actual = np.hstack((np.array([actual]), np.array([actual_ts_to_tr])))
# actual = actual.reset_index(drop=True)
actual = actual.loc[beg_idx:end_idx-1]
actual = actual.reset_index(drop=True)
# actual = pd.Series(actual)
# actual = actual.reshape((278))

# bert_mod_pred_tr_slice = bert_mod_pred_tr

# bert_mod_pred_tr_slice = bert_mod_pred_tr
pred_tokens = np.argmax(mod_pred_tr, axis=2)

predict_string = pd.Series()
for i in range(pred_tokens.shape[0]):
  predict_string.loc[i] = ti_tokenizer.detokenize(pred_tokens[i]).numpy().decode('utf-8')

prediction_df = pd.DataFrame({'en_input': actual, 'ti_translation': predict_string})
print(prediction_df.head())

prediction_df.to_csv('/content/drive/MyDrive/266_data/back_translation_5.csv')


[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 444ms/step
                                            en_input  \
0  You shall rejoice in all the good which Yahweh...   
1  When you have made an end of tithing all the t...   
2  You shall say before Yahweh your God, I have p...   
3  I have not eaten of it in my mourning, neither...   
4  Look down from your holy habitation, from heav...   

                                      ti_translation  
0  ንስኻን እግዚኣብሄር ኣብ ብዅሉ ንቤትካን ዝሀበካ ዅሉ ጽቡቕ ነገር ፡ ኣብ...  
1  በታ ሳልሰይቲ ዓመት ፡ ዓመት ኣብ ፡ ኣብ ኵሉ ዓመት ዕሽር ምስ ናይ ፡ ...  
2  ኣብ ቅድሚ እግዚኣብሄር ድማ ድማ ፡ ፡ ነቲ ኻብ ኻብ ቤተይ ኣውጺኤ ፡ ከ...  
3  ኣብ መንጎኻን ካብኡ ኣይበላዕኩን ፡ ኣብ ከሎኹ ኸኣ ኣብ ንየማን ፡ ኣነ ...  
4  ካብቲ ቅዱስ ህዝቢ ፡ ካብ ሰማይ ጠምት እሞ ፡ ናብ እስራኤልን ፡ ኣብ ህ...  


In [None]:
pred_1 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_1.csv', usecols = ['en_input', 'ti_translation'])
pred_2 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_2.csv', usecols = ['en_input', 'ti_translation'])
pred_3 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_3.csv', usecols = ['en_input', 'ti_translation'])
pred_4 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_4.csv', usecols = ['en_input', 'ti_translation'])
pred_5 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_5.csv', usecols = ['en_input', 'ti_translation'])

middle_translations_df = pd.concat([pred_1, pred_2, pred_3, pred_4, pred_5])

In [None]:
#### ti to english translator

en_vocab_size = 5100
ti_vocab_size = 14500


#define some hyperparameter values for our transformers
EMBED_DIM = 250
INTERMEDIATE_DIM = 700
NUM_HEADS = 4

In [None]:
keras.backend.clear_session()
# Encoder
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ti_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

In [None]:
# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=en_vocab_size,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
    # mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(en_vocab_size, activation="softmax",
                                     activity_regularizer = keras.regularizers.L1(0.01))(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

In [None]:
backseq2seq = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="s2sTransformer",
)

In [None]:
backseq2seq.summary()

In [None]:
backseq2seq.compile(
    optimizer = keras.optimizers.RMSprop(learning_rate = 0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

In [None]:
backseq2seq.fit(x=[ti_encoder_in_arr, en_decoder_in_arr], y=en_decoder_out_arr,
            batch_size = 64,
            epochs=20)

Epoch 1/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 118ms/step - accuracy: 0.2518 - loss: 19.5591
Epoch 2/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 14ms/step - accuracy: 0.3531 - loss: 17.8421
Epoch 3/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.3728 - loss: 17.5838
Epoch 4/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.3937 - loss: 17.3821
Epoch 5/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.4225 - loss: 17.1614
Epoch 6/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.4413 - loss: 16.9962
Epoch 7/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.4698 - loss: 16.8082
Epoch 8/20
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.4891 - loss: 16.6628
Epoch 9/20
[1m79/79[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x7f8ee6e739a0>

In [None]:
tok_in = ti_tokenizer(middle_translations_df['ti_translation'])

input_format = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=ti_tokenizer.token_to_id("[PAD]"),
    )
ti_pred_input = input_format(tok_in)
ti_pred_input

<tf.Tensor: shape=(5035, 22), dtype=int32, numpy=
array([[  11,   13,   13, ...,    1,    1,    1],
       [  19,   20,   21, ...,    1,    1,    1],
       [  22,   23,   24, ...,    1,    1,    1],
       ...,
       [4465,  220, 1190, ...,    4,    1,    1],
       [ 808, 3420,  220, ...,    1,    1,    1],
       [5840,  732, 7146, ...,    4,  237, 3420]], dtype=int32)>

In [None]:
beg_idx = 4800
end_idx = 6000

mod_pred_tr = backseq2seq.predict([ti_pred_input[beg_idx:end_idx],
                                        en_decoder_in_arr[beg_idx:end_idx]])

# train data
# actual = dat_5k['EN']

# actual = df.loc[shuffled_tr]['TI']
# actual_ts_to_tr = df.loc[shuffled_ts]['TI'][:250]
# actual = np.hstack((np.array([actual]), np.array([actual_ts_to_tr])))
# actual = actual.reset_index(drop=True)
# actual = actual.loc[beg_idx:end_idx-1]
# actual = actual.reset_index(drop=True)
# actual = pd.Series(actual)
# actual = actual.reshape((278))

# bert_mod_pred_tr_slice = bert_mod_pred_tr

# bert_mod_pred_tr_slice = bert_mod_pred_tr
pred_tokens = np.argmax(mod_pred_tr, axis=2)

predict_string = pd.Series()
for i in range(pred_tokens.shape[0]):
  predict_string.loc[i] = en_tokenizer.detokenize(pred_tokens[i]).numpy().decode('utf-8')

# prediction_df = pd.DataFrame({'en_input': actual, 'ti_translation': predict_string})
# print(prediction_df.head())
print(predict_string.head())
print(predict_string.tail())
predict_string.to_csv('/content/drive/MyDrive/266_data/back_translation_en_5.csv')

[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 707ms/step
0    and shall surely in the the good things yahweh...
1    when you have a an end of your all your years ...
2    he shall spoke , yahweh , god , i will given a...
3    and have in take , it , my covenant , i , i ha...
4    for at from the holy people , and heaven , and...
dtype: object
230    moses moses commanded land before yahweh comma...
231    in in him in the valley in the valley of moab ...
232    moses was one hundred twenty years old when he...
233    the children of israel wept for moses in the p...
234    joshua the son of nun , very of nun son , silv...
dtype: object


In [None]:
bt_en_1 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_en_1.csv', usecols = [1])
bt_en_2 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_en_2.csv', usecols = [1])
bt_en_3 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_en_3.csv', usecols = [1])
bt_en_4 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_en_4.csv', usecols = [1])
bt_en_5 = pd.read_csv('/content/drive/MyDrive/266_data/back_translation_en_5.csv', usecols = [1])

bt_en_df = pd.concat([bt_en_1, bt_en_2, bt_en_3, bt_en_4, bt_en_5])

In [None]:
middle_translations_df['bt_en'] = bt_en_df
middle_translations_df.head()

Unnamed: 0,en_input,ti_translation,bt_en
0,Can you find what you want to say here,እንታይ ከም ከም ዝደለኻ ካብዚ መጽሓፍ ክትረክቦ ምከኣልካዶ [END] [P...,"what i find what you this to not , for [PAD] [..."
1,It has several languages,ብዙሓት ቋንቋታት ኣለዎ [END] [PAD] [PAD] [PAD] [PAD] [...,"it will given with , [PAD] [PAD] [PAD] [PAD] [..."
2,We can try to communicate this way,በዚ ኣገባብ ጌርና ክንረዳዳእ ክንፍትን ኢና [END] [PAD] [PAD] ...,we can i to destroy this way which [PAD] [PAD]...
3,What is your name,ስሙ ' ዩ ስምካ [END] [PAD] [PAD] [PAD] [PAD] [PAD]...,my name your name abraham [PAD] [PAD] [PAD] [P...
4,I’m hungry,ጥሜት ኣለኒ [END] [PAD] [PAD] [PAD] [PAD] [PAD] [P...,i have m gods . [PAD] [PAD] [PAD] [PAD] [PAD] ...


In [None]:
middle_translations_df.to_csv('/content/drive/MyDrive/266_data/back_translations_5k.csv', index = False)

In [None]:
bt_translations = pd.read_csv('/content/drive/MyDrive/266_data/back_translations_5k.csv')
bt_translations.head()

Unnamed: 0,en_input,ti_translation,bt_en
0,Can you find what you want to say here,እንታይ ከም ከም ዝደለኻ ካብዚ መጽሓፍ ክትረክቦ ምከኣልካዶ [END] [P...,"what i find what you this to not , for [PAD] [..."
1,It has several languages,ብዙሓት ቋንቋታት ኣለዎ [END] [PAD] [PAD] [PAD] [PAD] [...,"it will given with , [PAD] [PAD] [PAD] [PAD] [..."
2,We can try to communicate this way,በዚ ኣገባብ ጌርና ክንረዳዳእ ክንፍትን ኢና [END] [PAD] [PAD] ...,we can i to destroy this way which [PAD] [PAD]...
3,What is your name,ስሙ ' ዩ ስምካ [END] [PAD] [PAD] [PAD] [PAD] [PAD]...,my name your name abraham [PAD] [PAD] [PAD] [P...
4,I’m hungry,ጥሜት ኣለኒ [END] [PAD] [PAD] [PAD] [PAD] [PAD] [P...,i have m gods . [PAD] [PAD] [PAD] [PAD] [PAD] ...
