# Nuer-English_Translation

## Project Overview

The goal of this project is to create a translation model that translate english to nuer language(my native language). The model is going to translate a small set of data.

## Approach

To translate english to nuer, we need to build a recurrent neural network(RNN). To build the RNN pipeline we need to start by:
1. **preprocessing**; Load and examine the data, clean, tokenize and pad it.
2. **Modeling**; build, train, and test the model
3. **Prediction**; Create specific translations of english to Nuer, and then compare the output translations to the ground truth translations.
4. **Iteration**; Go through the model, experimenting with different architectures.



## Import necessary packages and libraries

In [None]:
#Install the packages and libraries
%pip install numpy
%pip install tensorflow 
%pip install keras


In [7]:
import os
import sys
import load_func
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential


In [10]:
import tensorflow as tf
print(tf.__version__)

2.17.0


In [11]:
import sys
print(sys.executable)
print(sys.path)

/Users/makuachtenygatluak/Documents/alu-machine_learning/Nuer-English_Translation/transenv/bin/python
['/Library/Frameworks/Python.framework/Versions/3.12/lib/python312.zip', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12', '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/lib-dynload', '', '/Users/makuachtenygatluak/Documents/alu-machine_learning/Nuer-English_Translation/transenv/lib/python3.12/site-packages', '/Users/makuachtenygatluak/Documents/alu-machine_learning/Nuer-English_Translation/transenv/lib/python3.12/site-packages/setuptools/_vendor']


## Dataset
- I will go through the dataset and clean it if necessary. I have two datasets, the english.txt and the nuer.txt files. Each line in the english.txt file has a respective translation in each line of nuer.txt. I created a function outside the notebook to load the dataset.


In [None]:
## Load the data
english_sentences = load_func.load_data('data/english.txt')
nuer_sentences = load_func.load_data('data/nuer.txt')

print(english_sentences[0])


# Check the corresponding sentences
for i in range(5):
    print("English sample: ", english_sentences[i])
    print("Nuer samples: ",nuer_sentences[i])
    print()

## Preprocess
- Convert the text into a sequences of integers using:
1. Tokenization of the words into ids
2. Adding padding to make all the sequences the samee length

In [None]:
## Tokenize the data
def tokenize(x):
    """ Tokenize x
        : param x: List of sentences/strings to be tokenized
        : return: Tuple of (tokenized x data, tokenizer used to tokenize x) 
    """
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer


text_sentences, text_tokenizer = tokenize(english_sentences)
# print(text_sentences[0])

for i, (sent, token_sent) in enumerate(zip(english_sentences, text_sentences)):
    print('Sequence {} in x'.format(i + 1))
    print('Original sentence:', sent)
    print('Tokenized sentence:', token_sent)
    print()

In [None]:
# Pad the data
def pad(x, length=None):
    """ Pad x
        : param x: List of sequences.
        : param length: Length to pad the sequence to.  If
    """
    
    return pad_sequences(x, maxlen=length, padding='post')
text_sentences_padded = pad(text_sentences)
for i, (token_sent, pad_sent) in enumerate(zip(text_sentences, text_sentences_padded)):
    print('Sequence {} in x'.format(i + 1))
    print('Original sentence:', token_sent)
    print('Padded sentence:', pad_sent)
    print()   
    

In [None]:
# Preprocess the data
def preprocess(x, y):
    """ Preprocess x and y
        : param x: Feature List of sentences
        : param y: Label List of sentences
        : return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_nuer_sentences, english_tokenizer, nuer_tokenizer = preprocess(english_sentences, nuer_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_nuer_sequence_length = preproc_nuer_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
nuer_vocab_size = len(nuer_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max Nuer sentence length:", max_nuer_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("Nuer vocabulary size:", nuer_vocab_size)


## Models
The model architecture is Simple RNN, RNN with Embedding, Bidirectional RNN and Encoder-Decpder RNN.
- First the neural network will be translating the input to words ids, and then a logits_to_text function will convert the ids(logits) from the neural network to the nuer translation.

In [None]:
# Convert logits to text function

def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')