<a href="https://colab.research.google.com/github/kaybrian/Native_Language_Trans/blob/main/Luganda_English_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Required Libraries
First, we neeed to install all the needed libraries for the project


In [1]:
!pip install transformers datasets tensorflow


Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.10.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
  Downloading aiohappyeyeballs-2.4.3-py3-none-any.whl.metadata (6.1 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting frozenl

##  Data Collection and Preprocessing
Get the data from Hugging face
- [Luganda - English Dataset](https://huggingface.co/datasets/pkyoyetera/luganda_english_dataset)


In [5]:
from datasets import load_dataset
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Bidirectional, Attention
from tensorflow.keras.layers import Dot, Activation, Concatenate


In [6]:
# Load dataset from Hugging Face
dataset = load_dataset("pkyoyetera/luganda_english_dataset")

train_test_split_ratio = 0.2

# Split dataset into train and test sets
train_data, test_data = dataset['train'].train_test_split(test_size=train_test_split_ratio).values()


# Function to preprocess data (tokenization, padding, etc.)
def preprocess_data(batch, tokenizer, max_length=50):
    inputs = tokenizer(batch['English'], return_tensors="tf", max_length=max_length, padding='max_length', truncation=True)
    targets = tokenizer(batch['Luganda'], return_tensors="tf", max_length=max_length, padding='max_length', truncation=True)

    return inputs.input_ids, targets.input_ids

# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-lg")

train_input_ids, train_target_ids = preprocess_data(train_data, tokenizer)
test_input_ids, test_target_ids = preprocess_data(test_data, tokenizer)

# Convert to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((train_input_ids, train_target_ids))
test_dataset = tf.data.Dataset.from_tensor_slices((test_input_ids, test_target_ids))

# Batch the datasets
batch_size = 32
train_dataset = train_dataset.shuffle(len(train_data)).batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Model Development: Building the RNN-Based Seq2Seq Model
We will create an encoder-decoder architecture with an optional attention mechanism:



In [7]:
# Hyperparameters
embedding_dim = 256
units = 512

# Encoder
encoder_inputs = Input(shape=(None,))
encoder_embedding = Embedding(input_dim=tokenizer.vocab_size, output_dim=embedding_dim)(encoder_inputs)
encoder_lstm = Bidirectional(LSTM(units, return_sequences=True, return_state=True))
encoder_outputs, forward_h, forward_c, backward_h, backward_c = encoder_lstm(encoder_embedding)

# Concatenate the forward and backward states
state_h = tf.keras.layers.Concatenate()([forward_h, backward_h])
state_c = tf.keras.layers.Concatenate()([forward_c, backward_c])
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(input_dim=tokenizer.vocab_size, output_dim=embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(units * 2, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)


# Attention mechanism
attention = Dot(axes=[2, 2])([decoder_outputs, encoder_outputs])
attention_weights = Activation('softmax')(attention)
context_vector = Dot(axes=[2, 1])([attention_weights, encoder_outputs])

# Concatenate context vector with decoder output
decoder_combined_context = Concatenate(axis=-1)([context_vector, decoder_outputs])

# Output layer
decoder_dense = Dense(tokenizer.vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_combined_context)

# Final model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model summary
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 embedding (Embedding)       (None, None, 256)            1547443   ['input_1[0][0]']             
                                                          2                                       
                                                                                                  
 input_2 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 bidirectional (Bidirection  [(None, None, 1024),         3149824   ['embedding[0][0]']       

In [5]:
# Define training parameters
epochs = 1

# Train the model
history = model.fit([train_input_ids, train_target_ids],
                    train_target_ids,
                    epochs=epochs,
                    validation_data=([test_input_ids, test_target_ids], test_target_ids))




In [8]:
# Evaluate the model using BLEU score
from nltk.translate.bleu_score import sentence_bleu

def evaluate_model(model, test_dataset, tokenizer):
    for inputs, targets in test_dataset.take(1):
        predictions = model.predict([inputs, targets])
        predicted_sentences = tokenizer.batch_decode(np.argmax(predictions, axis=-1), skip_special_tokens=True)
        reference_sentences = tokenizer.batch_decode(targets, skip_special_tokens=True)

        # BLEU Score for each sentence
        for pred, ref in zip(predicted_sentences, reference_sentences):
            print(f"Reference: {ref}")
            print(f"Prediction: {pred}")
            print(f"BLEU Score: {sentence_bleu([ref.split()], pred.split())}")

evaluate_model(model, test_dataset, tokenizer)


Reference: Emiwendo gy'emmere eya bulijjo gyalinnya mu biseera by'ekirwadde bbunansi.
Prediction: productionTHETHETHETHE BE BE BETHE BETHE nnakaaba marijuana marijuana Saints Saints Saints Saints Saints Saints Saints yeesigamya marijuana marijuana marijuana marijuana marijuana marijuana marijuana marijuana marijuana marijuana BE BE marijuana marijuana marijuana marijuanaTHETHE stressful stressful stressfulambaleambaleambaleambaleambaleambaleambale
BLEU Score: 0
Reference: awo Mukama n'alabikira Sulemaani omulundi ogw'okubiri, nga bwe yamulabikira e Gibyoni.
Prediction: bitegeeza bitegeeza Modif ModifobutaagalaTHETHE stressfulTHETHE yeesigamyaTHE marijuanaTemuli abagaanyi abagaanyi bikemo abagaanyiello Pentat Pentat Pentat testify testify marijuana marijuana testify Saints Saints Saints testify stressful Nicola machinerylts disguisedakoledde abeesiga Pentat Pentat nnakaaba booleka stressful stressful stressful stressfulambaleambaleambaleambale
BLEU Score: 0
Reference: Gavumenti etaddewo

In [9]:
import numpy as np
import tensorflow as tf

def translate_to_luganda(model, tokenizer, max_length=50):
    while True:
        # Ask user for input
        user_input = input("Enter an English statement to translate to Luganda (or 'q' to quit): ")

        # Check if user wants to quit
        if user_input.lower() == 'q':
            print("Thank you for using the translator. Goodbye!")
            break

        # Tokenize the input
        input_ids = tokenizer.encode(user_input, return_tensors="tf", max_length=max_length, padding='max_length', truncation=True)

        # Create a target sequence of the same length filled with padding token ID
        target_ids = tf.ones_like(input_ids) * tokenizer.pad_token_id

        try:
            # Predict
            output = model.predict([input_ids, target_ids])

            # Check if output is empty or all zeros
            if np.all(output == 0):
                print("Error: Model output is all zeros. This might indicate a problem with the model.")
                continue

            # Get the predicted token IDs
            predicted_ids = np.argmax(output[0], axis=-1)

            # Decode the output
            predicted_sentence = tokenizer.decode(predicted_ids, skip_special_tokens=True)

            # Check if predicted sentence is empty
            if not predicted_sentence.strip():
                print("Error: Decoded output is empty. Showing raw prediction:")
                print(predicted_ids)
                continue

            # Print the result
            print(f"English: {user_input}")
            print(f"Luganda: {predicted_sentence}")
            print(f"Raw prediction: {predicted_ids}")
            print()

        except Exception as e:
            print(f"An error occurred: {str(e)}")
            print("Model input shape:", input_ids.shape)
            print("Model output shape:", output.shape if 'output' in locals() else "N/A")



# Use the function
translate_to_luganda(model, tokenizer)

Enter an English statement to translate to Luganda (or 'q' to quit): Hey brian
English: Hey brian
Luganda: THETHETHETHETHETHETHETHETHETHETHETHETHETHETHETHETHETHEambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambale
Raw prediction: [33918 33918 33918 33918 33918 33918 33918 33918 33918 33918 33918 33918
 33918 33918 33918 33918 33918 33918 32089 32089 32089 32089 32089 32089
 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089
 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089 32089
 32089 32089]

Enter an English statement to translate to Luganda (or 'q' to quit): q
Thank you for using the translator. Goodbye!


In [12]:
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 embedding (Embedding)       (None, None, 256)            1547443   ['input_1[0][0]']             
                                                          2                                       
                                                                                                  
 input_2 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 bidirectional (Bidirection  [(None, None, 1024),         3149824   ['embedding[0][0]']       

In [13]:
print("Vocabulary size:", tokenizer.vocab_size)


Vocabulary size: 60447


In [17]:
import tensorflow as tf
import numpy as np

def translate_english_to_luganda(model, tokenizer, text, max_length=50, temperature=0.7):
    def print_debug_info(input_ids, output):
        print("Tokenized input:", input_ids)
        print("Decoded input:", tokenizer.decode(input_ids[0]))
        print("Model input shape:", input_ids.shape)
        print("Model output shape:", output.shape if output is not None else "N/A")

    def basic_prediction(input_ids):
        target_ids = tf.ones_like(input_ids) * tokenizer.pad_token_id
        return model.predict([input_ids, target_ids])

    def beam_search_prediction(input_ids, beam_size=3):
        encoder_input = input_ids
        decoder_input = tf.expand_dims([tokenizer.bos_token_id], 0)

        def decoder_step(decoder_input):
            return model([encoder_input, decoder_input], training=False)

        beam = [(decoder_input, 0)]
        for _ in range(max_length):
            candidates = []
            for seq, score in beam:
                if seq[0][-1] == tokenizer.eos_token_id:
                    candidates.append((seq, score))
                    continue
                predictions = decoder_step(seq)
                top_k = tf.math.top_k(predictions[0, -1], k=beam_size)
                for i in range(beam_size):
                    new_seq = tf.concat([seq, tf.expand_dims([top_k.indices[i]], 0)], axis=-1)
                    new_score = score + tf.math.log(top_k.values[i])
                    candidates.append((new_seq, new_score))
            beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_size]
            if all(seq[0][-1] == tokenizer.eos_token_id for seq, _ in beam):
                break
        return beam[0][0]

    def temperature_sampling_prediction(input_ids):
        encoder_input = input_ids
        decoder_input = tf.expand_dims([tokenizer.bos_token_id], 0)

        for _ in range(max_length):
            predictions = model([encoder_input, decoder_input], training=False)
            predictions = predictions[:, -1, :] / temperature
            predicted_id = tf.random.categorical(predictions, num_samples=1)
            decoder_input = tf.concat([decoder_input, predicted_id], axis=-1)
            if predicted_id == tokenizer.eos_token_id:
                break
        return decoder_input

    # Tokenize input
    input_ids = tokenizer.encode(text, return_tensors="tf", max_length=max_length, padding='max_length', truncation=True)

    print_debug_info(input_ids, None)

    try:
        # Try basic prediction
        output = basic_prediction(input_ids)
        predicted_ids = np.argmax(output[0], axis=-1)

        # If basic prediction fails, try beam search
        if np.all(predicted_ids == predicted_ids[0]):
            print("Basic prediction failed. Trying beam search...")
            predicted_ids = beam_search_prediction(input_ids)

        # If beam search fails, try temperature sampling
        if np.all(predicted_ids == predicted_ids[0]):
            print("Beam search failed. Trying temperature sampling...")
            predicted_ids = temperature_sampling_prediction(input_ids)[0].numpy()

        predicted_sentence = tokenizer.decode(predicted_ids, skip_special_tokens=True)

        if not predicted_sentence.strip():
            raise ValueError("Decoded output is empty")

        print(f"English: {text}")
        print(f"Luganda: {predicted_sentence}")
        print(f"Raw prediction: {predicted_ids}")

        return predicted_sentence

    except Exception as e:
        print(f"An error occurred: {str(e)}")
        print_debug_info(input_ids, output if 'output' in locals() else None)
        return None

model_used = model
tokenizer_used = tokenizer

result = translate_english_to_luganda(model_used, tokenizer_used, "Hello, how are you?")


Tokenized input: tf.Tensor(
[[   18 43450     3   145    49    39    10     0 60446 60446 60446 60446
  60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446
  60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446
  60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446 60446
  60446 60446]], shape=(1, 50), dtype=int32)
Decoded input: Hello, how are you?</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Model input shape: (1, 50)
Model output shape: N/A
English: Hello, how are you?
Luganda: THETHETHETHETHETHETHETHETHETHEambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambaleambal