<a href="https://colab.research.google.com/github/jwmathis/transformer_nn_model/blob/main/Transformer_Model_arabic_to_english.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfomer Model: Arabic to English Translation
* TYPE OF MODEL: Using a transformer model. Is significant advancement in sequence-to-sequence tasks especially for natural language processing. Uses a self attention mechanism that allows the model to focus on different parts of the input sequence, capturing long-range dependencies more effectively than traditionall RNNs and LSTMs. Additionally it supports parallel processing, whihc enhances training speed.


### 1. Importing Dependencies

In [None]:
!pip install torchtext==0.15.2
!pip install datasets

Collecting torchtext==0.15.2
  Downloading torchtext-0.15.2-cp310-cp310-manylinux1_x86_64.whl.metadata (7.4 kB)
Collecting torch==2.0.1 (from torchtext==0.15.2)
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting torchdata==0.6.1 (from torchtext==0.15.2)
  Downloading torchdata-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.1->torchtext==0.15.2)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.1->torchtext==0.15.2)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.1->torchtext==0.15.2)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==2.0.1->torchtex

In [None]:
import torch
import torch.nn as nn
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
import torch.optim as optim
import nltk
from nltk.tokenize import word_tokenize
from datasets import load_dataset


### Step 1. Data Preprocessing
* DATA PREPROCESSING: sourced out a dataset from Hugging Face and splitit into training and tests setsw. The toeknization process employs NLTK to break down Arabic and English text into manageable tokens. Once tokenized, the text is converted into numerical representations using custom built vocabularies, that facilitate the training of the transformer model.


In [None]:
# Import the dataset from huggingface
ds = load_dataset("mohamed-khalil/ATHAR")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/11.7k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/251k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Split the data
# HuggingFace already has this split into a 'train' set and a 'test' set
train_data = ds['train']
test_data = ds['test']

In [None]:
"""Defining Tokenization functions"""
nltk.download('punkt') # downloads the language models for tokenizing

# functions for tokenizing using nltk library; not necessary since word_tokenize() is a nice compact function, but lets you at least renname it
def tokenize_arabic(text):
    return word_tokenize(text)

def tokenize_eng(text):
    return word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Block for Testing Tokenization
arabic_text = "مرحبا بالعالم! هذا هو اختباري الأول."
arabic_tokens = tokenize_arabic(arabic_text)
print("Arabic tokens: ", arabic_tokens)

english_text = "Hello world! This is my first test."
english_tokens = tokenize_eng(english_text)
print("English tokens: ", english_tokens)

Arabic tokens:  ['مرحبا', 'بالعالم', '!', 'هذا', 'هو', 'اختباري', 'الأول', '.']
English tokens:  ['Hello', 'world', '!', 'This', 'is', 'my', 'first', 'test', '.']


In [None]:
"""Processing Functions"""
# Convert text to lowercase, tokenize it and add <sos> and <eos> to tokens
def process_text(text, tokenizer):
  tokens = tokenizer(text.lower())
  tokens = ["<sos>"] + tokens + ["<eos>"]
  return tokens

# yields processed tokens for each text in the dataset
def yield_tokens(data_iter, tokenizer):
  for text in data_iter:
    yield process_text(text, tokenizer)

# Converts tokens into numerical representations using a vocab
def numericalize(text, tokenizer, vocab):
  tokens = process_text(text, tokenizer)
  return [vocab[token] for token in tokens]

In [None]:
# Example using the above functions
mock_vocab = {
    "<pad>": 0,
    "<sos>": 1,
    "<eos>": 2,
    "<unk>": 3,
    "أحب": 4,
    "تعلم": 5,
    "البرمجة": 6
}
text = "أحب تعلم البرمجة"
numericalized_text = numericalize(text, tokenize_arabic, mock_vocab)
print(numericalized_text)  # [token indices]

[1, 4, 5, 6, 2]


In [None]:
"""Create vocabularies"""
# Prepare token iterators
train_arabic_tokens = yield_tokens((example['arabic'] for example in train_data), tokenize_arabic)
train_english_tokens = yield_tokens((example['english'] for example in train_data), tokenize_eng)

# Build vocabularies
arabic_vocab = build_vocab_from_iterator(train_arabic_tokens, specials=["<pad>", "<sos>", "<eos>"])
english_vocab = build_vocab_from_iterator(train_english_tokens, specials=["<pad>", "<sos>", "<eos>"])

# Set default index for padding token
arabic_vocab.set_default_index(arabic_vocab["<pad>"])
english_vocab.set_default_index(english_vocab["<pad>"])

# Get vocab sizes
src_vocab_size = len(arabic_vocab)
trg_vocab_size = len(english_vocab)

# Get <pad> token index
src_pad_idx = arabic_vocab["<pad>"]
trg_pad_idx = english_vocab["<pad>"]

In [None]:
from torch.utils.data import Dataset, DataLoader

class TranslationDataset(Dataset):
  def __init__(self, data, arabic_vocab, english_vocab):
    self.data = data
    self.arabic_vocab = arabic_vocab
    self.english_vocab = english_vocab

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    arabic_text = self.data[idx]['arabic']
    english_text = self.data[idx]['english']

    # Numericalize the text
    arabic_numericalized = numericalize(arabic_text, tokenize_arabic, self.arabic_vocab)
    english_numericalized = numericalize(english_text, tokenize_eng, self.english_vocab)

    return torch.tensor(arabic_numericalized), torch.tensor(english_numericalized)

train_dataset = TranslationDataset(train_data, arabic_vocab, english_vocab)
test_dataset = TranslationDataset(test_data, arabic_vocab, english_vocab)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

* MODEL ARCHITECTURE: Features separate embedding layers for both arabic and english inputs. Position embeddings are added to account for the order of tokens wihthin the sequence. The transformer comprises six encoder and six decoder layers, each utilizing eight attention heads for robust feature extraction. To combat overfitting, dropout is implemented throughout the model, and the output layyer is a linear layer that maps the transformer's outputs to the target vocabulary for english translation. This particular architecture is based on the paper Attention is All you Need.


In [None]:
"""Defining Transformer Model"""
class Transformer(nn.Module):
    def __init__(
        self,
        embedding_size,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        num_heads,
        num_encoder_layers,
        num_decoder_layers,
        forward_expansion,
        dropout,
        max_len,
        device
    ):
        super(Transformer, self).__init__()
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        self.trg_word_embedding = nn.Embedding(trg_vocab_size, embedding_size)
        self.trg_position_embedding = nn.Embedding(max_len, embedding_size)
        self.device = device
        self.transformer = nn.Transformer(
            embedding_size,
            num_heads,
            num_encoder_layers,
            num_decoder_layers,
            forward_expansion,
            dropout,
        )

        self.fc_out = nn.Linear(embedding_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.src_pad_idx = src_pad_idx

    def make_src_mask(self, src):
        # src shape: (src_len, N)
        src_mask = src.transpose(0, 1) == self.src_pad_idx
        # output for Pytorch (N, src_len)
        return src_mask

    def forward(self, src, trg):
        src_seq_length, N = src.shape
        trg_seq_length, N = trg.shape

        src_positions = (
            torch.arange(0, src_seq_length).unsqueeze(1).expand(trg_seq_length, N).to(self.device)
        )

        trg_positions = (
            torch.arange(0, trg_seq_length).unsqueeze(1).expand(trg_seq_length, N).to(self.device)
        )

        embed_src = self.dropout(
            (self.src_word_embedding(src) + self.src_position_embedding(src_positions))
        )

        embed_trg = self.dropout(
            (self.trg_word_embedding(trg) + self.trg_word_embedding(trg_positions))
        )

        src_padding_mask = self.make_src_mask(src)
        trg_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to(self.device)

        out  = self.transformer(
            embed_src,
            embed_trg,
            src_key_padding_mask = src_padding_mask,
            tgt_mask = trg_mask
        )

        return self.fc_out(out)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
load_model = False
save_model = True

# Training Hyperparameters
num_epochs = 5
learning_rate = 3e-4
batch_size = 32

# Model Hyperparameters
embedding_size = 512
num_heads = 8
num_encoder_layers = 6
num_decoder_layers = 6
dropout = 0.10
max_len = 100
forward_expansion = 4

model = Transformer(
    embedding_size,
    src_vocab_size,
    trg_vocab_size,
    src_pad_idx,
    num_heads,
    num_encoder_layers,
    num_decoder_layers,
    forward_expansion,
    dropout,
    max_len,
    device,
).to(device)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = nn.CrossEntropyLoss(ignore_index=english_vocab["<pad>"])

* TRAINING PROCESS: Involves defining several key hyperparameters, including a ;earning rate of 3e-4, a batch size of 32 and a total of five epochs. The loss funciton used is Cross entropy loss whihc ignores padding tokens to ensure accurate evaluation. During each iteration of the training loop, the model processes batches of data, performs forward passes to generate predictions, calculates the lossses and updataes the model weights through backpropagation. This is the typical training process used by various examples and tutorials.


In [None]:
"""Training Loop"""

for epoch in range(num_epochs):
  for ar_batch, en_batch in train_loader:
    ar_batch, en_batch = ar_batch.to(device), en_batch.to(device)

    output = model(ar_batch, en_batch[:-1, :])

    loss = loss_fn(output.view(-1, output.shape[2]), en_batch[1:, :].view(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Epoch: {epoch}, Loss: {loss.item()}")

RuntimeError: stack expects each tensor to be equal size, but got [30] at entry 0 and [12] at entry 1

* EVALUATION: Use metrics like BLEU score to measure the quality of the translations produced by the model. Depending on the results, potential improvements invovle fine tuning the hyperparameters or implementing data augmentation techniques to enrich the training data.
