<a href="https://colab.research.google.com/github/joat26/NLP/blob/main/Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab Project on Recurrent Neural Networks**

This lab project is part of the Advanced Natural Language Processing course for the academic year 2024/2025. The objective is to explore and implement Recurrent Neural Network (RNN) models, focusing specifically on text translation between English and Moroccan Darija (a local Arabic dialect). Using a provided dataset available on Hugging Face, we aim to build and compare various LSTM-based models to achieve effective translation.italicized *text*



---


## **Tasks**

---



### **1.  Dataset**

*   The chosen dataset contains paired sentences in English and Darija, sourced from Hugging Face: Darija-English Dataset.
*   The dataset allows us to train, validate, and test models for accurate bidirectional text translation.




### **2.   Modeling Tasks:**

*   Baseline Model: Implement a vanilla LSTM-based model to serve as the benchmark for performance.
*   Advanced Models: Experiment with advanced LSTM variations, such as Peephole and Working Memory connections, to improve performance.
*   Hyperparameter Tuning: Evaluate the influence of parameters like learning rate, optimizer type, batch size, and initialization strategies on the model's performance.





### **3. Evaluation:**



* Performance metrics such as BLEU scores and loss functions are used to assess translation quality.
* Comparative analysis of the baseline and advanced models is conducted to highlight improvements.

## **Code**

---

This notebook demonstrates the process of building and optimizing text translation models using modern NLP tools and techniques while addressing challenges unique to Moroccan Darija.

---



In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

Importing essential libraries from PyTorch

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim



 Installs PyTorch and its associated libraries with GPU support (CUDA 11.8)

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


The code imports the Hugging Face load_dataset function to load the Darija-English dataset, focusing on the "sentences" subset for machine translation. This dataset provides source (English) and target (Darija) sentence pairs, organized into train, validation, and test splits. It's ideal for building and evaluating translation models, such as LSTMs or transformers, tailored to a low-resource language like Moroccan Darija.

In [None]:
from datasets import load_dataset
dataset = load_dataset("imomayiz/darija-english", "sentences")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/348 [00:00<?, ?B/s]

sentences.csv:   0%|          | 0.00/6.34M [00:00<?, ?B/s]

Generating sentences split: 0 examples [00:00, ? examples/s]

The code defines a function, remove_missing_data, that checks if the darija, eng, and darija_ar fields in each dataset entry are not None. It then uses the .filter() method to apply this function to the "sentences" subset of the dataset, removing any rows with missing values. This ensures the dataset contains only complete examples for training and evaluation purposes.

In [None]:
def remove_missing_data(example):
    return example["darija"] is not None and example["eng"] is not None and example["darija_ar"] is not None

dataset = dataset["sentences"].filter(remove_missing_data)

Filter:   0%|          | 0/87785 [00:00<?, ? examples/s]

The code prepares tokenized input (X) and target (Y) data for machine translation by iterating through the dataset. For each example, it converts the darija and eng sentences to lowercase and splits them into words using .lower().split(). The tokenized Darija words are stored in X_sentence, and the English words in Y_sentence. These tokenized sentences are then appended to the X and Y lists, creating two parallel datasets for training, where X contains Darija sentences and Y contains their corresponding English translations.

In [None]:
X = []
Y = []

for sentence in dataset:
    X_sentence = []
    Y_sentence = []
    X_sentence.extend(sentence["darija"].lower().split())
    Y_sentence.extend(sentence["eng"].lower().split())

    X.append(X_sentence)
    Y.append(Y_sentence)


The code computes the size of the Darija vocabulary by first creating a set, darija_vocab, which contains all unique words in the tokenized sentences stored in X, converting each word to lowercase to ensure case-insensitivity. It then calculates the total number of unique words, num_darija, as the length of this set. Finally, it prints the vocabulary size, providing an understanding of the dataset's language variety and word diversity.

In [None]:
darija_vocab = set([word.lower() for sentence in X for word in sentence])
num_darija = len(set([word.lower() for sentence in X for word in sentence]))

print("Darija Vocabulary size: {}".format(num_darija))

Darija Vocabulary size: 17282


The code calculates the size of the English vocabulary by creating a set, english_vocab, containing all unique words from the tokenized sentences in Y, with each word converted to lowercase to ensure case-insensitivity. It then determines the total number of unique words, num_english, as the length of this set. Finally, it prints the vocabulary size, providing insight into the diversity of English words in the dataset.

In [None]:
english_vocab = set([word.lower() for sentence in Y for word in sentence])
num_english = len(set([word.lower() for sentence in Y for word in sentence]))

print("English Vocabulary size: {}".format(num_english))

English Vocabulary size: 8079


The code generates a dictionary, darija_to_ix, that maps each unique word in the Darija vocabulary to a unique numerical index. It iterates through all the words in darija_vocab and assigns an index to any word not already in the dictionary, using the current size of the dictionary (len(darija_to_ix)) as the index. This ensures that every word in the vocabulary has a distinct numeric representation, which is critical for preparing data for machine learning models. Finally, the dictionary is printed to display the word-to-index mappings.

In [None]:
darija_to_ix = {}

for word in darija_vocab:
  if word not in darija_to_ix:
    darija_to_ix[word] = len(darija_to_ix)

print(darija_to_ix)

{'jjayya': 0, 'kay7ell': 1, '9tra': 2, 'jbt': 3, 'khssna': 4, 'lbid': 5, 'ktbt': 6, 'tblagh': 7, 'makaykhdamch': 8, 'sinin': 9, 'ntsenna?': 10, 'lfransawiyin': 11, 'balik': 12, 'ftaslo9': 13, 'mamezyanach': 14, 'mkellekh': 15, 'is3d': 16, 'tlmozari3in': 17, 'botolat': 18, 'drnaha': 19, 'tajarib': 20, 'lou9id!': 21, 'm3rouf': 22, 'nktachfou': 23, 'smiytek': 24, 'barda': 25, 'modmin': 26, 'bddebt': 27, 't3lm8om': 28, 'kanchekrouk': 29, 'ncomandi': 30, 'ddlam?': 31, 'madrribin': 32, 'mata3im': 33, 'ppyano': 34, 'katayeb': 35, 'bkhizo': 36, '7kem': 37, '9ette3ti': 38, 'ja.': 39, 'lmalal': 40, 'sokkan': 41, 'b7alb7al': 42, 'fih.': 43, 'te9der': 44, 'wllaft': 45, 'l5outa': 46, 'nbki!': 47, 'mafhamtch?': 48, 'katmata3': 49, 'anshita,': 50, 'sat7': 51, 'ikhasni': 52, '8a8oma': 53, 'ma3marni': 54, 'lbagage': 55, 'momti3!': 56, '7asan': 57, 'daba.': 58, 'cheta': 59, 'sebbat': 60, 't7emmel': 61, 'b3ida?': 62, 'nf8em': 63, 'ndowwech': 64, 'rrisala?': 65, 'solde': 66, 'mafrasi': 67, 'lla': 68, 'dkh

The code creates a dictionary, eng_to_ix, which maps each unique word in the English vocabulary to a unique numerical index. It iterates through all words in the english_vocab and assigns an index to any word not already in the dictionary, using the current size of eng_to_ix as the index. This ensures that every word in the English vocabulary has a distinct numeric representation, which is essential for processing the data in machine learning models. The resulting word-to-index mapping is then printed.

In [None]:
eng_to_ix = {}

for word in english_vocab:
  if word not in eng_to_ix:
    eng_to_ix[word] = len(eng_to_ix)

print(eng_to_ix)



The code splits the dataset into training, testing, and validation sets using scikit-learn's train_test_split function. Initially, it divides the data into 80% training and 20% testing, ensuring reproducibility with a fixed random seed (random_state=4). Next, it further splits the training set into 80% training and 20% validation data. The sizes of the resulting splits are printed, showing the number of sequences in the training, testing, and validation sets. This ensures the data is properly partitioned for training the model, validating its performance, and testing its accuracy.

In [None]:
from sklearn.model_selection import train_test_split

SPLIT_SIZE = 0.2

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=SPLIT_SIZE, random_state=4)

X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=SPLIT_SIZE, random_state=4)

print("TRAINING DATA")
print('Number of sequences: {}'.format(len(X_train)))
print("-"*50)
print("TESTING DATA")
print('Number of sequences: {}'.format(len(X_test)))
print("-"*50)
print("VALIDATION DATA")
print('Number of sequences: {}'.format(len(X_val)))

TRAINING DATA
Number of sequences: 8155
--------------------------------------------------
TESTING DATA
Number of sequences: 2549
--------------------------------------------------
VALIDATION DATA
Number of sequences: 2039


The VanillaLSTMCell class implements a custom LSTM cell in PyTorch, defining the core computations for the input gate, forget gate, output gate, and candidate cell state. During initialization, linear transformations for the input and hidden states are set up, along with learnable bias parameters for each gate. In the forward pass, the gates are computed using sigmoid activations, and the candidate cell state is calculated with a tanh activation. The cell state is updated by combining the forget gate, input gate, and candidate state, while the hidden state is updated using the output gate and the updated cell state. This implementation provides flexibility for experimenting with LSTM behavior in neural networks.

In [None]:
import torch
import torch.nn as nn

class VanillaLSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(VanillaLSTMCell, self).__init__()

        # Définition des transformations linéaires pour chaque porte
        self.W_hi = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_xi = nn.Linear(input_size, hidden_size)
        self.b_i = nn.Parameter(torch.zeros(hidden_size))  # Biais de la porte d'entrée

        self.W_hf = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_xf = nn.Linear(input_size, hidden_size)
        self.b_f = nn.Parameter(torch.zeros(hidden_size))  # Biais de la porte d'oubli

        self.W_ho = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_xo = nn.Linear(input_size, hidden_size)
        self.b_o = nn.Parameter(torch.zeros(hidden_size))  # Biais de la porte de sortie

        self.W_hc = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_xc = nn.Linear(input_size, hidden_size)
        self.b_c = nn.Parameter(torch.zeros(hidden_size))  # Biais de la cellule candidate

    def forward(self, x, hidden, cell):
        # Calcul des portes d'entrée, d'oubli et de sortie
        i = torch.sigmoid(self.W_xi(x) + self.W_hi(hidden) + self.b_i)
        f = torch.sigmoid(self.W_xf(x) + self.W_hf(hidden) + self.b_f)
        o = torch.sigmoid(self.W_xo(x) + self.W_ho(hidden) + self.b_o)

        # Calcul de la cellule candidate
        c_bar = torch.tanh(self.W_xc(x) + self.W_hc(hidden) + self.b_c)

        # Mise à jour de la cellule et de l'état caché
        cell_new = f * cell + i * c_bar
        hidden_new = o * torch.tanh(cell_new)

        return hidden_new, cell_new

The TextTrans class defines a neural network for text translation using embeddings, a custom LSTM cell (VanillaLSTMCell), and a linear output layer. It initializes word embeddings to convert input tokens into dense vectors, a custom LSTM cell to process sequential data, and a linear layer to map the LSTM's hidden state to the output vocabulary. In the forward pass, the input sentence is transformed into embeddings, and an LSTM processes each embedding sequentially, updating its hidden and cell states. The hidden state is passed through the linear layer to generate predictions for each word in the sequence, which are stacked and returned as the model's output. This design enables token-wise translation with customizable components.

In [None]:
class TextTrans(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(TextTrans, self).__init__()
        self.hidden_dim = hidden_dim

        # Couche d'embedding pour les mots
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Cellule LSTM personnalisée
        self.lstm_cell = VanillaLSTMCell(embedding_dim, hidden_dim)

        # Couche de sortie
        self.fc = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        hidden, cell = torch.zeros(1, self.hidden_dim, device='cuda'), torch.zeros(1, self.hidden_dim, device='cuda')

        outputs = []

        for embed in embeds:
            hidden, cell = self.lstm_cell(embed, hidden, cell)
            output = self.fc(hidden)
            outputs.append(output)

        outputs = torch.stack(outputs)
        return outputs

The code handles variable-length sequences by preparing padded inputs and calculating loss while ignoring padding tokens. The prepare_padded_sequence function converts sequences into index tensors using a mapping (to_ix) and pads them to a fixed maximum length (max_len) with zeros for uniformity. The masked_loss function reshapes model outputs, targets, and a mask indicating non-padded elements, applying the loss function only to non-padded tokens to ensure padding does not affect training. Additionally, the maximum sequence length (max_len) is determined from the training data to standardize input sizes. This approach ensures efficient processing of variable-length data while maintaining training accuracy.

In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence

# Préparer les séquences avec padding
def prepare_padded_sequence(seq, to_ix, max_len):
    idxs = [to_ix[w] for w in seq]
    idxs = torch.tensor(idxs, device='cuda')
    return nn.functional.pad(idxs, (0, max_len - len(idxs)), value=0)

# Calcul de la perte en ignorant les paddings
def masked_loss(tag_scores, targets, mask, loss_function):
    tag_scores = tag_scores.view(-1, tag_scores.size(-1))
    targets = targets.view(-1)
    mask = mask.view(-1)
    loss = loss_function(tag_scores[mask], targets[mask])
    return loss

# Déterminer la longueur maximale des séquences
max_len = max(len(seq) for seq in X_train + Y_train)


The code checks if a CUDA-enabled GPU is available for PyTorch by using the torch.cuda.is_available() function, which returns True if a compatible GPU is detected and properly configured, or False otherwise. The result is printed to confirm whether GPU acceleration can be utilized for faster training and inference. If True, PyTorch models and tensors can be moved to the GPU using device='cuda' for improved performance.

In [None]:
import torch
print(torch.cuda.is_available())

True


The code trains the TextTrans model for text translation using embeddings, a custom LSTM cell, and a linear output layer. It initializes key parameters such as embedding size, hidden size, vocabulary sizes, and the number of epochs. The model is moved to the GPU, and nn.CrossEntropyLoss is used as the loss function, with the Adam optimizer configured for parameter updates. During training, the data is processed in batches, with input and target sequences padded to a uniform length and a mask created to ignore padding during loss calculation. For each sequence, a forward pass generates predictions, and the masked loss is computed and backpropagated to adjust model weights. The average loss is calculated for each epoch, and progress is displayed using tqdm, providing insight into model performance over time.

In [None]:
import torch.optim as optim
import torch



# Définition des paramètres
EMBEDDING_DIM = 64
HIDDEN_DIM = 32
EPOCHS = 50
vocab_size_darija = len(darija_to_ix)
vocab_size_english = len(eng_to_ix)

# Initialisation du modèle et de l'optimiseur
model = TextTrans(EMBEDDING_DIM, HIDDEN_DIM, vocab_size_darija, vocab_size_english)
model.cuda()

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


# Boucle d'entraînement
from tqdm import tqdm

for epoch in tqdm(range(EPOCHS)):
    total_loss = 0
    model.train()

    for i in range(len(X_train)):
        inputs = prepare_padded_sequence(X_train[i], darija_to_ix, max_len)
        targets = prepare_padded_sequence(Y_train[i], eng_to_ix, max_len)

        # Créer le masque pour ignorer les paddings
        mask = (inputs != 0) & (targets != 0)

        optimizer.zero_grad()

        # Passage dans le modèle
        tag_scores = model(inputs)

        # Calcul de la perte masquée
        loss = masked_loss(tag_scores, targets, mask, loss_function)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch+1}/{EPOCHS}, Loss: {total_loss/len(X_train)}')

  2%|▏         | 1/50 [09:46<7:58:37, 586.08s/it]

Epoch 1/50, Loss: 6.535358639687281


  4%|▍         | 2/50 [19:54<7:59:11, 598.99s/it]

Epoch 2/50, Loss: 5.46733868851536


  6%|▌         | 3/50 [30:38<8:05:17, 619.53s/it]

Epoch 3/50, Loss: 4.9731728647984585


  8%|▊         | 4/50 [41:04<7:56:59, 622.16s/it]

Epoch 4/50, Loss: 4.576327598097681


 10%|█         | 5/50 [51:32<7:48:13, 624.30s/it]

Epoch 5/50, Loss: 4.249413510873357


 12%|█▏        | 6/50 [1:01:50<7:36:13, 622.12s/it]

Epoch 6/50, Loss: 3.9725637200356427


 14%|█▍        | 7/50 [1:12:12<7:25:51, 622.12s/it]

Epoch 7/50, Loss: 3.7375321355973243


 16%|█▌        | 8/50 [1:22:28<7:14:04, 620.11s/it]

Epoch 8/50, Loss: 3.532585414497246


 18%|█▊        | 9/50 [1:32:23<6:58:27, 612.38s/it]

Epoch 9/50, Loss: 3.356011094723955


 20%|██        | 10/50 [1:42:21<6:45:14, 607.86s/it]

Epoch 10/50, Loss: 3.1992926364247376


 22%|██▏       | 11/50 [1:52:13<6:31:54, 602.93s/it]

Epoch 11/50, Loss: 3.058962855918952


 24%|██▍       | 12/50 [2:02:01<6:19:00, 598.44s/it]

Epoch 12/50, Loss: 2.9324675884438904


 26%|██▌       | 13/50 [2:11:48<6:06:51, 594.92s/it]

Epoch 13/50, Loss: 2.8203749859136953


 28%|██▊       | 14/50 [2:21:33<5:55:18, 592.19s/it]

Epoch 14/50, Loss: 2.716171477578306


 30%|███       | 15/50 [2:31:33<5:46:43, 594.40s/it]

Epoch 15/50, Loss: 2.6204396650292234


 32%|███▏      | 16/50 [2:41:22<5:35:58, 592.89s/it]

Epoch 16/50, Loss: 2.5390808248615344


 34%|███▍      | 17/50 [2:51:07<5:24:39, 590.29s/it]

Epoch 17/50, Loss: 2.4605989550141096


 36%|███▌      | 18/50 [3:00:50<5:13:39, 588.09s/it]

Epoch 18/50, Loss: 2.3873931226361678


 38%|███▊      | 19/50 [3:10:38<5:03:53, 588.18s/it]

Epoch 19/50, Loss: 2.321869535959926


 40%|████      | 20/50 [3:20:32<4:54:59, 589.98s/it]

Epoch 20/50, Loss: 2.265700746398096




---


## **Explanation**

---


### **Data Preparation:**


Loaded and preprocessed the English-to-Darija dataset.

Tokenized sentences and converted them into sequences of numerical values compatible with the LSTM model.

---



### **Baseline Model Implementation:**



Designed a vanilla LSTM network for text translation, ensuring a bidirectional architecture to handle sentence context effectively.
Trained the model with basic hyperparameters to establish a performance baseline.


---



### **Advanced Model Proposals:**

Extended the baseline by incorporating advanced LSTM cells, such as Peephole connections and Working Memory-based architectures.
Modified the equations governing the gating mechanisms in LSTM layers to enhance learning capacity.

---



### **Hyperparameter Experiments:**

Conducted experiments by varying parameters like:
Optimizers (e.g., Adam, SGD).
Learning rates.
Initialization strategies.
Batch sizes.
Recorded and visualized results to identify optimal configurations.

---



### **Evaluation and Analysis:**

Compared models using metrics such as loss reduction, BLEU scores, and translation accuracy.
Visualized results to analyze trends and trade-offs between complexity and performance.

---



### **Exploratory Data Analysis (EDA):**

Analyzed dataset characteristics like vocabulary size, sequence lengths, and translation patterns to guide model design.

---

