# Dialect Classification Using LSTM and RNN

In this notebook, we will train and evaluate LSTM and RNN models for dialect classification using Arabic text data. We will preprocess the text data, convert it into numerical sequences, and then use these sequences to train our models.

## Step 1: Import Libraries
First, we import the necessary libraries for data manipulation and model training.

In [8]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

## Step 2: Load Data
Load the cleaned dataset and display the first few rows.

In [9]:
df = pd.read_csv('Data/cleaned_data.csv')

In [10]:
df.head()

Unnamed: 0,text,dialect
0,الجنوبمطفي لانه ناس باحزابنا شركاء مافيات المو...,LB
1,والدتكن كمان بتقلكن ساعدوني بالتعزيل وبطس الما...,LB
2,ماعمري ماجلبت مللي كنت صغيره,MA
3,الفاصله متفق معاك الفاصله اشريف اليوفي بكل حيا...,LY
4,عجبني بزاااف كنشوفوو اوزاان ثقيله ابطاال محفل ...,MA


## Step 3: Determine Maximum Text Length
Calculate the maximum length of the text in the dataset

## Benefit
Knowing the maximum length helps in setting the appropriate input length for the model.

In [11]:
#getting max length in text column
max_length = df['text'].apply(lambda x: len(x.split())).max()
print(max_length)

61


## Step 4: Preprocess Data
Convert the text data into numerical sequences and pad them to ensure uniform length. Also, encode the labels.

## Why?
Text to Numerical Sequences: To convert text data into a format suitable for model training.
Padding Sequences: To ensure that all sequences have the same length.
Label Encoding: To convert categorical labels into numerical format.
## Benefit
Converting text to numerical sequences and padding them ensures uniform input dimensions, which is essential for training neural networks. Encoding labels makes them suitable for classification tasks.

In [12]:
# Convert data into input sequences and labels
max_len = 61  # Maximum sequence length
texts = df['text'].values
labels = df['dialect'].values

In [13]:
# Convert Arabic text to numerical sequences
num_sequences = [[ord(char) for char in text] for text in texts]

# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(num_sequences, maxlen=max_len)

# Convert labels to numerical format
label_mapping = {label: idx for idx, label in enumerate(set(labels))}
encoded_labels = [label_mapping[label] for label in labels]
labels = np.array(encoded_labels)


## Step 5: Train-Test Split
Split the dataset into training and testing sets.

## Why?
Train-Test Split: To evaluate the performance of the model on unseen data.
## Benefit
This helps in assessing the generalizability of the model by testing it on a separate dataset that was not used during training.

In [14]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

## Step 6: Build and Train LSTM Model
Build the LSTM model, compile it, and train it on the training data.

## Why?
LSTM Model: To capture long-term dependencies in the text data.
## Benefit
LSTM (Long Short-Term Memory) networks are effective for sequential data, capturing context over long sequences which is crucial for understanding the dialects in text.

In [15]:
# Build the model
vocab_size = 2000
embedding_dim = 100 
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len),
    LSTM(units=64),
    Dense(units=len(label_mapping), activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x211b4ab9280>

In [13]:
#Save the model
model.save('’Models/LSTM.keras')

## Step 7: Evaluate LSTM Model
Evaluate the LSTM model on the test data.

In [14]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 0.7173075675964355
Test Accuracy: 0.7424047589302063


## Step 8: Build and Train RNN Model
Build a Simple RNN model, compile it, and train it on the training data

## Why?
RNN Model: To compare its performance with the LSTM model.
## Benefit
Simple RNNs are faster and less complex than LSTMs, making them suitable for smaller datasets or simpler tasks. Comparing both models helps in selecting the most suitable architecture for the task.

In [15]:
#trying RNN model
from tensorflow.keras.layers import SimpleRNN

In [16]:
# Build the RNN model
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len),
    SimpleRNN(units=64),
    Dense(units=len(label_mapping), activation='softmax')
])

In [17]:
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [18]:
# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x211bf20b760>

With these steps, we have trained and evaluated LSTM and RNN models for dialect classification. We converted text data into numerical sequences, padded them, and encoded the labels. We then trained the models and evaluated their performance on test data to understand their effectiveness in classifying dialects.

in conculsion we can say that the LSTM model is better than the RNN model.
but it is still does not perform better than Logistic Regression model. 