**Instruction for POS Tagging Using RNNs with Arabic Dataset**

**Dataset:**
The dataset provided is named "Assignment 2 - Arabic POS.conllu". It contains labeled data for Arabic text with Part-of-Speech (POS) tags in CoNLL-U format.

**Objective:**
Your objective is to perform Part-of-Speech (POS) tagging on Arabic text using Recurrent Neural Networks (RNNs). Specifically, you will use the Universal POS (UPOS) tags for tagging. UPOS is a standardized set of POS tags that aims to cover all languages.

**Evaluation metric:**
Accuracy

**Instructions:**
1. **Data Preprocessing:**
   - Load the provided dataset "Assignment 2 - Arabic POS.conllu". You can use pyconll library
   - Preprocess the data as necessary, including tokenization

2. **Model Building:**
   - Design an RNN-based model architecture suitable for POS tagging. You may consider using recurrent layers such as (LSTM) or (GRU).
   - Define the input and output layers of the model. The input layer should accept sequences of tokens, and the output layer should produce the predicted UPOS tags for each token.

3. **Training:**

4. **Evaluation:**

**Additional Notes:**
- Make sure to document your code thoroughly and provide clear explanations for each step.
- Feel free to explore different RNN architectures, hyperparameters, and optimization techniques to improve the model's accuracy.

### Import used libraries

In [1]:
!pip install pyconll



In [19]:
import pyconll
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, GRU, Bidirectional, Dense
from keras.callbacks import EarlyStopping

### Load Dataset

In [3]:
dataset = pyconll.load_from_file("Assignment 2 - Arabic POS.conllu")

### Data splitting

In [4]:
# Extract tokens and POS tags
sentences = []
upos_tags = []

for sentence in dataset:
    tokens = []
    tags = []
    for token in sentence:
        tokens.append(token.form)
        tags.append(token.upos)
    sentences.append(tokens)
    upos_tags.append(tags)

In [5]:
# Split data into training and testing sets
sentences_train, sentences_test, upos_tags_train, upos_tags_test = train_test_split(
    sentences, upos_tags, test_size=0.2, random_state=42)

In [6]:
sentences_train

[['كشفت',
  'الأوراق',
  'والمستندات',
  'و',
  'المستندات',
  'أن',
  'المستشفى',
  'تهرب',
  'من',
  'طلبات',
  'المستشفيات',
  'الحكومية',
  'من',
  'الدم',
  'باعتبارها',
  'ب',
  'اعتبار',
  'ها',
  'تأخذ',
  'الدم',
  'مجاناً',
  'وكان',
  'و',
  'كان',
  'يتم',
  'إبلاغ',
  'غرفة',
  'الطوارئ',
  'بوزارة',
  'ب',
  'وزارة',
  'الصحة',
  'بكميات',
  'ب',
  'كميات',
  'دم',
  'أقل',
  'مما',
  'من',
  'ما',
  'هو',
  'مدون',
  'بالأوراق',
  'ب',
  'الأوراق',
  'الرسمية',
  'لعدم',
  'ل',
  'عدم',
  'الاستعانة',
  'بكميات',
  'ب',
  'كميات',
  'الدم',
  'في',
  'الحوادث',
  'وحالات',
  'و',
  'حالات',
  'النزيف',
  'الحاد',
  '.'],
 ['ومن',
  'و',
  'من',
  'المتوقع',
  'ان',
  'يستقبل',
  'امير',
  'قطر',
  'الاربعاء',
  'في',
  'الفندق',
  'الذي',
  'يقيم',
  'فيه',
  'في',
  'ه',
  'كلا',
  'من',
  'رئيس',
  'المجلس',
  'النيابي',
  'نبيه',
  'بري',
  'ورئيس',
  'و',
  'رئيس',
  'الحكومة',
  'سليم',
  'الحص',
  '.'],
 ['من', 'هنا', '.', '.'],
 ['وتعتزم',
  'و',
  'تعتزم',
  'اله

### Cleaning and Preprocessing

Tokenization:
i don't find a need for tokenization because the sentences are already splitted into tokens in the upper loop 

In [7]:
# Convert words to indices
word_to_index = {word: i + 1 for i, word in enumerate(set(word for sentence in sentences_train + sentences_test for word in sentence))}
index_to_word = {i: word for word, i in word_to_index.items()}

indexed_sentences_train = [[word_to_index[word] for word in sentence] for sentence in sentences_train]
indexed_sentences_test = [[word_to_index[word] for word in sentence] for sentence in sentences_test]

In [8]:
# Padding sequences
max_len_train = max(len(sentence) for sentence in indexed_sentences_train)
max_len_test = max(len(sentence) for sentence in indexed_sentences_test)

padded_sentences_train = pad_sequences(indexed_sentences_train, maxlen=max_len_train, padding='post')
padded_sentences_test = pad_sequences(indexed_sentences_test, maxlen=max_len_test, padding='post')


In [9]:
# Convert UPOS tags to indices
upos_to_index = {upos: i for i, upos in enumerate(set(tag for tags in upos_tags_train for tag in tags))}
index_to_upos = {i: upos for upos, i in upos_to_index.items()}

indexed_tags_train = [[upos_to_index[tag] for tag in tags] for tags in upos_tags_train]
indexed_tags_test = [[upos_to_index[tag] for tag in tags] for tags in upos_tags_test]

In [10]:
# Padding sequences
padded_tags_train = pad_sequences(indexed_tags_train, maxlen=max_len_train, padding='post')
padded_tags_test = pad_sequences(indexed_tags_test, maxlen=max_len_test, padding='post')

In [11]:
# encoding tags
one_hot_tags_train = to_categorical(padded_tags_train, num_classes=len(upos_to_index))
one_hot_tags_test = to_categorical(padded_tags_test, num_classes=len(upos_to_index))

### Modelling

#### LSTM

In [12]:
model = Sequential()
model.add(Embedding(input_dim=len(word_to_index) + 1, output_dim=100, input_length=max_len_train))
model.add(LSTM(units=128, return_sequences=True))
model.add(Dense(len(upos_to_index), activation='softmax'))



In [13]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


In [14]:
model.summary()


In [15]:
history = model.fit(padded_sentences_train, one_hot_tags_train, validation_split=0.2, epochs=10, batch_size=32)


Epoch 1/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 504ms/step - accuracy: 0.8893 - loss: 0.4832 - val_accuracy: 0.9530 - val_loss: 0.1673
Epoch 2/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 514ms/step - accuracy: 0.9595 - loss: 0.1392 - val_accuracy: 0.9746 - val_loss: 0.0914
Epoch 3/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 496ms/step - accuracy: 0.9782 - loss: 0.0791 - val_accuracy: 0.9853 - val_loss: 0.0556
Epoch 4/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 506ms/step - accuracy: 0.9895 - loss: 0.0454 - val_accuracy: 0.9913 - val_loss: 0.0367
Epoch 5/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 499ms/step - accuracy: 0.9954 - loss: 0.0260 - val_accuracy: 0.9927 - val_loss: 0.0280
Epoch 6/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 503ms/step - accuracy: 0.9974 - loss: 0.0158 - val_accuracy: 0.9934 - val_loss: 0.0243
Epoch 7/10

#### Evaluation

**Evaluation metric:**
Accuracy

In [16]:
loss, accuracy = model.evaluate(padded_sentences_test, one_hot_tags_test)

[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 96ms/step - accuracy: 0.9863 - loss: 0.0452


In [17]:
print(f'Test Accuracy: {accuracy * 100:.2f}%')

Test Accuracy: 98.55%


### Enhancement

### GRU

In [20]:
gru = Sequential()
gru.add(Embedding(input_dim=len(word_to_index) + 1, output_dim=100, input_length=max_len_train))
gru.add(GRU(units=128, return_sequences=True))
gru.add(Dense(len(upos_to_index), activation='softmax'))


In [21]:
 gru.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [22]:
gru.summary()

In [24]:
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)


In [25]:
history = gru.fit(padded_sentences_train, one_hot_tags_train, validation_split=0.2, epochs=10, batch_size=32, callbacks=[early_stopping])


Epoch 1/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 508ms/step - accuracy: 0.8975 - loss: 0.4834 - val_accuracy: 0.9676 - val_loss: 0.1072
Epoch 2/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 522ms/step - accuracy: 0.9746 - loss: 0.0885 - val_accuracy: 0.9860 - val_loss: 0.0533
Epoch 3/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 521ms/step - accuracy: 0.9901 - loss: 0.0421 - val_accuracy: 0.9917 - val_loss: 0.0320
Epoch 4/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 521ms/step - accuracy: 0.9960 - loss: 0.0211 - val_accuracy: 0.9933 - val_loss: 0.0242
Epoch 5/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 519ms/step - accuracy: 0.9977 - loss: 0.0121 - val_accuracy: 0.9935 - val_loss: 0.0218
Epoch 6/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 526ms/step - accuracy: 0.9982 - loss: 0.0080 - val_accuracy: 0.9935 - val_loss: 0.0212
Epoch 7/10

In [26]:
loss, accuracy = gru.evaluate(padded_sentences_test, one_hot_tags_test)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 76ms/step - accuracy: 0.9866 - loss: 0.0429
Test Accuracy: 98.58%


### Conclusion and final results


- gru model is slightly better than lstm this may be due to sentence lengths or  language nuances

#### Done!