<a href="https://colab.research.google.com/github/khojwar/Master_Thesis/blob/main/001_Nepali_Text_POS_Tagging_using_Different_Deep_learning_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# part of speech tagging using rnn model

Part-of-speech (POS) tagging is a fundamental natural language processing task that involves assigning a grammatical category (such as noun, verb, adjective, etc.) to each word in a sentence. Recurrent Neural Networks (RNNs) can be used for POS tagging, although more advanced models like Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) are often preferred due to their ability to capture longer-range dependencies.

Here's a step-by-step guide to building a basic POS tagging model using an RNN (specifically, an LSTM) in Python with the help of libraries like TensorFlow and Keras:

1. Data Preparation:
Prepare your training data in the form of sentences with corresponding POS tags. You can use datasets like Penn Treebank or Universal Dependencies for this purpose.

2. Tokenization and Padding:
Convert your sentences into sequences of word tokens, and then pad the sequences to ensure uniform length.

3. Data Vectorization:
Convert the word tokens and POS tags into numerical vectors using techniques like one-hot encoding or word embeddings.

4. Model Architecture:
Build your RNN model using an LSTM layer. You can experiment with stacking multiple LSTM layers or combining them with other types of layers like Dense layers.

5. Compile the Model:
Compile your model using an appropriate loss function (categorical cross-entropy) and optimizer (e.g., Adam).

6. Model Training:
Train your model on the prepared training data. Monitor the training process and adjust hyperparameters as needed.

7. Model Evaluation:
Evaluate your model's performance on a separate validation or test dataset. Calculate metrics like accuracy, precision, recall, and F1-score.

8. Inference:
Use your trained model for POS tagging by feeding it new sentences. The model will output the predicted POS tags for each word.

Here's a code snippet illustrating the implementation using Keras:

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Example data (replace with your own)
sentences = [['The', 'cat', 'is', 'on', 'the', 'mat'],
             ['I', 'ate', 'an', 'apple']]

pos_tags = [['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN'],
            ['PRON', 'VERB', 'DET', 'NOUN']]

# Create vocabulary and tag sets
word_vocab = set(word for sent in sentences for word in sent)
tag_vocab = set(tag for tag_seq in pos_tags for tag in tag_seq)

word_to_index = {word: idx + 1 for idx, word in enumerate(word_vocab)}
tag_to_index = {tag: idx for idx, tag in enumerate(tag_vocab)}

# Convert words and tags to numerical values
X = [[word_to_index[word] for word in sent] for sent in sentences]
y = [[tag_to_index[tag] for tag in tag_seq] for tag_seq in pos_tags]

# Pad sequences
max_sequence_length = max(len(seq) for seq in X)
X_padded = pad_sequences(X, maxlen=max_sequence_length, padding='post')
y_padded = pad_sequences(y, maxlen=max_sequence_length, padding='post')

# Convert tags to one-hot encoding
num_tags = len(tag_vocab)
y_onehot = np.array([to_categorical(tag_seq, num_classes=num_tags) for tag_seq in y_padded])

# Build and compile the model
model = Sequential()
model.add(Embedding(input_dim=len(word_vocab) + 1, output_dim=50, input_length=max_sequence_length))
model.add(LSTM(100, return_sequences=True))
model.add(Dense(num_tags, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_padded, y_onehot, batch_size=32, epochs=10, validation_split=0.2)

# Inference
test_sentence = ['The', 'dog', 'barked']
test_sequence = [word_to_index[word] for word in test_sentence]
test_padded = pad_sequences([test_sequence], maxlen=max_sequence_length, padding='post')
predicted_probs = model.predict(test_padded)
predicted_tags = [tag_vocab[np.argmax(tag_prob)] for tag_prob in predicted_probs[0]]

print(predicted_tags)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


KeyError: ignored

# nepali text pos tagging using gru algorithms

Sure, I can help you with that. Here's a step-by-step guide to building a Nepali text POS tagging model using the GRU (Gated Recurrent Unit) algorithm with Python and libraries like TensorFlow and Keras. Please note that you'll need to prepare or obtain a Nepali POS tagged dataset for training.

1. Data Preparation:
Prepare your Nepali POS tagged dataset. Each sentence should be tokenized into words and accompanied by their corresponding POS tags.

2. Tokenization and Padding:
Convert the Nepali sentences into sequences of word tokens, and then pad the sequences to ensure uniform length.

3. Data Vectorization:
Convert the word tokens and POS tags into numerical vectors using techniques like one-hot encoding or word embeddings.

4. Model Architecture:
Build your GRU model using the Keras library. You can experiment with different configurations and hyperparameters.

5. Compile the Model:
Compile your model using an appropriate loss function (categorical cross-entropy) and optimizer (e.g., Adam).

6. Model Training:
Train your model on the prepared training data. Monitor the training process and adjust hyperparameters as needed.

7. Model Evaluation:
Evaluate your model's performance on a separate validation or test dataset. Calculate metrics like accuracy, precision, recall, and F1-score.

8. Inference:
Use your trained model for POS tagging by feeding it new Nepali sentences. The model will output the predicted POS tags for each word.

Here's a simplified code snippet illustrating the implementation using Keras:

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Example data (replace with your own)
nepali_sentences = [['नेपाल', 'बाट', 'सुन्दर', 'माउन्ट', 'एभरेस्ट', 'देखिन्छ'],
                    ['म', 'काठमाडौं', 'जान्न', 'गइन्']]

nepali_pos_tags = [['NOUN', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'VERB'],
                   ['PRON', 'NOUN', 'VERB', 'VERB']]

# Create vocabulary and tag sets
word_vocab = set(word for sent in nepali_sentences for word in sent)
tag_vocab = set(tag for tag_seq in nepali_pos_tags for tag in tag_seq)

word_to_index = {word: idx + 1 for idx, word in enumerate(word_vocab)}
tag_to_index = {tag: idx for idx, tag in enumerate(tag_vocab)}

# Convert words and tags to numerical values
X = [[word_to_index[word] for word in sent] for sent in nepali_sentences]
y = [[tag_to_index[tag] for tag in tag_seq] for tag_seq in nepali_pos_tags]

# Pad sequences
max_sequence_length = max(len(seq) for seq in X)
X_padded = pad_sequences(X, maxlen=max_sequence_length, padding='post')
y_padded = pad_sequences(y, maxlen=max_sequence_length, padding='post')

# Convert tags to one-hot encoding
num_tags = len(tag_vocab)
y_onehot = np.array([to_categorical(tag_seq, num_classes=num_tags) for tag_seq in y_padded])

# Build and compile the model
model = Sequential()
model.add(Embedding(input_dim=len(word_vocab) + 1, output_dim=50, input_length=max_sequence_length))
model.add(GRU(100, return_sequences=True))
model.add(Dense(num_tags, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_padded, y_onehot, batch_size=32, epochs=10, validation_split=0.2)

# Inference
test_sentence = ['नेपाल', 'मा', 'राम्रो', 'ठाउँ', 'छ']
test_sequence = [word_to_index[word] for word in test_sentence]
test_padded = pad_sequences([test_sequence], maxlen=max_sequence_length, padding='post')
predicted_probs = model.predict(test_padded)
predicted_tags = [tag_vocab[np.argmax(tag_prob)] for tag_prob in predicted_probs[0]]

print(predicted_tags)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


KeyError: ignored

# nepali text pos tagging using lstm algorithms

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Example data (replace with your own)
nepali_sentences = [['नेपाल', 'बाट', 'सुन्दर', 'माउन्ट', 'एभरेस्ट', 'देखिन्छ'],
                    ['म', 'काठमाडौं', 'जान्न', 'गइन्']]

nepali_pos_tags = [['NOUN', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'VERB'],
                   ['PRON', 'NOUN', 'VERB', 'VERB']]

# Create vocabulary and tag sets
word_vocab = set(word for sent in nepali_sentences for word in sent)
tag_vocab = set(tag for tag_seq in nepali_pos_tags for tag in tag_seq)

word_to_index = {word: idx + 1 for idx, word in enumerate(word_vocab)}
tag_to_index = {tag: idx for idx, tag in enumerate(tag_vocab)}

# Convert words and tags to numerical values
X = [[word_to_index[word] for word in sent] for sent in nepali_sentences]
y = [[tag_to_index[tag] for tag in tag_seq] for tag_seq in nepali_pos_tags]

# Pad sequences
max_sequence_length = max(len(seq) for seq in X)
X_padded = pad_sequences(X, maxlen=max_sequence_length, padding='post')
y_padded = pad_sequences(y, maxlen=max_sequence_length, padding='post')

# Convert tags to one-hot encoding
num_tags = len(tag_vocab)
y_onehot = np.array([to_categorical(tag_seq, num_classes=num_tags) for tag_seq in y_padded])

# Build and compile the model
model = Sequential()
model.add(Embedding(input_dim=len(word_vocab) + 1, output_dim=50, input_length=max_sequence_length))
model.add(LSTM(100, return_sequences=True))
model.add(Dense(num_tags, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_padded, y_onehot, batch_size=32, epochs=10, validation_split=0.2)

# Inference
test_sentence = ['नेपाल', 'मा', 'राम्रो', 'ठाउँ', 'छ']
test_sequence = [word_to_index[word] for word in test_sentence]
test_padded = pad_sequences([test_sequence], maxlen=max_sequence_length, padding='post')
predicted_probs = model.predict(test_padded)
predicted_tags = [tag_vocab[np.argmax(tag_prob)] for tag_prob in predicted_probs[0]]

print(predicted_tags)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


KeyError: ignored

# nepali text pos tagging using bi-lstm algorithms

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Example data (replace with your own)
nepali_sentences = [['नेपाल', 'बाट', 'सुन्दर', 'माउन्ट', 'एभरेस्ट', 'देखिन्छ'],
                    ['म', 'काठमाडौं', 'जान्न', 'गइन्']]

nepali_pos_tags = [['NOUN', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'VERB'],
                   ['PRON', 'NOUN', 'VERB', 'VERB']]

# Create vocabulary and tag sets
word_vocab = set(word for sent in nepali_sentences for word in sent)
tag_vocab = set(tag for tag_seq in nepali_pos_tags for tag in tag_seq)

word_to_index = {word: idx + 1 for idx, word in enumerate(word_vocab)}
tag_to_index = {tag: idx for idx, tag in enumerate(tag_vocab)}

# Convert words and tags to numerical values
X = [[word_to_index[word] for word in sent] for sent in nepali_sentences]
y = [[tag_to_index[tag] for tag in tag_seq] for tag_seq in nepali_pos_tags]

# Pad sequences
max_sequence_length = max(len(seq) for seq in X)
X_padded = pad_sequences(X, maxlen=max_sequence_length, padding='post')
y_padded = pad_sequences(y, maxlen=max_sequence_length, padding='post')

# Convert tags to one-hot encoding
num_tags = len(tag_vocab)
y_onehot = np.array([to_categorical(tag_seq, num_classes=num_tags) for tag_seq in y_padded])

# Build and compile the Bi-LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(word_vocab) + 1, output_dim=50, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(100, return_sequences=True)))
model.add(Dense(num_tags, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_padded, y_onehot, batch_size=32, epochs=10, validation_split=0.2)

# Inference
test_sentence = ['नेपाल', 'मा', 'राम्रो', 'ठाउँ', 'छ']
test_sequence = [word_to_index[word] for word in test_sentence]
test_padded = pad_sequences([test_sequence], maxlen=max_sequence_length, padding='post')
predicted_probs = model.predict(test_padded)
predicted_tags = [tag_vocab[np.argmax(tag_prob)] for tag_prob in predicted_probs[0]]

print(predicted_tags)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


KeyError: ignored


# nepali text pos tagging using  mbert

Using mBERT (Multilingual BERT) for Nepali text POS tagging involves fine-tuning a pre-trained mBERT model on a Nepali POS tagging dataset. Here's a general step-by-step guide to help you get started:

1. Data Preparation:
Prepare your Nepali POS tagged dataset. Each sentence should be tokenized into words and accompanied by their corresponding POS tags.

2. Tokenization and Padding:
Convert the Nepali sentences into sequences of word tokens and pad them to ensure uniform length.

3. Data Vectorization:
Convert the word tokens and POS tags into numerical vectors using techniques like one-hot encoding or word embeddings.

4. Preprocessing for mBERT:
Tokenize your sentences using a pre-trained mBERT tokenizer designed for Nepali text. You may need to install the transformers library by Hugging Face for this purpose.

5. Fine-Tuning mBERT:
Load a pre-trained mBERT model and fine-tune it on your Nepali POS tagging dataset. You'll need to create a custom classification layer on top of the mBERT model for POS tagging.

6. Compile and Train:
Compile your model using an appropriate loss function (categorical cross-entropy) and optimizer. Train the model on the prepared training data.

7. Model Evaluation:
Evaluate your fine-tuned mBERT model's performance on a separate validation or test dataset. Calculate metrics like accuracy, precision, recall, and F1-score.

8. Inference:
Use your trained model for POS tagging by feeding it new Nepali sentences. The model will output the predicted POS tags for each word.

Here's a simplified code snippet illustrating the implementation using the transformers library and Keras:

In [None]:
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Example data (replace with your own)
nepali_sentences = [['नेपाल', 'बाट', 'सुन्दर', 'माउन्ट', 'एभरेस्ट', 'देखिन्छ'],
                    ['म', 'काठमाडौं', 'जान्न', 'गइन्']]

nepali_pos_tags = [['NOUN', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'VERB'],
                   ['PRON', 'NOUN', 'VERB', 'VERB']]

# Create vocabulary and tag sets
word_vocab = set(word for sent in nepali_sentences for word in sent)
tag_vocab = set(tag for tag_seq in nepali_pos_tags for tag in tag_seq)

word_to_index = {word: idx + 1 for idx, word in enumerate(word_vocab)}
tag_to_index = {tag: idx for idx, tag in enumerate(tag_vocab)}

# Convert words and tags to numerical values
X = [[word_to_index[word] for word in sent] for sent in nepali_sentences]
y = [[tag_to_index[tag] for tag in tag_seq] for tag_seq in nepali_pos_tags]

# Pad sequences
max_sequence_length = max(len(seq) for seq in X)
X_padded = pad_sequences(X, maxlen=max_sequence_length, padding='post')
y_padded = pad_sequences(y, maxlen=max_sequence_length, padding='post')

# Convert tags to one-hot encoding
num_tags = len(tag_vocab)
y_onehot = np.array([to_categorical(tag_seq, num_classes=num_tags) for tag_seq in y_padded])

# Load mBERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Tokenize sentences
tokenized_inputs = tokenizer(nepali_sentences, padding=True, truncation=True, return_tensors='tf')

# Load mBERT model for sequence classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=num_tags)

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(tokenized_inputs, y_onehot, batch_size=32, epochs=10, validation_split=0.2)

# Inference
test_sentence = ['नेपाल', 'मा', 'राम्रो', 'ठाउँ', 'छ']
test_input = tokenizer(test_sentence, padding=True, truncation=True, return_tensors='tf')
predicted_probs = model.predict(test_input)
predicted_tags = [tag_vocab[np.argmax(tag_prob)] for tag_prob in predicted_probs]

print(predicted_tags)


ModuleNotFoundError: ignored