### file path:

C:\Users\Computer\Desktop\nnc_updated_ah\gc\books\book-academic-jana-siksha-2058

### how to preprocess this type of data to train postagging model using LSTM algorithms. explain step by step procedure with code

Preprocessing textual data for training a part-of-speech tagging model using LSTM algorithms involves several steps. I'll provide you with a step-by-step procedure along with Python code snippets for each step. For this example, I'll assume you're using the NLTK library for natural language processing tasks.

# 1. Data Loading and Parsing:

Load the XML data and extract the text content for preprocessing.

In [None]:
import xml.etree.ElementTree as ET

# Load the XML data
tree = ET.parse('book-academic-jana-siksha-2058.xml')
root = tree.getroot()

# Extract text content from <w> tags
text_data = ' '.join([w.text for w in root.iter('w')])
# text_data


# 2. Text Tokenization:
Tokenize the text data into words.

In [None]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

# Tokenize the text data
tokens = word_tokenize(text_data)
# tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 3. Remove Special Tags:
Remove any special tags like `< cesDoc >`, `< cesHeader >`, etc., and keep only the actual word tokens.

In [None]:
tokens = [token for token in tokens if not token.startswith('<')]
# tokens

# 4. Text Normalization:
Normalize the text by converting it to lowercase.

In [None]:
normalized_tokens = [token.lower() for token in tokens]


# 5. Create Vocabulary and Index Mapping:
Create a vocabulary and map words to unique indices.

In [None]:
vocab = set(normalized_tokens)
word_to_index = {word: index + 1 for index, word in enumerate(vocab)}  # Start index from 1
index_to_word = {index: word for word, index in word_to_index.items()}


# 6. Convert Words to Indices:
Convert tokens to corresponding indices.

In [None]:
indexed_data = [word_to_index[word] for word in normalized_tokens]


# 7. Generate Training Examples:
Generate input-output pairs for training the LSTM model. The input is a sequence of words, and the output is the corresponding part-of-speech tags.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

from nltk import pos_tag

# Perform part-of-speech tagging on the original text
pos_tags = pos_tag(tokens)

# Create a mapping from part-of-speech tags to unique indices
tag_set = set(tag for word, tag in pos_tags)
tag_to_index = {tag: index for index, tag in enumerate(tag_set)}

# Generate training examples
sequence_length = 10  # You can adjust this value as needed
X = [indexed_data[i:i+sequence_length] for i in range(len(indexed_data) - sequence_length)]
y = [tag_to_index[pos_tags[i + sequence_length][1]] for i in range(len(indexed_data) - sequence_length)]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# 8. Padding Sequences:
Pad sequences to a fixed length for training.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_padded = pad_sequences(X, maxlen=sequence_length, padding='post')
X_padded

array([[4491, 4548, 5103, ..., 4785, 5698, 3572],
       [4548, 5103, 2933, ..., 5698, 3572, 3678],
       [5103, 2933, 6642, ..., 3572, 3678, 1165],
       ...,
       [1454, 2304, 4482, ..., 2180, 4183, 5029],
       [2304, 4482, 5426, ..., 4183, 5029, 5016],
       [4482, 5426, 5029, ..., 5029, 5016, 2259]], dtype=int32)

# 9. One-Hot Encoding:
Convert part-of-speech tags to one-hot encoded vectors.

 `np.eye` function is used to create a 2D identity matrix with ones on the diagonal and zeros elsewhere.

`len(tag_set)` : Number of rows in the output matrix

In [None]:
import numpy as np

y_encoded = np.eye(len(tag_set))[y]


In [None]:
y_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# 2nd part


# 10. Train Word2Vec Model:
Train a Word2Vec model on your tokenized text data. You can use the `Word2Vec` class from the `gensim` library.

In [None]:
from gensim.models import Word2Vec

# Train Word2Vec model
w2v_model = Word2Vec(sentences=[normalized_tokens], vector_size=100, window=5, min_count=1, sg=0)


# 11. Get Word Vectors:
Once you have the trained Word2Vec model, you can obtain the dense vectors for words.

In [None]:

# Get vector for a specific word
word_vector = w2v_model.wv['म']

# Get vectors for a list of words
word_vectors = [w2v_model.wv[word] for word in normalized_tokens]


# 12. Padding Word Vectors:
If you want to maintain a fixed sequence length, you'll need to pad the word vectors similar to what you did with the integer indices in the previous example.

In [None]:
X_word2vec = [word_vectors[i:i+sequence_length] for i in range(len(word_vectors) - sequence_length)]
X_word2vec_padded = pad_sequences(X_word2vec, maxlen=sequence_length, padding='post')


Now you have the word vectors from the Word2Vec model represented as dense vectors ready for training.

# 13. Data Splitting:

divide dataset into a training set and a test set using the common 80-20 split.

training the model on the training set, it evaluates the model's performance on the test set using the evaluate method.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_word2vec_padded, y_encoded, test_size=0.2, random_state=42)

# 13. Build and Train LSTM Model:
Next, you can build an LSTM model to predict part-of-speech tags using the word vectors as input.

In [None]:
# from keras.models import Sequential
# from keras.layers import LSTM, Dense

# # Build LSTM model
# model = Sequential()
# model.add(LSTM(units=128, input_shape=(sequence_length, 100)))  # Assuming vector_size=100 in Word2Vec
# model.add(Dense(len(tag_set), activation='softmax'))

# # Compile the model
# model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# # Train the model
# model.fit(X_word2vec_padded, y_encoded, epochs=10, batch_size=32, validation_split=0.2)


This code builds and trains an LSTM model using the Word2Vec word vectors as input. You might need to adjust hyperparameters, such as the number of LSTM units, the vector size in the Word2Vec model, and the training parameters, according to your specific problem and dataset.

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Build LSTM model
model = Sequential()
model.add(LSTM(units=128, input_shape=(sequence_length, 100)))  # Assuming vector_size=100 in Word2Vec
model.add(Dense(len(tag_set), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model on the training set
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.6415, Test accuracy: 0.8794


# All the above step are together


In [3]:
!pip install tensorflow



In [4]:
!pip install nltk gensim scikit-learn




In [5]:
!pip install --upgrade tensorflow nltk gensim scikit-learn


Collecting tensorflow
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (524.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn
  Downloading scikit_learn-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
Collecting keras<2.14,>=2.13.1 (from tensorflow)
  Downloading keras-2.13.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.14,>=2.13 (from tensorflow)
  Downloading tensorboard-2.13.0-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.14,>=2.13.0 (fro

In [7]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

import xml.etree.ElementTree as ET

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag




# Step 1: Data Preprocessing
# Load and preprocess your XML data
### Load the XML data
tree = ET.parse('book-academic-jana-siksha-2058.xml')
root = tree.getroot()

### Extract text content from <w> tags
text_data = ' '.join([w.text for w in root.iter('w')])



# Tokenize and normalize the text data
### Tokenize the text data
tokens = word_tokenize(text_data)

### remove special tags
tokens = [token for token in tokens if not token.startswith('<')]

### Text normalization
normalized_tokens = [token.lower() for token in tokens]

# Create vocabulary and index mapping
vocab = set(normalized_tokens)
word_to_index = {word: index + 1 for index, word in enumerate(vocab)}  # Start index from 1
index_to_word = {index: word for word, index in word_to_index.items()}

# Convert words to indices
indexed_data = [word_to_index[word] for word in normalized_tokens]

# Define the sequence length for LSTM input
sequence_length = 10

# Step 2: Word Embedding using Word2Vec
w2v_model = Word2Vec(sentences=[normalized_tokens], vector_size=100, window=5, min_count=1, sg=0)

# Get word vectors
### Train Word2Vec model
w2v_model = Word2Vec(sentences=[normalized_tokens], vector_size=100, window=5, min_count=1, sg=0)

### Get vector for a specific word
word_vector = w2v_model.wv['म']

### Get vectors for a list of words
word_vectors = [w2v_model.wv[word] for word in normalized_tokens]

# Pad word vectors
X_word2vec = [word_vectors[i:i+sequence_length] for i in range(len(word_vectors) - sequence_length)]
X_word2vec_padded = pad_sequences(X_word2vec, maxlen=sequence_length, padding='post')
#-------------------------------------------------------------------------------------------------------------------


# Encoding Labels (e.g., one-hot encoding or label encoding):
### Perform part-of-speech tagging on the original text
pos_tags = pos_tag(tokens)

### Create a mapping from part-of-speech tags to unique indices
tag_set = set(tag for word, tag in pos_tags)
tag_to_index = {tag: index for index, tag in enumerate(tag_set)}

### generate training example
y = [tag_to_index[pos_tags[i + sequence_length][1]] for i in range(len(indexed_data) - sequence_length)]

# Encoding Labels (e.g., one-hot encoding or label encoding):
y_encoded = np.eye(len(tag_set))[y]
# -------------------------------------------------------------------------------------------------------------


# Step 3: Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X_word2vec_padded, y_encoded, test_size=0.2, random_state=42)

# Step 4: Create LSTM Model
model = Sequential()
model.add(LSTM(units=128, input_shape=(sequence_length, 100)))
model.add(Dense(len(tag_set), activation='softmax'))

# Step 5: Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Step 6: Train Model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Step 7: Evaluate Model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.6394, Test accuracy: 0.8794
