# Neural Part-of-Speech Tagger (POS Tagger)

The goal of this project is to build a neural parts-of-speech tagger.

The data is in JSON format and the key abbreviations are listed below:
- word: word in the particular sentence
- upos: Universal part-of-speech tag
- xpos: Language-specific part-of-speech tag

Metrics: We are going to evaulate our model with accuracy as it is a standard metric for most deep learning models.

author: Pratyush Mohit

In [497]:
import numpy as np
import json
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, SimpleRNN, LSTM, Dense, TimeDistributed, Activation
from tensorflow.keras.models import Model

import warnings
warnings.filterwarnings('ignore')

In [2]:
with open('telugu_pos (1).json', 'r') as f:
    data = json.load(f)

In [3]:
data[0:3]

[[{'word': 'మరో', 'upos': 'avy', 'xpos': 'QT_QTF'},
  {'word': 'సంగతి', 'upos': 'n', 'xpos': 'N_NN'},
  {'word': 'మీకు', 'upos': 'pn', 'xpos': 'PR_PRP'},
  {'word': 'తెలుసా', 'upos': 'avy', 'xpos': 'V_VM'},
  {'word': '?', 'upos': 'punc', 'xpos': 'RD_PUNC'}],
 [{'word': 'అందరి', 'upos': 'pn', 'xpos': 'PR_PRP'},
  {'word': 'ముందూ', 'upos': 'n', 'xpos': 'N_NST'},
  {'word': 'నా', 'upos': 'pn', 'xpos': 'PR_PRP'},
  {'word': 'తెల్లబట్ట', 'upos': 'n', 'xpos': 'N_NN'},
  {'word': 'బాధ', 'upos': 'n', 'xpos': 'N_NN'},
  {'word': 'ఎలా', 'upos': 'avy', 'xpos': 'PR_PRQ'},
  {'word': 'చెప్పుకొనేది', 'upos': 'unk', 'xpos': 'V_VM'},
  {'word': '?', 'upos': 'punc', 'xpos': 'RD_PUNC'}],
 [{'word': 'ఇట్లా', 'upos': 'avy', 'xpos': 'RB'},
  {'word': 'ఎందుకు', 'upos': 'avy', 'xpos': 'PR_PRQ'},
  {'word': 'జరుగుతోంది', 'upos': 'v', 'xpos': 'V_VM'},
  {'word': '?', 'upos': 'punc', 'xpos': 'RD_PUNC'}]]

In [4]:
data[0]

[{'word': 'మరో', 'upos': 'avy', 'xpos': 'QT_QTF'},
 {'word': 'సంగతి', 'upos': 'n', 'xpos': 'N_NN'},
 {'word': 'మీకు', 'upos': 'pn', 'xpos': 'PR_PRP'},
 {'word': 'తెలుసా', 'upos': 'avy', 'xpos': 'V_VM'},
 {'word': '?', 'upos': 'punc', 'xpos': 'RD_PUNC'}]

In [5]:
#we will create two datasets. One for upos tags and the other for xpos tags

In [6]:
all_sentences = []
all_upos = []
all_xpos = []
for sentence in tqdm(data):
    current_sentence = []
    current_upos = []
    current_xpos = []
    for word in sentence:
        current_sentence.append(word['word'])
        current_upos.append(word['upos'])
        current_xpos.append(word['xpos'])
    all_sentences.append(current_sentence)
    all_upos.append(current_upos)
    all_xpos.append(current_xpos)

100%|███████████████████████████████████████████████████████████████████████████████████████| 3185/3185 [00:00<00:00, 398153.86it/s]


In [7]:
print(len(all_sentences))
print(len(all_upos))
print(len(all_xpos))

3185
3185
3185


# We are now going to build a model for predicting upos

In [209]:
train_sentences, test_sentences, train_upos, test_upos = train_test_split(all_sentences, all_upos, test_size=0.2)

In [449]:
print(len(train_sentences))
print(len(test_sentences))
print(len(train_upos))
print(len(test_upos))

2548
637
2548
637


In [450]:
train_sentences[0:5]

[['చదువు', 'తెలివిని', 'పెంచుతుంది', '.'],
 ['పోలీసుల', 'కంట', 'పడిండు', '.'],
 ['రెండో', 'ఏడు', 'నిండేలోపల', 'మెదడు', 'బాగా', 'పెరుగుతుంది', '.'],
 ['పాత', 'రకం', 'విత్తనాలు', 'ఈ', 'వ్యాధులను', 'ణూLL', '.'],
 ['కడుపులో', 'తిరుగుతూ', '.']]

In [451]:
train_upos[0:5]

[['n', 'n', 'v', 'punc'],
 ['n', 'unk', 'unk', 'punc'],
 ['adj', 'n', 'v', 'n', 'avy', 'v', 'punc'],
 ['adj', 'n', 'n', 'avy', 'n', 'unk', 'punc'],
 ['n', 'v', 'punc']]

In [452]:
words, upos = set([]), set([])
 
for sentence in train_sentences:
    for word in sentence:
        words.add(word.lower())

for tag in train_upos:
    for t in tag:
        upos.add(t)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0  # The special value used for padding
word2index['-OOV-'] = 1  # The special value used for OOVs
 
upos2index = {t: i + 1 for i, t in enumerate(list(upos))}
upos2index['-PAD-'] = 0

In [453]:
len(words)

4970

In [454]:
len(upos)

24

In [455]:
train_sentences_x, test_sentences_x, train_upos_y, test_upos_y = [], [], [], []
 
for sentence in train_sentences:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(word2index[word.lower()])
        except KeyError:
            sentence_int.append(word2index['-OOV-'])
    train_sentences_x.append(sentence_int)

for sentence in test_sentences:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(word2index[word.lower()])
        except KeyError:
            sentence_int.append(word2index['-OOV-'])
    test_sentences_x.append(sentence_int)

for s in train_upos:
    train_upos_y.append([upos2index[t] for t in s])

for s in test_upos:
    test_upos_y.append([upos2index[t] for t in s])

In [456]:
print(train_sentences_x[0])
print(test_sentences_x[0])
print(train_upos_y[0])
print(test_upos_y[0])

[390, 3852, 2949, 1705]
[1, 1, 647, 1, 1, 3449, 1, 1437, 1705]
[6, 6, 24, 7]
[12, 6, 7, 6, 6, 12, 6, 24, 7]


In [457]:
MAX_LENGTH = len(max(train_sentences_x, key=len))
print(MAX_LENGTH)

26


In [219]:
train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_upos_y = pad_sequences(train_upos_y, maxlen=MAX_LENGTH, padding='post')
test_upos_y = pad_sequences(test_upos_y, maxlen=MAX_LENGTH, padding='post')

In [220]:
print(train_sentences_x[0])
print(test_sentences_x[0])
print(train_upos_y[0])
print(test_upos_y[0])

[ 390 3852 2949 1705    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
[   1    1  647    1    1 3449    1 1437 1705    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
[ 6  6 24  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0]
[12  6  7  6  6 12  6 24  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0]


In [221]:
def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)

In [222]:
cat_train_upos_y = to_categorical(train_upos_y, len(upos2index))
cat_test_upos_y = to_categorical(test_upos_y, len(upos2index))

## Model 1 - Vanilla Recurrent Neural Network

In [312]:
tf.keras.backend.clear_session()

input_layer_1 = Input(shape=(MAX_LENGTH,))
embedding_1 = Embedding(input_dim=len(word2index), output_dim=100)(input_layer_1)
rnn = SimpleRNN(100, return_sequences=True)(embedding_1)
output_1 = TimeDistributed(Dense(len(upos2index)))(rnn)
activation_1 = Activation('softmax')(output_1)

In [313]:
model1 = Model(inputs=[input_layer_1], outputs=[activation_1])

In [314]:
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [315]:
model1.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 26)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 26, 100)           497200    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 26, 100)           20100     
_________________________________________________________________
time_distributed (TimeDistri (None, 26, 25)            2525      
_________________________________________________________________
activation (Activation)      (None, 26, 25)            0         
Total params: 519,825
Trainable params: 519,825
Non-trainable params: 0
_________________________________________________________________


In [316]:
history1 = model1.fit(train_sentences_x, cat_train_upos_y, batch_size=128, epochs=40, validation_data=(test_sentences_x, cat_test_upos_y))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [317]:
scores = model1.evaluate(test_sentences_x, cat_test_upos_y)
print(f"{model1.metrics_names[1]}: {scores[1] * 100}")

accuracy: 96.05120420455933


## Model 2 - Long Short Term Memory

In [378]:
tf.keras.backend.clear_session()

input_layer_2 = Input(shape=(MAX_LENGTH,))
embedding_2 = Embedding(input_dim=len(word2index), output_dim=128)(input_layer_2)
lstm = LSTM(256, return_sequences=True)(embedding_2)
output_2 = TimeDistributed(Dense(len(upos2index)))(lstm)
activation_2 = Activation('softmax')(output_2)

In [379]:
model2 = Model(inputs=[input_layer_2], outputs=[activation_2])

In [380]:
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [381]:
model2.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 26)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 26, 128)           636416    
_________________________________________________________________
lstm (LSTM)                  (None, 26, 256)           394240    
_________________________________________________________________
time_distributed (TimeDistri (None, 26, 25)            6425      
_________________________________________________________________
activation (Activation)      (None, 26, 25)            0         
Total params: 1,037,081
Trainable params: 1,037,081
Non-trainable params: 0
_________________________________________________________________


In [382]:
history2 = model2.fit(train_sentences_x, cat_train_upos_y, batch_size=128, epochs=40, validation_data=(test_sentences_x, cat_test_upos_y))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [383]:
scores = model2.evaluate(test_sentences_x, cat_test_upos_y)
print(f"{model2.metrics_names[1]}: {scores[1] * 100}")

accuracy: 96.56442403793335


In [498]:
model2.save('pos_tagger.h5')

# Observations:
We have built two models to predict the upos (Universal Parts Of Speech) tag for each word in a sentence from the dataset. The first model is built with a simple/vanilla RNN. After training, we see that the accuracy received is 96.0512%. The second model is built with a Bidirectional LSTM and we have received an accuracy of 96.5644% on validation data. Both the models have a time distributed layer as we need temporal slicing.

We see that model 2 is performing better than model 1 i.e, the Bidirectional LSTM model performs better than a simple RNN model. LSTM is able to capture long term dependencies. This means if a particular sentence is long, LSTM is able to capture the essence better than an simple RNN model. If the dataset is larger and more complex with many long sentences, RNN will eventually fail to perform well but LSTM can be tuned as it is flexible and able to capture more complex patterns in the data.

# Now we will alter the sentences a bit and evaluate the model performance

In [196]:
#we will remove all words with length less than or equal to two

In [424]:
train_sentences_augmented = []
all_indices = []

for sentence in train_sentences:
    current_sentence = []
    current_index = []
    for index, word in enumerate(sentence):
        if len(word) > 3:
            current_sentence.append(word)
            current_index.append(index)
    train_sentences_augmented.append(current_sentence)
    all_indices.append(current_index)

In [458]:
train_sentences_augmented[0:5]

[['చదువు', 'తెలివిని', 'పెంచుతుంది'],
 ['పోలీసుల', 'పడిండు'],
 ['రెండో', 'నిండేలోపల', 'మెదడు', 'బాగా', 'పెరుగుతుంది'],
 ['విత్తనాలు', 'వ్యాధులను', 'ణూLL'],
 ['కడుపులో', 'తిరుగుతూ']]

In [459]:
train_upos_augmented = []
for indices, upos in zip(all_indices, train_upos):
    train_upos_augmented.append(list(np.array(upos)[indices]))

In [460]:
train_upos_augmented[0:5]

[['n', 'n', 'v'],
 ['n', 'unk'],
 ['adj', 'v', 'n', 'avy', 'v'],
 ['n', 'n', 'unk'],
 ['n', 'v']]

In [469]:
words_aug, upos_aug = set([]), set([])
 
for sentence in train_sentences_augmented:
    for word in sentence:
        words_aug.add(word.lower())

for tag in train_upos_augmented:
    for t in tag:
        upos_aug.add(t)

word2index_aug = {w: i + 2 for i, w in enumerate(list(words_aug))}
word2index_aug['-PAD-'] = 0  # The special value used for padding
word2index_aug['-OOV-'] = 1  # The special value used for OOVs
 
upos2index_aug = {t: i + 1 for i, t in enumerate(list(upos_aug))}
upos2index_aug['-PAD-'] = 0
upos2index_aug['-OOV-'] = 1

In [462]:
len(words_aug)

4733

In [463]:
len(upos_aug)

21

In [470]:
train_sentences_x, test_sentences_x, train_upos_y, test_upos_y = [], [], [], []
 
for sentence in train_sentences:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(word2index_aug[word.lower()])
        except KeyError:
            sentence_int.append(word2index_aug['-OOV-'])
    train_sentences_x.append(sentence_int)

for sentence in test_sentences:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(word2index_aug[word.lower()])
        except KeyError:
            sentence_int.append(word2index_aug['-OOV-'])
    test_sentences_x.append(sentence_int)

for sentence in train_upos:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(upos2index_aug[word.lower()])
        except KeyError:
            sentence_int.append(upos2index_aug['-OOV-'])
    train_upos_y.append(sentence_int)

for sentence in test_upos:
    sentence_int = []
    for word in sentence:
        try:
            sentence_int.append(upos2index_aug[word.lower()])
        except KeyError:
            sentence_int.append(upos2index_aug['-OOV-'])
    test_upos_y.append(sentence_int)

In [471]:
MAX_LENGTH = len(max(train_sentences_x, key=len))
print(MAX_LENGTH)

26


In [472]:
train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_upos_y = pad_sequences(train_upos_y, maxlen=MAX_LENGTH, padding='post')
test_upos_y = pad_sequences(test_upos_y, maxlen=MAX_LENGTH, padding='post')

In [473]:
cat_train_upos_y = to_categorical(train_upos_y, len(upos2index))
cat_test_upos_y = to_categorical(test_upos_y, len(upos2index))

## Model 1 - Recurrent Neural Network

In [480]:
tf.keras.backend.clear_session()

input_layer_1_aug = Input(shape=(MAX_LENGTH,))
embedding_1_aug = Embedding(input_dim=len(word2index), output_dim=100)(input_layer_1_aug)
rnn_aug = SimpleRNN(100, return_sequences=True)(embedding_1_aug)
output_1_aug = TimeDistributed(Dense(len(upos2index)))(rnn_aug)
activation_1_aug = Activation('softmax')(output_1_aug)

In [481]:
model1_aug = Model(inputs=[input_layer_1_aug], outputs=[activation_1_aug])

In [482]:
model1_aug.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [483]:
model1_aug.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 26)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 26, 100)           497200    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 26, 100)           20100     
_________________________________________________________________
time_distributed (TimeDistri (None, 26, 25)            2525      
_________________________________________________________________
activation (Activation)      (None, 26, 25)            0         
Total params: 519,825
Trainable params: 519,825
Non-trainable params: 0
_________________________________________________________________


In [484]:
history1_aug = model1_aug.fit(train_sentences_x, cat_train_upos_y, batch_size=128, epochs=40, validation_data=(test_sentences_x, cat_test_upos_y))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [485]:
scores = model1_aug.evaluate(test_sentences_x, cat_test_upos_y)
print(f"{model1_aug.metrics_names[1]}: {scores[1] * 100}")

accuracy: 92.61562824249268


## Model 2 - Long Short Term Memory

In [491]:
tf.keras.backend.clear_session()

input_layer_2_aug = Input(shape=(MAX_LENGTH,))
embedding_2_aug = Embedding(input_dim=len(word2index), output_dim=128)(input_layer_2_aug)
lstm_aug = LSTM(256, return_sequences=True)(embedding_2_aug)
output_2_aug = TimeDistributed(Dense(len(upos2index)))(lstm_aug)
activation_2_aug = Activation('softmax')(output_2_aug)

In [492]:
model2_aug = Model(inputs=[input_layer_2_aug], outputs=[activation_2_aug])

In [493]:
model2_aug.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [494]:
model2_aug.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 26)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 26, 128)           636416    
_________________________________________________________________
lstm (LSTM)                  (None, 26, 256)           394240    
_________________________________________________________________
time_distributed (TimeDistri (None, 26, 25)            6425      
_________________________________________________________________
activation (Activation)      (None, 26, 25)            0         
Total params: 1,037,081
Trainable params: 1,037,081
Non-trainable params: 0
_________________________________________________________________


In [495]:
history2_aug = model2_aug.fit(train_sentences_x, cat_train_upos_y, batch_size=128, epochs=40, validation_data=(test_sentences_x, cat_test_upos_y))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [496]:
scores = model2_aug.evaluate(test_sentences_x, cat_test_upos_y)
print(f"{model2_aug.metrics_names[1]}: {scores[1] * 100}")

accuracy: 93.2314932346344


# Observations:
As a part of data augmentation, we have removed words with length less than or equal to two. We have then performed similar preproccessing steps for the augmented data and have used the same models.

We quickly notice that, the validation accuracy is reduced by a considerable margin when compared to model with full data. By comparing the RNN and LSTM models built for augmented data, we see that LSTM again is performing better than simple RNN. The validation accuracy of the simple RNN model is 92.6156% while the validation accuracy for the bidirectional LSTM model is 93.2314%. This shows that LSTM is a better model than a simple RNN.

The same preprocessing techniques and model architecture can be used for predicting xpos tags as well.