## Home task: Recurrent neural networks (RNNs)

1) Find text to train (any book)<br>
2) Build train and validation set<br>
3) Train bidirectional language model that predicts the PoS of word being based on its `n_context = 3` neighbours from the left and `n_context = 3` neighbours from the right<br>
4) Evaluate the model

---

The novel Moby Dick by Herman Melville is used as a text for training a recurrent neural network. [Moby Dick](https://www.gutenberg.org/ebooks/2701) is contained in the Gutenberg corpus available in NLTK library.

In [None]:
import nltk
from nltk.corpus import gutenberg

In [None]:
nltk.download('gutenberg', quiet=True)

True

In [None]:
# Download the text of Moby Dick
moby_dick = gutenberg.raw('melville-moby_dick.txt')
print(moby_dick[:500])

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teac


In [None]:
import re

def preprocess(text):
    '''
    Converts the text into lowercase, removes digits and special symbols from it
    :param text: text to pre-process
    :type text: str
    :return: pre-processed text
    :rtype: str
    '''
    text = text.lower()
    text = re.sub(r'\[.*\]', '', text)
    text = re.sub(r'\d+', "", text)
    text = re.sub(r'["|()_]', "", text)
    return text

# Pre-process the Moby Dick text
moby_dick = preprocess(moby_dick)
print(moby_dick[:500])




etymology.

supplied by a late consumptive usher to a grammar school

the pale usher--threadbare in coat, heart, body, and brain; i see him
now.  he was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  he loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

while you take in hand to school others, and to teach them by what
name a whale-fish is t


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Use the count vectorizer to get unique words of the text
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b').fit([moby_dick])
vocab = vectorizer.get_feature_names_out()

print(f'Number of features: {len(vocab)}')
print('Features:')
print(vocab[:50])

Number of features: 16948
Features:
['a' 'aback' 'abaft' 'abandon' 'abandoned' 'abandonedly' 'abandonment'
 'abased' 'abasement' 'abashed' 'abate' 'abated' 'abatement' 'abating'
 'abbreviate' 'abbreviation' 'abeam' 'abed' 'abednego' 'abel' 'abhorred'
 'abhorrence' 'abhorrent' 'abhorring' 'abide' 'abided' 'abiding' 'ability'
 'abjectly' 'abjectus' 'able' 'ablutions' 'aboard' 'abode' 'abominable'
 'abominate' 'abominated' 'abomination' 'aboriginal' 'aboriginally'
 'aboriginalness' 'abortion' 'abortions' 'abound' 'abounded' 'abounding'
 'aboundingly' 'about' 'above' 'abraham']


In [None]:
# Construct dictionaries mapping words to their indexes and vice versa
word2index = vectorizer.vocabulary_
index2word = {index: word for word, index in word2index.items()}

In [None]:
# Tokenize the text based on count vectorizer
word_tokenize = vectorizer.build_tokenizer()
tokens = word_tokenize(moby_dick)
n_tokens = len(tokens)
print(f'Number of tokens: {n_tokens}')

# Get unique tokens
unique_tokens = sorted(set(tokens))
print(f'Number of unique tokens: {len(unique_tokens)}')

Number of tokens: 218370
Number of unique tokens: 16948


In [None]:
nltk.download('averaged_perceptron_tagger', quiet=True)

True

In [None]:
# Get PoS tags for each token
tagged = nltk.pos_tag(tokens)

# Construct contexts and targets for the dataset
n_context = 3
contexts = []
targets = []
for i in range(n_context, n_tokens - n_context):

    # Context is 3 words to the left and 3 words to the right of the target word
    left_context = [word for word, _ in tagged[i - n_context:i]]
    right_context = [word for word, _ in tagged[i + 1:i + n_context + 1]]
    contexts.append(left_context + right_context)

    # Target is the PoS of the target word
    _, target = tagged[i]
    targets.append(target)

print(f'Number of samples: {len(contexts)}')
print('First 10 contexts and targets:')
for i in range(10):
    print(f'{contexts[i]} -> [{targets[i]}]')

Number of samples: 218364
First 10 contexts and targets:
['etymology', 'supplied', 'by', 'late', 'consumptive', 'usher'] -> [DT]
['supplied', 'by', 'a', 'consumptive', 'usher', 'to'] -> [JJ]
['by', 'a', 'late', 'usher', 'to', 'a'] -> [NN]
['a', 'late', 'consumptive', 'to', 'a', 'grammar'] -> [NN]
['late', 'consumptive', 'usher', 'a', 'grammar', 'school'] -> [TO]
['consumptive', 'usher', 'to', 'grammar', 'school', 'the'] -> [DT]
['usher', 'to', 'a', 'school', 'the', 'pale'] -> [NN]
['to', 'a', 'grammar', 'the', 'pale', 'usher'] -> [NN]
['a', 'grammar', 'school', 'pale', 'usher', 'threadbare'] -> [DT]
['grammar', 'school', 'the', 'usher', 'threadbare', 'in'] -> [NN]


In [None]:
# Construct dictionaries mapping PoS to their indexes and vice versa
pos2index = {pos: index for index, pos in enumerate(set(targets))}
index2pos = {index: pos for pos, index in pos2index.items()}

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Construct features (contexts) and labels (targets)
X = []
y = []
for context, target in zip(contexts, targets):
    X.append([word2index[word] for word in context])
    y.append(pos2index[target])

# Convert targets to one-hot represatation
X = np.array(X)
y = tf.keras.utils.to_categorical(y)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [None]:
n_words = len(word2index)
n_pos = len(pos2index)

# Define a class for recurrent neural network model
class PredictPosModel(tf.keras.Model):

    def __init__(self, n_context):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(
            n_words, 64, input_length=(2 * n_context)
        )
        self.bidirectional = tf.keras.layers.Bidirectional(
            tf.keras.layers.LSTM(128, return_sequences=False)
        )
        self.dense = tf.keras.layers.Dense(n_pos, activation='softmax')

    def call(self, inputs, training=False):
        x = self.embedding(inputs)
        x = self.bidirectional(x)
        return self.dense(x)

In [None]:
# Create a RNN model and compile it
model = PredictPosModel(n_context)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model for 20 epochs
model.fit(X_train, y_train, epochs=20, batch_size=128);

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# Compute loss and accuracy of the model based on test set
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f'Test loss: {loss:.4f}')
print(f'Test accuracy: {acc:.4f}')

Test loss: 4.6058
Test accuracy: 0.3940
