# Recurrent Neural Networks

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/02/25  | Martin | Created   | Create for chapter 9. Started on text generation section | 
| 02/03/25  | Martin | Update   | Completed text generation with LSTM section | 

# Content

* [Introduction](#introduction)
* [Text Generation - LSTM](#text-generation)

# Introduction

__Recurrent Neural Networks (RNN)__ model data that is sequential in nature. Recurrent refers to data where the output of the current step becomes the input to the next one. At each step, the model considers what it has seen about the preceding elements on top of the current input.

__Natural Language Processing (NLP)__ is where we train models to understand text information by training them on the context

Topics covered:

1. Text generation
2. Sentiment classification
3. Time series - stock information
4. Open-domain question answering

---

# Text Generation

Use a _Long Short-Term Memory (LSTM)_ architecture to build a text generation model

* Standard RNN models suffer from long dependencies i.e words that are earlier in the context window no longer contribute to the model since they're further away (vanishing gradient problem).
* LSTM maintains a cell state, and a "carry" to ensure the signal is not loss as the sequence progresses
* Each step: (1) current word (2) carry (3) cell state

Video References

* [RNN Explained](https://www.youtube.com/watch?v=AsNTP8Kwu80&ab_channel=StatQuestwithJoshStarmer)
* [LSTM Explained](https://www.youtube.com/watch?v=YCzL96nL7j0&ab_channel=StatQuestwithJoshStarmer)

![LSTM Architecture](./images/lstm_architecture.png)

In [5]:
import tensorflow as tf
import tensorflow.keras as keras

# Keras modules for LSTM
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku

import numpy as np
np.random.seed(7)
tf.random.set_seed(7)
import random
random.seed(7)
tf.random.uniform([1], seed=1)

import pandas as pd
import string

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

W0000 00:00:1740926555.117657      18 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [None]:
# Define functions that simplify the workflow
def clean_text(txt):
  """
  Removes punctuations and lowercase text.
  Then convert text into utf-8 format
  """
  txt = "".join(v for v in text if v not in string.punctuation).lower()
  txt = txt.encode('utf8').decode('ascii', 'ignore')
  return txt


def get_sequence_of_tokens(corpus):
  """
  Creates an ngram sequence - a list of lists that contain the tokenised sentences.
  For each sentence, everytime a new word is added to the list, it is added as a new indexed list
  """
  # Tokeniser
  tokenizer.fit_on_text(corpus)
  total_words = len(tokenizer.word_index) + 1

  # Convert data to sequence of tokens
  input_sequence = []
  for line in corpus:
    token_list = tokenizer.text_to_sequences([line])[0]
    for i in range(1, len(token_list)):
      n_gram_sequence = token_list[:i+1]
      input_sequences.append(n_gram_sequence)
  return input_sequence, total_words


def generate_padded_sequences(input_sentences):
  """
  1. Ensure that all the sequenes are of the same length by adding padding
  All padding is added to the front
  2. Separate the predictions (text content) and labels (last word in the sequence)
  3. Convert the label into a categorical variable. Categories are all available words
  in the corpus
  """
  max_sequence_len = max([len(x) for x in input_sentences])
  input_sequences = np.array(pad_sequences(
    input_sequences,
    maxlen=max_sequence_len,
    padding='pre'
  ))

  predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
  label = ku.to_categorical(label, num_classes=total_words)
  return predictors, label, max_sequence_len


def generate_text(seed_text, next_words, model, max_sequence_len):
  """
  1. Add the same preprocessing done to the text
  2. Make predictions of next word
  3. Add the predicted word to the end of the seed text
  """
  for _ in range(next_words):
    # Apply the same preprocessing as the model
    token_list = tokenizer.text_to_sequences([seed_text])[0]
    token_list = pad_sequences(
      [token_list],
      maxlen=max_sequence_len-1, # need to remove the label
      padding='pre'
    )

    # Make a prediction on the next word
    predicted = model.predict_classes(token_list, verbose=0)

    # Convert the prediction back to actual word
    output_word = ""
    for word, index in tokenizer.word_index.items():
      if index == predicted:
        output_word = word
        break
      
    seed_text += " " + output_word
  
  return seed_text.title()

In [None]:
def create_model(max_sequence_len, total_words):
  """
  Create model with single LSTM hidden layer
  """
  input = max_sequence_len - 1
  model = Sequential()

  # Embedding layer
  model.add(Embedding(total_words, 10, input_length=input_len))

  # LSTM layer
  model.add(LSTM(100))

  # Output layer
  model.add(Dense(total_words, activation='softmax'))

  model.compile(loss='categorical_crossentropy', optimizer='adam')

  return model

## Loading Data

In [None]:
directory = "../data/ny_articles/"
all_headlines = []
for f in os.listdir(directory):
  article_df = pd.read_csv(directory + f)
  all_headlines.extend(list(article_df.headline.values))

all_headlines[:10]

Perform data preparation steps:

1. Remove punctuations and lower casing of all words
2. _Tokenisation:_ Converting text into ngram sequences - _ngrams_ are lists of integers that encode the word from a standard corpus based on the index
3. _Padding:_ Ensures that all the sequences are of the same length
4. Create the predictors and labels: labels are just the next word in the sequence

📜 __NOTE:__ Language modeling requires a sequence input data, as given a squence (of words/ tokens) the aim is the prediction of the next word/ token

In [None]:
# 1. Removing punctuations, lower casing
corpus = [clean_text(x) for x in all_headlines]
corpus[:10]

In [None]:
# 2. Tokenisation for ngram sequences
tokeniser = Tokenizer()

inp_sequences, total_words = get_sequences_of_tokens(corpus)
inp_sequences[:10]

In [None]:
# 3 + 4. Padding sequences + create labels
predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

## The model

In [None]:
model = create_model(max_sequence_len, total_words)
model.summary()

In [None]:
model.fit(predictors, label, epochs=100, verbose=2)

Testing the model

In [None]:
print (generate_text("united states", 5, model, max_sequence_len))
print (generate_text("united states", 10, model, max_sequence_len))
print (generate_text("united states", 15, model, max_sequence_len))

In [None]:
print (generate_text("president trump", 3, model, max_sequence_len))
print (generate_text("president trump", 4, model, max_sequence_len))
print (generate_text("president trump", 5, model, max_sequence_len))
print (generate_text("president trump", 8, model, max_sequence_len))

In [None]:
print (generate_text("joe biden", 3, model, max_sequence_len))
print (generate_text("joe biden", 4, model, max_sequence_len))
print (generate_text("joe biden", 5, model, max_sequence_len))
print (generate_text("joe biden", 8, model, max_sequence_len))

In [None]:
print (generate_text("india and china", 3, model, max_sequence_len))
print (generate_text("india and china", 4, model, max_sequence_len))
print (generate_text("india and china", 5, model, max_sequence_len))
print (generate_text("india and china", 8, model, max_sequence_len))

In [None]:
print (generate_text("european union", 3, model, max_sequence_len))
print (generate_text("european union", 4, model, max_sequence_len))
print (generate_text("european union", 5, model, max_sequence_len))
print (generate_text("european union", 8, model, max_sequence_len))