### Notebook and dataset backgrounder
I use non-client/customer specific applications of my projects due to privacy, but I would still like to showcase what I have built, so I use public datasets. This project is cross-published on Github and Kaggle.

The Internet Movie Database also known as IMDB is the best place on the Internet to find information about Hollywood productions, TV shows and movies.

A dataset with [25000 movie reviews from IMDB](https://keras.io/api/datasets/imdb/) was compiled by the [Keras team](https://keras.io/). It is great for demonstrating how [Long Short-Term Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory), a type of [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) can be used.

### Theory backgrounder
This uses a neural network, a network of processing nodes (perceptrons/neurons) which activate or not dependent upon the mathematical activation equation. These mathematical neurons mimic how we think as humans with our biological neurons. The defacto type of RNN, LSTM is used in this project. As noted by a very good [WikiPedia article](https://en.wikipedia.org/wiki/Recurrent_neural_network), early RNNs had a vanishing gradient problem, so there is nothing to train against.

> Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

### Tech backgrounder
During 2022 and 2023, when I originally learned how to work with Tensorflow and Keras, I learned about various ways to use Keras, and this project is to highlight my use of an RNN, LSTM type. Now in 2024, Keras can be run on top of JAX, Tensorflow, or PyTorch.

Here I will use Tensorflow and Keras, to supplt the dataset and use the LSTM.

### Misc
While the dataset as provided has 25000 reviews in total, I will use 20000 samples, limit each review to a maximum of 150 words, training the model with 15000 samples, and validating the model with 5000 samples. After the model is run, I will be able to determine the number of trainiable parameters are in the model with this build.

In [None]:
n_data_samples = 20000
review_length = 150
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=n_data_samples)   # REF (Keras IMDB dataset): https://keras.io/api/datasets/imdb/
from keras.preprocessing.sequence import pad_sequences   # REF (properly importing pad_sequences, working like this 2024): https://stackoverflow.com/questions/72326025/cannot-import-name-pad-sequences-from-keras-preprocessing-sequence and https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences
x_train = pad_sequences(x_train, maxlen=review_length)   # REF (using sequence.pad_sequences to pad/trim review length): https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences
x_test = pad_sequences(x_test, maxlen=review_length)

In [None]:
from keras.models import Sequential
from keras.layers import Input, Embedding, Embedding, LSTM, Dense
def build_model():
  model = Sequential()
  model.add(Input((review_length,)))
  model.add(Embedding(n_data_samples, 32, input_length=review_length))
  model.add(LSTM(100))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  return model

In [None]:
model = build_model()
model.summary()

# Deprecated because why not deprecate something that works?
#from keras.utils.layer_utils import count_params   # REF (Trainable params): https://stackoverflow.com/questions/45046525/how-can-i-get-the-number-of-trainable-parameters-of-a-model-in-keras
#trainable_count = count_params(model.trainable_weights)
#non_trainable_count = count_params(model.non_trainable_weights)

# REF (How can I get the number of trainable parameters of a model in Keras?): https://stackoverflow.com/questions/45046525/how-can-i-get-the-number-of-trainable-parameters-of-a-model-in-keras
import numpy as np
trainable_count, non_trainable_count = 0, 0

# Update 2024: Deprecation is wild in Keras
#from keras import backend as K
#trainable_count = int(np.sum([K.count_params(p) for p in set(model.trainable_weights)]))
#non_trainable_count = int(np.sum([K.count_params(p) for p in set(model.non_trainable_weights)]))

# Also get_shape() is now deprecated
#for p in model.trainable_weights:
#  trainable_count += int(np.prod(p.get_shape()))
#non_trainable_count = 0
#for p in model.non_trainable_weights:
#  non_trainable_count += int(np.prod(p.get_shape()))

# Using num_elements(), not available in Kaggle?
#for p in model.trainable_weights:
#  trainable_count += int(p.size())  # num_elements() replaced get_shape()
#non_trainable_count = 0
#for p in model.non_trainable_weights:
#  non_trainable_count += int(p.size())

# Update 2024: It works!
for p in model.trainable_weights:
  trainable_count += int(np.prod(p.shape))
for p in model.non_trainable_weights:
  non_trainable_count += int(np.prod(p.shape))

print(f'\nTotal parameters of the model: {trainable_count + non_trainable_count:,}')
print(f'Trainable parameters of the model: {trainable_count:,}')
print(f'Non-trainable parameters of the model: {non_trainable_count:,}')

In [None]:
model_history = model.fit(x=x_train, y=y_train, validation_data=(x_test[:5000], y_test[:5000]), epochs=1)   # Validate on 5000 of the total datapoints, only need 1 epoch for this dataset!

In [None]:
train_accuracy = model_history.history['accuracy']
val_accuracy = model_history.history['val_accuracy']
print(f'After 1 Epoch:\nTrain Accuracy: {train_accuracy}\nValidation Accuracy: {val_accuracy}')

In [None]:
predictionset = [0 if prediction < 0.5 else 1 for prediction in model.predict(x_test)]   # Normalise for sklearn
from sklearn.metrics import classification_report
print(f'\n{classification_report(y_test, predictionset)}')