<a href="https://colab.research.google.com/github/hacksaremeta/IS-Sentence-Completion/blob/model/is_autocomplete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Completion (TUD IS Project)

<a id="toc"></a>
## Table of contents
* 1 [Introduction](#introduction)  
* 2 [Training data](#data_preparation)  
    * 2.1 [DataManager class](#data_manager)  
    * 2.2 [DataUtils class](#data_utils)  
* 3 [Neural Network Implementation](#impl_nn)  
    * 3.1 [Data preparation](#keras_preparation)  
    * 3.2 [Creating a neural network](#keras_create_rnn)
    * 3.3 [Training the neural network](#keras_train_rnn)
    * 3.4 [Making Predictions](#keras_predict_rnn) 
* 4 [Evaluation](#evaluation)
    * 4.1 [Dataset comparison](#eval_dataset)
    * 4.2 [Data preparation comparison](#eval_preparation)
    * 4.3 [Network comparison](#eval_rnn)
    * 4.4 [Conclusion](#eval_conclusion)

<a id="introduction"></a>
## Introduction

In this project our task is to extend sentences from abstracts fetched from PubMed. To solve this problem we decided to use a recurrent neural network (RNN) because such networks are good at solving this kind of task. After training the network on sequences of words it can be used to extend such sentences. In short<sup>[1](#fn1)</sup>, the network takes a number of words as input and then predicts the next word in that sequence. We can then append the predicted word to the original sequence and input that into the network again and repeat this process to predict an arbitrary number of words. In order to implement such a network, we used [Keras](https://keras.io/) from [Tensorflow](https://www.tensorflow.org/).

The task can be divided into obtaining datasets, preparing the datasets for input to the RNN, constructing and training the RNN and finally making predictions. During our testing, we try find a good model<sup>[2](#fn2)</sup> by training the network independently on different datasets, changing preparation stages and tweaking parameters. We then compare our models in the [Evaluation](#evaluation) section.

We used the following methods from the lecture:
- Regular Expressions
- Neural Networks

<a name="fn1">1</a>: Simplified explanation, the details differ: The network calculates a function on numeric data, therefore the mentioned "words" are actually represented as vectors of integers. Likewise, the output is a floating-point vector that can be interpreted as a prediction. For details on how this network operates see [Neural Network Implementation](#impl_nn).

<a name="fn2">2</a>: (serializable) Keras representation of the neural network; The models we saved in the process contain both information about the structure of the network as well as weights of the connections (models serialized after training).

<a id="data_preparation"></a>
## Training data

In order to fetch data from PubMed and save it into different datasets as well as to load those datasets, some functionality is needed. This functionality will be provided by the [DataManager class](#data_manager).
The loaded dataset then has to be prepared for training the neural network. This includes tokenization, label and feature extraction and encoding, all of which is handled by the [DataUtils class](#data_utils).
TODO: more explanation / documentation ...

<a id="data_manager"></a>
### DataManager Class

- Provides functionality regarding data including fetch, persistence and TF2/Keras preparation utils

In [None]:
!pip install biopython

In [1]:
import os, json, logging, string
from Bio import Entrez, Medline

In [2]:
class DataManager():
    """Provides fetch, save and load functionality for datasets in json format"""
    
    def __init__(self, email, root_dir):
        self.email = email
        self.root_dir = root_dir
        self.log = logging.getLogger(self.__class__.__name__)

    def _exists_dataset(self, name):
        """Checks whether a dataset with the given name exists"""
        if not os.path.isdir(self.root_dir):
            return False
            
        for file in os.listdir(self.root_dir):
            if file.endswith(".json"):
                with open(os.path.join(self.root_dir, file), 'r') as f:
                    content = json.load(f)
                    if content["name"] == name:
                        return True
        return False

    def _fetch_papers(self, query : str, limit : int) -> 'list[dict]':
        """Retrieves data from PubMed"""
        Entrez.email = self.email
        record = Entrez.read(Entrez.esearch(db="pubmed", term=query, retmax=limit))
        idlist = record["IdList"]
        self.log.info("Found %d records for %s." % (len(idlist), query.strip()))
        records = Medline.parse(Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode = "text"))
        return [r for r in records if "AB" in r]

    
    def _fetch_abstracts(self, query : str, limit : int) -> 'list[str]':
        """Retrieves abstracts from PubMed"""
        papers = self._fetch_papers(query, limit)
        list_of_abstracts = [p['AB'] for p in papers]

        return list_of_abstracts
        
    def create_dataset(self, queries : 'list[str]', name : str, limit=50, overwrite=False) -> None:
        """
        Wraps other methods in this class
        Creates a dataset from multiple queries
        Does nothing if the dataset is already present (param overwrite)
        Limits every query to <limit> results
        """
        exists_dataset = self._exists_dataset(name)
        if not exists_dataset or (exists_dataset and overwrite):
            self.log.info("Dataset does not exist, fetching from PubMed...")

            res = dict()
            res["name"] = name
            res["data"] = list()
            
            for q in queries:
                q_data = dict()
                q_data["query"] = q
                q_data["abstracts"] = self._fetch_abstracts(q, limit)
                res["data"].append(q_data)
            
            self._save_dataset(res, name)
        else:
            self.log.info("Dataset already exists, skipping fetch")

    def _save_dataset(self, dataset: dict, name : str) -> None:
        """
        Creates a file <name>.json in the dataset directory
        For JSON file structure see below
        Param dataset has a structure analogous to the JSON file
        """
        if not os.path.isdir(self.root_dir):
            os.makedirs(self.root_dir)

        with open(os.path.join(self.root_dir, name + ".json"), 'w') as f:
            json.dump(dataset, f, indent=2)
        
    def load_full_dataset(self, name : str) -> 'list[str]':
        """
        Finds the file that matches given <name> in JSON information,
        parses it, loading all abstracts into a list (one string for each abstract)
        and returns it (Error if dataset doesn't exist)
        """

        if  not self._exists_dataset(name):
            self.log.info("Dataset does not exist")
            
        else:
           with open(os.path.join(self.root_dir, name+'.json'), 'r') as file:
                abstract_list=[]
                jsonObject = json.load(file)
                data_list= jsonObject['data']
                for item in data_list:
                    abstract_list.extend(item['abstracts'])
                return abstract_list

    def load_query_from_dataset(self, name : str, query : str) -> 'list[str]':
        """Like load_full_dataset but only loads abstracts for a single query"""


        result = self._exists_dataset(name)

        if  result:

            with open(os.path.join(self.root_dir, name+'.json'), 'r') as file:

                query_abstracts=[]
                jsonObject = json.load(file)
                data_list= jsonObject['data']

                q_names = [x['query'] for x in data_list]

                if query not in q_names:
                    self.log.info("The Query that you are searching for,does not exist in the Dataset")
                else:

                      for queries in data_list:
                            if queries['query'] == query:
                              query_abstracts.extend(queries['abstracts'])
                              return query_abstracts

        else:
             self.log.info("Dataset does not exist")


    def remove_punctuation(self, name:str) -> 'list[str]':


            abstracts_list= self.load_full_dataset(name)

            for text in abstracts:

                text = text.translate(str.maketrans('', '', string.punctuation))
                abstracts_list.append(text)


            return  abstracts_list

<a id="data_utils"></a>
### DataUtils Class

[Back to TOC](#toc)
- Static class providing utility functions to prepare data for training

In [3]:
import numpy as np
import random
from typing import Any
from sklearn.model_selection import train_test_split

In [4]:
# TODO: unify method param types (all np.array instead of list)
class DataUtils():
    """Provides utility functions for data preparation"""
    
    @staticmethod
    def extract_features_and_labels(sequences : 'list[list[Any]]',
                                            train_interval : 'tuple[int, int]') -> 'tuple[list[Any], list[Any]]':
        """
        Choses a random number l from <train_interval> (chosen for every sequence) and extracts
        features of dynamic length l from every sequence
        Every l+1-th word is extracted as a label
        Reurns tuple(features, labels)
        """
        features = []
        labels = []
        for s in sequences:
            l = random.randrange(*train_interval, 1)
            for i in range(l, len(s)):
                # First l words are features
                features.append(s[i-l : i])
                
                # l-th word is label
                labels.append(s[i])
        
        return (features, labels)
        
    @staticmethod
    def encode_data(labels : 'list[Any]', num_code_words : int) -> np.array:
        """
        One-hot encode labels using numpy to
        improve the training speed of the network
        """

        # Use numpy for better compatibility and performance
        # Data type: 8bit integers for binary numbers (0, 1)
        # Could be optimized in space by using single bits instead
        # But that adds overhead in calculation (tradeoff time - space)
        # Since we want improved training speed we just use
        # numpys smallest data type byte/uint8 here
        labels_encoded = np.zeros((len(labels), num_code_words), dtype=np.uint8)

        # One-hot encode
        for i, word in enumerate(labels):
            labels_encoded[i, word] = 1
            
        return labels_encoded
    
    # Uses Scikit-learn here for simplicity
    @staticmethod
    def split_data(features: np.array, labels: np.array, _test_size=0.2) -> Any:
        """
        Splits features and labels into training and validation data sets
        Returns: (features_training, features_validation, labels_training, labels_validation)
        """
        return train_test_split(features, labels, test_size=_test_size)

<a id="impl_nn"></a>
## Neural Network Implementation (LSTM RNN)  
For the general methodology regarding Keras neural networks see [Tensorflow Docs: Text generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation), [Sanchit Tanwar: Building our first neural network in keras](https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5) and [Will Koehrsen: Recurrent Neural Networks by Example in Python](https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470).

After fetching the abstracts from PubMed, we first we have to convert the input abstracts into numeric sequences that the network can work with.
Then we train the network by giving it n 'words' (features) from the sequences and having it predict the (n+1)-th word (label) in the sequence.
The predicted word is then compared to the actual word present in the training data and back-propagation is used to tweak the network weights (this is by Keras during training). After the training process is done we can use the model to predict a number of words given an input sequence of arbitrary length.

<a id="keras_preparation"></a>
### Data preparation

[Back to TOC](#toc)

The first step in this stage is to fetch the abstracts from PubMed. For this either a single or multiple queries with a specified limit for the number of abstracts to fetch can be used. The abstracts are then processed by multiple regular expressions to remove unwanted tokens such as emails, links, abbreviations, etc. After this, we use Keras' tokenizer (which is also saved in `res/tokenizers` for making predictions later) to convert the abstracts into sequences of integers where each word is represented by a unique integer. We also one-hot encode the labels for efficiency and to simplify the prediction process later. Note that this also increases the RAM usage considerably and can be omitted if deemed necessary.

The final preparation step is to split the resulting data across the following (disjoint!) sets:

- features (vectors of words that are used for training)
- labels (vector of words that should be predicted)
- validation features
- validation labels

The features and labels are used to train the network, whereas validation features and labels are used to test the network's predictions on different data of a similar domain<sup>[3](#fn3)</sup>.

<a name="fn3">3</a>: similar domain only when using a single query to generate the dataset

In [5]:
import re, pickle, warnings
from tensorflow import device
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.compat.v1.logging import set_verbosity, ERROR
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Masking, Dense, Dropout, Flatten, Input

In [6]:
# Suppress TF2 GPU Warnings
set_verbosity(ERROR)

In [7]:
if __name__ == "__main__":
    # Init logging
    logging.basicConfig(level=logging.DEBUG, format='[%(levelname)s] %(name)s: %(message)s')
    log = logging.getLogger("Main")

    # Create DataManager in 'res/datasets' folder
    data_folder = os.path.join("res", "datasets")
    dman = DataManager("mymail@example.com", data_folder)


    queries = ["brain surgery[Title/Abstract]"]
    records = 200
    dataset_name = "{0} Dataset {1}".format(', '.join(queries).replace('/', '_'), str(records))
    #query = ["RNA", "mRNA", "tRNA"]
    #dataset_name = f"RNA Dataset"

    # Gather maximum of 100 abstracts for each query
    # I would suggest around 5 - 20 abstracts in total for the small data sets
    # and maybe 500 - 5000 for the final ones but we'll have to test
    # since that depends on how long it takes to train the network
    # This only queries PubMed if data if the data is not already present
    dman.create_dataset(queries, dataset_name, records, overwrite=False)

    # Load the dataset
    abstracts = dman.load_full_dataset(dataset_name)
    #abstracts_mrna = dman.load_query_from_dataset(dataset_name, query)

    ab = dman.remove_punctuation(dataset_name)

    assert(len(ab) > 0)
    log.debug(f"First extracted abstract: {ab[0]}")

    # Perform some regex preprocessing to improve data quality
    # Do this sequentially for better code readability
    
    # Remove section titles often present in abstracts
    ab = [re.sub(r"(\.\s|^)(?:\w+[,:]?\s){1,2}\w+:\s", r"\1", w) for w in ab]
    
    # Separate punctuation to keep in tokenization 
    ab = [re.sub(r"([.:,;!?])", r" \1 ", w) for w in ab]
    
    # Remove URLs and Emails
    # Credit: Matthew O'Riordan
    # https://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without
    regex_url_email = (r'((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?'
                       r'[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%'
                       r'\/.\w\-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[.\!\/\\w]*))?)')
    ab = [re.sub(regex_url_email, '', w) for w in ab]
    
    # Remove references like [...], (...)
    ab = [re.sub(r"(\(.*?\)|\[.*?\])", '', w) for w in ab]
        
    # Substitute special Symbols like &, /, ...
    ab = [re.sub(r"([^\s]*)&([^\s]*)", r"\1 and \2", w) for w in ab]
    ab = [re.sub(r"([^\s]*)\/([^\s]*)", r"\1 or \2", w) for w in ab]
    
    # Remove duplicate spaces
    ab = [re.sub(r"\s\s+", ' ', w) for w in ab]
    
    assert(len(ab) > 0)
    log.debug(f"First extracted abstract after RegEx preprocessing: {ab[0]}")
    
    # Tokenize abstracts
    # See https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
    # Filters slightly modified (comp. to docs) to keep punctuation
    # Lowercase has to be used for pre-trained embeddings (filters='#$%&()*+-,<=>@[\\]^_`{|}~\t\n')
    # '"%;[\\]^_`{|}~\t\n'
    tokenizer = Tokenizer(
        num_words=None, 
        filters='"#*+<>@=~_|´`\\[]{}\t\n',
        lower = True, split = ' '
    )

    tokenizer.fit_on_texts(ab)

    # Generates list of lists of integers
    # Can be reversed with the sequences_to_texts() function of the tokenizer
    sequences = tokenizer.texts_to_sequences(ab)

    assert(len(sequences) > 0)
    log.debug(f"First tokenized sequence: {sequences[0]}")

    # Prepare data for input to RNN
    # Extract features and labels
    # Number of words before prediction:
    # random number from feature_len_range
    # Use tuple(n, n+1) for fixed feature length n
    feature_len_range = (60, 61)
    features, labels = DataUtils.extract_features_and_labels(sequences, feature_len_range)

    assert(len(features) > 0 and len(labels) > 0)
    log.debug(f"First extracted feature: {tokenizer.sequences_to_texts(features)[0]} {features[0]}")
    log.debug(f"First extracted label: {tokenizer.index_word[labels[0]]} [{labels[0]}]")

    # One-hot encode data for improved training performance
    num_code_words = len(tokenizer.index_word) + 1
    labels_encoded = DataUtils.encode_data(labels, num_code_words)

    assert(len(labels_encoded) > 0)
    log.debug(f"First one-hot encoded label: [0 ... {labels_encoded[0][labels[0]]} (at index {labels[0]}) ... 0]")

    # Final log for prepared data
    log.info(f"Loaded {labels_encoded.shape[0]} sequences"
             f" with an encoded length of ~{labels_encoded.shape[1] // 8} bytes per sequence")
    
    # Convert features to numpy array
    # This is necessary for input to the RNN
    features = np.array(features)
    
    # Split dataset into training and validation sets
    features_training, features_validation, labels_training, labels_validation = \
    DataUtils.split_data(features, labels_encoded, 0.2)
    
    assert(len(features_training) > 0 and len(features_validation) > 0 and
           len(labels_training) > 0 and len(labels_validation) > 0)
    log.info(f"Size of training data: {features_training.shape[0]} sequences")
    log.info(f"Size of validation data: {features_validation.shape[0]} sequences")
    
    log.info("Training data preparation finished")

[INFO] DataManager: Dataset already exists, skipping fetch
[DEBUG] Main: First extracted abstract: Background: Surgical resection is frequently the recommended treatment for drug-resistant temporal lobe epilepsy (TLE), yet many factors play a role in patients' perceptions of brain surgery that ultimately impact decision-making. The purpose of the current study was to explore how people with epilepsy, in their own words, experienced the overall process of consenting to surgery for drug-resistant TLE. Methods and Materials: Data was drawn from in-person, semi-structured interviews of 19 adults with drug-resistant TLE eligible to undergo epilepsy surgery. A systematic thematic analysis was performed to code, sort and compare participant responses. The mean age of these 12 (63%) women and seven (37%) men was 37.6 years (18-68 years), with average duration of epilepsy of 13 years (2-30 years). Results: Meeting the neurosurgeon and consenting to surgery represented an important treatment mil

<a id="keras_create_rnn"></a>
### Creating the neural network

[Back to TOC](#toc)

Note: We use the words 'nodes', 'neurons' and 'cells' synonymously in context of the neural network.

Our baseline network (`res/models/model-baseline_252276_200_1_128_128_150.h5`) consists of the following layers:
- Embedding: maps input sequences to vectors
- LSTM: Recurrent layer consisting of LSTM cells
- Dense: For additional 'learning capacity'
- Dropout: To regulate fitting and prevent overfitting
- Dense: maps input to normalized probability distribution

The parameters used for the baseline network are as follows:

    # Network params
    train_embedding = True
    embed_vec_size = 200

    num_nodes_lstm = 128
    num_nodes_dense = 128

    # Training params
    num_batch = 2800
    num_epochs = 150

In this case our input sequence is an integer vector representation of a (variable) number of words. Since the labels are one-hot encoded, the probability distribution generated by the output layer can be used with argmax to pick the predicted word. We discuss influence of the parameters in the [Network Evaluation](#eval_rnn) section.

In [8]:
# Network params
train_embedding = True
embed_vec_size = 250

num_nodes_lstm = 1024
num_nodes_dense = 2048

# Training params
num_batch = 128
num_epochs = 150

X_train = features_training
y_train = labels_training
X_valid = features_validation
y_valid = labels_validation

# Model save params
model_name = "model-xy-1234"

# Create directories if necessary
model_dir = os.path.join("res", "models")
if not os.path.isdir(model_dir):
    os.makedirs(model_dir)

# Filename semantics:
# <model_name>_<train_size>_<output_dim>_<selftrained_embeddings>_<num_nodes_lstm>_<num_nodes_dense>_<num_epochs>.h5
model_file = model_name + "_" + str(X_train.shape[0]) \
    + "_" + str(embed_vec_size) + "_" + str(int(train_embedding)) \
    + "_" + str(num_nodes_lstm) + "_" + str(num_nodes_dense) \
    + "_" + str(num_epochs) + ".h5"

# To make predictions later the tokenizer has to be saved
tokenizer_dir = os.path.join("res", "tokenizers")
if not os.path.isdir(tokenizer_dir):
    os.makedirs(tokenizer_dir)
    
with open(os.path.join(tokenizer_dir, model_file[:-3] + ".pkl"), "wb") as f:
    pickle.dump(tokenizer, f)

def make_model(input_dim : int,
               embed_vec_size : int, nodes_lstm : int,
               nodes_dense : int, dropout_lstm : float = 0.1,
               dropout_lstm_recurrent : float = 0.1, dropout_dense : float = 0.5) -> Any:
    """
    Creates a sequential Keras Model with given parameters
    The embeddings have to be trained
    """
    model = Sequential()

    # Input layer
    # num_code_words = number of unique words
    # training_length we use the first num_pred words
    model.add(
        Embedding(input_dim = input_dim,
                  input_length = None,
                  output_dim = embed_vec_size,
                  trainable = True,
                  mask_zero = True))

    # LSTM layer (recurrent part of the network)
    model.add(LSTM(nodes_lstm, return_sequences = False, 
                   dropout = dropout_lstm, recurrent_dropout = dropout_lstm_recurrent))

    # Dense layer (connections to all nodes in previous layer)
    model.add(Dense(nodes_dense, activation = 'relu'))

    # Dropout layer (helps to prevent overfitting during training)
    model.add(Dropout(0.5))

    # Output layer (converts to normalized probability distribution)
    model.add(Dense(input_dim, activation = 'softmax'))
    
    return model



def make_model_pretrained(input_dim : int,
               embed_vec_size : int, embedding_matrix : np.array,
               nodes_lstm : int, nodes_dense : int,
               dropout_lstm : float = 0.1, dropout_lstm_recurrent : float = 0.1,
               dropout_dense : float = 0.5) -> Any:
    """
    Creates a sequential Keras Model with given parameters
    Uses pre-trained embeddings
    """
    model = Sequential()

    # Allow for variable length input
    model.add(Input(shape = (None,)))
    
    # Embedding layer
    # num_code_words numb of unique words 
    # training_length we use the first num_pred words
    model.add(
        Embedding(input_dim = input_dim,
                  input_length = None,
                  output_dim = embed_vec_size,
                  weights = embedding_matrix,
                  trainable = False,
                  mask_zero = True))

    # Masking layer for pre-trained embeddings
    model.add(Masking(mask_value=0.0))

    # Recurrent layer
    model.add(LSTM(nodes_lstm, return_sequences = False, 
                   dropout = dropout_lstm, recurrent_dropout = dropout_lstm_recurrent))

    # Fully connected layer
    model.add(Dense(nodes_dense, activation = 'relu'))

    # Dropout for regularization
    model.add(Dropout(0.5))

    # Output layer
    model.add(Dense(input_dim, activation = 'softmax'))
    
    return model

    

# Create Keras model
if train_embedding:
    model = make_model(num_code_words,
                       embed_vec_size, num_nodes_lstm,
                       num_nodes_dense)
else:
    #model = make_model_pretrained(num_pred, num_code_words,
    #                              embed_vec_size, embedding_matrix,
    #                              num_nodes_lstm, num_nodes_dense)
    raise NotImplementedError

# Compile the model
model.compile(
    optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 250)         1908750   
_________________________________________________________________
lstm (LSTM)                  (None, 1024)              5222400   
_________________________________________________________________
dense (Dense)                (None, 2048)              2099200   
_________________________________________________________________
dropout (Dropout)            (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 7635)              15644115  
Total params: 24,874,465
Trainable params: 24,874,465
Non-trainable params: 0
_________________________________________________________________


<a id="keras_train_rnn"></a>
### Training the neural network

[Back to TOC](#toc)

#### Callbacks

We decided to use the validation set metrics to define our best model as the one with the lowest validation loss (val_loss) for simplicity.
In short, the callbacks used here are called after each training step and can save the best model (Checkpoint) and logging data for plotting diagrams of the metrics (TensorBoard) and optionally stop the training process prematurely if the val_loss is no longer decreasing (EarlyStopping).

In [7]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard

In [12]:
%load_ext tensorboard

In [11]:
train_log_dir = os.path.join("logs", "training", model_file[:-3])

def make_callbacks(model_name, save=True):
    """
    Creates callbacks for saving the model after each step
    and stopping once learning process is finished
    """
    callbacks = []
    
    #stopping = EarlyStopping(monitor='val_loss', patience = 5)
    #callbacks.append(stopping)
    
    reduce = ReduceLROnPlateau(monitor = 'loss', factor = 0.2,
                               patience = 5, min_lr = 0.0001)
    callbacks.append(reduce)
    
    board = TensorBoard(log_dir = train_log_dir, profile_batch = 0)
    callbacks.append(board)
    
    if save:
        checkpoint = ModelCheckpoint(
            os.path.join(model_dir, model_file),
            save_best_only = True,
            save_weights_only = False)
        callbacks.append(checkpoint)
        
    return callbacks

callbacks = make_callbacks(model_name)

In [None]:
# On windows use 'tensorboard.exe --logdir logs/training' instead
%tensorboard --logdir "logs/training"

#### Training

In [None]:
# Train the model and save the 'best' version
history = model.fit(X_train,  y_train,  batch_size = num_batch, 
          epochs = num_epochs, callbacks = callbacks,
          validation_data = (X_valid, y_valid))

<a id="keras_predict_rnn"></a>
### Making Predictions

[Back to TOC](#toc)

This section is the final step to achieve our goal. We load our Keras model and it's corresponding tokenizer and use it to make predictions. The routine given here takes arbitrary user input but for best results we recommend to supply input from the learning domain or at least the general bio-medical field (as PubMed is mainly a bio-medical database). It then generates words until it predicts a sentence delimiter (.?!) or a timeout occurs (in that case it predicts a fixed number of words). Since predictions will never be 100% accurate, the model does not always predict a sentence delimiter within a reasonable amount of time. In a more comprehensive solution one could change this to something like a method that relies on an HMM classifier to find a suitable end of the sentence and add punctuation accordingly and skip predicting punctuation entirely. The word that is selected for prediction from the output distribution of the network can be chosen by either appyling argmax to the output directly (can cause loops) or by picking from the probability distribution with a certain randomness (probabilities are still taken into account). We opted for the latter in this implementation.

#### Loading a model

In [8]:
import os, pickle
from tensorflow import device
from tensorflow.keras.models import load_model
from tensorflow.compat.v1.logging import set_verbosity, ERROR

# Suppress TF2 GPU Warnings
set_verbosity(ERROR)

In [18]:
model_file = "changed-network-brain-20-more-nodes_4370_250_1_1024_2048_150.h5"

model_dir = os.path.join("res", "models")
tokenizer_dir = os.path.join("res", "tokenizers")

In [19]:
# Loading with TF2-DirectML on GPU fails on Windows
# Therefore force loading on CPU
with device('/cpu:0'):
    model = load_model(os.path.join(model_dir, model_file))

model.summary()
#model.evaluate(X_valid, y_valid, batch_size=4096, verbose=1)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 250)         379000    
_________________________________________________________________
lstm (LSTM)                  (None, 1024)              5222400   
_________________________________________________________________
dense (Dense)                (None, 2048)              2099200   
_________________________________________________________________
dropout (Dropout)            (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1516)              3106284   
Total params: 10,806,884
Trainable params: 10,806,884
Non-trainable params: 0
_________________________________________________________________


#### Predicting with the loaded model

In [20]:
from tensorflow.errors import UnimplementedError
import numpy as np
import time, warnings, logging

# Init logging
logging.basicConfig(level=logging.DEBUG, format='[%(levelname)s] %(name)s: %(message)s')
log = logging.getLogger("Main")

# Suppress np.log divide by 0 warnings
warnings.filterwarnings('ignore')

In [21]:
# Load tokenizer
with open(os.path.join(tokenizer_dir, model_file[:-3] + ".pkl"), "rb") as f:
    tokenizer = pickle.load(f)

In [None]:
# Randomness factor for picking predictions
# from probability distribution (in range (0, 1])
rand = 0.4

# Timeout for prediction in seconds
timeout = 2

# Fallback prediction length
# in case end of sentence is not predicted in time
pred_len = 10

# Get input sequence
s = [input("Enter a sentence to complete: ")]
sequences = np.array(tokenizer.texts_to_sequences(s))
len_orig = len(sequences)

# Make prediction
predictions = []
elem = ""
t_start = time.time()

# Stop on punctuation
end_symbols = ['.', '?', '!']
end_tokens = []
for t in end_symbols:
    try:
        t = tokenizer.word_index[t]
        end_tokens.append(t)
    except KeyError:
        continue

while elem not in end_tokens:
    try:
        pred = model.predict(sequences)[0]
    except UnimplementedError:
        log.warning("Input contains words that are not in vocabulary. Prediction will be inaccurate.")
    
    # Introduce random factor to prevent getting
    # stuck in prediction loop
    pred = np.exp(np.log(pred) / rand)
    
    # Issue: https://github.com/numpy/numpy/issues/8317
    # Solution: https://stackoverflow.com/a/53605818
    pred = np.asarray(pred).astype('float64')
    
    # Softmax
    pred = pred / pred.sum()
    
    # Pick one word from the
    # generated probability distribution
    probs = np.random.multinomial(1, pred, 1)[0]
    elem = np.argmax(probs)
    
    # Separately save predictions
    predictions.append(elem)
    
    # Append to current sequence for new input
    sequences = np.append(sequences, [[elem]], axis=1)
    
    # Timeout and fallback
    elapsed = time.time() - t_start
    if elapsed >= timeout:
        log.warning(f"Could not predict end of sentence within {timeout}s. Falling back to predicting {pred_len} words.")
        sequences = sequences[:len_orig + pred_len]
        predictions = predictions[:pred_len]
        break

# Convert model output to human-readable text
predictions = tokenizer.sequences_to_texts([predictions])[0]

log.info(f"Predicted sequence:\n{''.join(s[0])}\n{''.join(predictions)}")

<a id="evaluation"></a>
## Evaluation

For evaluation of our models we will mainly use the val_loss (validation loss) and val_acc (validation accuracy) metrics. We define the 'best' model (across epochs) as the one with the lowest validation loss. We will then make changes to parameters of the [dataset](#eval_dataset), [data preparation](#eval_preparation) and [network and training](#eval_rnn) to improve our model with respect to the individual stages. In order to keep things simple and organized, not all diagrams are shown in this document. For all diagrams and the corresponding model names, check the 'doc' folder.

The model with initial, basic parameters is our baseline model. The params are as follows: 
    
    # Network params
    train_embedding = True
    embed_vec_size = 200

    num_nodes_lstm = 128
    num_nodes_dense = 128

    # Training params
    num_batch = 2800
    num_epochs = 150
    
Additionally, the tokenizer does not convert words to lowercase and punctuation is kept. The dataset used contains 1000 abstracts and 252276 sequences are extracted as training data from it. The PubMed query used is 'clustering[ti] algorithm' and the train/validation split is 0.8/0.2.

    model-baseline_252276_200_1_128_128_150

<table>
  <tr>
    <td><img src="doc/baseline/images/epoch_val_loss.svg" alt="baseline model val_loss" width="500"></td>
    <td><img src="doc/baseline/images/epoch_val_acc.svg" alt="baseline model val_acc" width="500"></td>
  </tr>
  <tr>
    <td>baseline model val_loss</td>
    <td>baseline model val_acc</td>
  </tr>
</table>

x-Axis scale: 5 Epochs per vertical line

Metrics:

|    Metrics        	| Baseline Model |
|:---------------------:|:--------------:|
| acc (max)           	|     23.33%     |
| loss (min)          	|      4.504     |
| val_acc (max)       	|     21.45%     |
| val_loss (min)      	|      5.553     |
| step (max val_loss) 	|       44       |


<a id="eval_dataset"></a>
### Dataset comparison

[Back to TOC](#toc)

The dataset size has a great impact on the quality of the model. In this graph we compare the single query 'clustering[ti] algorithm' which is also used for the baseline model (<font color="#0A69A0">blue curve</font> in all dataset diagrams) with 1000 abstracts. The query used for this dataset yielded around 4000 results on PubMed.  

Other curves:

- <font color="#C83369">100 abstracts</font> (pink curve)
- <font color="#009988">200 abstracts</font> (green curve)
- <font color="#CC3311">500 abstracts</font> (red curve)
- <font color="#33A0C8">2000 abstracts</font> (light blue curve)

<table>
  <tr>
    <td><img src="doc/datasets/images/epoch_val_loss_cluster.svg" alt="Clustering Algorithm Dataset val_loss" width="500"></td>
    <td><img src="doc/datasets/images/epoch_val_acc_cluster.svg" alt="Clustering Algorithm Dataset val_acc" width="500"></td>
  </tr>
  <tr>
    <td>Clustering Algorithm Dataset val_loss</td>
    <td>Clustering Algorithm Dataset val_acc</td>
  </tr>
</table>

x-Axis scale: 10 Epochs per vertical line

We can see that the validation loss gets lower with larger dataset sizes. The dataset with 2000 abstracts achieves the lowest val_loss with 5.471 in epoch 32. with more data, the improvement of the model per epoch rises. The validation accuracy is roughly between 20% and 27% for the sizes > 100 abstracts and the greatest val_acc is achieved by the dataset with only 200 abstracts after 92 epochs with 26.98%. We also tested models trained on small datasets with <50 abstracts, which achieve really high validation accuracies and low loss values but the vocabulary of these models is quite limited and they don't generalize well compared to the models trained on more data.

#### Trying different datasets

Query: 'covid-19'  
Results on PubMed: 211602  
Curves:  

- <font color="#0077BB">200 abstracts</font> (blue curve; val_loss higher than baseline; val_acc lower than baseline)
- <font color="#CC3311">500 abstracts</font> (red curve)
- <font color="#FF7043">2000 abstracts</font> (orange curve)

<table>
  <tr>
    <td><img src="doc/datasets/images/epoch_val_loss_covid.svg" alt="Covid-19 Dataset val_loss" width="500"></td>
    <td><img src="doc/datasets/images/epoch_val_acc_covid.svg" alt="Covid-19 Dataset val_acc" width="500"></td>
  </tr>
  <tr>
    <td>Covid-19 Dataset val_loss</td>
    <td>Covid-19 Dataset val_acc</td>
  </tr>
</table>

x-Axis scale: 10 Epochs per vertical line

With this more general query (significantly more results than the clustering algorithm one) we can see a stable, gradual improvement in val_loss and val_acc with dataset size. The best model in this batch is the one with 2000 abstracts, which achieves a minimum val_loss of 5.616 and a maximum val_acc of 22.32%. Overall these models perform worse compared to the clustering algorithm dataset models.

<hr></hr>

Query: 'covid-19 vaccine'  
Results on PubMed: 15601  
Curves:  

- <font color="#009988">200 abstracts</font> (green curve)
- <font color="#EE3377">500 abstracts</font> (pink curve)
- <font color="#33BBEE">2000 abstracts</font> (light blue curve)

<table>
  <tr>
    <td><img src="doc/datasets/images/epoch_val_loss_covid_vaccine.svg" alt="Covid-19 vaccine Dataset val_loss" width="500"></td>
    <td><img src="doc/datasets/images/epoch_val_acc_covid_vaccine.svg" alt="Covid-19 vaccine Dataset val_acc" width="500"></td>
  </tr>
  <tr>
    <td>Covid-19 vaccine Dataset val_loss</td>
    <td>Covid-19 vaccine Dataset val_acc</td>
  </tr>
</table>

x-Axis scale: 10 Epochs per vertical line

In an effort to narrow down the query results we found this dataset to be a good middle ground.
The best model was the one trained on 2000 abstracts again, the val_loss was lowered to 'only' 5.402 and the val_acc stayed about the same with 22.31%. We suspect that more general queries decrease the performance on predictions related to specific parts of the topic but increase the performance on predictions regarding a broader spectrum of the topic. This would mean that narrowing down the query for a really specific subject does not necessarily mean better models in every case since at some point the model is only really useful for very specific input data. A good middle ground between fitting to the data and generalization has to be found in that regard, which could be achieved by testing and evaluating the actual predictions by hand and picking the best model according to the requirements of the application.

<hr></hr>

Query: 'brain surgery[Title/Abstract]'  
Results on PubMed: 2079  
Curves:  

- <font color="#AD3318">200 abstracts</font> (red curve)
- <font color="#0077BB">500 abstracts</font> (blue curve; val_loss higher than baseline; val_acc lower than baseline)
- <font color="#D66440">1000 abstracts</font> (orange curve)
- <font color="#A0A0A0">2000 abstracts</font> (grey curve)

<table>
  <tr>
    <td><img src="doc/datasets/images/epoch_val_loss_brain_surgery.svg" alt="brain surgery Dataset val_loss" width="500"></td>
    <td><img src="doc/datasets/images/epoch_val_acc_brain_surgery.svg" alt="brain surgery Dataset val_acc" width="500"></td>
  </tr>
  <tr>
    <td>Brain surgery Dataset val_loss</td>
    <td>Brain surgery Dataset val_acc</td>
  </tr>
</table>

x-Axis scale: 10 Epochs per vertical line

Narrowing down results even more to just over 2000, we can see that the dataset with 2000 abstracts still yields the best model with val_loss = 5.504 and val_acc = 25.78%. The val_loss of this model slightly better than the one of the baseline model and val_acc is significantly better as well, again we expect this to be caused by the more narrow query.

<hr></hr>

Queries: ["brain surgery[Title/Abstract]", "clustering[ti] algorithm", "covid-19 vaccine"]  
Results on PubMed: -    
Curves:  

- <font color="#33BBEE">210 abstracts</font> (light blue curve)
- <font color="#C83369">510 abstracts</font> (pink curve)
- <font color="#0A8477">999 abstracts</font> (green curve)

<table>
  <tr>
    <td><img src="doc/datasets/images/epoch_val_loss_multiple.svg" alt="Multiple query Dataset val_loss" width="500"></td>
    <td><img src="doc/datasets/images/epoch_val_acc_multiple.svg" alt="Multiple query Dataset val_acc" width="500"></td>
  </tr>
  <tr>
    <td>Multiple query Dataset val_loss</td>
    <td>Multiple query Dataset val_acc</td>
  </tr>
</table>

x-Axis scale: 10 Epochs per vertical line

Lastly, we tried using the previous queries combined to possibly enable predictions for broader topics. As expected, the val_loss and val_acc values are significantly worse than most models we tried so far. Since our baseline network only has 256 neurons in the hidden layers in total it is quite possible that increasing the amount of LSTM and dense nodes will allow for more learning parameters and therefore more 'learning capacity'. The best models we tested so far seem limited to about 30% val_acc and 5.4 val_loss, which might be caused by the small number of neurons 'bottlenecking' performance. With more neurons, more general and multiple queries as a base for the dataset probably yield models that are a lot better; due to our limited computing capacity and time we will not pursue that direction further here though.


<a id="eval_preparation"></a>
### Data preparation comparison

[Back to TOC](#toc)

<a id="eval_rnn"></a>
### Network comparison

[Back to TOC](#toc)

<a id="eval_conclusion"></a>
### Conclusion

[Back to TOC](#toc)

The val_acc of 'best' models with the initial network seemed to max out at around 27-30%. To combat this for a final try to improve our model we opted for significantly more LSTM and dense nodes in combination with additional preprocessing and the narrow 'brain surgery[Title/Abstract]' dataset. The new network params are now as follows:

    # Network params
    train_embedding = True
    embed_vec_size = 250

    num_nodes_lstm = 1024
    num_nodes_dense = 2048

    # Training params
    num_batch = 128
    num_epochs = 150
    
Curves:  

- <font color="#355C95">Baseline model (1000 abstracts)</font> (blue curve)
- <font color="#3F786D">Clustering algorithm dataset model (200 abstracts)</font> (green curve)
- <font color="#903A1D">20 abstracts</font> (red curve)
- <font color="#BA643E">200 abstracts + lowercase tokenizer</font> (orange curve)
- <font color="#5891BD">2000 abstracts</font> (light blue curve)

<img src="doc/optimization/images/epoch_val_loss_all.svg" alt="val_loss" width="1000">
Multiple query Dataset val_loss

<img src="doc/optimization/images/epoch_val_acc_all.svg" alt="val_acc" width="1000">
Multiple query Dataset val_acc

x-Axis scale: 10 Epochs per vertical line

Metrics:

|                     	| Baseline Model 	| Best Dataset 200 	| More Nodes 20 	| More Nodes 2000 	| More Nodes 200 (lowercase) 	|
|---------------------	|:--------------:	|:----------------:	|:-------------:	|:---------------:	|:--------------------------:	|
| acc (max)           	|     23.33%     	|      35.46%      	|     71.72%    	|      75.93%     	|           62.67%           	|
| loss (min)          	|      4.504     	|       2.978      	|     0.998     	|      0.942      	|            1.664           	|
| val_acc (max)       	|     21.45%     	|      26.98%      	|     48.58%    	|      54.54%     	|           41.09%           	|
| val_loss (min)      	|      5.553     	|       5.677      	|     5.037     	|      4.832      	|            5.019           	|
| step (max val_loss) 	|       44       	|        92        	|       18      	|        18       	|             10             	|

TODO: Write evalutation when we have data preparation and network results and possibly test another model
