## 🛠 Exercises

1. Train `model_5` on all of the data in the training dataset for as many epochs until it stops improving. Since this might take a while, you might want to use:
  * [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to save the model's best weights only.
  * [`tf.keras.callbacks.EarlyStopping`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) to stop the model from training once the validation loss has stopped improving for ~3 epochs.
2. Checkout the [Keras guide on using pretrained GloVe embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/). Can you get this working with one of our models?
  * Hint: You'll want to incorporate it with a custom token [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.
  * It's up to you whether or not you fine-tune the GloVe embeddings or leave them frozen.
3. Try replacing the TensorFlow Hub Universal Sentence Encoder pretrained  embedding for the [TensorFlow Hub BERT PubMed expert](https://tfhub.dev/google/experts/bert/pubmed/2) (a language model pretrained on PubMed texts) pretrained embedding. Does this effect results?
  * Note: Using the BERT PubMed expert pretrained embedding requires an extra preprocessing step for sequences (as detailed in the [TensorFlow Hub guide](https://tfhub.dev/google/experts/bert/pubmed/2)).
  * Does the BERT model beat the results mentioned in this paper? https://arxiv.org/pdf/1710.06071.pdf 
4. What happens if you were to merge our `line_number` and `total_lines` features for each sequence? For example, created a `X_of_Y` feature instead? Does this effect model performance?
  * Another example: `line_number=1` and `total_lines=11` turns into `line_of_X=1_of_11`.
5. Write a function (or series of functions) to take a sample abstract string, preprocess it (in the same way our model has been trained), make a prediction on each sequence in the abstract and return the abstract in the format:
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * ...
    * You can find your own unstrcutured RCT abstract from PubMed or try this one from: [*Baclofen promotes alcohol abstinence in alcohol dependent cirrhotic patients with hepatitis C virus (HCV) infection*](https://pubmed.ncbi.nlm.nih.gov/22244707/).

We begin with some standard imports.

In [10]:
# Required imports
import pandas as pd
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import os
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from helper_functions import calculate_results
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
from keras.utils.vis_utils import plot_model
import tensorflow.keras.layers as layers
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping

We require the PubMed dataset so we import the corresponding GitHub repository.

In [2]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
!cd pubmed-rct

fatal: destination path 'pubmed-rct' already exists and is not an empty directory.


The following cell will provide us all the prerequisites we need to perform our excercises without being interrupted.

In [3]:
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"
filenames = [data_dir + filename for filename in os.listdir(data_dir)]

def get_lines(filename):
  """
  Reads filename (a text filename) and returns the lines of text as a list.

  Args:
    filename: a string containing the target filepath.

  Returns:
    A list of strings with one string per line from the target filename.
  """
  with open(filename, 'r') as file:
    return file.readlines()

def preprocess_text_with_line_numbers(filename):
  """
  Returns a list of dictionaries of abstract line data.

  Takes in filename, reads its contents and sorts through each line,
  extracting things like the target label, the text of the sentence,
  how many sentences are in the current abstract and what sentence,
  number the target line is.
  """
  input_lines = get_lines(filename) # get all lines from filename
  abstract_lines = "" # create an empty abstract
  abstract_samples = [] # create an empty list of abstracts

  # Loop through each line in the target line
  for line in input_lines:
    if line.startswith("###"):  # check to see if line is an ID line
      abstract_id = line
      abstract_lines = "" # reset abstract string
    elif line.isspace():  # check to see if line is a new line
      abstract_line_split = abstract_lines.splitlines() # split abstract into separate lines

      # Iterate through each line in a single abstract and count them at the same time
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
        line_data = {}  # create empty dict to store data from line
        target_text_split = abstract_line.split("\t") # split target label from text
        line_data['target'] = target_text_split[0]  # get target label
        line_data['text'] = target_text_split[1].lower()  # get target text and lower it
        line_data['line_number'] = abstract_line_number # what number line does the line appear in the abstract?
        line_data['total_lines'] = len(abstract_line_split) - 1 # how many total lines are in the abstract? (start from 0)
        abstract_samples.append(line_data) # how many total lines are in the abstract? (start from 0)

    else: # if the above conditions aren't fulfilled, the line contains a labelled sentence
      abstract_lines += line

  return abstract_samples

train_df = pd.DataFrame(preprocess_text_with_line_numbers(data_dir+'train.txt'))
val_df = pd.DataFrame(preprocess_text_with_line_numbers(data_dir+'dev.txt'))
test_df = pd.DataFrame(preprocess_text_with_line_numbers(data_dir+'test.txt'))

train_sentences = train_df['text'].tolist()
val_sentences = val_df['text'].tolist()
test_sentences = test_df['text'].tolist()

one_hot_encoder = OneHotEncoder(sparse=False) # we want non-sparse matrix
train_labels_one_hot = one_hot_encoder.fit_transform(train_df.target.to_numpy().reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_df.target.to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df.target.to_numpy().reshape(-1, 1))

le = LabelEncoder()
train_labels_encoded = le.fit_transform(train_df.target.to_numpy())
val_labels_encoded = le.transform(val_df.target.to_numpy())
test_labels_encoded = le.transform(test_df.target.to_numpy())

tf_hub_embedding_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        trainable=False,
                                        name='universal_sentence_encoder')

def split_chars(text):
  return " ".join(list(text))

train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]

char_length = [len(sentence) for sentence in train_sentences]

alphabet = string.ascii_lowercase + string.digits + string.punctuation
output_seq_char_len = int(np.percentile(char_length, 95))
NUM_CHAR_TOKENS = len(alphabet) + 2
char_vec = TextVectorization(max_tokens=NUM_CHAR_TOKENS,
                             output_sequence_length=output_seq_char_len,
                             name='character_vectorizer')
char_vec.adapt(train_chars)

char_vocab = char_vec.get_vocabulary()
char_embed = layers.Embedding(input_dim=len(char_vocab),  # number of different characters
                              output_dim=25,  # this is the size of the char embedding in the paper: https://arxiv.org/pdf/1612.05251.pdf (Figure 1)
                              mask_zero=True,
                              name='char_embed')

train_line_numbers_one_hot = tf.one_hot(train_df.line_number.to_numpy(), depth=15)
val_line_numbers_one_hot = tf.one_hot(val_df.line_number.to_numpy(), depth=15)
test_line_numbers_one_hot = tf.one_hot(test_df.line_number.to_numpy(), depth=15)

train_lines_total_one_hot = tf.one_hot(train_df.total_lines.to_numpy(), depth=20)
val_lines_total_one_hot = tf.one_hot(val_df.total_lines.to_numpy(), depth=20)
test_lines_total_one_hot = tf.one_hot(test_df.total_lines.to_numpy(), depth=20)

train_tribrid_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot,
                                                         train_lines_total_one_hot,
                                                         train_sentences,
                                                         train_chars))
train_tribrid_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_tribrid_dataset = tf.data.Dataset.zip((train_tribrid_data, train_tribrid_labels))
train_tribrid_dataset = train_tribrid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

val_tribrid_data = tf.data.Dataset.from_tensor_slices((val_line_numbers_one_hot,
                                                       val_lines_total_one_hot,
                                                       val_sentences, 
                                                       val_chars))
val_tribrid_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_tribrid_dataset = tf.data.Dataset.zip((val_tribrid_data, val_tribrid_labels))
val_tribrid_dataset = val_tribrid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

test_tribrid_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot,
                                                        test_lines_total_one_hot,
                                                        test_sentences, 
                                                        test_chars))
test_tribrid_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_tribrid_dataset = tf.data.Dataset.zip((test_tribrid_data, test_tribrid_labels))
test_tribrid_dataset = test_tribrid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
train_tribrid_dataset



<PrefetchDataset element_spec=((TensorSpec(shape=(None, 15), dtype=tf.float32, name=None), TensorSpec(shape=(None, 20), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.string, name=None)), TensorSpec(shape=(None, 5), dtype=tf.float64, name=None))>

### Excercise 1: Training model until it stops improving

With the use of `tensorflow.keras.callbacks` we can stop training our model once our certain specified criteria has been met. In this case, it will be the stagnation of `val_loss`.

In [4]:
# 1. Token inputs
token_inputs = layers.Input(shape=[], dtype=tf.string)
token_embeddings = tf_hub_embedding_layer(token_inputs)
token_outputs = layers.Dense(128, activation='relu')(token_embeddings)
token_model = tf.keras.Model(token_inputs,
                             token_outputs)

# 2. Char inputs
char_inputs = layers.Input(shape=(1,), dtype=tf.string)
char_vectors = char_vec(char_inputs)
char_embeddings = char_embed(char_vectors)
char_bi_lstm = layers.Bidirectional(layers.LSTM(24))(char_embeddings)
char_model = tf.keras.Model(char_inputs,
                            char_bi_lstm)

# 3. Line number model
line_number_inputs = layers.Input(shape=(15, ), dtype=tf.float32)
line_number_outputs = layers.Dense(32, activation='relu')(line_number_inputs)
line_number_model = tf.keras.Model(line_number_inputs,
                                   line_number_outputs)

# 4. Total lines model
total_lines_inputs = layers.Input(shape=(20, ), dtype=tf.float32)
total_lines_outputs = layers.Dense(32, activation='relu')(total_lines_inputs)
total_lines_model = tf.keras.Model(total_lines_inputs,
                                   total_lines_outputs)

# 5. Combine models 1 and 2
combined_embeddings = layers.Concatenate(name='char_token_hybrid_embedding')([token_model.output, char_model.output])
z = layers.Dense(256, activation='relu')(combined_embeddings)
z = layers.Dropout(0.5)(z)

# 6. Combine positional embedding with combined token and char embeddings
tribrid_embeddings = layers.Concatenate(name='char_token_positional_embeddings')([line_number_model.output,
                                                                                 total_lines_model.output,
                                                                                 z])

# 7. Create output layer
output_layer = layers.Dense(5, activation='softmax', name='output_layer')(tribrid_embeddings)

# 8. Get it all together
model_6 = tf.keras.Model(inputs=[line_number_model.input,
                                 total_lines_model.input,
                                 token_model.input,
                                 char_model.inputs],
                         outputs=output_layer,
                         name='model_6_tribrid_enhanced')

In [11]:
plot_model(model_6)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model/model_to_dot to work.


In [12]:
# Get the summary
model_6.summary()

Model: "model_6_tribrid_enhanced"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 input_1 (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 character_vectorizer (TextVect  (None, 290)         0           ['input_2[0][0]']                
 orization)                                                                                       
                                                                                                  
 universal_sentence_encoder (Ke  (None, 512)         256797824   ['input_1[

In [13]:
checkpoint_filepath = '.\\tmp\\checkpoint'
mc = ModelCheckpoint(filepath=checkpoint_filepath,
                     save_weights_only=True,
                     monitor='accuracy',
                     mode='max',
                     save_best_only=True)

es = EarlyStopping(monitor='val_loss',
                   min_delta=0.01,
                   patience=3)

In [14]:
# Compile the model
model_6.compile(loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2),
                optimizer='adam',
                metrics=['accuracy'])

In [16]:
# Fit the model
history_6 = model_6.fit(train_tribrid_dataset,
                        steps_per_epoch=int(0.1*len(train_tribrid_dataset)),
                        epochs=20,
                        validation_data=val_tribrid_dataset,
                        validation_steps=int(0.1*len(val_tribrid_dataset)),
                        callbacks=[mc, es])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20






In [17]:
# Evaluate on whole validation dataset
model_6.evaluate(val_tribrid_dataset)



[0.9067788124084473, 0.8492652177810669]

In [18]:
# Make predictions
model_6_pred_probs = model_6.predict(val_tribrid_dataset)
model_6_pred_probs

array([[0.58933675, 0.10183831, 0.01632273, 0.26716176, 0.02534041],
       [0.57850176, 0.10878776, 0.07371843, 0.21760412, 0.02138789],
       [0.38493866, 0.09261087, 0.08450082, 0.38327268, 0.054677  ],
       ...,
       [0.02926422, 0.07242577, 0.02314727, 0.03077097, 0.84439176],
       [0.01994521, 0.3650633 , 0.05783086, 0.02192453, 0.53523606],
       [0.08063398, 0.82472867, 0.0431503 , 0.02469845, 0.02678857]],
      dtype=float32)

In [19]:
# Turn predictions into labels
model_6_preds = tf.argmax(model_6_pred_probs, axis=1)
model_6_preds

<tf.Tensor: shape=(30212,), dtype=int64, numpy=array([0, 0, 0, ..., 4, 4, 1], dtype=int64)>

In [20]:
# Get all metrics
model_6_results = calculate_results(val_labels_encoded, model_6_preds)
model_6_results

{'accuracy': 84.92651926386866,
 'precision': 0.8518644675250031,
 'recall': 0.8492651926386866,
 'f1': 0.8456816522143424}