# Milestone Project 2: Skimlit

The purpose of this notebook is to build an NLP model to make reading medical abstracts easier.

The paper we're replicating(the source of the dataset that we'll be using) is available here: https://arxiv.org/abs/1710.06071

And reading through the paper above, we see that the model architecture that they use to achieve their best reuslts is available at: https://arxiv.org/abs/1612.05251



## Confirm access to GPU


In [None]:
!nvidia -smi -L

## Get data

Since we'll be replicating the paper above(Pubmed 200k RCT). let's download the dataset they used.

We can do so from the authors Github: https://github.com/Franck-Dernoncourt/pubmed-rct


In [None]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

In [None]:
# Check what files are in the Pubmed_20k dataset
!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

In [None]:
# Start our experiments using the 20k dataset with numbers replaced by "@" sign
data_dir = "/kaggle/working/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [None]:
# Check all of the filenames in the target directory
import os
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

## Preprocess the data

Now we've got some text data, it's time to become one with it.

And one the best ways to become one with the data is to..,
> Visulize..Visualize..Visualize

So with that in the mind, let's write a function to read in all of the lines of a target text file

In [None]:
# Create a function to read the lines of a document
def get_lines(filename):
  """
  Reads filename (a text filename) and returns the lines of text as a list.

  Args:
    filename: a string containing the target filepath.

  Returns:
    A list of strings with one string per line from the target filename.
  """
  with open(filename, "r") as f:
    return f.readlines()

In [None]:
#Let's read in the training lines
train_lines = get_lines(data_dir+"train.txt") #Read the lines with the training files
train_lines[:20]

In [None]:
len(train_lines)

Let's think about how we want our data to look...

let's try thiss...
```
[{'line_number': 0,
  'target': 'BACKGROUND',
  'text': "Emotional eating is associated with overeating and the development of obesity .\n",
  'total_lines': 11},
  ...]
```

Let's write a function which turns each of our datasets into the above format so we can continue to prepare our data for modelling.

In [None]:
def preprocess_text_with_line_numbers(filename):
  """
  Returns a list of dictionaries of abstract line data.

  Takes in filename, reads it contents and sorts  through each line, extracting things
  like the target label, the text of the sentence, how many sentences are in the current
  abstract and what sentence number the target line is.
  """
  input_lines = get_lines(filename) # get all lines from filename
  abstract_lines = ""  # create an empty abstract
  abstract_samples = [] # create an empty list of abstracts

  # loop through each line in the target file
  for line in input_lines:
    if line.startswith("###"): #check to see if the line is an ID line
      abstract_id = line
      abstract_lines = "" #reset the abstract string if the line is an id line
    elif line.isspace(): # Check to see if line is a new line
      abstract_line_split = abstract_lines.splitlines() #split abstract into separate lines

      # Iterate through each line in a single abstract and count them at same time
      for abstract_line_number , abstract_line in enumerate(abstract_line_split):
        line_data = {} # creates a empty dictionary for each line
        target_text_split = abstract_line.split("\t") # Split target labels from text
        line_data["target"] = target_text_split[0] # get the target label
        line_data["text"] = target_text_split[1].lower() #get the target text and lower it
        line_data["line_number"] = abstract_line_number # what number line does the line appear in the abstract
        line_data["total_lines"] = len(abstract_line_split) - 1 # how many total lines are there in the target abstract? (start from 0)
        abstract_samples.append(line_data) # add line data to abstract samples list

    else: #if the above conditions aren't fulfilled, the line contains a labelled sentence
      abstract_lines += line

  return abstract_samples

In [None]:
# Get data from file and preprocess it
%time
train_samples = preprocess_text_with_line_numbers(data_dir +"train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir +"dev.txt")
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")
print(len(train_samples), len(test_samples), len(val_samples) )

In [None]:
# Check the first abstract of our training data
train_samples[:10]

Now that our data is the format of a list of dictionaries, how about we run it into
a DataFrame to further visualize it.

In [None]:
import pandas as pd
train_df  = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)
train_df.head(10)

In [None]:
# Distribution of labels in training data
train_df.target.value_counts()

In [None]:
# Let's check the length of different lines
train_df.total_lines.plot.hist();

### Get lists of sentences

In [None]:
# Convert abstract text lines into lists
train_sentences = train_df["text"].tolist()
val_sentences = val_df["text"].tolist()
test_sentences = test_df["text"].tolist()
len(train_sentences), len(val_sentences), len(test_sentences)

In [None]:
# View the 10 lines of training sentences
train_sentences[:10]

## Make numeric labels (ML models require numeric labels)

In [None]:
# One hot enode labels
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1))
val_labels_one_hot = one_hot_encoder.transform(val_df["target"].to_numpy().reshape(-1,1))
test_labels_one_hot = one_hot_encoder.transform(test_df["target"].to_numpy().reshape(-1,1))

# Check what one hot encoded labels look like
train_labels_one_hot

### Label encode labels

In [None]:
# Extract labels ("target" columns) and encode them into integers
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_df["target"].to_numpy())
val_labels_encoded = label_encoder.transform(val_df["target"].to_numpy())
test_labels_encoded = label_encoder.transform(test_df["target"].to_numpy())

In [None]:
train_labels_encoded

In [None]:
# Get class names and number of classes from LabelEncoder instance
num_classes = len(label_encoder.classes_)
class_names = label_encoder.classes_
num_classes, class_names

## Starting a series of modelling experiments...

As usual, we're going to be trying out a bunch of different models and seeing which one works best
And as always, we're going to start with a baseline(TF-IDF Multinomial Naive Bayes Classifier)

## Model 0: Getting a baseline

To create our baseline, we'll use Sklearn's Multinomial Naive Bayes using the TF-IDF(Term frequency - inverse document frequency, i.e. tf-idf = tf * idf) formula to convert our words to numbers.

> **Note:** It's common practice to use non-Dl algorithms as a baseline because of their speed and then later using Dl to see if you can improve upon them.

In [None]:
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [None]:
# Create a tokenization and modelling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) #MOdel the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels_encoded)

In [None]:
# Evaluate the model
model_0.score(val_sentences, val_labels_encoded)

In [None]:
# Make predictions using our baseline model
baseline_preds = model_0.predict(val_sentences)
baseline_preds

### Download helper functions script

In [None]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

In [None]:
from helper_functions import calculate_results

In [None]:
# calcluate baseline results
baseline_results = calculate_results(y_true=val_labels_encoded,
                                     y_pred = baseline_preds)

In [None]:
baseline_results

## Preparing our data(the text) for deep sequence model

Before we start building deeper models, we've got to create vectorization and embedding layers.  


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

In [None]:
# How long is each sentence on average?
sent_lens = [len(sentence.split()) for sentence in train_sentences]
avg_sent_len = np.mean(sent_lens)
avg_sent_len

In [None]:
# What's the distribution look like?
import matplotlib.pyplot as plt
plt.hist(sent_lens, bins=20);

In [None]:
# How long of a sentence length covers 95% of examples?
output_seq_len = int(np.percentile(sent_lens, 95))
output_seq_len

In [None]:
# Maximum sequence length in the training set?
max(sent_lens)

### Text Vectorizer

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=68000, #how many words in vocabulary(automatically add<OOV>)
                                    output_sequence_length=55)

In [None]:
# Adapt text vectorizer to training sentences
text_vectorizer.adapt(train_sentences)

In [None]:
# Test out text vectorizer on random sentences
import random
target_sentence = random.choice(train_sentences)
print(f"Text:\n {target_sentence}")
print(f"\nLength of text: {len(target_sentence.split())}")
print(f"\nVectorized text: {text_vectorizer([target_sentence])}")

In [None]:
# How many words in our training vocabulary
rct_20k_text_vocab = text_vectorizer.get_vocabulary()
print(f"Number of words in vocab: {len(rct_20k_text_vocab)}")
print(f"Most common words in the vocab: {rct_20k_text_vocab[:5]}")
print(f"Least common words in the vocab: {rct_20k_text_vocab[-5:]}")

In [None]:
# Get the config of our text vectorizer
text_vectorizer.get_config()

### Create a custom text embedding

In [None]:
from tensorflow.keras  import layers

embedding = layers.Embedding(input_dim = len(rct_20k_text_vocab), #set input length
                             output_dim = 128, #output shape
                             mask_zero=True, # use masking to handle variable sequence lengths(save space)
                             input_length= 55, #how long is each input
                             name="token_embedding")
embedding

In [None]:
# Show example embedding
print(f"Sentence before vectorization: \n {target_sentence}\n")
vectorized_sentence = text_vectorizer([target_sentence])
print(f"Sentence after vectorization (before embedding): \n {vectorized_sentence}\n")
embedded_sentence = embedding(vectorized_sentence)
print(f"Sentence after embedding: \n {embedded_sentence}\n")
print(f"Embedded sentence shape: {embedded_sentence.shape}")

## Creating datasets (making sure our data loads as fast as possible)



In [None]:
# Turn our data into TensorFlow Datasets
train_dataset =  tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot))
valid_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot))
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))

train_dataset

In [None]:
# Take the TensorSliceDataset's and turn them into prefetch dataset
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
train_dataset

## Model 1: Conv1D with token embeddings

In [None]:
from tensorflow.keras import layers

# Build the model
inputs = layers.Input(shape=(1, ), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Conv1D(filters=64, kernel_size=5,  activation="relu", padding ="same")(x)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(5, activation="softmax")(x)

model_1 = tf.keras.Model(inputs, outputs, name="model_1_Conv1D")

# Compile the model
model_1.compile(loss=tf.keras.losses.categorical_crossentropy,
                metrics=["accuracy"],
                optimizer=tf.keras.optimizers.Adam())

In [None]:
model_1.summary()

In [None]:

history_model_1 = model_1.fit(train_dataset,
                              validation_data=(valid_dataset),
                              epochs=3,
                              steps_per_epoch=int(0.1 * len(train_dataset)),
                              validation_steps=int(len(valid_dataset) * 0.10))

In [None]:
# Evaluate on validation data
model_1.evaluate(valid_dataset)

In [None]:
# Make prediction(model_predicts prediction probabilities for each class)
model_1_pred_probs =  model_1.predict(valid_dataset)
model_1_preds = tf.argmax(model_1_pred_probs, axis=1)
model_1_preds

In [None]:
model_1_preds[:10]

In [None]:
# Calculate model_1 results
model_1_results= calculate_results(y_pred=model_1_preds,
                                   y_true=val_labels_encoded)

In [None]:
model_1_results, baseline_results


## Model 2: Feature extraction with pretrained  token embedding
Now let's use pretrained word embeddings from TensorFlow Hub, more specifically the universal sentence encoder: https://tfhub.dev/google/universal-sentence-encoder/4

In [None]:
import tensorflow_hub as hub
# Create a Keras layer using the USE pretrained layer from tensorflowhub
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-large/5",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable= False,
                                        name="USE-large")

In [None]:
# Building the model
model_2 = tf.keras.Sequential([
    sentence_encoder_layer,
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(5, activation="softmax")
])

model_2.compile(loss="categorical_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

model_2.summary()

In [None]:
history_model_2 = model_2.fit(train_dataset,
                              epochs=3,
                              validation_data=(valid_dataset),
                              steps_per_epoch=int(0.1 * len(train_dataset)),
                              validation_steps=int(0.1 * len(valid_dataset)))

In [None]:
model_2.evaluate(valid_dataset)

In [None]:
# Make predictions with feature extraction model
model_2_pred_probs = model_2.predict(valid_dataset)
model_2_preds = tf.argmax(model_2_pred_probs, axis=1)
model_2_preds[:10]

In [None]:
# Calulate results Tfhub pretrained embeddings results on val set
model_2_results = calculate_results(y_true=val_labels_encoded,
                                    y_pred=model_2_preds)
model_2_results

## Model 3: Conv1D with character embeddings
The paper which we're replicating states they used a combination of token and character-level embeddings.

Previously we've token-level embeddings but we'll need to do similar steps for characters if we want to use char-level embeddings.

### Creating a character-level tokenizer

In [None]:
train_sentences[:5]

In [None]:
# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

In [None]:
# Text splitting non-character-level sequence into characters
split_chars(train_sentences[0])

In [None]:
# Split sequence level data splits into character level data splits
train_chars = [split_chars(sentence) for sentence in train_sentences]
valid_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence)for sentence in test_sentences]
train_chars[:5]

In [None]:
# What's the average character length?
chars_length = [len(sentence) for sentence in train_sentences]
mean_char_len = np.mean(chars_length)
mean_char_len

In [None]:
# Check the distribution of our sentence ar a character level
import matplotlib.pyplot as plt
plt.hist(chars_length, bins=7)

In [None]:
# Find what character length cover 95% of sequences
output_seq_char_len = int(np.percentile(chars_length, 95))
output_seq_char_len

In [None]:
# Get all keyboard characters
import string
alphabet = string.ascii_lowercase + string.digits + string.punctuation
print(len(alphabet), alphabet)

In [None]:
# Create char-level tokenvectorier instance
NUM_CHAR_TOKENS = len(alphabet) + 2 #add 3 for space and OOV token(OOV = out of vocab)
char_vectorizer = TextVectorization(max_tokens=NUM_CHAR_TOKENS,
                                    output_sequence_length = output_seq_char_len,
                                    # standardize=None,
                                    name="char_vectorizer")

In [None]:
# Adapt to training chars
char_vectorizer.adapt(train_chars)

In [None]:
# Check character vocab stats
char_vocab = char_vectorizer.get_vocabulary()
print(f"Number of different characters in character vocab: {len(char_vocab)}")
print(f"5 Most common characters: {char_vocab[:5]}")
print(f"5 least common characters: {char_vocab[-5:]}")

In [None]:
# Test out character vectorizer
random_train_chars = random.choice(train_chars)
print(f"Charified text:\n {random_train_chars}")
print(f"\nLength of random_train_chars: {len(random_train_chars.split())}")
vectorized_chars = char_vectorizer([random_train_chars])
print(f"\nVectorized chars:\n {vectorized_chars}")
print(f"\nLength of Vectorized chars: {len(vectorized_chars[0])}")

### Character level embedding

In [None]:
from tensorflow.keras import layers

char_embedding = layers.Embedding(input_dim = len(char_vocab),
                                  output_dim = 25, #this is size of the char embed in paper
                                  mask_zero=True,
                                  name = "character_embedding")

In [None]:
# Show sample char embedding
print(f"Sentence before vectorization: \n {random_train_chars}\n")
vectorized_chars = char_vectorizer([random_train_chars])
print(f"Sentence after vectorization(before embedding):\n {vectorized_chars}\n")
char_embedded = char_embedding(vectorized_chars)
print(f"Sentence after embedding:\n{char_embedded}\n")
print(f"Embedded char sentence shape:{char_embedded.shape}")

### Building a Conv1D model to fit on character embeddings

In [None]:
# Make Conv1D on char embedding
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
char_vectors = char_vectorizer(inputs)
char_embeddings = char_embedding(char_vectors)
x = layers.Conv1D(filters=64, kernel_size=5, padding="same", activation="relu")(char_embeddings)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(num_classes, activation="softmax")(x)

model_3 = tf.keras.Model(inputs=inputs, outputs=outputs)
model_3.summary()

In [None]:
model_3.compile(loss="categorical_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [None]:
#Create char level datasets
train_char_dataset = tf.data.Dataset.from_tensor_slices((train_chars, train_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
val_char_dataset = tf.data.Dataset.from_tensor_slices((valid_chars, val_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
test_char_dataset = tf.data.Dataset.from_tensor_slices((test_chars, test_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
history_model_3 = model_3.fit(train_char_dataset,
                              steps_per_epoch = int(0.1 * len(train_char_dataset)),
                              epochs=3,
                              validation_data=val_char_dataset,
                              validation_steps=int(0.1 * len(val_char_dataset)))

In [None]:
# Make prediction character model
model_3_pred_probs = model_3.predict(val_char_dataset)
model_3_preds = tf.argmax(model_3_pred_probs, axis=1)

In [None]:
# calculate results foe connv1d model
model_3_results = calculate_results(y_true=val_labels_encoded,
                                    y_pred= model_3_preds)

In [None]:
model_3_results

## Model 4: Combining pretrained token embeddings + character embeddings(hybrid embedding layer)

1. Create a token-level embedding(similar `model_1`)
2. Create a character-level model(similar to `model_3` with slight modification)
3. Combine 1 & 2 with concatenate(`layers.Concatenate`)
4. Build a series of output layers on top of 3 similar to Figure 1 and section 4.2 of https://arxiv.org/pdf/1612.05251.pdf
5. Construct a model which takes token and character-level sequences as input and produces sequence label probabilities as output

In [None]:
# 1. Setup token inputs/model
token_inputs = layers.Input(shape=[], dtype=tf.string, name="token_input")
token_embeddings = sentence_encoder_layer(token_inputs)
token_outputs = layers.Dense(128, activation="relu")(token_embeddings)
token_model = tf.keras.Model(inputs=token_inputs,
                             outputs=token_outputs)

# 2. Setup char inputs/model
char_inputs = layers.Input(shape=(1,), dtype=tf.string, name="char_input")
char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embedding(char_vectors)
char_bi_lstm= layers.Bidirectional(layers.LSTM(24))(char_embeddings)
char_model = tf.keras.Model(inputs=char_inputs,
                            outputs=char_bi_lstm)

# 3. Concatenate token and char inputs(create hybrid token embedding)
token_char_concat = layers.Concatenate(name="token_char_hybrid")([token_model.output,
                                                                  char_model.output])

# 4. Create output layers - adding in Dropout in section of paper
combined_dropout = layers.Dropout(0.5)(token_char_concat)
combined_dense = layers.Dense(128, activation="relu")(combined_dropout)
final_dropout = layers.Dropout(0.5)(combined_dense)
output_layer = layers.Dense(num_classes, activation="softmax")(final_dropout)

# 5. Construct model with char and token inputs
model_4 = tf.keras.Model(inputs=[token_model.input, char_model.input],
                         outputs=output_layer)


In [None]:
# Get summary
model_4.summary()

In [None]:
# Plot hybrid token and character model
from keras.utils import plot_model
plot_model(model_4, show_shapes=True)

In [None]:
# Compile token char model
model_4.compile(loss="categorical_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

### Combining token and character data into a tf.data Dataset

In [None]:
# Combine chars and tokens into a dataset
train_char_token_data = tf.data.Dataset.from_tensor_slices((train_sentences,train_chars))
train_char_token_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_char_token_dataset = tf.data.Dataset.zip((train_char_token_data, train_char_token_labels))

# Prefetch and batch train data
train_char_token_dataset = train_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
# Combine chars and tokens into a dataset
val_char_token_data = tf.data.Dataset.from_tensor_slices((val_sentences,valid_chars))
val_char_token_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_dataset = tf.data.Dataset.zip((val_char_token_data, val_char_token_labels))

# Prefetch and batch val data
val_char_token_dataset = val_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
# Check out training char and token embedding dataset
train_char_token_dataset, val_char_token_dataset

### Fitting a model on token and character-level sequences

In [None]:
# Fit the model on tokens and chars
history_model_4 = model_4.fit(train_char_token_dataset,
                              steps_per_epoch=int(0.1 * len(train_char_token_dataset)),
                              epochs=3,
                              validation_data=val_char_token_dataset,
                              validation_steps=int(0.1 * len(val_char_token_dataset)))

In [None]:
# Evaluating the model
model_4.evaluate(val_char_token_dataset)

In [None]:
# Get the predictions
model_4_preds_probs = model_4.predict(val_char_token_dataset)

In [None]:
# Round the preds
preds_model_4 = tf.argmax(model_4_preds_probs, axis=1)
preds_model_4

In [None]:
# Calulate the evaluation matrix
model_4_results = calculate_results(y_pred=preds_model_4,
                                    y_true=val_labels_encoded)
model_4_results

**Note**: Any engineered features used to train a model need to be available at test time. In our case, line numbers and total lines are available.

### Create positional embeddings

In [None]:
# How many different line numbers are there?
train_df["line_number"].value_counts()

In [None]:
# Check the distribution of "line_number" column
train_df.line_number.plot.hist()

In [None]:
# Use TensorFlow to create one-hot-encoded tensors of our "line_number" column
train_line_numbers_one_hot = tf.one_hot(train_df["line_number"].to_numpy(),depth=15)
val_line_numbers_one_hot = tf.one_hot(val_df["line_number"].to_numpy(),depth=15)
test_line_numbers_one_hot = tf.one_hot(test_df["line_number"].to_numpy(),depth=15)
train_line_numbers_one_hot[:10], train_line_numbers_one_hot.shape

In [None]:
train_df.head()

In [None]:
# How many total lines are there?
train_df["total_lines"].value_counts().sort_index()

In [None]:
# Check the distribution of "line_number" column
train_df.total_lines.plot.hist()

In [None]:
# Check teh coverage of a total line value of 20
np.percentile(train_df.total_lines, 98)

In [None]:
# Use TensorFlow to create one-hot-encoded tensors of our "total_lines" column
train_total_lines_one_hot = tf.one_hot(train_df["total_lines"].to_numpy(),depth=20)
val_total_lines_one_hot = tf.one_hot(val_df["total_lines"].to_numpy(),depth=20)
test_total_lines_one_hot = tf.one_hot(test_df["total_lines"].to_numpy(),depth=20)
train_total_lines_one_hot[:10], train_total_lines_one_hot.shape

## Building a tribrid embedding model

1. Create a token-level model
2. Create a character-level model
3. Create a model for the "line_number" feature
4. Create a model for the "total_lines"
feature
5. Combine the outputs of 1 & 2 using tf.keras.layers.Concatenate
6. Combine the outputs of 3, 4, 5 using tf.keras.layers.Concatenate
7. Create an output layer to accept the tribrid embedding and output label probabilities
8. Combine the inputs of 1,2,3,4 and outputs of into a tf.keras.Model

In [None]:
# 1. Token inputs
token_inputs = layers.Input(shape=[], dtype="string", name="token_inputs")
token_embeddings = sentence_encoder_layer(token_inputs)
token_outputs = layers.Dense(128, activation="relu")(token_embeddings)
token_model = tf.keras.Model(inputs=token_inputs,
                             outputs=token_outputs)

# 2. Char inputs
char_inputs = layers.Input(shape=(1,), dtype="string", name="char_inputs")
char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embedding(char_vectors)
char_bi_lstm = layers.Bidirectional(layers.LSTM(24))(char_embeddings)
char_model = tf.keras.Model(inputs=char_inputs,
                            outputs=char_bi_lstm)

# 3. line numbers model
line_inputs = layers.Input(shape=(15,) , dtype=tf.float32, name="line_numbers_inputs")
x = layers.Dense(32, activation="relu")(line_inputs)
line_number_model = tf.keras.Model(line_inputs, x)

# 4. Total lines model
total_lines_inputs = layers.Input(shape=(20,), dtype=tf.float32, name="total_line_inputs")
y = layers.Dense(32, activation="relu")(total_lines_inputs)
total_line_model = tf.keras.Model(total_lines_inputs, y)

# 5. Concatenate token and char inputs(create hybrid token embedding)
combined_embeddings = layers.Concatenate(name="char_token_hybrid_embedding")([token_model.output,
                                                                              char_model.output])

z = layers.Dense(256, activation="relu")(combined_embeddings)
z = layers.Dropout(0.5)(z)

# 6. Combine the positional embedding with combined token and char embedings
tribrid_embeddings = layers.Concatenate(name="char_token_positional_embedding")([line_number_model.output,
                                                                                 total_line_model.output,
                                                                                 z])

# 7. Create output layer
output_layer = layers.Dense(5, activation="softmax", name="output_layer")(tribrid_embeddings)

# 8. Put together model with all kinds of inputs
model_5 = tf.keras.Model(inputs=[line_number_model.input,
                                 total_line_model.input,
                                 token_model.input,
                                 char_model.input],
                         outputs=output_layer,
                         name="model_5_tribrid_embedding_model")

In [None]:
model_5.summary()

In [None]:
# plot model_5
from tensorflow.keras.utils import plot_model
plot_model(model_5, show_shapes=True )

What is label smoothing?

For example, if our model gets too confident on a single class (e.g. its prediction probability is really high), it may get stuck on that class and not consider other classes...

Really confident: `[0.0, 0.0, 1.0, 0.0, 0.0]`

What label smoothing does is it assigns some of the value from the highest pred prob to other classes, in turn, hopefully improving generalization: `[0.02, 0.01, 0.96, 0.01, 0.01]`

In [None]:
# Compile token, char, and positional embedding model
model_5.compile(loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2), #helps to prevent overfitting
                optimizer= "adam",
                metrics=["accuracy"])

### Create tribrid embedding datasets using tf.data

In [None]:
train_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot, train_total_lines_one_hot, train_sentences, train_chars))
train_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_dataset = tf.data.Dataset.zip((train_data, train_labels))

val_data = tf.data.Dataset.from_tensor_slices((val_line_numbers_one_hot,val_total_lines_one_hot,val_sentences,valid_chars))
val_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_dataset = tf.data.Dataset.zip((val_data, val_labels))

test_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot, test_total_lines_one_hot, test_sentences, test_chars))
test_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_dataset = tf.data.Dataset.zip((test_data, test_labels))

# Prefetch and batch data
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
train_dataset, val_dataset

### Fitting, evaluating and making prediction with tribrid model

In [None]:
history_model_5 =  model_5.fit(train_dataset,
                               epochs=3,
                               steps_per_epoch= int(0.1* len(train_dataset)),
                               validation_data=(val_dataset),
                               validation_steps=int(0.1 * len(val_dataset)))

In [None]:
# Make prediction with the char token pos model
model_5_pred_probs = model_5.predict(val_dataset, verbose=1)
# Conver pred probs to pred labels
model_5_preds = tf.argmax(model_5_pred_probs, axis=1)
model_5_preds

In [None]:
# Calculate the results of char token pos model
model_5_results = calculate_results(y_true=val_labels_encoded,
                                    y_pred=model_5_preds)
model_5_results

## Compare model results

In [None]:
# Combine model results into a dataframe
all_model_results = pd.DataFrame({"model_0_baseline": baseline_results,
                                 "model_1_custom_token_embedding": model_1_results,
                                 "model_2_pretrained_token_embedding": model_2_results,
                                 "model_3_custom_char_embedding": model_3_results,
                                 "model_4_hybrid_char_token_embedding": model_4_results,
                                 "model_5_pos_char_token_embedding": model_5_results})

all_model_results = all_model_results.transpose()
all_model_results

In [None]:
# Reduce the accuracy to same scale as other metrics
all_model_results["accuracy"] = all_model_results["accuracy"]/100

In [None]:
# Plot and compare all model results
all_model_results.plot(kind="bar", figsize=(10,7)).legend(bbox_to_anchor=(1.0,1.0));

In [None]:
# Sort model results by f1_score
all_model_results.sort_values("f1", ascending=True)["f1"].plot(kind="bar", figsize=(10,7))

## Save and load model

In [None]:
# Save the best performing model to SaveModel format (default)
model_5.save("skimlit_tribrid_model")

In [None]:
# Load in the best performing model
loaded_model = tf.keras.models.load_model("skimlit_tribrid_model")

In [None]:
# Make predictions with the loaded model on the validation set
loaded_model_preds_probs = loaded_model.predict(val_dataset)
loaded_preds = tf.argmax(loaded_model_preds_probs, axis=1)
loaded_preds[:10]

In [None]:
# calculate the loaded model results
loaded_model_results = calculate_results(y_true=val_labels_encoded,
                                        y_pred=loaded_preds)
loaded_model_results

In [None]:
model_5_results 

In [None]:
loaded_model.summary()