In [1]:
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
    tf.config.experimental.set_memory_growth(gpu_instance, True)

In [2]:
!rm -rf pubmed-rct
!rm -rf skimlit_tribrid_model*
!rm checkpoint
!rm glove.6B*
!rm saved_weights*

rm: cannot remove 'glove.6B*': No such file or directory


# 🛠 09. Milestone Project 2: SkimLit 📄🔥 Exercises

1. Train `model_5` on all of the data in the training dataset for as many epochs until it stops improving. Since this might take a while, you might want to use:
  * [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to save the model's best weights only.
  * [`tf.keras.callbacks.EarlyStopping`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) to stop the model from training once the validation loss has stopped improving for ~3 epochs.
2. Checkout the [Keras guide on using pretrained GloVe embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/). Can you get this working with one of our models?
  * Hint: You'll want to incorporate it with a custom token [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.
  * It's up to you whether or not you fine-tune the GloVe embeddings or leave them frozen.
3. Try replacing the TensorFlow Hub Universal Sentence Encoder pretrained  embedding for the [TensorFlow Hub BERT PubMed expert](https://tfhub.dev/google/experts/bert/pubmed/2) (a language model pretrained on PubMed texts) pretrained embedding. Does this effect results?
  * Note: Using the BERT PubMed expert pretrained embedding requires an extra preprocessing step for sequences (as detailed in the [TensorFlow Hub guide](https://tfhub.dev/google/experts/bert/pubmed/2)).
  * Does the BERT model beat the results mentioned in this paper? https://arxiv.org/pdf/1710.06071.pdf 
4. What happens if you were to merge our `line_number` and `total_lines` features for each sequence? For example, created a `X_of_Y` feature instead? Does this effect model performance?
  * Another example: `line_number=1` and `total_lines=11` turns into `line_of_X=1_of_11`.
5. Write a function (or series of functions) to take a sample abstract string, preprocess it (in the same way our model has been trained), make a prediction on each sequence in the abstract and return the abstract in the format:
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * `PREDICTED_LABEL`: `SEQUENCE`
  * ...
    * You can find your own unstructured RCT abstract from PubMed or try this one from: [*Baclofen promotes alcohol abstinence in alcohol dependent cirrhotic patients with hepatitis C virus (HCV) infection*](https://pubmed.ncbi.nlm.nih.gov/22244707/).

## 1. Train `model_5` on all of the data in the training dataset for as many epochs until it stops improving. Since this might take a while, you might want to use:
  * [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to save the model's best weights only.
  * [`tf.keras.callbacks.EarlyStopping`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) to stop the model from training once the validation loss has stopped improving for ~3 epochs.

### Load model

In [3]:
!rm -rf skimlit_tribrid_model
!rm skimlit_tribrid_model.zip
!wget https://storage.googleapis.com/ztm_tf_course/skimlit/skimlit_tribrid_model.zip
!unzip skimlit_tribrid_model.zip
!rm skimlit_tribrid_model.zip

rm: cannot remove 'skimlit_tribrid_model.zip': No such file or directory
--2024-11-07 19:17:34--  https://storage.googleapis.com/ztm_tf_course/skimlit/skimlit_tribrid_model.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.211.251, 216.58.209.187, 216.58.209.219, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.211.251|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 962182847 (918M) [application/zip]
Saving to: ‘skimlit_tribrid_model.zip’


2024-11-07 19:18:02 (32.3 MB/s) - ‘skimlit_tribrid_model.zip’ saved [962182847/962182847]

Archive:  skimlit_tribrid_model.zip
   creating: skimlit_tribrid_model/
  inflating: skimlit_tribrid_model/keras_metadata.pb  
   creating: skimlit_tribrid_model/assets/
 extracting: skimlit_tribrid_model/fingerprint.pb  
   creating: skimlit_tribrid_model/variables/
  inflating: skimlit_tribrid_model/variables/variables.index  
  inflating: skimlit_tribrid_model/variables/variables.data-0

In [4]:
import tensorflow as tf
model = tf.keras.models.load_model("skimlit_tribrid_model")

In [5]:
model.summary()

Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 char_inputs (InputLayer)    [(None, 1)]                  0         []                            
                                                                                                  
 token_inputs (InputLayer)   [(None,)]                    0         []                            
                                                                                                  
 char_vectorizer (TextVecto  (None, 290)                  0         ['char_inputs[0][0]']         
 rization)                                                                                        
                                                                                                  
 universal_sentence_encoder  (None, 512)                  2567978   ['token_inputs[0][0]']  

### Load dataset

In [6]:
!rm -rf pubmed-rct
!git clone --quiet https://github.com/Franck-Dernoncourt/pubmed-rct.git
!cd pubmed-rct/PubMed_200k_RCT_numbers_replaced_with_at_sign/ && unzip train.zip && cd -

Archive:  train.zip
  inflating: train.txt               
/home/jupyter/projects/Course_Tensorflow_for_Deep_Learning_Bootcamp/exercises


### Create training, validation and test datasets

#### Read data into lists

In [7]:
data_dir = "pubmed-rct/PubMed_200k_RCT_numbers_replaced_with_at_sign/"
# Create function to read the lines of a document
def get_lines(filename):
    with open(filename, "r") as f:
        return f.readlines()
train_lines, test_lines, val_lines = (get_lines(f) for f in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))
len(train_lines), len(test_lines), len(val_lines)

(2593169, 34493, 33932)

#### Preprocess lines

In [8]:
def preprocess_text_with_linenumbers(filename):
    lines = get_lines(filename)
    abstract_lines = ""
    abstract_samples = []

    for line in lines:
        if line.startswith("###"):
            abstract_lines = ""
        elif line.isspace():
            abstract_line_split = abstract_lines.splitlines()
            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                target, text = abstract_line.split("\t", maxsplit=1)
                abstract_samples.append({
                    "target": target,
                    "text": text,
                    "line_number": abstract_line_number,
                    "total_lines": len(abstract_line_split)-1
                })
        else:
            abstract_lines += line

    return abstract_samples

%time train_samples, test_samples, val_samples = (preprocess_text_with_linenumbers(file) for file in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))
len(train_samples), len(test_samples), len(val_samples)

CPU times: user 3.81 s, sys: 480 ms, total: 4.29 s
Wall time: 4.29 s


(2211861, 29493, 28932)

In [9]:
import pandas as pd

In [10]:
train_df, test_df, val_df = (pd.DataFrame(samples) for samples in (train_samples, test_samples, val_samples))
train_df.head()

Unnamed: 0,target,text,line_number,total_lines
0,BACKGROUND,The emergence of HIV as a chronic condition me...,0,10
1,BACKGROUND,This paper describes the design and evaluation...,1,10
2,METHODS,This study is designed as a randomised control...,2,10
3,METHODS,The intervention group will participate in the...,3,10
4,METHODS,The program is based on self-efficacy theory a...,4,10


#### Create training dataset

In [11]:
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1))
train_line_numbers_one_hot = tf.one_hot(train_df["line_number"].to_numpy(), depth=15)
train_total_lines_one_hot = tf.one_hot(train_df["total_lines"].to_numpy(), depth=20)

# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

train_sentences = train_df["text"].to_numpy()
train_chars = [split_chars(sentence) for sentence in train_sentences]

train_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    train_line_numbers_one_hot,
    train_total_lines_one_hot,
    train_sentences,
    train_chars
))
train_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_char_token_pos_dataset = tf.data.Dataset.zip((
    train_char_token_pos_data,
    train_char_token_pos_labels
)).batch(1024).prefetch(tf.data.AUTOTUNE)

#### Create validation dataset

In [12]:
val_labels_one_hot = one_hot_encoder.fit_transform(val_df["target"].to_numpy().reshape(-1,1))
val_line_numbers_one_hot = tf.one_hot(val_df["line_number"].to_numpy(), depth=15)
val_total_lines_one_hot = tf.one_hot(val_df["total_lines"].to_numpy(), depth=20)

# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

val_sentences = val_df["text"].to_numpy()
val_chars = [split_chars(sentence) for sentence in val_sentences]

val_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    val_line_numbers_one_hot,
    val_total_lines_one_hot,
    val_sentences,
    val_chars
))
val_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_pos_dataset = tf.data.Dataset.zip((
    val_char_token_pos_data,
    val_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

#### Create test dataset

In [13]:
test_labels_one_hot = one_hot_encoder.fit_transform(test_df["target"].to_numpy().reshape(-1,1))
test_line_numbers_one_hot = tf.one_hot(test_df["line_number"].to_numpy(), depth=15)
test_total_lines_one_hot = tf.one_hot(test_df["total_lines"].to_numpy(), depth=20)

# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

test_sentences = test_df["text"].to_numpy()
test_chars = [split_chars(sentence) for sentence in test_sentences]

test_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    test_line_numbers_one_hot,
    test_total_lines_one_hot,
    test_sentences,
    test_chars
))
test_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_char_token_pos_dataset = tf.data.Dataset.zip((
    test_char_token_pos_data,
    test_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

### Create Callbacks

In [14]:
weights_dir = "saved_weights"
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(weights_dir, save_best_only=True, save_weights_only=True)
early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=1, restore_best_weights=True)

### Compile model

In [15]:
model.compile(
    loss="categorical_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

### Fit the model

In [16]:
model_history = model.fit(
    train_char_token_pos_dataset,
    epochs=100,
    validation_data=val_char_token_pos_dataset,
    callbacks=[model_checkpoint_callback, early_stopping_callback]
)
model_0_score = model.evaluate(val_char_token_pos_dataset)
model_0_score

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100


[0.27947017550468445, 0.8937508463859558]

In [17]:
model_score = model.evaluate(val_char_token_pos_dataset)



In [18]:
model_score

[0.27947017550468445, 0.8937508463859558]

## 2. Checkout the [Keras guide on using pretrained GloVe embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/). Can you get this working with one of our models?
  * Hint: You'll want to incorporate it with a custom token [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.
  * It's up to you whether or not you fine-tune the GloVe embeddings or leave them frozen.

### Create TextVectorizer

In [19]:
train_sentences[:5]

array(['The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .',
       'This paper describes the design and evaluation of Positive Outlook , an online program aiming to enhance the self-management skills of gay men living with HIV .',
       'This study is designed as a randomised controlled trial in which men living with HIV in Australia will be assigned to either an intervention group or usual care control group .',
       "The intervention group will participate in the online group program ` Positive Outlook ' .",
       'The program is based on self-efficacy theory and uses a self-management approach to enhance skills , confidence and abilities to manage the psychosocial issues associated with HIV in daily life .'],
      dtype=object)

In [20]:
import numpy as np
output_seq_len = int(np.percentile([len(s.split(" ")) for s in train_sentences], 95))
vectorizer = tf.keras.layers.TextVectorization(max_tokens=20000, output_sequence_length=output_seq_len)
text_ds = tf.data.Dataset.from_tensor_slices(train_sentences).batch(1024*4).prefetch(tf.data.AUTOTUNE)
vectorizer.adapt(text_ds)

In [21]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
voc[:5]

['', '[UNK]', 'the', 'of', 'and']

### Get Glove6B

In [22]:
!rm glove.6B.zip
!rm -rf glove.6B
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip
!rm glove.6B.zip

rm: cannot remove 'glove.6B.zip': No such file or directory
--2024-11-07 20:03:07--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2024-11-07 20:05:48 (5.15 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



### Create weights

In [23]:
import numpy as np

path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Found 400000 word vectors.
Converted 15399 words (4601 misses)


### Create layer

In [24]:
from keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    trainable=False,
)
embedding_layer.build((1,))
embedding_layer.set_weights([embedding_matrix])

### Create CNN model

In [25]:
from tensorflow.keras import layers
# Create model
inputs = layers.Input(shape=(1,), dtype="string")
x = vectorizer(inputs)
x = embedding_layer(x)
x = layers.Conv1D(64, 5, activation="relu", padding="valid")(x)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(5, activation="softmax")(x)

model_1 = tf.keras.Model(inputs, outputs)

### Compile the model

In [26]:
model_1.compile(
    loss="categorical_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

### Create dataset for CNN model

In [27]:
# Training dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot)).batch(2048).prefetch(tf.data.AUTOTUNE)
# Validation data
val_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
# Test data
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)

### Fit the model

In [28]:
model_1_history = model_1.fit(
  train_dataset,
  epochs=100,
  validation_data=val_dataset,
  callbacks=[early_stopping_callback]
)
model_1_score = model_1.evaluate(val_dataset)
model_1_score

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100


[0.482700377702713, 0.819818913936615]

In [29]:
model_score, model_1_score

([0.27947017550468445, 0.8937508463859558],
 [0.482700377702713, 0.819818913936615])

## 3. Try replacing the TensorFlow Hub Universal Sentence Encoder pretrained  embedding for the [TensorFlow Hub BERT PubMed expert](https://tfhub.dev/google/experts/bert/pubmed/2) (a language model pretrained on PubMed texts) pretrained embedding. Does this effect results?
  * Note: Using the BERT PubMed expert pretrained embedding requires an extra preprocessing step for sequences (as detailed in the [TensorFlow Hub guide](https://tfhub.dev/google/experts/bert/pubmed/2)).
  * Does the BERT model beat the results mentioned in this paper? https://arxiv.org/pdf/1710.06071.pdf 

### Get bert

In [30]:
import tensorflow_hub as hub
import tensorflow_text as text

preprocess = hub.KerasLayer('https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3', trainable = False)
bert = hub.KerasLayer('https://www.kaggle.com/models/google/experts-bert/TensorFlow2/pubmed/2', trainable = False)

### Create datasets

In [31]:
import pandas as pd

data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"
# Create function to read the lines of a document
def get_lines(filename):
    with open(filename, "r") as f:
        return f.readlines()
train_lines, test_lines, val_lines = (get_lines(f) for f in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))

def preprocess_text_with_linenumbers(filename):
    lines = get_lines(filename)
    abstract_lines = ""
    abstract_samples = []

    for line in lines:
        if line.startswith("###"):
            abstract_lines = ""
        elif line.isspace():
            abstract_line_split = abstract_lines.splitlines()
            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                target, text = abstract_line.split("\t", maxsplit=1)
                abstract_samples.append({
                    "target": target,
                    "text": text,
                    "line_number": abstract_line_number,
                    "total_lines": len(abstract_line_split)-1
                })
        else:
            abstract_lines += line

    return abstract_samples

%time train_samples, test_samples, val_samples = (preprocess_text_with_linenumbers(file) for file in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))
len(train_samples), len(test_samples), len(val_samples)

train_df, test_df, val_df = (pd.DataFrame(samples) for samples in (train_samples, test_samples, val_samples))
train_df.head()

CPU times: user 546 ms, sys: 52.5 ms, total: 598 ms
Wall time: 598 ms


Unnamed: 0,target,text,line_number,total_lines
0,OBJECTIVE,To investigate the efficacy of @ weeks of dail...,0,11
1,METHODS,A total of @ patients with primary knee OA wer...,1,11
2,METHODS,Outcome measures included pain reduction and i...,2,11
3,METHODS,Pain was assessed using the visual analog pain...,3,11
4,METHODS,Secondary outcome measures included the Wester...,4,11


#### Training

In [32]:
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1))
train_line_numbers_one_hot = tf.one_hot(train_df["line_number"].to_numpy(), depth=15)
train_total_lines_one_hot = tf.one_hot(train_df["total_lines"].to_numpy(), depth=20)

# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

train_sentences = train_df["text"].to_numpy()
train_chars = [split_chars(sentence) for sentence in train_sentences]

train_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    train_line_numbers_one_hot,
    train_total_lines_one_hot,
    train_sentences,
    train_chars
))
train_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_char_token_pos_dataset = tf.data.Dataset.zip((
    train_char_token_pos_data,
    train_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

#### Validation

In [33]:
val_labels_one_hot = one_hot_encoder.fit_transform(val_df["target"].to_numpy().reshape(-1,1))
val_line_numbers_one_hot = tf.one_hot(val_df["line_number"].to_numpy(), depth=15)
val_total_lines_one_hot = tf.one_hot(val_df["total_lines"].to_numpy(), depth=20)

val_sentences = val_df["text"].to_numpy()
val_chars = [split_chars(sentence) for sentence in val_sentences]

val_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    val_line_numbers_one_hot,
    val_total_lines_one_hot,
    val_sentences,
    val_chars
))
val_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_pos_dataset = tf.data.Dataset.zip((
    val_char_token_pos_data,
    val_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

#### Test

In [34]:
test_labels_one_hot = one_hot_encoder.fit_transform(test_df["target"].to_numpy().reshape(-1,1))
test_line_numbers_one_hot = tf.one_hot(test_df["line_number"].to_numpy(), depth=15)
test_total_lines_one_hot = tf.one_hot(test_df["total_lines"].to_numpy(), depth=20)

test_sentences = test_df["text"].to_numpy()
test_chars = [split_chars(sentence) for sentence in test_sentences]

test_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    test_line_numbers_one_hot,
    test_total_lines_one_hot,
    test_sentences,
    test_chars
))
test_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_char_token_pos_dataset = tf.data.Dataset.zip((
    test_char_token_pos_data,
    test_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

### Create model

In [35]:
from tensorflow.keras import layers
# Buidling the tribid model using the functional api 

input_token = layers.Input(shape = [] , dtype =tf.string)
bert_inputs_token = preprocess(input_token)
bert_embedding_char = bert(bert_inputs_token)
output_token = layers.Dense(64 , activation = 'relu')(bert_embedding_char['pooled_output'])
token_model = tf.keras.Model(input_token , output_token)

input_char = layers.Input(shape = [] , dtype =tf.string)
bert_inputs_char = preprocess(input_char)
bert_embedding_char = bert(bert_inputs_char)
output_char = layers.Dense(64 , activation = 'relu')(bert_embedding_char['pooled_output'])
char_model = tf.keras.Model(input_char , output_char)

# 3. Line number model
line_num_inputs = layers.Input(shape=(15,), dtype=tf.float32, name="line_number_input")
x = layers.Dense(32, activation="relu")(line_num_inputs)
line_number_model = tf.keras.Model(line_num_inputs, x)

# 4. Total line model
total_line_inputs = layers.Input(shape=(20,), dtype=tf.float32, name="total_lines_input")
x = layers.Dense(32, activation="relu")(total_line_inputs)
total_lines_models = tf.keras.Model(total_line_inputs, x)

# Concatenating the tokens amd chars output (Hybrid!!!)
combined_embeddings = layers.Concatenate(name = 'token_char_hybrid_embedding')([token_model.output , 
                                                                                char_model.output])

# Combining the line_number_total to our hybrid model (Time for Tribid!!)
z = layers.Concatenate(name = 'tribid_embeddings')([total_lines_models.output , 
                                                    combined_embeddings])

# Adding a dense + dropout and creating our output layer 
dropout = layers.Dropout(0.5)(z)
x = layers.Dense(128 , activation='relu')(dropout)
output_layer = layers.Dense(5 , activation='softmax')(x)

# Packing into a model
model_2 = tf.keras.Model(inputs = [
        line_number_model.input,
        total_lines_models.input,
        token_model.input,
        char_model.input
], 
outputs = output_layer)

### Compile model

In [36]:
model_2.compile(
  loss="categorical_crossentropy",
  optimizer=tf.keras.optimizers.Adam(),
  metrics=["accuracy"]
)

### Fit the bert model

In [37]:
early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=1, restore_best_weights=True)

bert_history = model_2.fit(
  train_char_token_pos_dataset,
  epochs=100,
  steps_per_epoch=int(len(train_char_token_pos_dataset) * 0.1),
  validation_data=val_char_token_pos_dataset,
  validation_steps=int(len(val_char_token_pos_dataset) * 0.02),
  callbacks=[early_stopping_callback]
)
model_2_score = model_2.evaluate(val_char_token_pos_dataset)
model_2_score

Epoch 1/100
Epoch 2/100


[0.4498974680900574, 0.8409572243690491]

## 4. What happens if you were to merge our `line_number` and `total_lines` features for each sequence? For example, created a `X_of_Y` feature instead? Does this effect model performance?
  * Another example: `line_number=1` and `total_lines=11` turns into `line_of_X=1_of_11`.

### Create datasets

In [38]:
import pandas as pd

data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"
# Create function to read the lines of a document
def get_lines(filename):
    with open(filename, "r") as f:
        return f.readlines()
train_lines, test_lines, val_lines = (get_lines(f) for f in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))

def preprocess_text_with_linenumbers(filename):
    lines = get_lines(filename)
    abstract_lines = ""
    abstract_samples = []

    for line in lines:
        if line.startswith("###"):
            abstract_lines = ""
        elif line.isspace():
            abstract_line_split = abstract_lines.splitlines()
            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                target, text = abstract_line.split("\t", maxsplit=1)
                abstract_samples.append({
                    "target": target,
                    "text": text,
                    "line_number": abstract_line_number,
                    "total_lines": len(abstract_line_split)-1
                })
        else:
            abstract_lines += line

    return abstract_samples

%time train_samples, test_samples, val_samples = (preprocess_text_with_linenumbers(file) for file in (f"{data_dir}train.txt", f"{data_dir}test.txt", f"{data_dir}dev.txt"))
len(train_samples), len(test_samples), len(val_samples)

train_df, test_df, val_df = (pd.DataFrame(samples) for samples in (train_samples, test_samples, val_samples))
train_df.head()

CPU times: user 418 ms, sys: 6.85 ms, total: 425 ms
Wall time: 425 ms


Unnamed: 0,target,text,line_number,total_lines
0,OBJECTIVE,To investigate the efficacy of @ weeks of dail...,0,11
1,METHODS,A total of @ patients with primary knee OA wer...,1,11
2,METHODS,Outcome measures included pain reduction and i...,2,11
3,METHODS,Pain was assessed using the visual analog pain...,3,11
4,METHODS,Secondary outcome measures included the Wester...,4,11


In [39]:
# Combining the total lines and line number into a new feature! 
train_df['line_number_total'] = train_df['line_number'].astype(str) + '_of_' + train_df['total_lines'].astype(str)
val_df['line_number_total'] = val_df['line_number'].astype(str) + '_of_' + val_df['total_lines'].astype(str)

train_df.head(10)

Unnamed: 0,target,text,line_number,total_lines,line_number_total
0,OBJECTIVE,To investigate the efficacy of @ weeks of dail...,0,11,0_of_11
1,METHODS,A total of @ patients with primary knee OA wer...,1,11,1_of_11
2,METHODS,Outcome measures included pain reduction and i...,2,11,2_of_11
3,METHODS,Pain was assessed using the visual analog pain...,3,11,3_of_11
4,METHODS,Secondary outcome measures included the Wester...,4,11,4_of_11
5,METHODS,"Serum levels of interleukin @ ( IL-@ ) , IL-@ ...",5,11,5_of_11
6,RESULTS,There was a clinically relevant reduction in t...,6,11,6_of_11
7,RESULTS,The mean difference between treatment arms ( @...,7,11,7_of_11
8,RESULTS,"Further , there was a clinically relevant redu...",8,11,8_of_11
9,RESULTS,These differences remained significant at @ we...,9,11,9_of_11


In [40]:
# Perform one hot encoding on the train and transform the validation dataframe 
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Creating an instance 
one_hot_encoder = OneHotEncoder()

# Fitting on the training dataframe 
one_hot_encoder.fit(np.expand_dims(train_df['line_number_total'] , axis = 1))

# Transforming both train and val df 
train_line_number_total_encoded = one_hot_encoder.transform(np.expand_dims(train_df['line_number_total'] , axis =1))
val_line_number_total_encoded  = one_hot_encoder.transform(np.expand_dims(val_df['line_number_total'] , axis= 1))

# Checking the shapes 
train_line_number_total_encoded.shape , val_line_number_total_encoded.shape

((180040, 460), (30212, 460))

In [41]:
# Converting the sparse object to array 
train_line_number_total_encoded = train_line_number_total_encoded.toarray()
val_line_number_total_encoded = val_line_number_total_encoded.toarray()

# Converting the datatype to int 
train_line_number_total_encoded = tf.cast(train_line_number_total_encoded , dtype= tf.int32)
val_line_number_total_encoded = tf.cast(val_line_number_total_encoded , dtype= tf.int32)

#### Training

In [42]:
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1,1))
train_line_numbers_one_hot = tf.one_hot(train_df["line_number"].to_numpy(), depth=15)
train_total_lines_one_hot = tf.one_hot(train_df["total_lines"].to_numpy(), depth=20)

# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

train_sentences = train_df["text"].to_numpy()
train_chars = [split_chars(sentence) for sentence in train_sentences]

train_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    train_line_number_total_encoded,
    train_sentences,
    train_chars
))
train_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_char_token_pos_dataset = tf.data.Dataset.zip((
    train_char_token_pos_data,
    train_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

#### Validation

In [43]:
val_labels_one_hot = one_hot_encoder.fit_transform(val_df["target"].to_numpy().reshape(-1,1))
val_line_numbers_one_hot = tf.one_hot(val_df["line_number"].to_numpy(), depth=15)
val_total_lines_one_hot = tf.one_hot(val_df["total_lines"].to_numpy(), depth=20)

val_sentences = val_df["text"].to_numpy()
val_chars = [split_chars(sentence) for sentence in val_sentences]

val_char_token_pos_data = tf.data.Dataset.from_tensor_slices((
    val_line_number_total_encoded,
    val_sentences,
    val_chars
))
val_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_pos_dataset = tf.data.Dataset.zip((
    val_char_token_pos_data,
    val_char_token_pos_labels
)).batch(32).prefetch(tf.data.AUTOTUNE)

### Create model

In [44]:
from tensorflow.keras import layers

preprocess = hub.KerasLayer('https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3', trainable = False)
bert = hub.KerasLayer('https://www.kaggle.com/models/google/experts-bert/TensorFlow2/pubmed/2', trainable = False)

input_token = layers.Input(shape = [] , dtype =tf.string)
bert_inputs_token = preprocess(input_token)
bert_embedding_char = bert(bert_inputs_token)
output_token = layers.Dense(64 , activation = 'relu')(bert_embedding_char['pooled_output'])
token_model = tf.keras.Model(input_token , output_token)

input_char = layers.Input(shape = [] , dtype =tf.string)
bert_inputs_char = preprocess(input_char)
bert_embedding_char = bert(bert_inputs_char)
output_char = layers.Dense(64 , activation = 'relu')(bert_embedding_char['pooled_output'])
char_model = tf.keras.Model(input_char , output_char)

# 3. Line number model
line_num_inputs = layers.Input(shape=(460,), dtype=tf.float32, name="line_number_input")
x = layers.Dense(32, activation="relu")(line_num_inputs)
line_number_model = tf.keras.Model(line_num_inputs, x)

# Concatenating the tokens amd chars output (Hybrid!!!)
combined_embeddings = layers.Concatenate(name = 'token_char_hybrid_embedding')([token_model.output , 
                                                                                char_model.output])

# Combining the line_number_total to our hybrid model (Time for Tribid!!)
z = layers.Concatenate(name = 'tribid_embeddings')([line_number_model.output , 
                                                    combined_embeddings])

# Adding a dense + dropout and creating our output layer 
dropout = layers.Dropout(0.5)(z)
x = layers.Dense(128 , activation='relu')(dropout)
output_layer = layers.Dense(5 , activation='softmax')(x)

# Packing into a model
model_3 = tf.keras.Model(inputs = [
        line_number_model.input,
        token_model.input,
        char_model.input
], 
outputs = output_layer)

### Compile the model

In [45]:
model_3.compile(
    loss="categorical_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

### Fit the model

In [46]:
model_3_history = model_3.fit(
    train_char_token_pos_dataset,
    epochs=100,
  steps_per_epoch=int(len(train_char_token_pos_dataset) * 0.1),
    validation_data=val_char_token_pos_dataset,
  validation_steps=int(len(val_char_token_pos_dataset) * 0.02),
    callbacks=[early_stopping_callback]
)
model_3_score = model_3.evaluate(val_char_token_pos_dataset)
model_3_score

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100


[0.2829873263835907, 0.8978882431983948]

In [47]:
!rm -rf pubmed-rct
!rm -rf skimlit_tribrid_model*
!rm checkpoint
!rm glove.6B*
!rm saved_weights*