# Word Embeddings Options

Below are three techniques for finding word embeddings, all of which involve some kind of neural network. The word embeddings method shown in SMLTAR seems to have been inspired from this:

https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/

This method, though simple, can be quite memory-intensive. I'm still looking into how to get it to work in R, but for now, there are other options which have been more widely adopted.

Both the TensorFlow and Bert models essentially run a classifier neural network model and then generate word embeddings in the process. word2vec is the only algorithm I found that only generates the word embeddings. There are also pretrained GLoVe and word2vec word embeddings out there.

word2vec is probably the best to start out with. The data cleaning functions that I found in the tutorials were very slow though, so it may be worth either doing data preprocessing in R and then putting it directly into the `build_corpus()`function or tokenizing using an [NLTK function](https://www.nltk.org/api/nltk.tokenize.html)

In [None]:
import nltk # NLTK stands for natural language toolkit--nearly every text tool you might need can be found in NLTK
import pandas as pd # Pandas is similar to tidyr/dplyr, it has most data manipulation tools
import numpy as np # I'm not sure what the R equivalent would be for numpy, numpy contains a lot of mathematical and statistical functions
import re # Regex package
# from nltk.sentiment import SentimentAnalyzer # You can import specific functions like this, so you don't have to type out the whole thing
# from nltk.sentiment.util import * # * imports everything I believe
# You won't need the next 3 lines
from google.colab import drive # Making Google Drive the working directory
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Fun with Fragrances'
# Import dataset
call_data = pd.read_csv('call_data.csv')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Fun with Fragrances


In [None]:
# Below is some data prep for the TensorFlow/BERT models.

# sklearn has most of the stats/ML tools one might need.
from sklearn.model_selection import train_test_split
# Dropping a handful of NA calls.
call_data = call_data.dropna(subset=['text'])
# Creating a label based on positive or negative difference. 0 is considered
# positive--while this can go either way, I doubt many calls have a difference of 0.
call_data['label'] = np.where(call_data['difference']>= 0, 1, 0)
# Some functions allow you to define multiple objects with commas.
X_train, X_test, Y_train, Y_test = train_test_split(
    call_data[[ "title", 'text', 'label']], # Columns included in X.
    call_data['difference'], # Column in Y.
    train_size = .9, # Proportion of training observations.
    stratify = call_data['label'], # Variable to stratify on.
    random_state=42) # Random seed.




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [None]:
# Most corpora are typically a collection of text documents, rather than a giant
# csv. The tutorial I'm following uses a collection of text documents, and while
# I was able to coerce the csv into as tensor for later steps, I figured there
# were enough benefits to convert our corpus back into text documents, the
# most important being memory usage. 


for i in range(len(X_train)):
    text = X_train.iloc[i, 1] 
    label = X_train.iloc[i, 2]
    title = X_train.iloc[i, 0]
    if label == 1:
      text_file = open(f"EC_Corpus/train/pos_diff/{title}", "w") # You might have to make this folders if you do this
      n = text_file.write(text)
      text_file.close()
    else:
      text_file = open(f"EC_Corpus/train/neg_diff/{title}", "w")
      n = text_file.write(text)
      text_file.close()

In [None]:
for i in range(len(X_test)):
    text = X_test.iloc[i, 1] 
    label = X_test.iloc[i, 2]
    title = X_test.iloc[i, 0]
    if label == 1:
      text_file = open(f"EC_Corpus/test/pos_diff/{title}", "w")
      n = text_file.write(text)
      text_file.close()
    else:
      text_file = open(f"EC_Corpus/test/neg_diff/{title}", "w")
      n = text_file.write(text)
      text_file.close()

In [None]:
# Experimenting with separating the speakers from what they said--ignore.
# call_data.iloc[1,19]
# r1 = re.compile(r"(?<=  ).*?(?=: )", flags=re.MULTILINE)
# x = re.findall(r1,call_data.iloc[1,19])
# r2 = re.compile(r"(?<=:).*?(?=  )", flags=re.MULTILINE)
# y = re.findall(r2,call_data.iloc[1,19])


## TensorFlow Supervised Embeddings

One of several methods to get word embeddings, probably not too much different from Word2Vec. 

I followed these tutorials:

https://www.tensorflow.org/text/guide/tf_text_intro

https://www.tensorflow.org/text/guide/word_embeddings

In [None]:


import io
import os
import re
import shutil
import string
import tensorflow as tf
%matplotlib inline
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Fun with Fragrances'

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

batch_size = 1024
seed = 123
train_ds = tf.keras.utils.text_dataset_from_directory(
    'EC_Corpus/train', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    'EC_Corpus/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Fun with Fragrances
Found 113148 files belonging to 2 classes.
Using 90519 files for training.
Found 113148 files belonging to 2 classes.
Using 22629 files for validation.


In [None]:
# Let's see if it label works
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

# It would appear so

1 b'Q2 2013 Beazer Homes USA Inc. Earnings Conference Call - Final. 9852 words. 2 May 2013. CQ FD Disclosure. FNDW. English. \xc2\xa92013 by CQ Transcriptions, LLC. All rights reserved. Presentation. OPERATOR: Good morning and welcome to the Beazer Homes Earnings Conference Call for the quarter ended March 31, 2013. Today\'s calling is being recorded and a replay will be available on the Company\'s website later today. In addition, PowerPoint slides intended to accompany this call are available on the Investor Relations section of the company\'s at www.beazer.com[http://www.beazer.com]. At this point I will now turn the call to have Carey Phelps, Director, Investor Relations. You may begin. CAREY PHELPS, DIRECTOR, IR, BEAZER HOMES: Thank you, Tanya. Good morning, and welcome to the Beazer Homes conference discussing our results for the second quarter of fiscal 2013. Before we begin you should be aware that during this call we will be making forward-looking statements. Such statements i

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
embedding_layer = tf.keras.layers.Embedding(1000, 5)
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

In [None]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])


In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7efb8f120990>

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [None]:
# Only for Google Colab

# try:
#   from google.colab import files
#   files.download('vectors.tsv')
#   files.download('metadata.tsv')
# except Exception:
#   pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## word2vec Embeddings

word2vec is the go-to for most word embedding needs, and although it is somewhat dated compared to more state-of-the-art procedures, word2vec has provided good results for relatively low complexity. [SMLTAR](https://smltar.com/embeddings.html#glove) presents word2vec is a somewhat confusing manner, as it seems to conflate the pretrained vectors produced by Google with the actual model. This has been a source of confusion for me for quite some time, but I think I understand it now. The code below doesn't involve any pretrained vectors or models, though we can access those if needed.


References:

https://radimrehurek.com/gensim/models/word2vec.html

https://code.google.com/archive/p/word2vec/

https://www.kaggle.com/code/chewzy/tutorial-how-to-train-your-custom-word-embedding


https://www.kaggle.com/code/jeffd23/visualizing-word-vectors-with-t-sne/notebook


In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None 
import numpy as np
import re
import nltk

from gensim.models import word2vec
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Fun with Fragrances'
call_data = pd.read_csv('call_data.csv')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Fun with Fragrances


In [None]:
# I didn't filter out stop words for the embeddings above, but it shouldn't be a hard fix
STOP_WORDS = nltk.corpus.stopwords.words()

call_data = call_data[['id', 'text']]
# Small sample just to test
# call_data = call_data[0:1000]

# Data cleaning function I found
def clean_sentence(val):
  regex = re.compile('([^\s\w]|_)+')
  sentence = regex.sub('', val).lower()
  sentence = sentence.split(" ")
  for word in list(sentence):
      if word in STOP_WORDS:
          sentence.remove(word) 
  sentence = " ".join(sentence)
  return sentence

In [None]:
# It's very slow, so it might be better to do this in R and then reimport it. 
def clean_dataframe(data):
  data = data.dropna(subset=['text'])
  for col in ['text']:
    data[col] = data[col].apply(clean_sentence)
    return data


In [None]:
data = clean_dataframe(call_data)


In [None]:

def build_corpus(data):
  corpus = []
  word_list=[]
  for col in ['text']:
    for sentence in data[col].iteritems():
      word_list = sentence[1].split(" ")
      corpus.append(word_list)
  return corpus

corpus = build_corpus(data)

# See the first link above for a complete list of arguments in the Word2Vec function
model = word2vec.Word2Vec(corpus,
                          size=200,
                          window=4,
                          min_count=20,
                          workers=4)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
# save model
model.save('model.bin')




AttributeError: ignored

In [None]:
from gensim.models import Word2Vec
# load model
new_model = Word2Vec.load('model.bin')


In [None]:
# To get a word vector, just use the following function
model.wv.get_vector('question')

In [None]:
# This makes a fun little two-dimensional plot of words for some kind of cluster analysis

def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()


## BERT Embeddings

BERT is among the first of the latest generation of deep learning text models. While it has been [surpassed in GLUE](https://mccormickml.com/2019/11/05/GLUE/) by similar models, BERT has still been able to produce robust results for a variety of NLP tasks and has a good amount of third-party documentation to work with. 

References:
https://www.tensorflow.org/text/tutorials/classify_text_with_bert

https://colab.research.google.com/drive/1hMLd5-r82FrnFnBub-B-fVW78Px4KPX1

https://huggingface.co/docs/transformers/training

https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#1-loading-pre-trained-bert



In [3]:
# Install
!pip install -q -U "tensorflow-text==2.8.*"

[K     |████████████████████████████████| 4.9 MB 5.2 MB/s 
[K     |████████████████████████████████| 497.9 MB 3.9 kB/s 
[K     |████████████████████████████████| 462 kB 74.4 MB/s 
[K     |████████████████████████████████| 5.8 MB 5.7 MB/s 
[K     |████████████████████████████████| 1.4 MB 81.1 MB/s 
[?25h

In [4]:
# Install
!pip install -q tf-models-official==2.7.0
# For Colab, I have to restart my runtime before importing. 

[K     |████████████████████████████████| 1.8 MB 4.7 MB/s 
[K     |████████████████████████████████| 43 kB 2.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 71.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 58.5 MB/s 
[K     |████████████████████████████████| 238 kB 67.8 MB/s 
[K     |████████████████████████████████| 118 kB 68.5 MB/s 
[K     |████████████████████████████████| 352 kB 49.4 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [1]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt
from google.colab import drive # Making Google Drive the working directory
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Fun with Fragrances'

tf.get_logger().setLevel('ERROR')

Mounted at /content/gdrive
/content/gdrive/My Drive/Fun with Fragrances


In [4]:

AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 123
# Importing our corpus once again
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'EC_Corpus/train/', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
class_names = raw_train_ds.class_names
raw_train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

val_ds = tf.keras.utils.text_dataset_from_directory(
    'EC_Corpus/train/', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

test_ds = tf.keras.utils.text_dataset_from_directory(
    'EC_Corpus/test/',
    batch_size=batch_size)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 34872 files belonging to 2 classes.
Using 27898 files for training.
Found 34872 files belonging to 2 classes.
Using 6974 files for validation.
Found 12913 files belonging to 2 classes.


In [5]:
# Start with Small BERT, then go bigger
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8' 
# All the possible BERT models
map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

# BERT preprocessing 
map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

In [6]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)


def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

In [7]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

epochs = 2
steps_per_epoch = tf.data.experimental.cardinality(raw_train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

In [8]:
classifier_model = build_classifier_model()

classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)


In [9]:

history = classifier_model.fit(x=raw_train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

Epoch 1/2
Epoch 2/2


In [11]:
# Saving the model
dataset_name = 'EarningsCalls'
saved_model_path = '/{}_bert'.format(dataset_name.replace('/', '_'))

classifier_model.save(saved_model_path, include_optimizer=False)



In [20]:
# Loading the model
reloaded_model = tf.saved_model.load(saved_model_path)


In [26]:
def print_my_examples(inputs, results):
  result_for_printing = \
    [f'input: {inputs[i]:<30} : score: {results[i][0]:.6f}'
                         for i in range(len(inputs))]
  print(*result_for_printing, sep='\n')
  print()


examples = [
    'Marketing',  # this is the same sentence tried earlier
    'Our marketing strategy has been unsuccessful.',
    'The products have been performing well.',
    'Consumers aren\'t purchasing our products.'
]
original_results = tf.sigmoid(classifier_model(tf.constant(examples)))


print('Results from the model in memory:')
print_my_examples(examples, original_results)

Results from the model in memory:
input: Marketing                      : score: 0.003387
input: Our marketing strategy has been unsuccessful. : score: 0.000762
input: The products have been performing well. : score: 0.001224
input: Consumers aren't purchasing our products. : score: 0.000988

