### Import libraries


In [None]:
# scientific and numberical libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#general libraries
from pathlib import Path, PurePath
import requests
from requests.exceptions import HTTPError, ConnectionError
import re, os, sys
import logging

# NLP libraries
import nltk

# Import Tensorflow libraries 
import tensorflow as tf
from tensorflow.python.estimator.estimator import Estimator
from tensorflow.python.estimator.run_config import RunConfig
from tensorflow.python.estimator.model_fn import EstimatorSpec
from tensorflow.keras.utils import Progbar

from bert_serving.server.graph import optimize_graph
from bert_serving.server.helper import get_args_parser
from bert_serving.server.bert.tokenization import FullTokenizer
from bert_serving.server.bert.extract_features import convert_lst_to_features

In [None]:
# Add Covid19_Search_Tool/src to python path
nb_dir = os.path.split(os.getcwd())[0]
data_dir = os.path.join(nb_dir,'src')
if data_dir not in sys.path:
    sys.path.append(data_dir)

# Import local libraries
from utils import ResearchPapers
from nlp import SearchResults, WordTokenIndex, preprocess

### Download data from local folder
Requires visiting [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), downloading the data (you need a Kaggle account), then moving and unzipping the data in Covid19_Search_Tool/data

In [None]:
# Download metadata from the CORD-19 dataset
data_path = os.path.join(os.getcwd(), "../data","CORD-19-research-challenge")
metadata_path = os.path.join(data_path, 'metadata.csv')
metadata = pd.read_csv(metadata_path,
                               dtype={'Microsoft Academic Paper ID': str,
                                      'pubmed_id': str})

# Set the abstract to the paper title if it is null
metadata.abstract = metadata.abstract.fillna(metadata.title)
print("Number of articles BEFORE removing duplicates: %s " % len(metadata))

# Some papers are duplicated since they were collected from separate sources. Thanks Joerg Rings
duplicate_paper = ~(metadata.title.isnull() | metadata.abstract.isnull() | metadata.publish_time.isnull()) & (metadata.duplicated(subset=['title', 'abstract']))
metadata.dropna(subset=['publish_time', 'journal'])
metadata = metadata[~duplicate_paper].reset_index(drop=True)
print("Number of articles AFTER removing duplicates: %s " % len(metadata))

# Create data classes for the dataset and paper
papers = ResearchPapers(metadata)

# Create a BERT sentance encoding search engine 
From: https://towardsdatascience.com/building-a-search-engine-with-bert-and-tensorflow-c6fdc0186c8a
By: Denis Antyukhov
In this experiment, we will use a pre-trained BERT model checkpoint to build a general-purpose text feature extractor.

These things are sometimes referred to as Natural Language Understanding (NLU) modules, because the features they extract are relevant for a wide array of downstream NLP tasks.

One use for these features is in instance-based learning, which relies on computing the similarity of the query to the training samples.

We will illustrate this by building a simple Information Retrieval system using the BERT NLU module for feature extraction.

**The plan for this experiment is:**
1. getting the pre-trained BERT model checkpoint
2. extracting a sub-graph optimized for inference
3. creating a feature extractor with tf.Estimator
4. exploring vector space with T-SNE and Embedding Projector
5. implementing an Information Retrieval engine
6. accelerating search queries with math
7. building a covid research article recommendation system

### Step 1: getting the pre-trained model
We start with a pre-trained english BERT-base model checkpoint.

For configuring and optimizing the graph for inference we will use bert-as-a-service repository, which allows for serving BERT models for remote clients over TCP.

Having a remote BERT-server is beneficial in multi-host environments. However, in this part of the experiment we will focus on creating a local (in-process) feature extractor. This is useful if one wishes to avoid additional latency and potential failure modes introduced by a client-server architecture. Now, let us download the model and install the package.

_This task was completed via install commands in the README!_

## Step 2: optimizing the inference graph
Normally, to modify the model graph we would have to do some low-level TensorFlow programming. 

However, thanks to bert-as-a-service, we can configure the inference graph using a simple CLI interface.

There are a couple of parameters in the below snippet too look out for.

For each text sample, BERT-base model encoding layers output a tensor of shape **[sequence_len, encoder_dim],** with one vector per input token. To obtain a fixed representation, we need to apply some sort of pooling.

**POOL_STRAT** parameter defines the pooling strategy applied to the  **POOL_LAYER** encoding layer. The default value **REDUCE_MEAN** averages the vectors for all tokens in a sequence. This strategy works best for most sentence-level tasks, when the model is not fine-tuned. Another option is NONE, in which case no pooling is applied at all. This is useful for word-level tasks such as Named Entity Recognition or POS tagging. For a detailed discussion of other options check out the Han Xiao's [blog post.](https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/)

**SEQ_LEN** affects the maximum length of sequences processed by the model. Smaller values increase the model inference speed almost linearly.

In [None]:
# base model dir
nb_dir = os.path.split(os.getcwd())[0]
base_dir = os.path.join(nb_dir,'models')

# input dir
MODEL_DIR = os.path.join(base_dir,'uncased_L-12_H-768_A-12') #@param {type:"string"}
VOCAB_PATH = os.path.join(MODEL_DIR, "vocab.txt") #@param {type:"string"}
# output dir
GRAPH_DIR = os.path.join(base_dir,'graph') #@param {type:"string"}
# output filename
GRAPH_PATH = os.path.join(GRAPH_DIR,'extractor.pbtxt') #@param {type:"string"}

POOL_STRAT = 'REDUCE_MEAN' #@param ['REDUCE_MEAN', 'REDUCE_MAX', "NONE"]
POOL_LAYER = '-2' #@param {type:"string"}
SEQ_LEN = '256' #@param {type:"string"}

print ("MODEL_DIR:  %s\nVOCAB_PATH: %s\nGRAPH_PATH: %s\nGRAPH_OUT:  %s\n"
       %(MODEL_DIR, VOCAB_PATH, GRAPH_PATH, GRAPH_OUT))

In [None]:
# Create an interactive tf session
sesh = tf.InteractiveSession()

tf.gfile.MkDir(GRAPH_DIR)

parser = get_args_parser()
carg = parser.parse_args(args=['-model_dir', MODEL_DIR,
                               '-graph_tmp_dir', GRAPH_DIR,
                               '-max_seq_len', str(SEQ_LEN),
                               '-pooling_layer', str(POOL_LAYER),
                               '-pooling_strategy', POOL_STRAT])

tmp_name, config = optimize_graph(carg)
graph_fout = os.path.join(GRAPH_DIR, GRAPH_OUT)

tf.gfile.Rename(
    tmp_name,
    graph_fout,
    overwrite=True
)
print("\nSerialized graph to {}".format(graph_fout))

Running the above snippet will put the BERT model graph and weights from  **MODEL_DIR** into a GraphDef object which will be serialized to a pbtxt file at **GRAPH_OUT**. The file will be smaller than the original model because the nodes and variables required for training will be removed. This results in a quite portable solution: for example the english base model only takes 389 MB after exporting.

### Step 3: creating a feature extractor
Now, we will use the serialized graph to build a feature extractor using the tf.Estimator API. We will need to define two things: **input_fn** and **model_fn**

In [None]:
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)
log.handlers = []

**input_fn** manages getting the data into the model. That includes executing the whole text preprocessing pipeline and preparing a feed_dict for BERT. 

First, each text sample is converted into a tf.Example instance containing the necessary features listed in **INPUT_NAMES**. The bert_tokenizer object contains  the WordPiece vocabulary and performs the text preprocessing. After that the examples are re-grouped by feature name in a **feed_dict**.

In [None]:
INPUT_NAMES = ['input_ids', 'input_mask', 'input_type_ids']
bert_tokenizer = FullTokenizer(VOCAB_PATH)

def build_feed_dict(texts):
    
    text_features = list(convert_lst_to_features(
        texts, SEQ_LEN, SEQ_LEN, 
        bert_tokenizer, log, False, False))

    target_shape = (len(texts), -1)

    feed_dict = {}
    for iname in INPUT_NAMES:
        features_i = np.array([getattr(f, iname) for f in text_features])
        features_i = features_i.reshape(target_shape).astype("int32")
        feed_dict[iname] = features_i

    return feed_dict

tf.Estimators have a fun feature which makes them re-build and re-initialize the whole computational graph at each call to the predict function. 

So, in order to avoid the overhead, to the predict function we will pass a generator, which will yield the features to the model in a never-ending loop.

In [None]:
def build_input_fn(container):
    
    def gen():
        while True:
          try:
            yield build_feed_dict(container.get())
          except:
            yield build_feed_dict(container.get())

    def input_fn():
        return tf.data.Dataset.from_generator(
            gen,
            output_types={iname: tf.int32 for iname in INPUT_NAMES},
            output_shapes={iname: (None, None) for iname in INPUT_NAMES})
    return input_fn

class DataContainer:
  def __init__(self):
    self._texts = None
  
  def set(self, texts):
    if type(texts) is str:
      texts = [texts]
    self._texts = texts
    
  def get(self):
    return self._texts

**model_fn** contains the specification of the model. In our case, it is loaded from the pbtxt file we saved in the previous step. 

The features are mapped explicitly to the corresponding input nodes with input_map.

In [None]:
def model_fn(features, mode):
    with tf.gfile.GFile(GRAPH_PATH, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        
    output = tf.import_graph_def(graph_def,
                                 input_map={k + ':0': features[k] for k in INPUT_NAMES},
                                 return_elements=['final_encodes:0'])

    return EstimatorSpec(mode=mode, predictions={'output': output[0]})
  
estimator = Estimator(model_fn=model_fn)

Now we have everything we need to perform inference:

In [None]:
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

def build_vectorizer(_estimator, _input_fn_builder, batch_size=128):
  container = DataContainer()
  predict_fn = _estimator.predict(_input_fn_builder(container), yield_single_examples=False)
  
  def vectorize(text, verbose=False):
    x = []
    bar = Progbar(len(text))
    for text_batch in batch(text, batch_size):
      container.set(text_batch)
      x.append(next(predict_fn)['output'])
      if verbose:
        bar.add(len(text_batch))
      
    r = np.vstack(x)
    return r
  
  return vectorize

In [None]:
bert_vectorizer = build_vectorizer(estimator, build_input_fn)

In [None]:
bert_vectorizer(64*['sample']).shape

### Step 4: exploring vector space with Projector

*A* standalone version of BERT feature extractor is available in the [repository](https://github.com/gaphex/bert_experimental).

Using the vectorizer we will generate embeddings for articles from the CORD-19 benchmark (in this tutorial, the Reuters-21578 benchmark corpus was used previously)

To visualise and explore the embedding vector space in 3D we will use a dimensionality reduction technique called [T-SNE](https://distill.pub/2016/misread-tsne/).

Lets get the article embeddings first.

In [None]:
import nltk
from nltk.corpus import reuters

nltk.download("reuters")

In [None]:
type(reuters)

In [None]:
# REUTERS EXAMPLE
max_samples = 256
categories = ['wheat', 'tea', 'strategic-metal', 
              'housing', 'money-supply', 'fuel']

S, X, Y = [], [], []

for category in categories:
  print(category)
  
  sents = reuters.sents(categories=category)
  sents = [' '.join(sent) for sent in sents][:max_samples]
  X.append(bert_vectorizer(sents, verbose=True))
  Y += [category] * len(sents)
  S += sents
  
X = np.vstack(X) 
X.shape

In [None]:
with open("embeddings.tsv", "w") as fo:
  for x in X.astype('float16'):
    line = "\t".join([str(v) for v in x])
    fo.write(line + "\n")

with open("metadata.tsv", "w") as fo:
  fo.write("Label\tSentence\n")
  for y, s in zip(Y, S):
    fo.write("{}\t{}\n".format(y, s))

The interactive visualization of generated embeddings is available on the [Embedding Projector](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/gaphex/7262af1e151957b1e7c638f4922dfe57/raw/3b946229fc58cbefbca2a642502cf51d4f8e81c5/reuters_proj_config.json). **<--CLICK THAT TO GENERATE**

From the link you can run T-SNE yourself, or load a checkpoint using the bookmark in lower-right corner (loading works only on Chrome).

To reproduce the input files used for this visualization, run the code below. Then, download the files to your machine and upload to Projector

(you can dowload files from the menu opened by the ">" button in the upper-left)

### Create embeddings for CORD19 Articles

In [None]:
# Convert pandas dataframe to nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader
# From: https://stackoverflow.com/questions/49088978/how-to-create-corpus-from-pandas-data-frame-to-operate-with-nltk/49104725
def CreateCorpusFromDataFrame(corpusfolder,df):
    for index, r in df.iterrows():
        id=index
        title=r['title']
        body=r['title']
        # handler text for not properly munged data
        try: 
          category=re.sub('/', '', r['journal']) # remove odd characters as writing to file
        except TypeError:
          continue
        fname=str(category)+'_'+str(id)+'.txt'
        corpusfile=open(corpusfolder+'/'+fname,'a+')
        corpusfile.write(str(body) +" " +str(title))
        corpusfile.close()

In [None]:
# create folder to hold CORD19 nltk
dirName = 'CORD19_nltk_title_only'
try:
    # Create target Directory
    os.mkdir(dirName)
except FileExistsError:
    pass

# create corpus
CreateCorpusFromDataFrame(dirName,metadata)
print("Corpus created in folder: %s" % dirName)

In [None]:
# Import the corpus reader
from nltk.corpus.reader import CategorizedPlaintextCorpusReader

# Create NLTK data structure (with pattern matching to create the article names again)
CORD_corpus=CategorizedPlaintextCorpusReader(dirName,r'.*', cat_pattern=r'(.*)_.*.txt$') 

In [None]:
# total journals
print("Total number journals: %s" % (len(metadata.journal.unique())))

# select a subset of journals, where the journal will be the tag
num_journals=8
categories=metadata['journal'].value_counts()[:num_journals].index.tolist()
print ("\nPicking most common journals:")
categories



In [None]:
#CORD19 Examples
max_samples = 5000

S, X, Y = [], [], []

for category in categories:
  print(category)
  
  sents = CORD_corpus.sents(categories=category)
  sents = [' '.join(sent) for sent in sents][:max_samples]
  X.append(bert_vectorizer(sents, verbose=True))
  Y += [category] * len(sents)
  S += sents
  
X = np.vstack(X) 
X.shape

In [None]:
# make folder in google drive to download files
location = '/content/drive/My Drive/'

with open(location + "embeddings_large.tsv", "w") as fo:
  for x in X.astype('float16'):
    line = "\t".join([str(v) for v in x])
    fo.write(line + "\n")

with open(location + "metadata_large.tsv", "w") as fo:
  fo.write("Label\tSentence\n")
  for y, s in zip(Y, S):
    fo.write("{}\t{}\n".format(y, s))

The interactive visualization of generated embeddings is available on the [Embedding Projector](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/gaphex/7262af1e151957b1e7c638f4922dfe57/raw/3b946229fc58cbefbca2a642502cf51d4f8e81c5/reuters_proj_config.json). **<--CLICK THAT TO GENERATE**

Then go to bottom right and load in those files

In [None]:
from IPython.display import HTML

HTML("""
<video width="900" height="632" controls>
  <source src="https://storage.googleapis.com/bert_resourses/reuters_tsne_hd.mp4" type="video/mp4">
</video>
""")