# An Overview of Applications in Natural Language Processing

Paul M. Washburn

## Introduction

**Universal Sentence Embedding** is a state-of-the-art practice that emerged in 2018 as an effective practice in natural language machine learning tasks.

Both word & sentence embeddings are transfer learning techniques that convert natural language text into fixed-length dense vectors of real numbers that represent the original data in a more [contextually aware manner](https://aclweb.org/aclwiki/Distributional_Hypothesis).  Universal embeddings are of interest to the community due to the fact that machine learning algorithms in general, and in particular neural networks, play nicely with the dense numeric vectors produced by the embedding process.

Over the last couple of years there has been a movement towards **Universal Embeddings** that are pre-trained and ready-for-use in downstream machine learning workflows.  Universal embeddings often confer a qualitative improvement in NLP classification (and similar) tasks due to the fact that they are derived from a large corpus of examples.

Downstream workflows that might benefit from universal embeddings include:

- Sentiment analysis
- Classification tasks
- Translation tasks
- Unsupervised learning
- Visualization 

The enrichment of natural language text data via embeddings has had considerable success, so we will explore the reasons for that success as well as some applications of the technology.

# Word Embeddings

Since sentence embeddings are an extension on the concept of *Word Embeddings* let us briefly examine what they are.  

Both word & sentence embeddings are a form of [*Transfer Learning*](https://en.wikipedia.org/wiki/Transfer_learning), or the transference of knowledge learned from one domain to another (related) domain.  For example, a model built to recognize cars in a traffic video stream could be re-used in for a model built for recognizing trucks. 

---------------------------------------------------------------
## What Is An *Embedding*?

[Google's TensorFlow documentation](https://www.tensorflow.org/guide/embedding) states: *An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.*

```python
## For example, a 300-dimensional embedding for English words could include:
blue:  (0.01359, 0.00075997, 0.24608, ..., -0.2524, 1.0048, 0.06259)
blues:  (0.01396, 0.11887, -0.48963, ..., 0.033483, -0.10007, 0.1158)
orange:  (-0.24776, -0.12359, 0.20986, ..., 0.079717, 0.23865, -0.014213)
oranges:  (-0.35609, 0.21854, 0.080944, ..., -0.35413, 0.38511, -0.070976)
```

*Embedding functions are the standard and effective way to transform such discrete input objects into useful continuous vectors.*

*The individual dimensions in these vectors have no inherent meaning.  The overall patterns of location and distance between vectors, however, are highly valuable in machine learning tasks.  For example the Euclidean distance or angle between vectors can be easily derived for a sort of nearest-neighbor analysis:*

```python
blue:  (red, 47.6°), (yellow, 51.9°), (purple, 52.4°)
blues:  (jazz, 53.3°), (folk, 59.1°), (bluegrass, 60.6°)
orange:  (yellow, 53.5°), (colored, 58.0°), (bright, 59.9°)
oranges:  (apples, 45.3°), (lemons, 48.3°), (mangoes, 50.4°)
```
---------------------------------------------------------------

## A Brief History of Word Embeddings

The concept of word embeddings began with two projects by Google and Stanford called [word2vec](https://github.com/dav/word2vec/) and [GloVe - Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/).  Both approaches are unsupervised approaches that are based on the [distributional hypothesis](https://aclweb.org/aclwiki/Distributional_Hypothesis) (words that co-occur tend to have similar meanings).  These approaches were superceded by [FastText](https://github.com/facebookresearch/fastText) and [ELMo](http://allennlp.org/elmo) that have greater tolerance for out-of-vocabulary n-grams and character features.  These advancements appear to have spurred an increased rate of innovation in this space in recent months, leading to exciting advancements such as universal embeddings. All this work naturally culminates on the topic of this presentation, [Google's Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf).

# From Word Embeddings to Sentence Embeddings

There are many competing schemes for taking word embeddings and transforming them into sentence embeddings.  The simple heuristic of taking a sentence's word vectors' average is generally accepted as a strong approach in spite of its simplicity.  However there is a lot of exciting research in this area that includes supervised, unsupervised approaches, and ensemble approaches.  

## Google's Universal Sentence Encoder

In early 2018 Google made available a universal sentence encoder transformer that has been trained on many observations from a variety of sources & tasks in order to get broad enough coverage to be universally useful. This tool makes it easy for data scientists to access sentence-level embeddings as easy as it has historically been to lookup individual word embeddings.  The embeddings returned by this free-to-use-tool are approximately normalized, which is ideal for use in machine learning tasks. 

# Applications

## `"Hello World"` Demonstration of Process

In [21]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
from sklearn.base import TransformerMixin, BaseEstimator
import tensorflow as tf
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import scikitplot as skplt
from sklearn.metrics import accuracy_score, f1_score
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from yellowbrick.text import TSNEVisualizer
%matplotlib inline

In [10]:
with open("data/thoreau-walden.txt", "r") as f:
    lines = f.readlines()
    
lines = [l for l in lines if l != "\n"]
lines = [l for l in lines if len(l) >]
lines[25:40]

[' ON THE DUTY OF CIVIL DISOBEDIENCE\n',
 'WALDEN\n',
 'Economy\n',
 'When I wrote the following pages, or rather the bulk of them, I lived\n',
 'alone, in the woods, a mile from any neighbor, in a house which I had\n',
 'built myself, on the shore of Walden Pond, in Concord, Massachusetts,\n',
 'and earned my living by the labor of my hands only. I lived there two\n',
 'years and two months. At present I am a sojourner in civilized life\n',
 'again.\n',
 'I should not obtrude my affairs so much on the notice of my readers if\n',
 'very particular inquiries had not been made by my townsmen concerning\n',
 'my mode of life, which some would call impertinent, though they do not\n',
 'appear to me at all impertinent, but, considering the circumstances,\n',
 'very natural and pertinent. Some have asked what I got to eat; if I did\n',
 'not feel lonesome; if I was not afraid; and the like. Others have been\n']

In [18]:
# FastText & CountVectorizer & TfidfVectorizer as a baseline


def fetch_universal_sentence_embeddings(messages, verbose=0):
    """Fetches universal sentence embeddings from Google's
    research paper https://arxiv.org/pdf/1803.11175.pdf.
    
    INPUTS:
    RETURNS:
    """
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]

    # Import the Universal Sentence Encoder's TF Hub module
    embed = hub.Module(module_url)

    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        message_embeddings = session.run(embed(messages))
        embeddings = list()
        for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
            if verbose:
                print("Message: {}".format(messages[i]))
                print("Embedding size: {}".format(len(message_embedding)))
                message_embedding_snippet = ", ".join(
                    (str(x) for x in message_embedding[:3]))
                print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
            embeddings.append(message_embedding)
    return embeddings


In [22]:
embeddings = fetch_universal_sentence_embeddings(lines[25:40])

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [24]:
embeddings[0][:25]

[-0.03612850233912468,
 -0.04058030992746353,
 -0.01834092102944851,
 0.019077610224485397,
 0.08532124012708664,
 -0.060324832797050476,
 2.1142602690815693e-06,
 0.0012475283583626151,
 -0.05353572219610214,
 -0.045716796070337296,
 0.02887057326734066,
 -0.06979456543922424,
 0.05494796857237816,
 0.02506193332374096,
 -0.09119703620672226,
 -0.0011676112189888954,
 -0.04826129600405693,
 0.05431929603219032,
 0.01620793342590332,
 -0.06765597313642502,
 -0.008597459644079208,
 -0.07260719686746597,
 -0.06897539645433426,
 0.09442207962274551,
 0.027198338881134987]

# Test on `ARTHUR` Data

In [32]:
# read in training data from ARTHUR competition
arthur = pd.read_csv("data/paul_arthur_data/train_df.csv")
arthur.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,target,desc_of_operations,website,bus_eff_date
0,14507,c5506,SUBMITTED BY COMMERCIAL INSURANCE SERVICES INC...,,1956-01-01 00:00:00.0
1,14514,c7720,1111 audit: This policyholder holder is a muni...,,1956-01-01 00:00:00.0
2,14543,c0251,OWN AND MAINTIAN AN IRRIGATION CANAL THAT DELI...,,1975-08-19 00:00:00.0
3,14545,c0251,Website: www.northsterling.org The North Sterl...,https://www.northsterling.org/,1956-01-01 00:00:00.0
4,14551,c5506,UNDER LA JARA TOWN GOVERNMENT IN THE WHITE PAG...,,1956-01-01 00:00:00.0


In [33]:
arthur.shape

(562988, 5)

In [51]:
arthur = arthur.drop_duplicates(subset="desc_of_operations")
arthur = arthur.loc[~arthur.desc_of_operations.isnull()]
arthur.shape

(342706, 5)

In [55]:
X, y = arthur.desc_of_operations.astype(str).tolist(), arthur.target.astype(str)
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=7, stratify=y, train_size=.8)



In [56]:
embed_train = fetch_universal_sentence_embeddings(X_train)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [62]:
clf = LinearSVC()
clf.fit(embed_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [64]:
from sklearn.metrics import confusion_matrix

def multiclass_confusion_matrix(y, yhat, model_name='unspecified',
                               verbose=1):
    '''
    Inputs:
    ------------------------------------------------------
    y: true labels 
    yhat: predicted labels 
    model_name: name of model for printing
    
    Outputs:
    ------------------------------------------------------
    cm: confusion matrix (easily readable)
    metrics: dict of metrics on multiclass classification
    '''
    # organize confusion matrix from sklearn into readable format
    sk_confusion_matrix = confusion_matrix(y, yhat).transpose()#; print(sk_confusion_matrix)
    
    # put in pd.DataFrame and add names
    cm = pd.DataFrame(sk_confusion_matrix)
    IX = ['Test_' + str(i+1) for i in cm.index]
    COLS = ['Condition_' + str(i+1) for i in cm.columns]
    cm.columns, cm.index = COLS, IX
    
    # add totals
    cm['Total'] = cm.sum(axis=1)
    cm.loc['Total'] = cm.sum(axis=0)
    
    # get performance scores
    N = cm.loc['Total', 'Total']
    TP = np.diag(cm.loc[IX, COLS]).sum()
    ACC = np.divide(TP, N)
    MCR = 1 - ACC
    
    if verbose:
        print('''
        Confusion Matrix for Model: %s
        ------------------------------------------------------''' %model_name)
        print(cm)
        print('''
        Metrics for Model: %s
        ------------------------------------------------------
        Accuracy Rate = %.5f
        Misclassification Rate = %.5f
        ''' %(model_name, ACC, MCR))
        return None

    return cm


def train_val_metrics(grid, X_train, X_val, y_train, y_val):
    # check train data
    y_pred_train, y_pred_val = grid.predict(X_train), grid.predict(X_val)

    train_acc, train_f1 = accuracy_score(y_pred_train, y_train), f1_score(y_pred_train, y_train, average='macro')
    print('''
    Training Accuracy = %.4f
    Training F1 Score = %.4f
    ''' %(train_acc, train_f1))

    _ = multiclass_confusion_matrix(y_train, y_pred_train)

    skplt.metrics.plot_roc_curve(y_train, grid.predict_proba(X_train))
    ax = plt.gca()
    ax.set_title('Training Results')
    plt.show()
    
    # check validation data 
    val_acc, val_f1 = accuracy_score(y_pred_val, y_val), f1_score(y_pred_val, y_val, average='macro')

    print('''
    Validation Accuracy = %.4f
    Validation F1 Score = %.4f
    ''' %(val_acc, val_f1))

    _ = multiclass_confusion_matrix(y_val, y_pred_val)

    skplt.metrics.plot_roc_curve(y_val, grid.predict_proba(X_val))
    ax = plt.gca()
    ax.set_title('Validation Results')
    plt.show()


In [65]:
embed_val = fetch_universal_sentence_embeddings(X_val)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [68]:
train_val_metrics(clf, embed_train, embed_val, y_train, y_val)

  'recall', 'true', average, warn_for)



    Training Accuracy = 0.6212
    Training F1 Score = 0.4579
    

        Confusion Matrix for Model: unspecified
        ------------------------------------------------------
          Condition_1  Condition_2  Condition_3  Condition_4  Condition_5  \
Test_1            137            1            0            0           17   
Test_2              1          111            1            1            6   
Test_3              1            4          177            0            0   
Test_4              0            0            0           49            0   
Test_5             23            7            2            1          632   
Test_6              0            0            0            0            0   
Test_7             15          147           17            2           13   
Test_8             46            6            2            0           17   
Test_9              0            0            0            0            0   
Test_10             0            1            0   

AttributeError: 'LinearSVC' object has no attribute 'predict_proba'

In [None]:
# transform combined_text
tfidf = TfidfVectorizer(ngram_range=(1,3), min_df=3, max_features=1000)
docs = tfidf.fit_transform(X_train) 

# derive clusters
clusters = KMeans(n_clusters=10)
clusters.fit(docs)

tsne = TSNEVisualizer()
tsne.fit(docs, ["c{}".format(c) for c in clusters.labels_])
tsne.poof()

In [None]:
tsne = TSNEVisualizer()
tsne.fit(docs, y_train)
tsne.poof()

In [46]:
class UniversalSentenceEmbeddingTransformer(TransformerMixin, BaseEstimator):

    def fetch_universal_sentence_embeddings(self, messages, verbose=0):
        """Fetches universal sentence embeddings from Google's
        research paper https://arxiv.org/pdf/1803.11175.pdf.
        INPUTS:
        RETURNS:
        """
        module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]

        # Import the Universal Sentence Encoder's TF Hub module
        embed = hub.Module(module_url)

        with tf.Session() as session:
            session.run([tf.global_variables_initializer(), tf.tables_initializer()])
            message_embeddings = session.run(embed(messages))
            embeddings = list()
            for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
                if verbose:
                    print("Message: {}".format(messages[i]))
                    print("Embedding size: {}".format(len(message_embedding)))
                    message_embedding_snippet = ", ".join(
                        (str(x) for x in message_embedding[:3]))
                    print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
                embeddings.append(message_embedding)
        return embeddings
    
    def fit(self, X, y=None):
        """interface conforming, and allows use of fit_transform"""
        return self
    
    def transform(self, X):
        return self.fetch_universal_sentence_embeddings(messages=X)

In [47]:
## Feature union with tfidf and embeddings
from sklearn.pipeline import FeatureUnion

# specify pipeline params
params = {'clf__C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 
          'clf__random_state': [7]}

# create pipeline
pipeline = Pipeline([
    ('encode', UniversalSentenceEmbeddingTransformer()),
    ('clf', LinearSVC()) 
])

# fit pipeline
grid = GridSearchCV(pipeline, params, verbose=1, n_jobs=-1)
grid = grid.fit(X_train, y_train)



Fitting 3 folds for each of 7 candidates, totalling 21 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


TypeError: Can't convert 'text': data type not understood

# References

- [Universal Sentence Encoder (paper)](https://arxiv.org/pdf/1803.11175.pdf)
- [Google's Universal Sentence Encoder (TensorFlow Hub)](https://tfhub.dev/google/universal-sentence-encoder/2)
- [Colab Notebook with Examples](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb#scrollTo=pxe8MsCfFcy7)
- [TensorFlow Embedding Documentation](https://www.tensorflow.org/guide/embedding)
- [The Current Best of Universal Word Embeddings and Sentence Embeddings](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)
- [Introducing state of the art text classification with universal language models](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html)