## Outline of Strategy

For complex natural language understanding tasks like detecting paraphrases, duplicates, or plagiarism, the state of the art is complex neural networks using transformers and attention, usually with the pretrained resources released by Google, HuggingFace, and those few who have the enormous resources it requires to compute these language models. Even with these resources, deep NLU is a nontrivial task. For the best performance, they must be fine-tuned with compatible, well-prepared custom data; the correct models and hyperparameters for the use case must be carefully selected; and the interpretability of the model must be seriously considered.

I would approach this problem, depending on the urgency of the issue and the immediate, medium-term, and long-term needs of the client/business, as follows: 

- Perform a preliminary analysis of the data in spaCy, into which we can load out-of-the-box BERT and relatively quickly model our data and sketch out several different analyses to get an idea of how best to proceed with fine-tuning. SpaCy also provides the advantages of excellent visualization interfaces and relatively high interpretability in this context of deep learning.
- Migrate our model to PyTorch, where it is possible to fine-tune both the data and the model in a much more granular way. We may find it advantageous to experiment with Siamese LSTMs, etc., and to take advantage of the many well-written and maintained PyTorch libraries from AllenNLP and others. 
- If speed and interpretability are of paramount importance, then I would implement doc2vec/sent2vec in gensim, which has low computational complexity and is fast and parallelizable, yet has very good downstream performance on a variety of tasks. Furthermore, using t-Distributed Stochastic Neighbor Embedding, we can produce visualizations (in Tensorboard or otherwise) of which words and phrases, precisely, are most similar to each other, and thus contribute most to the result.

Some additional important considerations: 

- Customization will be important if we seek to find synonyms in any kind of specialized domain, like law or finance, where the usual distance measures may produce less than satisfactory results without the ability to add more training data efficiently. We may also be able to take advantage of hierarchical training, e.g., training image recognition simultaneously for semantic enrichment.
- SpaCy only saves the final hidden state of the model, so we're unable to inquire about its "thought process" at arriving there. In future versions, this feature will be available, but it's a good reason to use PyTorch --especially since visualization tools for recovering and interpreting the weights of hidden layers' states are becoming more common and more revealing.
- I have not yet implemented dependency parsing, part-of-speech tagging, or many other features that can be structurally important, partially because the scope of this notebook will soon become very expansive, and partially because BERT accounts for a surprising amount of the variance. We could show spaCy dependency parsing, e.g. visually with displaCy or as a feature, but it would require a parallel-trained model, as BERT-trained models cannot also retain dependency/NER/etc. in spaCy. That's a limitation of this approach.

### SpaCy using BERT transformer

In [1]:
import numpy
import torch
import spacy
from numpy.testing import assert_almost_equal

import pandas as pd
from gensim.models import Word2Vec

In [2]:
nlp = spacy.load("en_trf_bertbaseuncased_lg")

I1125 17:45:40.600836 140283058251648 file_utils.py:39] PyTorch version 1.2.0 available.
I1125 17:45:41.085140 140283058251648 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1125 17:45:41.272373 140283058251648 modeling_bert.py:226] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1125 17:45:41.293509 140283058251648 modeling_xlnet.py:339] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [3]:
text = pd.read_csv(r"msr-para-train.tsv", 
                   sep="\t", header=0, error_bad_lines=False)    

b'Skipping line 102: expected 5 fields, saw 6\nSkipping line 656: expected 5 fields, saw 6\nSkipping line 867: expected 5 fields, saw 6\nSkipping line 880: expected 5 fields, saw 6\nSkipping line 980: expected 5 fields, saw 6\nSkipping line 1439: expected 5 fields, saw 6\nSkipping line 1473: expected 5 fields, saw 6\nSkipping line 1822: expected 5 fields, saw 6\nSkipping line 1952: expected 5 fields, saw 6\nSkipping line 2009: expected 5 fields, saw 6\nSkipping line 2230: expected 5 fields, saw 6\nSkipping line 2506: expected 5 fields, saw 6\nSkipping line 2523: expected 5 fields, saw 6\nSkipping line 2809: expected 5 fields, saw 6\nSkipping line 2887: expected 5 fields, saw 6\nSkipping line 2920: expected 5 fields, saw 6\nSkipping line 2944: expected 5 fields, saw 6\nSkipping line 3241: expected 5 fields, saw 6\nSkipping line 3358: expected 5 fields, saw 6\nSkipping line 3459: expected 5 fields, saw 6\n'


In [5]:
# Duplicate labels so it's easier to process the data
text["label_1"] = text["Quality"]
text["label_2"] = text["Quality"]

In [6]:
# Rename the columns
text.columns = ["label", "id_1", "id_2", "doc_1", 
                "doc_2", "label_1", "label_2"]

In [7]:
text.head()

Unnamed: 0,label,id_1,id_2,doc_1,doc_2,label_1,label_2
0,1,702876,702977,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,1
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,1
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,1


In [16]:
"Apple".lower()

'apple'

In [84]:
# lowercase the documents, because our BERT model is the uncased version
text["doc_1_proc"] = text["doc_1"].apply(str.lower)
text["doc_2_proc"] = text["doc_2"].apply(str).apply(str.lower)

For more fine-tuning and extensibility of a model, eventually it would be a good idea to use a framework like PyTorch, with, e.g., the HuggingFace transformers, which can be trained online, and which supports any kind of custom model you might want to write. Some state-of-the-art methods for resolving the paraphrase question include Siamese recurrent architectures (e.g. LSTMs) and other parallel models, which can be implemented in PyTorch.

However, these can be slow to train and require a lot of time to perfect. For a fast way to get a reasonable idea of the data, spaCy does a remarkably good job, especially with its integration with the PyTorch BERT model. Furthermore, it contains a mapping from the BERT word part embeddings back to the spaCy token ids, meaning that it is more immediately interpretable than some other models, which would require a lot of convoluted reasoning -- pun intended -- to interpret the weights of the hidden layers, and even then, the meaning may not be entirely clear. 

In [22]:
# We load the uncased BERT model, so we will need to lowercase our corpus.
nlp = spacy.load("en_trf_bertbaseuncased_lg")

BERT's WordPiece model checks whether the whole word is in the vocabulary. If not, it recursively decomposes it into subwords, and finally characters. This greedy algorithm means that, when it sees an out-of-vocabulary word, the model can _always_ represent it, if only as the average of its characters' vectors. BERT thus preserves more context than models that use explicit tags and assign unknown tokens to a catch-all category. 

The transformer model BERT was trained on uses self-attention, with a mask in the decoder, but not the encoder, and also includes a next-sentence prediction task that shuffles the corpus. It derives most of its power from the fact that it learns from itself: at a high level, tokens derive their embeddings from matrix multiplication (a measure of distance or similarity), weighted by the embedding vector as context. Every word in the corpus has multiple, bi-directional, context-dependent embeddings, allowing for more nuanced queries and understanding.

In [33]:
doc = nlp(text["doc_1"].tolist()[0])
print(doc)
doc.tensor.shape # (7, 768)  # Always has one row per token

Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.


(19, 768)

We can see that BERT has mapped the OOV token "Amrozi" to several word parts ("Am-ro-zi"), and the word "distorting" into pieces that map roughly to morphemes or phonological syllabic units ("di-stor-ting"). We can also access the alignment via the `doc._.trf_alignment` attribute. `[CLS]` and `[SEP]` denote the beginning and end of a document and are important tokens in their own right; note that they, too, receive attention scores.

In [61]:
print("Wordpieces and attention scores for sample 1:")
alignment_table = pd.DataFrame(list(zip(doc._.trf_word_pieces, doc._.trf_word_pieces_, doc._.trf_last_hidden_state)), 
             columns=["wordpiece_id", "wordpiece_text", "last_hidden_state"])
alignment_table.head(30)

Wordpieces and attention scores for sample 1:


Unnamed: 0,wordpiece_id,wordpiece_text,last_hidden_state
0,101,[CLS],"[-0.5703974, 0.32815552, -0.5710026, -0.379076..."
1,2572,am,"[-0.12826154, 0.013061862, -0.60954684, -1.305..."
2,3217,##ro,"[0.6386509, -0.3578455, -0.054456137, -1.19054..."
3,5831,##zi,"[0.43038714, 0.1304761, -0.43265325, -0.924443..."
4,5496,accused,"[0.1592648, 0.06319537, -0.12583436, -0.546904..."
5,2010,his,"[0.2193718, -0.12850477, -0.5328888, -0.092491..."
6,2567,brother,"[-0.008652532, -0.023832224, -0.8758689, -0.24..."
7,1010,",","[-0.54377395, 0.25359523, 0.027289549, -0.0495..."
8,3183,whom,"[-0.5229964, 0.4036232, 0.3030943, -0.41309872..."
9,2002,he,"[-0.17844011, 0.21058142, -0.4347112, -0.10529..."


In [125]:
# match the sum-pooled vectors as nearly as possible
assert_almost_equal(doc.tensor.sum(axis=0), doc._.trf_last_hidden_state.sum(axis=0), decimal=5)
span = doc[2:4]

We can quickly obtain the cosine distance between any two docs or doc subparts:

In [77]:
doc_b = nlp(text["doc_2"].tolist()[0]) # This is the known paraphrase of our doc
print(f"The similarity score of known synonyms: {doc.similarity(doc_b)}")
doc2 = nlp(text["doc_1"].tolist()[1]) # This is an unrelated sample
print(f"The similarity score of unrelated samples: {doc.similarity(doc2)}")

The similarity score of known synonyms: 0.9723842587615814
The similarity score of unrelated samples: 0.6042276210130809


SpaCy allows us to acces the values of any span tensor we desire. This is meaningfully possible because of the way BERT tokenizes and encodes a corpus. It gives us the power to directly compare sentences or subparts of sentences using various distance methods, based on their vectors, very quickly, and to see whether those methods accord with intuition. 

In [127]:
# Access the tensor from Span elements (especially helpful for sentences)
assert numpy.array_equal(span.tensor, doc.tensor[2:4])

Here is some text to encode.

We can input any text we like and see the similarity scores for:

- the entire input
- any token in the input
- any span in the input

Therefore, we can adjust granularity with which we examine which components contribute to the _overall_ similarity.

In [201]:
bert1 = nlp("BERT is a technologically ground-breaking natural language processing model")
bert2 = nlp("BERT is referred to as a model in many articles, however, it is more of a framework")
bert3 = nlp("Bert and Ernie were built by Don Sahlin from a simple design scribbled by Jim Henson|")
print(bert1[0].similarity(bert2[0]))  
print(bert1[0].similarity(bert3[0]))  

0.8991707
0.658642


In [202]:
# Get token-by-token similarity scores for the sentences above

sims = []
for token in bert1:
    for compare in bert2:
        sims.append((token.text, compare.text, token.similarity(compare)))
sims_df = pd.DataFrame(sims, columns=["token1", "token2", "score"])

# Sort the similarity scores in descending order

sims_df.sort_values(by=["score"], ascending=False)

Unnamed: 0,token1,token2,score
0,BERT,BERT,0.899171
208,model,framework,0.737894
20,is,is,0.684145
43,a,a,0.646669
39,a,is,0.621150
...,...,...,...
10,BERT,",",0.088770
14,BERT,is,0.084505
3,BERT,to,0.081156
114,breaking,BERT,0.065998


In [206]:
# Spans have tensors as well
span = bert3[2:4]
assert numpy.array_equal(span.tensor, bert3.tensor[2:4])

### Gensim doc2vec with earth mover's distance

If interpretability is a primary concern, then neural networks of the complexity that BERT and associated models introduce may not be the best choice, because it's much harder to recover the "reasoning" for the output. Moreover, as more models are pruned intentionally for "selective brain damage" and information is lost, it's been shown that hard-to-predict classes are disproportionately affected. This isn't always desirable. Additionally, it may not be necessary to use the expensive, computationally intense tools to do a job that traditional ones can handle just fine.

Implementations of word2vec-like shallow neural networks dramatically improve on the token-level bag-of-words model. They're more context aware because the underlying model is trained to calculate probabilities on sequences of tokens within a given window. Thus, the model retains more structural information about terms as they relate to each other in context. Sense2vec, sent2vec, and doc2vec can produce satisfactorily context-aware representations of larger subparts of a corpus for many purposes, with the benefit of being more explainable than their convolutional, and convoluted, neural counterparts.

Techniques like singular value decomposition (latent semantic indexing/analysis) have a sound mathematical basis for dimensionality reduction, and make intuitive sense with respect to covariance explainability.

We can ameliorate some of the word2vec family of methods' context issues to a degree by adding ngram models. Really, this echoes BERT's strategy of iteratively breaking down and encoding smaller subparts of a language to "overfit" it as closely as possible, except less extensively and precisely. However, we gain speed and economy, flexibility, and interpretability in the bargain.

In [130]:
import gensim
from gensim import corpora
from gensim.models import Phrases

from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
# to do: refit to glove model (still downloading!)
glove_model = gensim.models.KeyedVectors.load_word2vec_format('../model/text/stanford/glove/glove.6B.50d.vec')

In [85]:
text["doc_1_proc"] = list(nlp.pipe(text["doc_1_proc"].tolist()))
text["doc_2_proc"] = list(nlp.pipe(text["doc_2_proc"].tolist()))

In [94]:
concat_docs = pd.concat([text["doc_1_proc"], text["doc_2_proc"]], axis=0)
concat_docs.head()

0    (amrozi, accused, his, brother, ,, whom, he, c...
1    (yucaipa, owned, dominick, 's, before, selling...
2    (they, had, published, an, advertisement, on, ...
3    (around, 0335, gmt, ,, tab, shares, were, up, ...
4    (the, stock, rose, $, 2.11, ,, or, about, 11, ...
dtype: object

In [97]:
# build gensim corpus
corpus_docs = []
for i, doc in enumerate(concat_docs):
    corpus_docs.append([token.text for token in doc if not token.is_stop and not token.is_punct])

A quite useful metric is the earth mover distance, accessible via `doc2vec_model.wv.wmdistance`. EMD accounts for the problem of synonyms without literal overlap (which other metrics, e.g. Jaccard distance, cannot do) by seeking the smallest distance between distributions of word clusters in the embedding space.

In [153]:
doc2vec_documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_docs)]
doc2vec_model = Doc2Vec(vector_size=50, window=4, min_count=1, workers=4, epochs=40) # instantiate the model

W1125 20:11:50.620370 140283058251648 base_any2vec.py:723] consider setting layer size to a multiple of 4 for greater performance


In [154]:
# fit the model to the vocabulary
doc2vec_model.build_vocab(doc2vec_documents)

I1125 20:12:28.039155 140283058251648 doc2vec.py:1377] collecting all words and their counts
I1125 20:12:28.040684 140283058251648 doc2vec.py:1321] PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
I1125 20:12:28.076210 140283058251648 doc2vec.py:1385] collected 12844 word types and 6916 unique tags from a corpus of 6916 examples and 79407 words
I1125 20:12:28.077234 140283058251648 word2vec.py:1647] Loading a fresh vocabulary
I1125 20:12:28.120304 140283058251648 word2vec.py:1671] effective_min_count=1 retains 12844 unique words (100% of original 12844, drops 0)
I1125 20:12:28.121589 140283058251648 word2vec.py:1677] effective_min_count=1 leaves 79407 word corpus (100% of original 79407, drops 0)
I1125 20:12:28.215401 140283058251648 word2vec.py:1736] deleting the raw counts dictionary of 12844 items
I1125 20:12:28.217094 140283058251648 word2vec.py:1739] sample=0.001 downsamples 8 most-common words
I1125 20:12:28.218403 140283058251648 word2vec.py:1742] downsampl

In [156]:
# train the model on the training docs
doc2vec_model.train(doc2vec_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

I1125 20:13:49.388809 140283058251648 base_any2vec.py:1210] training model with 4 workers on 12844 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=4
I1125 20:13:49.683892 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 3 more threads
I1125 20:13:49.690744 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 2 more threads
I1125 20:13:49.708986 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:13:49.722156 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 0 more threads
I1125 20:13:49.723231 140283058251648 base_any2vec.py:1346] EPOCH - 1 : training on 79407 raw words (83840 effective words) took 0.3s, 267215 effective words/s
I1125 20:13:49.978583 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 3 more threads
I1125 20:13:50.014454 140283058251648 base_any2vec.py:349] worker thread finish

I1125 20:13:54.483043 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 3 more threads
I1125 20:13:54.497586 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 2 more threads
I1125 20:13:54.508027 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:13:54.514766 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 0 more threads
I1125 20:13:54.516225 140283058251648 base_any2vec.py:1346] EPOCH - 14 : training on 79407 raw words (83961 effective words) took 0.4s, 235061 effective words/s
I1125 20:13:54.880956 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 3 more threads
I1125 20:13:54.893553 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 2 more threads
I1125 20:13:54.899285 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:13:54.90746

I1125 20:13:59.289099 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 2 more threads
I1125 20:13:59.306787 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:13:59.318450 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 0 more threads
I1125 20:13:59.319638 140283058251648 base_any2vec.py:1346] EPOCH - 27 : training on 79407 raw words (83937 effective words) took 0.4s, 232825 effective words/s
I1125 20:13:59.592310 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 3 more threads
I1125 20:13:59.629797 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 2 more threads
I1125 20:13:59.638214 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:13:59.644656 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 0 more threads
I1125 20:13:59.64557

I1125 20:14:03.622956 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 1 more threads
I1125 20:14:03.628952 140283058251648 base_any2vec.py:349] worker thread finished; awaiting finish of 0 more threads
I1125 20:14:03.629824 140283058251648 base_any2vec.py:1346] EPOCH - 40 : training on 79407 raw words (83932 effective words) took 0.3s, 279151 effective words/s
I1125 20:14:03.630399 140283058251648 base_any2vec.py:1382] training on a 3176280 raw words (3356651 effective words) took 14.3s, 235356 effective words/s


In [151]:
# from gensim.test.utils import get_tmpfile

# file_name = get_tmpfile("doc2vec_model")

# model.save(file_name)

# To continue training online, we can re-load the model: 

# doc2vec_model = Doc2Vec.load(file_name)

In [112]:
dictionary = corpora.Dictionary(corpus_docs)

I1125 19:40:45.107861 140283058251648 dictionary.py:205] adding document #0 to Dictionary(0 unique tokens: [])
I1125 19:40:45.309946 140283058251648 dictionary.py:212] built Dictionary(12844 unique tokens: ['accused', 'amrozi', 'brother', 'called', 'deliberately']...) from 6916 documents (total 79407 corpus positions)


In [133]:
# convert original corpus to a list of vectors

word2vec_model = Word2Vec(corpus_docs, size=100, window=5, min_count=1, workers=4)
word2vec_model.save("word2vec_model.model")
word2vec_model.train(corpus_docs, total_words=len(dictionary), epochs=4)

I1125 19:55:38.614854 140283058251648 word2vec.py:1588] collecting all words and their counts
I1125 19:55:38.616070 140283058251648 word2vec.py:1573] PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
I1125 19:55:38.638021 140283058251648 word2vec.py:1596] collected 12844 word types from a corpus of 79407 raw words and 6916 sentences
I1125 19:55:38.638849 140283058251648 word2vec.py:1647] Loading a fresh vocabulary
I1125 19:55:38.669111 140283058251648 word2vec.py:1671] effective_min_count=1 retains 12844 unique words (100% of original 12844, drops 0)
I1125 19:55:38.670149 140283058251648 word2vec.py:1677] effective_min_count=1 leaves 79407 word corpus (100% of original 79407, drops 0)
I1125 19:55:38.732173 140283058251648 word2vec.py:1736] deleting the raw counts dictionary of 12844 items
I1125 19:55:38.733381 140283058251648 word2vec.py:1739] sample=0.001 downsamples 8 most-common words
I1125 19:55:38.734551 140283058251648 word2vec.py:1742] downsampling leaves estimat

W1125 19:55:40.958196 140283058251648 base_any2vec.py:1362] EPOCH - 4 : supplied raw word count (79407) did not equal expected count (12844)
I1125 19:55:40.958857 140283058251648 base_any2vec.py:1382] training on a 317628 raw words (307971 effective words) took 0.8s, 367292 effective words/s
W1125 19:55:40.959497 140283058251648 base_any2vec.py:1386] under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay


(307971, 317628)

A nice feature of gensim is that we can input any list of words we like, and the trained model will infer a vector using the `model.infer_vector` function. (These words should be tokenized the same way as the original input.)

In [164]:
doc2vec_model.infer_vector("president united states gave speech today.".split())

array([ 0.8970243 ,  0.03396155,  0.03367773,  0.18531203,  0.23543327,
       -0.2886431 ,  0.11652056,  0.33937278,  0.37848   , -0.35977128,
        0.3452646 , -0.10515809,  0.11883105, -0.33633265, -0.02338399,
        0.3911174 ,  0.30608284, -0.60811317, -0.14078681,  0.04302975,
        0.17476422,  0.23118949,  0.07585598,  0.23363063, -0.07966822,
       -0.29803884,  0.07596204,  0.09144077,  0.24587053, -0.01020437,
        0.23842847, -0.28817824,  0.20111568, -0.00382623, -0.49379253,
        0.04864242, -0.18418962,  0.13911545,  0.39039007, -0.24681053,
       -0.2171303 , -0.15378155,  0.33905306,  0.22656122, -0.32394376,
        0.2269503 ,  0.07785336,  0.17612188,  0.14768776, -0.13606453],
      dtype=float32)

In [None]:
doc2vec_model.docvecs[0]
# nb: Care must be taken with oov terms -- we need to make sure 
# to choose models and methods, e.g. Laplacian smoothing, that
# can deal with values it's never seen before 
# there may be artifacts here of gensim version, see GitHub issues
# to do: try updating gensim in separate venv

We also have the options to instantiate word2vec models with or without ngrams to quickly assess whether they are useful and which option is more advantageous.

Traditional NLP methods use a largely context-agnostic method of computing similarity that nonetheless produces good results much of the time. TF-IDF improves on naive frequency counts by weighting rarer words more heavily and penalizing too-common words on a log scale. Still, its information gain is largely limited by the count of the words in the vocabulary.

Word2vec consists of a shallow neural network that represents the terms in a corpus as weights as calculated by a model that relates them to one another within a given window. Within that window, it is context and direction agnostic: we cannot recover word order, i.e., how close or far the context word is from the center word. Moreover, it produces one embedding per vocabulary word, which can cause problems for polysemy, unlike BERT, which is bi-directionally context sensitive. Still, with other linguistic features (e.g., syntactic dependencies, entity extraction and coreference resolution, part-of-speech tagging, ngram modeling) we can increase the model's capacity. 

In [141]:
# learn a word2vec model with multiword ngram expressions
    
bigram_transformer = Phrases(corpus_docs)
phrase_model = Word2Vec(bigram_transformer[corpus_docs], min_count=1)

I1125 19:58:04.797906 140283058251648 phrases.py:475] collecting all words and their counts
I1125 19:58:04.799263 140283058251648 phrases.py:482] PROGRESS: at sentence #0, processed 0 words and 0 word types
I1125 19:58:04.980337 140283058251648 phrases.py:505] collected 59101 word types from a corpus of 79407 words (unigram + bigrams) and 6916 sentences
I1125 19:58:04.981020 140283058251648 phrases.py:558] using 59101 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
I1125 19:58:04.982093 140283058251648 word2vec.py:1588] collecting all words and their counts
I1125 19:58:04.983121 140283058251648 word2vec.py:1573] PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
I1125 19:58:05.451836 140283058251648 word2vec.py:1596] collected 13162 word types from a corpus of 75506 raw words and 6916 sentences
I1125 19:58:05.452785 140283058251648 word2vec.py:1647] Loading a fresh vocabulary
I1125 19:58:05.486203 140283058251648 word2vec.py:1671

#### Visualizations

Some good visualizations for this kind of data could be t-SNE, or simply a mapping of the highest contributing similarity features.

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec

import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

### PyTorch modeling and visualization

## Coda: Installing transformers in spaCy, generating Gold corpus for updating BERT

We can improve our spaCy BERT model by creating a GoldParse corpus and updating the out-of-the-box transformer with our data. If the data is relatively simple, it's fairly simple to create a jsonlines file of training tuples and update the model from the command line or in a script: 

```sh
!cat msr-para-train.tsv | jq --raw-input --slurp 'split("\n") | map(split("\t")) | .[0:-1] | map( { "id_1": .[1], "doc_1": .[3], "label": .[0], "id_2": .[2], "doc_2": .[4], "id_2": .[1] } )'
```

To install the spaCy compatible transformers, run:

```sh
$ !pip install spacy-transformers && python -m spacy download en_trf_bertbaseuncased_lg
```

In [9]:
doc = nlp.("this is a document for testing")
nlp2.make_doc(" ".join([token.lower_ for token in doc]))

this is a document for testing

In [23]:
from spacy.util import minibatch
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})

In [45]:
for label in ("POSITIVE", "NEGATIVE"):
    textcat.add_label(label)
nlp.add_pipe(textcat)
print(nlp.pipe_names)

ValueError: [E007] 'trf_textcat' already exists in pipeline. Existing names: ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']

#### Example training corpus illustration

In [118]:
TRAIN_DATA = [
    (
        "We trade in over the counter corporate bonds in Europe and Asia.",
        {
            "label": "POSITIVE",
            "id": 1
        },
    ),
    (
        "I like sushi and sashimi, but not nigiri.",
        {
            "label": "NEGATIVE",
            "id": 2
           
        },
    ),
]

In [None]:
optimizer = nlp.resume_training()
for i in range(10):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for batch in minibatch(TRAIN_DATA, size=8):
        texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)
    print(i, losses)

SpaCy requires _tuples_ of jsonlines attributes and strings, not simply jsonlines, so depending on how we process our text, we may need to write our `jq` or python pipeline differently.

In [82]:
for i in range(10):
    doc1 = text["doc_1"].iloc[i]
    id_1 = text["id_1"].iloc[i]
    label_1 = text["label_1"].iloc[i]
    doc2 = text["doc_2"].iloc[i]
    id_2 = text["id_2"].iloc[i]
    label_2 = text["label_2"].iloc[i]
    line1 = encode_line(doc1, id_1, label_1)
    line2 = encode_line(doc2, id_2, label_2)
    print(line1)
    print(line2)

(Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence., {'label': 1, 'id': 702876}),
(Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence., {'label': 1, 'id': 702977}),
(Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion., {'label': 1, 'id': 2108705}),
(Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998., {'label': 1, 'id': 2108831}),
(They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added., {'label': 1, 'id': 1330381}),
(On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale., {'label': 1, 'id': 1330521}),
(Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57., {'label': 1, 'id': 3344667}),
(Tab shares jumped 20 cents, or 4.6%, to set a record closi