# SpaCy Sentence Transformers

Creating a spaCy pipeline that includes a transformer-based sentence embedding component.

In [2]:
import spacy
import pandas as pd
import numpy as np

In [3]:
# Read in a single organization's corpus
df = pd.read_json("data/02. Data Sets/NIFA/contradictions_datasets_nifa_reports.zip", orient='records', compression='infer')
df['fulltext'] = df.text_by_page.str.join(' ')
d0 = df.iloc[0]

In [23]:
# Just using this model for its sentence splitting. Everything else disabled.
nlp_sentencer = spacy.load('en_core_web_sm')
nlp_sentencer.enable_pipe('senter')
nlp_sentencer.select_pipes(enable='senter')
print(nlp_sentencer.pipeline)

# Turn our text into a spacy `doc` which now contains split sentences `sents`
d0_doc = nlp_sentencer(d0.fulltext)
d0_sents = [sent for sent in d0_doc.sents]
len(d0_sents)

[('senter', <spacy.pipeline.senter.SentenceRecognizer object at 0xffff66de3820>)]


613

In [24]:
# Use a baseline model to test embedding our sentences using transformers! In
# the future better models should be used, this is just to test.
#
# This will take a moment to download.
import spacy_sentence_bert
nlp_trf = spacy_sentence_bert.load_model('en_nli_bert_base')
nlp_trf.components

[('sentence_bert',
  <spacy_sentence_bert.language.SentenceBert at 0xffff618e3b50>),
 ('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0xffff614d0440>)]

In [25]:
# The model includes a Sentencizer (rule-based), but we used a
# SentenceRecognizer (model-based) already, so we can disable it.
nlp_trf.select_pipes(enable='sentence_bert')
nlp_trf.pipeline

[('sentence_bert',
  <spacy_sentence_bert.language.SentenceBert at 0xffff618e3b50>)]

In [26]:
# Convert the text of our sentences into spacy `docs` which now contain vectors
d0_sent_docs = list(nlp_trf.pipe(sent.text for sent in d0_sents))

In [36]:
s = d0_sent_docs[100]
print(s.text)
print(s.vector.shape)
print(s.ents)


It is not delinquent on any Federal debt, pursuant to OMB Circular No. A-129, "Managing Federal Credit Programs," and requirements contained in OMB Memorandum —87-32, as implemented by 7 CFR Part 3. i. It will make a good-faith effort to provide and maintain a drug-free environment by prohibiting illicit drugs in the workplace, providing employees with drug-free policy statements (including penalties for noncompliance), and establishing necessary awareness programs to keep employees informed about the availability of counseling, rehabilitation, and related services (§5151-5610 of the Drug-Free Workplace Act of 1988, as implemented by 7 CFR Part 3017, Subpart F).
(768,)
()


In [37]:
premise = "The office is responsible for writing a report on the project status within 6 months."
hypothesis_contradict = "Interim project status reports are not required."
hypothesis_entail = "Interim project status reports will be required."
hypothesis_neutral = "Operations—activities or processes associated with the programs to be housed in a completed facility and those processes which are necessary to run the facility."

d_prem, d_cont, d_ent, d_neu = list(nlp_trf.pipe([premise, hypothesis_contradict, hypothesis_entail, hypothesis_neutral]))

In [39]:
print("Contradiction Similarity:", d_prem.similarity(d_cont))
print("Entailment Similarity:", d_prem.similarity(d_ent))
print("Neutral Similarity:", d_prem.similarity(d_neu))

Contradiction Similarity: 0.13807614532548218
Entailment Similarity: 0.41477016096460345
Neutral Similarity: 0.46119326156374896


> **Notes**
> * We can mix and match pipeline components. So if we wanted to build some algorithm that utilizes sentence embeddings and also looks at the Named Entities in the text, we can use the NER component from another spacy model and just shove in this sentence transformer component!
> * The sentence splitting is still wonky. We'll need to do some more cleaning of that functionality
>   * For example, our senter creates some sentences that are just a single number or a heading.
>   * Simple start could be to clean sentences under a certain length. 
> * There may be some merit to similarity scores... but I think we're still way too likely to get very similar but contradictory sentences, and very dissimilar but completely unrelated (neutral) sentences.