# Smarter conversations using Spacy

Created on March 21st 2020 and updated in August 2021 by Patrick Rotzetter

https://www.linkedin.com/in/rotzetter/

**Small experiment for automatic order mail processing**




# Pre-requisites

Let us import required libraries and check that some pre-requisites are met

In [1]:
# Import libraries
import spacy
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher,PhraseMatcher
from spacy.symbols import nsubj, VERB, dobj, NOUN, root, xcomp
from spacy import displacy
from spacy.matcher import Matcher
from pathlib import Path
import random  

In [2]:
# Let us confirm we have the rigt Python and spacy version, we never know !
from platform import python_version
print(python_version())
!pip show spacy

3.8.11
Name: spacy
Version: 3.1.1
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /Users/patrickrotzetter/opt/anaconda3/lib/python3.8/site-packages
Requires: catalogue, typer, blis, wasabi, pydantic, spacy-legacy, preshed, packaging, numpy, srsly, pathy, murmurhash, requests, cymem, setuptools, tqdm, jinja2, thinc
Required-by: spacy-transformers, en-core-web-trf, en-core-web-sm, en-core-web-lg


In [3]:
# validate libraries and models are well installed
!python -m spacy validate

[2Kading compatibility table...[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/Users/patrickrotzetter/opt/anaconda3/lib/python3.8/site-packages/spacy[0m

NAME              SPACY            VERSION                            
en_core_web_trf   >=3.1.0,<3.2.0   [38;5;2m3.1.0[0m   [38;5;2m✔[0m
en_core_web_sm    >=3.1.0,<3.2.0   [38;5;2m3.1.0[0m   [38;5;2m✔[0m
en_core_web_lg    >=3.1.0,<3.2.0   [38;5;2m3.1.0[0m   [38;5;2m✔[0m



In [4]:
# load spacy model, in this case we will use the transformer model

import en_core_web_trf
nlp = en_core_web_trf.load()

In [5]:
#read mail file which contains the possible dialog content

text = open('ordermail.txt').read().replace('\n', ' ')

print(text)


Hello,  I would like to order a notebook with 16GB and 256 GB disk, I would like to spend less than 1000 Francs, what would be the options  Thanks a lot  Patrick


# Processing the text content with spacy

## Standard Named Entities Recognition

Let us start with named entity recognition and see what Spacy could detect.

In [6]:
#process the mail trough standard spacy pipeline
docMail=nlp(text)


In [7]:
# print text entities detected
for ent in docMail.ents :
    print(ent.text, ent.label_,)


16GB QUANTITY
256 GB QUANTITY
less than 1000 Francs MONEY
Patrick PERSON


We can also visualize the result directly in the text with highlighted entities.

In [8]:
#Let us visualize the result directly in the text
displacy.render(docMail, style='ent', minify=True)

## Domain specific entities

The default model does not seem to detect notebook and disk as entities, but identifies the sender as a person and identifies the RAM and disk size as quantities. This is a good start, but still far away from a practical solution. So, let us add some domain specific entities that will help us later on.

In [9]:
# add domain specific entities and add to the pipeline
patterns = [{"label": "CURRENCY", "pattern":  [{"lower": "francs"}]},
            {"label": "PART", "pattern":  [{"lower": "disk"}]}]


config = {
   "phrase_matcher_attr": None,
   "validate": True,
   "overwrite_ents": True,
   "ent_id_sep": "||",
}
ruler=nlp.add_pipe('entity_ruler',config=config)


In [10]:
ruler.add_patterns(patterns)

In [11]:
#process the mail again with added entities
docMail=nlp(text)
for ents in docMail.ents:
    # Print the entity text and its label
    print(ents.text, ents.label_,)



16GB QUANTITY
256 GB QUANTITY
disk PART
Francs CURRENCY
Patrick PERSON


#### Now the results look a bit better, the model could detect 'disk' and 'francs' for example.

In [15]:
displacy.render(docMail, style='ent', minify=True)

## Matching some specific patterns

#### Sometimes it is not enough to match only entities, for example we have defined the RAM size as 16 GB. So let us see how to detect the memory size automatically

In [12]:
matcher = PhraseMatcher(nlp.vocab)
terms = ["16 GB","256 GB"]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(t) for t in terms]
matcher.add("MEMORY", None, *patterns)

doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

16GB
256 GB


#### Quite cool, it detected the patterns and matched the text related to memory size. Unfortunately, the issue is that we do not know to what it refers to, so we need to start a different kind of analysis.

## Dependency Parsing: Identify verbs, modifiers and objects

One of the key features of Spacy is its linguistic and predictive features. Indeed, Spacy is able to make a prediction of which tag or label most likely applies in a specific context.

Let us start with displaying the result of part of speech tagging and dependency analysis. As we can see below, the code is pretty simple

In [13]:
for token in docMail:
  print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Hello intj be VERB [,]
, punct Hello INTJ []
  punct be VERB []
I nsubj like VERB []
would aux like VERB []
like ccomp be VERB [I, would, order]
to aux order VERB []
order xcomp like VERB [to, notebook]
a det notebook NOUN []
notebook dobj order VERB [a, with]
with prep notebook NOUN [GB, disk]
16 nummod GB NOUN []
GB pobj with ADP [16, and, GB]
and cc GB NOUN []
256 nummod GB NOUN []
GB conj GB NOUN [256]
disk pobj with ADP []
, punct be VERB []
I nsubj like VERB []
would aux like VERB []
like ccomp be VERB [I, would, spend]
to aux spend VERB []
spend xcomp like VERB [to, Francs]
less amod 1000 NUM []
than quantmod 1000 NUM []
1000 nummod Francs NOUN [less, than]
Francs dobj spend VERB [1000]
, punct be VERB []
what attr be VERB []
would aux be VERB []
be ROOT be VERB [Hello,  , like, ,, like, ,, what, would, options,  , Thanks]
the det options NOUN []
options attr be VERB [the]
  punct be VERB []
Thanks attr be VERB [lot]
a det lot NOUN []
lot npadvmod Thanks NOUN [a]
  ROOT   SPACE 

In [14]:
displacy.render(docMail, style="dep", minify=True, jupyter=True)

#### The result is quite impressive, it shows all predicted tags for each word and the dependency tree with the associated dependency labels. For example ‘I’ is a pronoun and is subject to the verb ‘like’.

#### Let us detect the numerical modifiers, as we will need them to identify the memory size required

In [16]:
for token in docMail:  
    if token.dep_ == 'nummod':    
      print(f"Numerical modifier: {token.text} --> object: {token.head}")

Numerical modifier: 16 --> object: GB
Numerical modifier: 256 --> object: GB
Numerical modifier: 1000 --> object: Francs


#### This is again quite cool, we can associate quantities to different words in the text.

## Identifying the action verbs

Spacy provides all the required tagging to find the action verbs, we want to know if the customer wants to order something or is just interested by some information for example. Let us iterate through all tokens in the text and search for an open clausal complement ( refer to for all possible dependency tags https://spacy.io/api/annotation#pos-tagging )



In [19]:
verbs = set()
for possible_verbs in docMail:
    if possible_verbs.dep == xcomp and possible_verbs.head.pos == VERB :
        verbs.add(possible_verbs)
print(verbs)

{spend, order}


#### We have now identified ‘spend’ and ‘order’ as possible actions in the text. We can also do the same to find objects or items in the text that are the referred to by the client

## Identifying items

Let us find possible items in the text using the dependency tag ‘dobj’ for direct objects of a verb.

In [20]:
items = set()
for possible_item in docMail:
    if possible_item.dep == dobj and possible_item.head.pos == VERB:
        items.add(possible_item)
print(items)

{notebook, Francs}


#### ‘Francs’ and ‘notebook’ have been found. Now we can think of using word similarities to find what kind of item the client is referring to. We could also use other techniques, but let us try a simple way for now. We will compare similarities between identified obejcts and the word ‘laptop’. The word ‘notebook’ is much closer to ‘laptop’ than Francs.

In [35]:
import en_core_web_lg
nlpwithvectors = en_core_web_lg.load() # we are using the large language model as we need vectors to compute word similarities
orderobject=nlpwithvectors("laptop")
for  sub in items:
  print(f"Similarity of {sub.text} to laptop is {nlpwithvectors(sub.text).similarity(orderobject):.4f}")


Similarity of notebook to laptop is 0.8022
Similarity of Francs to laptop is 0.0016


#### Finally putting it together, we can think of automatically detecting the required action verb using a heuristic. Let us assume that if the similarity is more than 80%, then we have found the right verb. We then search for the direct object of the similar verb. That could look like this

In [36]:
orderword=nlpwithvectors("order")
for  verb in verbs:
  if (nlpwithvectors(verb.text).similarity(orderword)) >=0.8:
    for v in verb.children:
      if v.dep==dobj:
        print(v.text)


notebook
