1 - Part of speech tagging (POS)
2 - NER - named entity recognition 
3 - parsing

# 1 | Spacy

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "John watched an old movie at the cinema."
doc = nlp(s)

## 1 | Part of speech tagging (POS)

In [2]:
[(t.text, t.pos_) for t in doc]

[('John', 'PROPN'),
 ('watched', 'VERB'),
 ('an', 'DET'),
 ('old', 'ADJ'),
 ('movie', 'NOUN'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('cinema', 'NOUN'),
 ('.', 'PUNCT')]

In [3]:
spacy.explain("DET")

'determiner'

In [10]:
# The POS tags above are called *course-grained* tags. You can also access *fine-grained* tags
#  through the *tag_* attribute. Fine-grained tags provide more detailed information about a token
# such as its tense and, if a word is a pronoun, what specific type of pronoun it is.
[(t.text, t.tag_) for t in doc] # So **NNP** refers specifically to a _singular pronoun_, and **VBD** is a verb 


[('John', 'NNP'),
 ('watched', 'VBD'),
 ('an', 'DT'),
 ('old', 'JJ'),
 ('movie', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cinema', 'NN'),
 ('.', '.')]

In [11]:
print(spacy.explain("IN"))

conjunction, subordinating or preposition


## 2 | NER - named entity recognition 

In [12]:
# spaCy is bundled with visualizers for both parsing and named entities
# https://spacy.io/usage/visualizers



s = "Volkswagen is developing an electric sedan which could potentially come to America next fall."
doc = nlp(s)

[(t.text, t.ent_type_) for t in doc]

[('Volkswagen', 'ORG'),
 ('is', ''),
 ('developing', ''),
 ('an', ''),
 ('electric', ''),
 ('sedan', ''),
 ('which', ''),
 ('could', ''),
 ('potentially', ''),
 ('come', ''),
 ('to', ''),
 ('America', 'GPE'),
 ('next', 'DATE'),
 ('fall', 'DATE'),
 ('.', '')]

In [13]:
# filtering only the entity values.
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next', 'DATE'), ('fall', 'DATE')]


In [15]:
[ent for ent in doc.ents]

[Volkswagen, America, next fall]

In [14]:
# positions of entities.
print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

[('Volkswagen', 'ORG', 0, 10), ('America', 'GPE', 75, 82), ('next fall', 'DATE', 83, 92)]


In [16]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to output
# the visualization directly. Otherwise, you'll get raw HTML.
displacy.render(doc, style='ent', jupyter=True)

In [17]:
# For domain-specific corpora, an NER tagger may need to be further fine-tuned. Here, 
# we may want _The Martian_ tagged as a "FILM" (assuming that's our goal).

s = "Ridley Scott directed The Martian."
doc = nlp(s)
displacy.render(doc, style='ent', jupyter=True)

## 3 | Parsing

In [18]:
s = "She enrolled in the course at the university."
doc = nlp(s)

# Note the 'style' argument is assigned a 'dep' flag this time around.
displacy.render(doc, style='dep', jupyter=True)

In [19]:
spacy.explain('nsubj')

'nominal subject'

In [20]:
[(t.text, t.dep_) for t in doc]

[('She', 'nsubj'),
 ('enrolled', 'ROOT'),
 ('in', 'prep'),
 ('the', 'det'),
 ('course', 'pobj'),
 ('at', 'prep'),
 ('the', 'det'),
 ('university', 'pobj'),
 ('.', 'punct')]

In [21]:
# But the labels above don't show how the words are related to each other (the arcs). To get a better idea, you can print the head of each dependency.
[(t.text, t.dep_, t.head.text) for t in doc]

[('She', 'nsubj', 'enrolled'),
 ('enrolled', 'ROOT', 'enrolled'),
 ('in', 'prep', 'enrolled'),
 ('the', 'det', 'course'),
 ('course', 'pobj', 'in'),
 ('at', 'prep', 'course'),
 ('the', 'det', 'university'),
 ('university', 'pobj', 'at'),
 ('.', 'punct', 'enrolled')]

## 4 | Matchers to find patterns

In [23]:
# The general Matcher is one of multiple matcher objects
# included with spaCy.
from spacy.matcher import Matcher

# We initialize the Matcher with the spaCy vocab object, which contains
# words along with their labels and entities.
matcher = Matcher(nlp.vocab)

s = "I want to book a hotel room."
doc = nlp(s)

# Patterns are expressed as an ordered sequence. Here, we're looking
# to match occurrences starting with a 'book' string followed by
# a determiner (DET) POS tag, then a noun POS tag.
# The OP key marks the match as optional in some way.

# Here, the DET POS (marked with '?') will match 0 or 1 times, and
# the NOUN POS (marked with '+') will match 1 or more times.
# See this link for more information:
# https://spacy.io/usage/rule-based-matching#quantifiers
pattern = [
  {'TEXT': 'book'}, #  first word should be a text
  {'POS': 'DET', 'OP': '?'}, # second word should be a determinat
  {'POS': 'NOUN', 'OP': '+'}, # 3ed word should be a verb.
]

# We give our pattern a label and pass it to the matcher.
matcher.add('USER_INTENT', [pattern])

# Run the matcher over the doc.
matches = matcher(doc)

# For each match, the matcher returns a tuple specifying a match id, start, 
# and end of the match.
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['book a hotel', 'book a hotel room']


The code above demonstrates the Matcher but is brittle.
- What if "book" is capitalized?
- What if a user types "reserve" instead of "book"?
- How can we match on "hotel room" as a compound noun?
- What if a user types "book a flight and hotel room"?

Can you think of how you would handle these cases?
<br><br>
We could come up more rules to match different patterns, or perhaps just search for keywords based on POS and entities (e.g. a country) and present the user with a bunch of possible intentions and let them choose one, or have a bunch of different interpretation functions submit answers and select the most likely one based on what was historically accepted most often. We can also ask clarifying questions to narrow things down.
<br><br>
For example, for the last sentence, you could have a function scan through the **Doc** object's *noun_chunks* (phrases that have a noun as their head) and isolate keywords there along with potential conjunctions (e.g. "and").<br>
https://spacy.io/usage/linguistic-features#noun-chunks

In [24]:
doc = nlp("I want to book a flight and hotel room in Berlin.")
for noun_phrase in doc.noun_chunks:
  print("phrase: {}, root head: {}".format(noun_phrase, noun_phrase.root.head))

phrase: I, root head: want
phrase: a flight and hotel room, root head: book
phrase: Berlin, root head: in


Using pure rules is a good place to start or prototype (especially if the domain is narrow with a tight set of use cases) but as our requirements get more sophisticated, we'll need to blend in other approaches such as classical models or perhaps deep learning (at the very least, maybe tune existing neural networks). spaCy's models can be updated with more examples to fine-tune predictions.<br>
https://spacy.io/usage/training<br>
<br>
We'll keep learning more approaches as the course progresses.

## 5 |  Talkin' like Yoda

In [26]:
def yodize(s: str):
  doc = nlp(s)
  for t in doc:
    if t.dep_ == "ROOT":

      # Assuming our sentence is of the form subject-verb-object, we take 
      # everything after the root (likely verb) and put it in front, and 
      # likewise take everything before the root, and put it after.
      seq = [doc[t.i + 1: -1].text, doc[0: t.i].text, t.text + '.']
      seq[0] = seq[0].capitalize()
      print(' '.join(seq))


yodize("I will fly to Texas.")

To texas I will fly.


In [None]:
#
# EXERCISE: Learn how to extend spaCy's NER models. Specifically, how to add new
# entity names and entity types. 
#

In [32]:
#
# EXERCISE: using doc.ents, identify and print the dates in this sentence.
# Expected output: ['Feb 13th', 'Feb 24th']
#
s = "We'll be in Osaka on Feb 13th and leave on Feb 24th."
doc = nlp(s)
[(t.text) for t in doc.ents if t.label_=='DATE']

['Feb 13th', 'Feb 24th']

In [None]:
#
# EXERCISE: Read about spaCy's PhraseMatcher
# https://spacy.io/usage/rule-based-matching#phrasematcher
#
# Using the PhraseMatcher, find the start and end index of all occurrences 
# of 'Caesar Augustus' and 'Roman Empire' (case-insensitive).
#
# Expected output: [(0, 2), (15, 17)]
#
from spacy.matcher import PhraseMatcher
s = "Caesar Augustus was the founder of the Roman Principate (the first phase of the Roman Empire)."
doc = nlp(s)


# 2 | nltk

## 1 - Part of speech tagging (POS)

## 2 | NER - named entity recognition 

## 3 | parsing