# NLP Unit 4

Unit 4 has the following parts
* Finish Word Embeddings from unit 3: start at slide 30
* Look at slides about important basic NLP tasks, such as POS-tagging, NER, parsing etc using these slides: https://www.slideshare.net/GirishKhanzode/nlp-52218202
* In combination with the slides we will also look at spaCy to practice things
* **this unit will be organized from this ipynb notebook, which we use back and forth with slides**

# spaCy

* spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
* NLTK is guided more towards education, and is quite old.
* spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.


## Resources:
### Resource 1: Web site:  https://spacy.io/
### Resource 2: Basic usage and functions: https://spacy.io/usage/spacy-101
### Resource 3: Course:  very nice course found here: https://course.spacy.io/

## Installation:

In [1]:
# if not installed yet
!pip3 install spacy
!python3 -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [1]:
import spacy
nlp = spacy.load("en_core_web_md")

<font color='blue'>


### Slides: Basic NLP (repetition) 2-10
While the installation is running, let's look at the slides: 2-10</font>
    


In [57]:
# analyze a piece of text with the model
doc = nlp(u'This is a sentence, which has two parts.')
print(doc.text)
print(doc.lang_)
print(list(doc.sents))



Doc length: 10
This is a sentence, which has two parts.
en
[This is a sentence, which has two parts.]


### Let's just look at the spaCy course: https://course.spacy.io/ Chapter 1

<font color="red">

### First small exercise:
* create an example sentence
* create a spaCy object
* iterate of tokens, show part-of-speech tags, if the token is alpha-numeric, etc



## spaCy simple example (dependence parse, pos, ..)

In [4]:
docX = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in docX:
    print(token.text, token.pos_, token.dep_, list(token.ancestors), token.lemma_)

Apple PROPN nsubj [looking] Apple
is AUX aux [looking] be
looking VERB ROOT [] look
at ADP prep [looking] at
buying VERB pcomp [at, looking] buy
U.K. PROPN compound [startup, buying, at, looking] U.K.
startup NOUN dobj [buying, at, looking] startup
for ADP prep [buying, at, looking] for
$ SYM quantmod [billion, for, buying, at, looking] $
1 NUM compound [billion, for, buying, at, looking] 1
billion NUM pobj [for, buying, at, looking] billion


### Spacy can be used with many languages. In the simple way shown below we can use it without statistical ML  models. How to use a better model for Russian, see below.

<font color='red'> maybe these simple models only do tokenization and other simple rule-based operations

In [5]:
from spacy.lang.ru import Russian
doc = nlp("Привет Миру! Как твои дела? Сегодня неплохая погода.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_alpha, token.is_punct)

print("")

from spacy.lang.de import German
doc = nlp("Guten Tag, wie heissen Sie denn?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_alpha, token.is_punct)


Привет Привет PROPN True False
Миру Миру PROPN True False
! ! PUNCT False True
Как Как PROPN True False
твои твои PROPN True False
дела дела PROPN True False
? ? PUNCT False True
Сегодня Сегодня PROPN True False
неплохая неплохая PROPN True False
погода погода PROPN True False
. . PUNCT False True

Guten Guten PROPN True False
Tag Tag PROPN True False
, , PUNCT False True
wie wie PROPN True False
heissen heissen PROPN True False
Sie Sie PROPN True False
denn denn PROPN True False
? ? PUNCT False True


<font color='blue'>
    
# Let's look at the next part of the slides: 11-20

<hr><hr>
    
# Start of Unit 5

<font color='red'>
    
### Quick repetition?
* What is spaCy?
* What did we look at last time with spaCy?

## Let's look at some spaCy features in detail

### Lemmatization

In [6]:
# lemmatization example
docX = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in docX:
    print(token.text, token.lemma_)

Apple Apple
is be
looking look
at at
buying buy
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


### Part of Speech, Stopwords, is_alpha, ...

In [7]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP Xxxxx True False
is be AUX VBZ xx True True
looking look VERB VBG xxxx True False
at at ADP IN xx True True
buying buy VERB VBG xxxx True False
U.K. U.K. PROPN NNP X.X. False False
startup startup NOUN NN xxxx True False
for for ADP IN xxx True True
$ $ SYM $ $ False False
1 1 NUM CD d False False
billion billion NUM CD xxxx True False


In [8]:
print(spacy.explain("PROPN"))
print(spacy.explain("CD"))

proper noun
cardinal number


### chunking and parsing

<font color='blue'>
    
# Let's look at the dependency parsing slideset


In [9]:
print("Simple chunking:")
list(doc.noun_chunks)

Simple chunking:


[Apple, U.K. startup]

In [10]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Apple PROPN nsubj looking
is AUX aux looking
looking VERB ROOT looking
at ADP prep looking
buying VERB pcomp at
U.K. PROPN compound startup
startup NOUN dobj buying
for ADP prep buying
$ SYM quantmod billion
1 NUM compound billion
billion NUM pobj for


In [58]:
print(spacy.explain("nsubj"))
print(spacy.explain("aux"))
print(spacy.explain("prep"))
print(spacy.explain("pcomp"))
print(spacy.explain("pobj"))
print(spacy.explain("npadvmod"))
print(spacy.explain("case"))

nominal subject
auxiliary
prepositional modifier
complement of preposition
object of preposition
noun phrase as adverbial modifier
case marking


In [12]:
doc = nlp("Joe bought an apple in France this morning.")

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Joe PROPN nsubj bought
bought VERB ROOT bought
an DET det apple
apple NOUN dobj bought
in ADP prep bought
France PROPN pobj in
this DET det morning
morning NOUN npadvmod bought
. PUNCT punct bought


<font color="red">

### Small exercise on dependency parsing (Goal: understand and work with *basic sentence structure*)

Write a script that
* has a sentence as input
* it gives you the basic "subject", "predicate (root)" and "object" (direct object) of the sentence
* Output is: Subject: XX, Predicate: XX, Object:XX
* print some more information about S-P-O, print their lemma, if the subject and object have a determiner, if there is a npadvmod for the predicate (root)

In [2]:
import spacy
nlp = spacy.load("en_core_web_md")

In [12]:
doc = nlp("Joe bought an apple in France this morning.")

for token in doc:
    print(token.text, token.dep_, token.head.text)
    if (token.dep_ == "ROOT"): 
        root = token.head.text
    
for token in doc:
    print("XX", root, token.head.text)
    if token.head.text == root and token.dep_ == 'nsubj':
        subj = token.text
    if token.head.text == root and token.dep_ == 'dobj':
        obj = token.text   

print(subj,root,obj)
        


Joe nsubj bought
bought ROOT bought
an det apple
apple dobj bought
in prep bought
France pobj in
this det morning
morning npadvmod bought
. punct bought
XX bought bought
XX bought bought
XX bought apple
XX bought bought
XX bought bought
XX bought in
XX bought morning
XX bought bought
XX bought bought
Joe bought apple


<font color='blue'>
    
# Let's look at the next part of the slides: 
* Semantics: 20-21
* Pragmatics: 24-25
* Challenges: 26-30
* POS: 47-49
* NER: 50-51


### Named Entities

In [60]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
print("\n")

for token in doc:
    print(token.text, token.ent_type_)

Doc length: 11
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Apple ORG
is 
looking 
at 
buying 
U.K. GPE
startup 
for 
$ MONEY
1 MONEY
billion MONEY


In [14]:
spacy.explain("GPE")

'Countries, cities, states'

### Some internals, OOV?, the vector

In [13]:
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print("\n", token.text, token.has_vector, token.vector_norm, token.is_oov)
    print("Length of vector:", len(token.vector), token.vector[:10]) # show only first part of vector


 dog True 7.0336733 False
Length of vector: 300 [-0.40176   0.37057   0.021281 -0.34125   0.049538  0.2944   -0.17376
 -0.27982   0.067622  2.1693  ]

 cat True 6.6808186 False
Length of vector: 300 [-0.15067  -0.024468 -0.23368  -0.23378  -0.18382   0.32711  -0.22084
 -0.28777   0.12759   1.1656  ]

 banana True 6.700014 False
Length of vector: 300 [ 0.20228  -0.076618  0.37032   0.032845 -0.41957   0.072069 -0.37476
  0.05746  -0.012401  0.52949 ]

 afskfsd False 0.0 True
Length of vector: 300 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


<font color="red">

### Exercise 3: using *basic named entities and dependencies*
* Take the text from a random Wikipedia Artikel (copy and paste)
* Split the article into sentences with NLTK
* Print all persons named in the article
* Print all locations named in the article
* Print person names that are subjects (nsubj) according to the dependency parser of the sentences.
* Let's find out what the sentences talk about, which include both a location and a person. What are the
    ROOTs (according to the dependency parser) of those sentences?


In [16]:
## download a random book -- take first 200 sentences
import urllib  # the lib that handles the url stuff
url = "https://raw.githubusercontent.com/NSkelsey/cvf/master/war_and_peace.txt"
data = urllib.request.urlopen(url) # it's a file like object and works just like a file
text = [line.decode('utf-8') for line in data]
text = "".join(text[:200]) # first 100K characters of the book
print(text[:1000])

﻿The Project Gutenberg EBook of War and Peace, by Leo Tolstoy

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: War and Peace

Author: Leo Tolstoy

Posting Date: January 10, 2009 [EBook #2600]
Release Date: April, 2001
[Last updated: August 22, 2012]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***




An Anonymous Volunteer





WAR AND PEACE

By Leo Tolstoy/Tolstoi





BOOK ONE: 1805





CHAPTER I


"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist--I really believe he is Antichrist--I will have nothing more
to do with you an

In [17]:
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)




﻿The Project Gutenberg EBook ORG
War and Peace WORK_OF_ART
Leo Tolstoy PERSON
the Project ORG
eBook ORG
War and Peace

 WORK_OF_ART
Leo Tolstoy PERSON
January 10, 2009 DATE
2600 MONEY
April, 2001 DATE
August 22, 2012 DATE
English LANGUAGE
Leo Tolstoy PERSON
Tolstoi





 PERSON
ONE CARDINAL
1805





 DATE
Genoa GPE
Lucca GPE
the
Buonapartes PERSON
Antichrist PERSON
July, 1805 DATE
Anna Pavlovna
Scherer PERSON
Marya Fedorovna PERSON
Prince Vasili Kuragin PERSON
first ORDINAL
Anna Pavlovna
 PERSON
some days DATE
la
 ORG
St. Petersburg GPE
French LANGUAGE
an evening TIME
tonight TIME
7 CARDINAL
10 CARDINAL
Heavens WORK_OF_ART
French NORP
Anna Pavlovna PERSON
First ORDINAL
one CARDINAL
one CARDINAL
Anna Pavlovna PERSON
English NORP
Wednesday DATE
today DATE
Novosiltsev ORG
Buonaparte PERSON
Anna Pavlovna Scherer PERSON
forty years DATE
Anna Pavlovna PERSON
Austria GPE
Austria GPE
Russia GPE
Europe LOC
one CARDINAL
earth LOC
England GPE
Alexander PERSON
Malta GPE
English 

## Visualize entities and dependencies!

In [16]:
# visualize entities

from spacy import displacy

doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.')

displacy.render([doc_ent], style='ent', jupyter=True)


In [17]:
# visualize entities

from spacy import displacy


doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google '
u'in 2007, few people outside of the company took him seriously.')
displacy.render([doc_ent], style='dep', jupyter=True)

## Token, Span and Doc similarity


### Basic idea:

* As we know, tokens have vectors (embeddings) associated with them (if using statistical models)
* And: All the main spaCy structures like Token, Span, and Doc all have vectors assigned 
* Doc and Span vectors simply default to average of token vectors
* This **simple** method of average vectors is not good for long documents, why?
* With this information we can simply compute (using cosine similarity) the similarity between such objects

### Example 1: Similarity between 2 documents

In [62]:
nlp = spacy.load("en_core_web_md")

# Example 1:  Determine semantic similarities
doc1 = nlp(u'the fries were gross')
doc2 = nlp(u'worst fries ever')
doc1.similarity(doc2)

# Hook in your own deep learning models

0.7791662980402656

In [63]:

# Example 2: similarity between 2 one-word documents
doc1a = nlp(u'large')
doc2a = nlp(u'small')
print(doc1a.similarity(doc2a))

## first Token of both documents
print(doc1a[0].similarity(doc2a[0]))

## Token vs Doc
print(doc1a[0].similarity(doc2a))


0.8343904257366469
0.8343904
0.8343904020331632


### Example 2: Similarity between Tokens

In [22]:
# similarity on the basis of tokens

doc = nlp(u"Apple and banana are similar. Pasta and hippo aren't.")
apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

assert apple.similarity(banana) > pasta.similarity(hippo)
assert apple.has_vector

print(apple.vector)


[-3.6391e-01  4.3771e-01 -2.0447e-01 -2.2889e-01 -1.4227e-01  2.7396e-01
 -1.1435e-02 -1.8578e-01  3.7361e-01  7.5339e-01 -3.0591e-01  2.3741e-02
 -7.7876e-01 -1.3802e-01  6.6992e-02 -6.4303e-02 -4.0024e-01  1.5309e+00
 -1.3897e-02 -1.5657e-01  2.5366e-01  2.1610e-01 -3.2720e-01  3.4974e-01
 -6.4845e-02 -2.9501e-01 -6.3923e-01 -6.2017e-02  2.4559e-01 -6.9334e-02
 -3.9967e-01  3.0925e-02  4.9033e-01  6.7524e-01  1.9481e-01  5.1488e-01
 -3.1149e-01 -7.9939e-02 -6.2096e-01 -5.3277e-03 -1.1264e-01  8.3528e-02
 -7.6947e-03 -1.0788e-01  1.6628e-01  4.2273e-01 -1.9009e-01 -2.9035e-01
  4.5630e-02  1.0120e-01 -4.0855e-01 -3.5000e-01 -3.6175e-01 -4.1396e-01
  5.9485e-01 -1.1524e+00  3.2424e-02  3.4364e-01 -1.9209e-01  4.3255e-02
  4.9227e-02 -5.4258e-01  9.1275e-01  2.9576e-01  2.3658e-02 -6.8737e-01
 -1.9503e-01 -1.1059e-01 -2.2567e-01  2.4180e-01 -3.1230e-01  4.2700e-01
  8.3952e-02  2.2703e-01  3.0581e-01 -1.7276e-01  3.2536e-01  5.4696e-03
 -3.2745e-01  1.9439e-01  2.2616e-01  7.4742e-02  2

### Example 3: Similarity between different object types

In [23]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


In [24]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


## spacy with Russian language

### taken from: https://github.com/buriy/spacy-ru


In [25]:
!pip3 install pymorphy2==0.8



In [26]:
# 1.) go to the folder where you the ipynb is
!mkdir ru2
!git clone -b v2.1 https://github.com/buriy/spacy-ru.git
!cp -r ./spacy-ru/ru2/. ru2/


mkdir: cannot create directory ‘ru2’: File exists
fatal: destination path 'spacy-ru' already exists and is not an empty directory.


In [27]:
import spacy
sample_sentences = "Привет Миру! Как твои дела? Сегодня неплохая погода."
if __name__ == '__main__':
    nlp = spacy.load('ru2')
    nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
    doc = nlp(sample_sentences)
    for s in doc.sents:
        print(list(['lemma "{}" from text "{}"'.format(t.lemma_, t.text) for t in s]))

['lemma "привет" from text "Привет"', 'lemma "мир" from text "Миру"', 'lemma "!" from text "!"']
['lemma "как" from text "Как"', 'lemma "твой" from text "твои"', 'lemma "дело" from text "дела"', 'lemma "?" from text "?"']
['lemma "сегодня" from text "Сегодня"', 'lemma "неплохой" from text "неплохая"', 'lemma "погода" from text "погода"', 'lemma "." from text "."']


In [28]:
doc = nlp(sample_sentences)
for s in doc.sents:
    print(list(['pos "{}" from text "{}"'.format(t.pos_, t.text) for t in s]))
    print(list(['pos "{}" from text "{}"'.format(t.dep_, t.text) for t in s]))

['pos "NOUN" from text "Привет"', 'pos "PROPN" from text "Миру"', 'pos "PUNCT" from text "!"']
['pos "ROOT" from text "Привет"', 'pos "appos" from text "Миру"', 'pos "punct" from text "!"']
['pos "ADV" from text "Как"', 'pos "DET" from text "твои"', 'pos "NOUN" from text "дела"', 'pos "PUNCT" from text "?"']
['pos "mark" from text "Как"', 'pos "det" from text "твои"', 'pos "ROOT" from text "дела"', 'pos "punct" from text "?"']
['pos "ADV" from text "Сегодня"', 'pos "ADJ" from text "неплохая"', 'pos "NOUN" from text "погода"', 'pos "PUNCT" from text "."']
['pos "advmod" from text "Сегодня"', 'pos "amod" from text "неплохая"', 'pos "ROOT" from text "погода"', 'pos "punct" from text "."']


In [29]:
doc.ents


(Миру,)

In [30]:
doc[1:4].text

'Миру! Как'

## Spacy internals

In [73]:
nlp = spacy.load("en_core_web_md")
doc = nlp("I love coffee.")

In [75]:
print(doc.vocab)
print("Vocab size:", len(doc.vocab))
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

<spacy.vocab.Vocab object at 0x7f83e7f819e0>
Vocab size: 1340242
hash value: 3197928453018144401
string value: coffee


In [33]:
# lexeme is a vocab entry
lexeme = nlp.vocab['coffee']
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In [77]:
# spans are slices of the doc object
from spacy.tokens import Doc, Span

span = Span(doc, 1, 3)
print(span.text)
print(span.label_)

span = Span(doc, 1, 3, label="part 2")
print(span.text)
print(span.label_)

love coffee

love coffee
part 2


# Unit 5/6: spaCy rule-based matching

Here we can define simple patterns, which we can match not only to the text, but also to the annotations (eg. the POS tags).

This can be use for information extraction tasks.

See: https://course.spacy.io/chapter1, item 10

![alt text](screen1.png "Title")


### Basic idea
* define patterns to match in text 
* the idea is similar to regular expressions, but the patterns are simpler, and they can match not only the text but also POS-tags, etc.
* **in many real-world scenarios (where you don't have a large amount of training data) to train a ML model, these rule-based approaches still are applied**

### Simple patterns
* Patterns are lists of dictionaries, and each dictionary describes *one token and its attributes*. Patterns can be added to the matcher using the matcher dot add method.



In [22]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

print(matches)

[(9528407286733565721, 1, 3)]


In [36]:
print(matches)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(9528407286733565721, 1, 3)]
iPhone X


In [19]:
def print_matches(pattern_name, pattern, doc):
    matcher = Matcher(nlp.vocab)
    matcher.add(pattern_name, None, pattern)
    matches = matcher(doc)
    # Iterate over the matches
    print("Matches found:")
    for match_id, start, end in matches:
        # Get the matched span
        matched_span = doc[start:end]
        print(matched_span.text)
    

In [84]:
# a pattern spanning 5 tokens
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won! 2014 fifa world cup. The 2010 fifa world cup was in ...")

print_matches('Fifa pattern', pattern, doc)


Matches found:
2018 FIFA World Cup:
2014 fifa world cup.


<font color="red">

### What does this next pattern do?

In [25]:
# multiple conditions for one token
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
doc = nlp("I loved dogs but now I love cats more.")

print_matches('pattern2', pattern, doc)


Matches found:
loved dogs
love cats


### More complex patterns:
* OP to say have often a patterns can match: "+" will match one or more times, "?" zero or one times, "!" match zero times (negation), "*" match 0-n times
* LEMMA, POS, LOWER, TEXT, OP, ENT_TYPE (https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

In [33]:
# match on token 2 is optional
pattern = [
    {'ENT_TYPE': 'PERSON', 'OP': '!'},  # optional: match 0 or 1 times
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps. Arnold bought a smartphone. ")

print_matches("buy a pattern", pattern, doc)



Matches found:
I bought a smartphone
'm buying apps


### Example: combine matcher and statistical model information

In [41]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])

doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


### online demo to create patterns visually and test them:
https://explosion.ai/demos/matcher


<font color="red">
    
## Exercise 5: simple matching
Image you want to find (for example in all of Wikipedia), which Persons and Organizations appear in sentences together with "Albert Einstein".
* Write the patterns
* use the pattern to collect the persons and organistations
* Test on a few example sentences

# Stopped here Unit 5

### PhraseMatcher: fast matching of phrases
* PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
* Takes Doc object as patterns
* More efficient and faster than the Matcher
* Great for matching large word lists

In [42]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


## spaCy processing pipelines

* We already know the basic pipeline in the nlp() function: tokenization, lemmatization, POS-tagging, dependency parsing, ...
* We exchange compents or add custom ones

![alt text](pipe.png "Title")

![alt text](pipe2.png "Title")

![alt text](pipe3.png "Title")

In [3]:
# show current pipeline
print(nlp.pipe_names)


['tagger', 'parser', 'ner']


In [4]:
print(nlp.pipeline)


[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f6bf63b2f90>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f6bf64ee980>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f6bf64eede0>)]


### custom pipeline components
* will be executed automatically when nlp() is used
* you can updated built-in attributes like doc.ents

In [5]:
nlp = spacy.load('en_core_web_sm') 

def compute_length(doc):
    # Do something to the doc here
    print('Doc length:', len(doc))
    return doc

if nlp.has_pipe("compute_length"):
    nlp.remove_pipe("compute_length")
    
nlp.add_pipe(compute_length, after='tagger')
# nlp.add_pipe(component, last=True)
# nlp.add_pipe(component, first=True)
print(nlp.pipe_names)

['tagger', 'compute_length', 'parser', 'ner']


In [8]:
doc = nlp("This is a test.")

Doc length: 5


 <font color="red">
    
## Exercise 6: simple pipeline extentions
* write a function that checks if the document is in English language (doc.lang_)
* if the document is not in English, print "wrong language, use another pipeline"
* add another function to the pipeline, which check is if the document contains a phrase of two consecutive nouns. If such phrases exist, they are printed.

In [21]:
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'POS': 'NOUN'}, {'POS': 'NOUN'}])


def check_english(doc):

    if not doc.lang_ == "en":
        print("Wrong Language!")
    return doc

def check_dog(doc):

    for match_id, start, end in matcher(doc):
        print(doc[start:end])
    return doc

nlp.add_pipe(check_english, after='ner')
nlp.add_pipe(check_dog, after='check_english')

doc = nlp("This is a test document retriever.")


test document
document retriever


## setting custom attributes, properties and methods

### Attribute extensions
* for the Doc, Span and Token objects
* set attributes for each of these objects
* for example: doc.\_.title = "bla"  ... the ._. is used to distinguish custom and built-in attributes

In [22]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)


In [23]:
def set_color(doc):    
    doc._.title = 'My document'
    doc[0]._.is_color = False # token
    doc[0:2]._.has_color = True # span
    return doc

    
nlp.add_pipe(set_color, after='parser')
print(nlp.pipe_names)

['tagger', 'parser', 'set_color', 'ner', 'check_english', 'check_dog']


In [50]:
doc = nlp("The sky is blue.")
print(doc._.title)
# print(doc._.is_color) # throws an exception!
print(doc[0]._.is_color)
print(doc[1]._.is_color)
print(doc[0:2]._.has_color)
print(doc[0:3]._.has_color)

Doc length: 5
My document
False
False
True
False


In [27]:
from spacy.matcher import Matcher
from spacy.tokens import Doc, Token, Span

  
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'POS': 'NOUN'}, {'POS': 'NOUN'}])

#  Span.set_extension('noun_phrase', default=False)



def check_english(doc):

    if not doc.lang_ == "en":
        print("Wrong Language!")
    return doc

def check_dog(doc):

    for match_id, start, end in matcher(doc):
        print(doc[start:end])
        doc[start:end]._.noun_phrase = True

    return doc

nlp.add_pipe(check_english, after='ner')
nlp.add_pipe(check_dog, after='check_english')

doc = nlp("This is a test document retriever.")
print(doc[3:5]._.noun_phrase)
print(doc[0:2]._.noun_phrase)

test document
document retriever
True
False


 <font color="red">

## Exercise 7: simple attribute
* extend the solution of exercise 6, add a Span attribute "noun_phrase" to the respective span
* Test with a few example docs if it works 

### Property extensions
* getter (and optional: setting) functions are called 
* especially if you want to set attributes on a Span you should use this (getter functions)!
* As I (Gerhard) understand it, these functions are not called in the nlp() function, but on-demand when the property is accessed
* The function does not modify the doc object


In [51]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color1', getter=get_is_color)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color1, '-', doc[3].text)

Doc length: 5
True - blue


In [52]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color1', getter=get_has_color)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color1, '-', doc[1:4].text)
print(doc[0:2]._.has_color1, '-', doc[0:2].text)

Doc length: 5
True - sky is blue
False - The sky


### Finally: Method extensions
* Assign a function that becomes available as an object method
* Lets you pass arguments to the extension function
* Again (as I understand), the function is called only on-demand

In [53]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

Doc length: 5
True - blue
False - cloud


### spaCy https://course.spacy.io/chapter4 discusses how to change the machine learning models 
* it's a very simple process
* but we don't have time to cover it here

## End of spaCy notebook