[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W4T_Linguistic_Analysis.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy stanza transformers sentencepiece
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download de_core_news_sm

# Introduction to spaCy and Stanza

*This tutorial is based on the [Advanced NLP with spaCy](https://course.spacy.io/en) course, the [spaCy documentation](https://spacy.io/usage/linguistic-features) and the [Stanza documentation](https://stanfordnlp.github.io/stanza/).*

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do:

- You create a pipeline, or load one of the ones available into an instance, usually called simply `nlp`.
- You put in raw text, and get back a [Doc](https://spacy.io/api/doc) object, that comes with a variety of annotations.

<div>
<img src="https://spacy.io/images/pipeline.svg" alt="Visualizing a nlp pipeline" style="width: 50%">
</div>

Docs are containers for [Token](https://spacy.io/api/token) objects, which can also be subsetted in [Span](https://spacy.io/api/span) objects. It contains the raw text, the part of speech tag, the dependency relation, the named entity tag, and more.

<div>
<img src="https://course.spacy.io/doc_span.png" alt="Visualizing tokens, docs and spans" style="width: 40%">
</div>

In [2]:
import spacy

# Load an blank pipeline for English, i.e. only basic lexical information
# like tokenization, is_number, is_alpha, is_punct, etc.
nlp = spacy.blank('en')
print("Pipeline:", nlp.pipe_names)

# Load a small pipeline for English. This includes also a POS tagger, a dependency
# parser and a named entity recognizer trained on labeled data.
nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

def print_attributes(doc_or_span):
    print("Text\tLemma\tUPOS\tXPOS\tDEP\tShape\tIsAlpha\tIsStop\tIsPunct\tIsAscii\n" + "-"*80)
    for token in doc_or_span:
        print(
            f"{token.text}\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t"
            f"{token.dep_}\t{token.shape_}\t{token.is_alpha}\t"
            f"{token.is_stop}\t{token.is_punct}\t{token.is_ascii}"
        )

print_attributes(doc)

Pipeline: []
Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
Apple	Apple	PROPN	NNP	nsubj	Xxxxx	True	False	False	True
is	be	AUX	VBZ	aux	xx	True	True	False	True
looking	look	VERB	VBG	ROOT	xxxx	True	False	False	True
at	at	ADP	IN	prep	xx	True	True	False	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False	False	True
U.K.	U.K.	PROPN	NNP	dobj	X.X.	False	False	False	True
startup	startup	NOUN	NN	dep	xxxx	True	False	False	True
for	for	ADP	IN	prep	xxx	True	True	False	True
$	$	SYM	$	quantmod	$	False	False	False	True
1	1	NUM	CD	compound	d	False	False	False	True
billion	billion	NUM	CD	pobj	xxxx	True	False	False	True


Attributes extracted:

- **Text:** The original word text.

- **Lemma:** The base form of the word.

- **POS:** The simple UPOS part-of-speech tag.

- **Tag:** The detailed part-of-speech tag.

- **Dep:** Syntactic dependency, i.e. the relation between tokens.

- **Shape:** The word shape – capitalization, punctuation, digits.

- **is alpha:** Is the token an alpha character?

- **is stop:** Is the token part of a stop list, i.e. the most common words of the language?

- **is punct:** Is the token a punctuation mark?

- **is_ascii:** Is the token part of the ASCII charset?

If some tags may look mysterious for the non-linguists among you, the `spacy.explain` tool can help in getting a better grasp at them:

In [3]:
print("quantmod:", spacy.explain("quantmod"))
print("dobj:", spacy.explain("dobj"))
print("VBZ:", spacy.explain("VBZ"))

quantmod: modifier of quantifier
dobj: direct object
VBZ: verb, 3rd person singular present


In [4]:
# A slice from the doc is a Span object

span = doc[1:5]
print_attributes(span)

Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
is	be	AUX	VBZ	aux	xx	True	True	False	True
looking	look	VERB	VBG	ROOT	xxxx	True	False	False	True
at	at	ADP	IN	prep	xx	True	True	False	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False	False	True


## Lemmatization

The [Lemmatizer](https://spacy.io/api/lemmatizer) is a pipeline component that provides lookup and rule-based lemmatization methods in a configurable component.

In [5]:
import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
if "lemmatizer" not in nlp.pipe_names:
    config = {"mode": "rule"}
    lemmatizer = nlp.add_pipe("lemmatizer", config=config)
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)

doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])



rule
['I', 'be', 'read', 'the', 'paper', '.']


The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from [WordNet](https://wordnet.princeton.edu/).

>**💡 Interesting Fact** spaCy v3 introduced new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. These lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules. See more in this blog post: [Neural edit-tree lemmatization for spaCy](https://explosion.ai/blog/edit-tree-lemmatizer)

### Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a [Vocab](https://spacy.io/api/vocab) instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization.

The process of splitting a text into tokens in spaCy is started with a **whitespace tokenization**, and then a series of splitting steps are applied to prefixes, suffixes, infixes and other exceptions (e.g. N.Y. and U.K. are kept as a single token).

<div>
<img src="https://spacy.io/images/tokenization.svg" alt="Tokenization example with spaCy" style="width: 60%">
</div>

Refer to the [spaCy Documentation](https://spacy.io/usage/linguistic-features#tokenization) for more details on how to define custom rules and exceptions.

### Creating a custom tokenizer

We are going now to demonstrate how a custom tokenizer can be used to replace the default tokenizer in a spaCy pipeline. We are going to create a class `WhitespaceTokenizer` and replace the original English tokenizer with it:

In [6]:
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print("Before substitution:", [token.text for token in doc])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print("After substitution:", [token.text for token in doc])

Before substitution: ['What', "'s", 'happened', 'to', 'me', '?', 'he', 'thought', '.', 'It', 'was', "n't", 'a', 'dream', '.']
After substitution: ["What's", 'happened', 'to', 'me?', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.']


## Morphological features

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples:

| **Context** | **Surface** | **Lemma** | **POS** | **&nbsp;Morphological Features** |
|---|---|---|---|---|
| I was reading the paper | reading | read | VERB | VerbForm=Ger |
| I don’t watch the news, I read the paper | read | read | VERB | VerbForm=Fin, Mood=Ind, Tense=Pres |
| I read the paper yesterday | read | read | VERB | VerbForm=Fin, Mood=Ind, Tense=Past |

Morphological features are stored in the `token.morph` attribute of tokens:

In [7]:
doc = nlp("I was reading the paper.")

def get_morphological_features(doc_or_span):
    print("Token\tCase\tNumber\tPerson\tPronType\n" + "-"*40)
    for token in doc_or_span:
        print(
            f"{token.text}\t{next(iter(token.morph.get('Case')), '')}\t"
            f"{next(iter(token.morph.get('Number')), '')}\t"
            f"{next(iter(token.morph.get('Person')), '')}\t"
            f"{next(iter(token.morph.get('PronType')), '')}"
        )

get_morphological_features(doc)

Token	Case	Number	Person	PronType
----------------------------------------
I				
was				
reading				
the				
paper.				


For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and **fine-grained part-of-speech** tags to produce **coarse-grained part-of-speech tags** and morphological features. For other languages with a more complex morphological system, spaCy's [`Morphologizer`](https://spacy.io/api/morphologizer) is used instead.

Let's now try another example using a small German pipeline including the statistical `Morphologizer` component:

In [8]:
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # English: 'Where are you?'
get_morphological_features(doc)

Token	Case	Number	Person	PronType
----------------------------------------
Wo				Int
bist		Sing	2	
du	Nom	Sing	2	Prs
?				


## Word vectors and similarity

Remember the assignment of finding the most relevant context to a specific question last week? Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using algorithms like `word2vec` and models like `BERT`and usually look like this:

```python
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

The medium and large pipeline packages (that is, for example, `en_core_web_md` and `en_core_web_lg` as opposed to `en_core_web_sm`) provide word vectors for the entire vocabulary of the model. Small pipelines do not contain word vectors.

Pipeline packages that come with built-in word vectors make them available as the `Token.vector` attribute. `Doc.vector` and `Span.vector` will default to an average of their token vectors.

Similarly to what we achieved in the previous lesson, we can compute a similarity score between sentences using the `similarity` method:

In [9]:
import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use to install the package!
doc1 = nlp("The ship was traveling alonside the river.")
doc2 = nlp("The boat sailed next to the river bank.")
doc3 = nlp("I like to make money.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc1, "<->", doc3, doc1.similarity(doc3))
print(doc2, "<->", doc3, doc2.similarity(doc3))

# Similarity of tokens and spans
make_money = doc3[3:5]
bank = doc2[7]
river = doc2[6]
print(make_money, "<->", bank, make_money.similarity(bank))
print(make_money, "<->", river, make_money.similarity(river))

The ship was traveling alonside the river. <-> The boat sailed next to the river bank. 0.8442151297581031
The ship was traveling alonside the river. <-> I like to make money. 0.33754224644780834
The boat sailed next to the river bank. <-> I like to make money. 0.573202532701966
make money <-> bank 0.3689677119255066
make money <-> river 0.10360369086265564


Notice in this example how the similarity score is high for the first two sentences because of a clear semantic overlap: both sentences contain `river`, and `sailing`/`traveling` and `boat`/`ship`/`river` are probably close in embedding space.

When comparing the first and last example, we can see the similarity is much lower, since they don't share common words or semantics. However, the third example shows a high similarity with the second one: this is most likely due to the presence of the word `bank`, that is semantically related to `money`. This is confirmed by comparing the similarity of `make money` with `river` and `bank` directly.

>**💡 Interesting Fact** This is due to the usage of **static word vectors** like word2vec, which are pre-trained and fixed for all token in the vocabulary. Tokens exhibiting **polysemy** (i.e. multiple meanings) like `bank` will have all their possible meanings "crammed" into a single vector. This is one major limitation of static word vectors, and a reason why **contextualized word vectors** produced by pretrained neural language models, in which each resulting output embedding is generated dynamically depending on the full context, have gained much traction in NLP.

## Other references

We saw just a little part of what spaCy can do. In the next week, we are going to have a better look at the tagging and parsing components of the library. These are the general functionalities of the library:

| **Name** | **Description** |
|---|---|
| Tokenization | Segmenting text into words, punctuations marks etc. |
| Part-of-speech (POS) Tagging | Assigning word types to tokens, like verb or noun. |
| Dependency Parsing | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| Lemmatization | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”. |
| Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences. |
| Named Entity Recognition (NER) | Labelling named “real-world” objects, like persons, companies or locations. |
| Entity Linking (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
| Similarity | Comparing words, text spans and documents and how similar they are to each other. |
| Text Classification | Assigning categories or labels to a whole document, or parts of a document. |
| Rule-based Matching | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| Training | Updating and improving a statistical model’s predictions. |
| Serialization | Saving objects to files or byte strings. |

Refer to [spaCy 101](https://spacy.io/usage/spacy-101) and the documentation for more details.

## Handling More Languages with Stanza

spaCy is a great tool for English, but what if you want to use it for other languages? The good news is that spaCy supports a wide range of languages, but the bad news is that it doesn't support all of them. In this section, we will see how to understand whether a language is supported, and what to do if it isn't.

The list of supported languages is available on the [spaCy website](https://spacy.io/usage/models#languages). If no packages are available for the language you are interested in, you can still use spaCy to process text in that language, but you will have to use the `Language` class directly, and you won't be able to use the pre-trained statistical models. This limits the functionality of spaCy, but it is still possible to use it for tokenization, lemmatization, and other basic NLP tasks.

An alternative is to use the [Stanza](https://stanfordnlp.github.io/stanza/) library, which is a Python wrapper for the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library. Stanza supports a wide range of languages, including Chinese, Japanese, and Arabic (full list [here](https://stanfordnlp.github.io/stanza/available_models.html)). Conveniently, Stanza authors thoroughly report the [performance](https://stanfordnlp.github.io/stanza/performance.html) of their models, so you can easily decide whether to use them or not.

Stanza `Pipeline` is very similar to spaCy `Language`, and it can be used in a similar way. Notice that while we had to manually download pipelines for spaCy at the beginning of the notebook, Stanza pipelines are automatically downloaded when you first use them. Here are some examples matching the ones we saw for spaCy:

**Word and Sentence Tokenization, Multiword Tokens**

In [10]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

nlp = stanza.Pipeline(lang='fr', processors='tokenize')
doc = nlp("Ça n'est pas du tout une phrase.")
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {word.id}\tword: {word.text}\ttoken: {word.parent.text}' for word in sentence.words], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/tokenize/combined.pt:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/mwt/combined.pt:   0%|        …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| mwt       | combined |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: test
id: (5,)	text: sentence
id: (6,)	text: for
id: (7,)	text: stanza
id: (8,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: another
id: (4,)	text: sentence
id: (5,)	text: .


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/tokenize/combined.pt:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/mwt/combined.pt:   0%|        …

INFO:stanza:Loading these models for language: fr (French):
| Processor | Package  |
------------------------
| tokenize  | combined |
| mwt       | combined |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Done loading processors!


id: 1	word: Ça	token: Ça
id: 2	word: n'	token: n'
id: 3	word: est	token: est
id: 4	word: pas	token: pas
id: 5	word: de	token: du
id: 6	word: le	token: du
id: 7	word: tout	token: tout
id: 8	word: une	token: une
id: 9	word: phrase	token: phrase
id: 10	word: .	token: .


**POS Tagging and Morphological Features**

In [11]:
import stanza

nlp = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos')
doc = nlp('Ceci n’est pas une pipe.')
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/pos/combined_charlm.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/forward_charlm/newswiki.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/pretrain/conll17.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.10.0/models/backward_charlm/newswiki.pt:  …

INFO:stanza:Loading these models for language: fr (French):
| Processor | Package         |
-------------------------------
| tokenize  | combined        |
| mwt       | combined        |
| pos       | combined_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


word: Ceci	upos: PRON	xpos: None	feats: Gender=Masc|Number=Sing|Person=3|PronType=Dem
word: n’	upos: ADV	xpos: None	feats: Polarity=Neg
word: est	upos: AUX	xpos: None	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: pas	upos: ADV	xpos: None	feats: Polarity=Neg
word: une	upos: DET	xpos: None	feats: Definite=Ind|Gender=Fem|Number=Sing|PronType=Art
word: pipe	upos: NOUN	xpos: None	feats: Gender=Fem|Number=Sing
word: .	upos: PUNCT	xpos: None	feats: _


In [12]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Gabriele Sarti teaches at the University of Groningen. He lives in the Netherlands.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/ner/ontonotes-ww-multi_charlm.…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/backward_charlm/1billion.pt:  …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/pretrain/conll17.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/forward_charlm/1billion.pt:   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


entity: Gabriele Sarti	type: PERSON
entity: the University of Groningen	type: ORG
entity: Netherlands	type: GPE


Find more usage examples in the [official documentation](https://stanfordnlp.github.io/stanza/index.html).

------------------------------------------------------------------------

# Text tagging and Dependency Parsing with spaCy

Here, we extend the overview of spaCy functionalities to include named entity recognition and dependency parsing. You will then learn how to use 🤗 Transformers for span labeling tasks, and see some other interesting tagging use-cases.

## Named entity recognition with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#named-entities).*

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can **recognize various types of named entities in a document**, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the `ents` property of a `Doc`:

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for US$1B")

def print_entities(doc):
    print("Token\tStart\tEnd\tType\tExplanation\n" + "-"*80)
    for ent in doc.ents:
        print(
            f"{ent.text}\t{ent.start_char}\t{ent.end_char}\t{ent.label_}\t{spacy.explain(ent.label_)}"
        )

print_entities(doc)



Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
Apple	0	5	ORG	Companies, agencies, institutions, etc.
U.K.	27	31	GPE	Countries, cities, states
1B	47	49	MONEY	Monetary values, including unit


In [14]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

### Accessing entity annotations and labels

The standard way to access entity annotations is the `doc.ents` property, which produces a sequence of `Span` objects. The entity type is accessible either as a hash value or as a string, using the attributes `ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the `token.ent_iob` and `token.ent_type` attributes. `token.ent_iob` indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

**The IOB Scheme**

- I – Token is inside an entity.
- O – Token is outside an entity (i.e. no entity tag)
- B – Token is the beginning of an entity.

In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
print_entities(doc)

print("\n\n\n")

# token level
def print_token_iob(doc):
    print("Token\tIOB Tag\tType\tExplanation\n" + "-"*80)
    for t in doc:
        print(
            f"{t.text.ljust(10)}\t{t.ent_iob_}\t{t.ent_type_}\t{spacy.explain(t.ent_type_)}"
        )

print_token_iob(doc)

Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
San Francisco	0	13	GPE	Countries, cities, states




Token	IOB Tag	Type	Explanation
--------------------------------------------------------------------------------
San       	B	GPE	Countries, cities, states
Francisco 	I	GPE	Countries, cities, states
considers 	O		None
banning   	O		None
sidewalk  	O		None
delivery  	O		None
robots    	O		None




### Visualizing named entities

The [displaCy ENT visualizer](https://explosion.ai/demos/displacy-ent) lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a `Doc` or a list of `Doc` objects to displaCy and run `displacy.serve` to run the web server, or `displacy.render` to generate the raw markup. It works in Jupyter Notebooks!

For more details and examples, see the usage [guide on visualizing spaCy](https://spacy.io/usage/visualizers).

In [16]:
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)


# Dependency parsing with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#dependency-parse).*

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a `Doc` object has been parsed by calling `doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has been set. It returns a boolean value. If the result is `False`, the default sentence iterator will raise an exception.

### Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of `.dep` is a hash value. You can get the string value with `.dep_`.

In the following, we add the fields `head.pos_` to consider the part of speech associated with the token and `children` to get the list of the immediate syntactic dependents of the token. Notice that now we are operating on single tokens, as opposed to full noun chunks as above.

In [17]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Autonomous amod cars NOUN []
cars nsubj shift VERB [Autonomous]
shift ROOT shift VERB [cars, liability, toward]
insurance compound liability NOUN []
liability dobj shift VERB [insurance]
toward prep shift VERB [manufacturers]
manufacturers pobj toward ADP []


Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest – from below:

In [18]:
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Finding the verb associated with the subject "cars"
verbs = set()
for possible_subject in doc:
    if possible_subject.text == "cars" and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
verbs

{shift}

### Visualizing Parse Trees

We can use the same `displaCy` tool from above to visualize the parse tree:

In [19]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep', jupyter=True)

Long texts can become difficult to read when displayed in one row, so it’s often better to visualize them sentence-by-sentence instead. DisplaCy supports rendering both `Doc` and `Span` objects, as well as lists of `Docs` or `Spans`. Instead of passing the full `Doc` to `displacy.render`, you can also pass in a list of `doc.sents`. This will create one visualization for each sentence.

In [20]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. One day, Senex goes on a trip and leave Pseudolus in charge of Hero."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", jupyter=True, options = {"compact": True})

The dependency parse can be a useful tool for information extraction, especially when combined with other predictions like named entities. The following example extracts money and currency values, i.e. entities labeled as `MONEY`, and then uses the dependency parse to find the noun phrase they are referring to – for example `"Net income"→ "$9.4 million"`.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")
# Merge noun phrases and entities for easier analysis
nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of 1$ billion",
]
for doc in nlp.pipe(TEXTS):
    for token in doc:
        if token.ent_type_ == "MONEY":
            # We have an attribute and direct object, so check for subject
            if token.dep_ in ("attr", "dobj"):
                subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subj:
                    print(subj[0], "-->", token)
            # We have a prepositional object with a preposition
            elif token.dep_ == "pobj" and token.head.dep_ == "prep":
                print(token.head.head, "-->", token)

Net income --> $9.4 million
the prior year --> $2.7 million
Revenue --> twelve billion dollars
a loss --> 1$ billion


## Span Labeling with 🤗 Transformers

We are now going to see some interesting use-cases for span labeling with the transformer library. We will use `pipelines` covered in the week 2 tutorial, since they are the fastest way to use fine-tuned models.

In the first example, we will use a model trained to perform NER on the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/index.html) dataset. Recall that these model operate on subword tokens, which complicates a bit the mapping of predicted tags to the original word representations. Luckily, this is handled automatically by the `ner` pipeline, a subclass of the `TokenClassificationPipeline` we mentioned in the first tutorial.

In [22]:
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")
example = "My name is Wolfgang and I live in Berlin"
ner_results = ner(example)
print(ner_results)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


Let's now build a function that lets us visualize the predictions of a model using displaCy:

In [23]:
import numpy
import spacy
from spacy import displacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
example = "My name is Wolfgang and I live in Berlin"

def render_tags(nlp, text, tags):
    doc = nlp.make_doc(text)
    for ent in tags:
        # Using char_span instead of building the Span manually allows us to use the char indexes
        # produced alongside the model's output
        ent_span = doc.char_span(ent["start"], ent["end"], label=ent["entity"])
        doc.set_ents([ent_span], default="unmodified")
    # This adds nice colors to the unknown entities
    options = {"colors": {k["entity"]:"linear-gradient(90deg, #aa9cfc, #fc9ce7)" for k in tags}}
    return displacy.render(doc, style="ent", jupyter=True, options=options)

render_tags(nlp, example, ner_results)



### Span Labeling Beyond POS & NER

While POS tagging and NER are by far the two most popular tasks in which per-token labels need to be predicted, they are not the only interesting ones. In the next code bit we will use a model trained to tag personal information in clinical reports, which can be used for **clinical note de-identification**:

In [24]:
note = (
    "PROCEDURE: Chest xray. COMPARISON: last seen on 1/1/2020 and also record dated of March 1st, 2019. "
    "FINDINGS: patchy airspace opacities. IMPRESSION: The results of the chest xray of January 1 2020 are the most "
    "concerning ones. The patient was transmitted to another service of UH Medical Center under the responsability of "
    "Dr. Perez. We used the system MedClinical data transmitter and sent the data on 2/1/2020, under the ID 5874233. "
    "We received the confirmation of Dr Perez. He is reachable at 567-493-1234."
)

deid = pipeline("token-classification", model="StanfordAIMI/stanford-deidentifier-base")
deid_result = deid(note)
deid_result[:3]

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


[{'entity': 'DATE',
  'score': 0.99990094,
  'index': 12,
  'word': '1',
  'start': 48,
  'end': 49},
 {'entity': 'DATE',
  'score': 0.9999145,
  'index': 13,
  'word': '/',
  'start': 49,
  'end': 50},
 {'entity': 'DATE',
  'score': 0.99995744,
  'index': 14,
  'word': '1',
  'start': 50,
  'end': 51}]

Let's now customize the function from before to have only one tag per entity, under the assumption of having only entities composed by contiguous tokens:

In [25]:
def render_tags_deid(nlp, text, tags):
    doc = nlp.make_doc(text)
    start_idx = 0
    end_idx = 0
    for i, ent in enumerate(tags):
        start_idx = ent["start"]
        end_idx = ent["end"]
        # The alignment mode parameter is normally set to "strict", meaning that if the start/end indices
        # do not match the token boundaries, the span is not created. We set it to "expand" to make the matching
        # more lenient.
        ent_span = doc.char_span(start_idx, end_idx, label=ent.get("entity"), alignment_mode="expand")
        doc.set_ents([ent_span], default="unmodified")
        # This adds nice colors to the unknown entities
        options = {"colors": {k["entity"]:"linear-gradient(90deg, #aa9cfc, #fc9ce7)" for k in tags}}
    return displacy.render(doc, style="ent", jupyter=True, options=options)

render_tags_deid(nlp, note, deid_result)

As you can see, the model is not perfect, but it can provide a nice baseline to anonymize sensitive information in clinical reports.