[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W4T_Text_Tagging.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy transformers sentencepiece thermostat-datasets
!python -m spacy download en_core_web_sm

# Text tagging and Dependency Parsing with spaCy

In week 3 we covered some of the tagging functionalities of spaCy (namely, part-of-speech and morphological tagging). In this notebook we will extend the overview of spaCy functionalities to include named entity recognition and dependency parsing. You will then learn how to use 🤗 Transformers for span labeling tasks, and see some other interesting tagging use-cases.

## Named entity recognition with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#named-entities).*

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can **recognize various types of named entities in a document**, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the `ents` property of a `Doc`:

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for US$1B")

def print_entities(doc):
    print("Token\tStart\tEnd\tType\tExplanation\n" + "-"*80)
    for ent in doc.ents:
        print(
            f"{ent.text}\t{ent.start_char}\t{ent.end_char}\t{ent.label_}\t{spacy.explain(ent.label_)}"
        )

print_entities(doc)

Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
Apple	0	5	ORG	Companies, agencies, institutions, etc.
U.K.	27	31	GPE	Countries, cities, states
1B	47	49	MONEY	Monetary values, including unit


In [2]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

### Accessing entity annotations and labels

The standard way to access entity annotations is the `doc.ents` property, which produces a sequence of `Span` objects. The entity type is accessible either as a hash value or as a string, using the attributes `ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the `token.ent_iob` and `token.ent_type` attributes. `token.ent_iob` indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

**The IOB Scheme**

- I – Token is inside an entity.
- O – Token is outside an entity (i.e. no entity tag)
- B – Token is the beginning of an entity.

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
print_entities(doc)

print("\n\n\n")

# token level
def print_token_iob(doc):
    print("Token\tIOB Tag\tType\tExplanation\n" + "-"*80)
    for t in doc:
        print(
            f"{t.text.ljust(10)}\t{t.ent_iob_}\t{t.ent_type_}\t{spacy.explain(t.ent_type_)}"
        )

print_token_iob(doc)

Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
San Francisco	0	13	GPE	Countries, cities, states




Token	IOB Tag	Type	Explanation
--------------------------------------------------------------------------------
San       	B	GPE	Countries, cities, states
Francisco 	I	GPE	Countries, cities, states
considers 	O		None
banning   	O		None
sidewalk  	O		None
delivery  	O		None
robots    	O		None


### Visualizing named entities

The [displaCy ENT visualizer](https://explosion.ai/demos/displacy-ent) lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a `Doc` or a list of `Doc` objects to displaCy and run `displacy.serve` to run the web server, or `displacy.render` to generate the raw markup. It works in Jupyter Notebooks!

For more details and examples, see the usage [guide on visualizing spaCy](https://spacy.io/usage/visualizers).

In [4]:
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")


# Dependency parsing with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#dependency-parse).*

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a `Doc` object has been parsed by calling `doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has been set. It returns a boolean value. If the result is `False`, the default sentence iterator will raise an exception.

### Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of `.dep` is a hash value. You can get the string value with `.dep_`.

In the following, we add the fields `head.pos_` to consider the part of speech associated with the token and `children` to get the list of the immediate syntactic dependents of the token. Notice that now we are operating on single tokens, as opposed to full noun chunks as above.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Autonomous amod cars NOUN []
cars nsubj shift VERB [Autonomous]
shift ROOT shift VERB [cars, liability, toward]
insurance compound liability NOUN []
liability dobj shift VERB [insurance]
toward prep shift VERB [manufacturers]
manufacturers pobj toward ADP []


Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest – from below:

In [2]:
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Finding the verb associated with the subject "cars"
verbs = set()
for possible_subject in doc:
    if possible_subject.text == "cars" and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
verbs

{shift}

### Visualizing Parse Trees

We can use the same `displaCy` tool from above to visualize the parse tree:

In [3]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

Long texts can become difficult to read when displayed in one row, so it’s often better to visualize them sentence-by-sentence instead. DisplaCy supports rendering both `Doc` and `Span` objects, as well as lists of `Docs` or `Spans`. Instead of passing the full `Doc` to `displacy.render`, you can also pass in a list of `doc.sents`. This will create one visualization for each sentence.

In [4]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. One day, Senex goes on a trip and leave Pseudolus in charge of Hero."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", options = {"compact": True})

### Iterating around the local tree

A few more convenience attributes are provided for iterating around the local tree from the token. `Token.lefts` and `Token.rights` attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, `Token.n_lefts` and `Token.n_rights` that give the number of left and right children.

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("bright red apples on the tree")
print("Lefts:", [token.text for token in doc[2].lefts])
print("Rights", [token.text for token in doc[2].rights])
print("# Lefts", doc[2].n_lefts)
print("# Rights", doc[2].n_rights)
displacy.render(doc)

Lefts: ['bright', 'red']
Rights ['on']
# Lefts 2
# Rights 1


You can get a whole phrase by its syntactic head using the `Token.subtree` attribute. This returns an ordered sequence of tokens. You can walk up the tree with the `Token.ancestors` attribute, and check dominance with `Token.is_ancestor`. For the default English pipelines, the parse tree is **projective**, which means that there are no crossing brackets. The tokens returned by `.subtree` are therefore guaranteed to be contiguous. This is not true for e.g. the German pipelines, which have many non-projective dependencies.

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")

root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,   descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])

Credit nmod 0 2 2 ['account', 'holders', 'submit']
and cc 0 0 0 ['Credit', 'account', 'holders', 'submit']
mortgage conj 0 0 0 ['Credit', 'account', 'holders', 'submit']
account compound 1 0 0 ['holders', 'submit']
holders nsubj 1 0 0 ['submit']


Finally, the `.left_edge` and `.right_edge` attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a `Span` object for a syntactic phrase, using the `.retokenize` function. Note that `.right_edge` gives a token within the subtree – so if you use it as the end-point of a range, don’t forget to +1!

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Credit and mortgage account holders NOUN nsubj submit
must AUX aux submit
submit VERB ROOT submit
their PRON poss requests
requests NOUN dobj submit


The dependency parse can be a useful tool for information extraction, especially when combined with other predictions like named entities. The following example extracts money and currency values, i.e. entities labeled as `MONEY`, and then uses the dependency parse to find the noun phrase they are referring to – for example `"Net income"→ "$9.4 million"`.

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
# Merge noun phrases and entities for easier analysis
nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of 1$ billion",
]
for doc in nlp.pipe(TEXTS):
    for token in doc:
        if token.ent_type_ == "MONEY":
            # We have an attribute and direct object, so check for subject
            if token.dep_ in ("attr", "dobj"):
                subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subj:
                    print(subj[0], "-->", token)
            # We have a prepositional object with a preposition
            elif token.dep_ == "pobj" and token.head.dep_ == "prep":
                print(token.head.head, "-->", token)

Net income --> $9.4 million
the prior year --> $2.7 million
Revenue --> twelve billion dollars
a loss --> 1$ billion


## Span Labeling with 🤗 Transformers

We are now going to see some interesting use-cases for span labeling with the transformer library. We will use `pipelines` covered in the week 2 tutorial, since they are the fastest way to use fine-tuned models.

In the first example, we will use a model trained to perform NER on the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/index.html) dataset. Recall that these model operate on subword tokens, which complicates a bit the mapping of predicted tags to the original word representations. Luckily, this is handled automatically by the `ner` pipeline, a subclass of the `TokenClassificationPipeline` we mentioned in the first tutorial.

In [9]:
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")
example = "My name is Wolfgang and I live in Berlin"
ner_results = ner(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


Let's now build a function that lets us visualize the predictions of a model using displaCy:

In [10]:
import numpy
import spacy
from spacy import displacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
example = "My name is Wolfgang and I live in Berlin"

def render_tags(nlp, text, tags):
    doc = nlp.make_doc(text)
    for ent in tags:
        # Using char_span instead of building the Span manually allows us to use the char indexes
        # produced alongside the model's output
        ent_span = doc.char_span(ent["start"], ent["end"], label=ent["entity"])
        doc.set_ents([ent_span], default="unmodified")
    # This adds nice colors to the unknown entities
    options = {"colors": {k["entity"]:"linear-gradient(90deg, #aa9cfc, #fc9ce7)" for k in tags}}
    return displacy.render(doc, style="ent", options=options)

render_tags(nlp, example, ner_results)

### Span Labeling Beyond POS & NER

While POS tagging and NER are by far the two most popular tasks in which per-token labels need to be predicted, they are not the only interesting ones. In the next code bit we will use a model trained to tag personal information in clinical reports, which can be used for **clinical note de-identification**:

In [11]:
note = (
    "Hospital Care Team Service: Orthopedics Inpatient Attending: Roger C Kelly, "
    "MD Attending phys phone: (634)743-5135 Discharge Unit: HCS843 Primary Care "
    "Physician: Hassan V Kim, MD 512-832-5025."
)

deid = pipeline("ner", model="obi/deid_bert_i2b2")
deid_result = deid(note)
deid_result[:3]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity': 'B-STAFF',
  'score': 0.99943334,
  'index': 15,
  'word': 'Roger',
  'start': 61,
  'end': 66},
 {'entity': 'I-STAFF',
  'score': 0.99912256,
  'index': 16,
  'word': 'C',
  'start': 67,
  'end': 68},
 {'entity': 'L-STAFF',
  'score': 0.9993362,
  'index': 17,
  'word': 'Kelly',
  'start': 69,
  'end': 74}]

Let's now customize the function from before to have only one tag per entity, under the assumption of having only entities composed by contiguous tokens:

In [12]:
def render_tags_deid(nlp, text, tags):
    doc = nlp.make_doc(text)
    start_idx = 0
    end_idx = 0
    for i, ent in enumerate(tags):
        if ent["entity"].startswith("U"):
            start_idx = ent["start"]
            end_idx = ent["end"]
        elif ent["entity"].startswith("B"):
            start_idx = ent["start"]
            continue
        # The second part of the expression lets us have some margin of classification error with a simple
        # heuristic: if the next entity is the same as the current one, we consider it a continuation even
        # if the present one is an end tag (L). Try to remove this to see what happens.
        elif ent["entity"].startswith("L") and (i < len(tags) - 1 and tags[i+1]["entity"][2:] != ent["entity"][2:]):
            end_idx = ent["end"]
        else:
            continue
        # The alignment mode parameter is normally set to "strict", meaning that if the start/end indices
        # do not match the token boundaries, the span is not created. We set it to "expand" to make the matching
        # more lenient.
        ent_span = doc.char_span(start_idx, end_idx, label=ent.get("entity")[2:], alignment_mode="expand")
        doc.set_ents([ent_span], default="unmodified")
        # This adds nice colors to the unknown entities
        options = {"colors": {k["entity"][2:]:"linear-gradient(90deg, #aa9cfc, #fc9ce7)" for k in tags}}
    return displacy.render(doc, style="ent", options=options)

render_tags_deid(nlp, note, deid_result)

As you can see, the model is not perfect, but it can provide a nice baseline to anonymize sensitive information in clinical reports. 

## Other Interesting Text Tagging Tools

In this final section, we will briefly cover [Thermostat](https://github.com/DFKI-NLP/thermostat), using Huggingface Transformers and Datasets to visualize interpretability metrics for text classification models.

### Token Saliency with Thermostat

The concept of **saliency** in interpretable NLP will be covered in the final lesson of the course. For now, it is sufficient to know that we can use different methods, called **feature attribution methods**, to obtain scores approximating the importance of every feature in the input (i.e. tokens in NLP) in determining the model output. Saliency scores are nowadays widely used to interpret the behavior of models, and identify issues in their inference procedure (e.g. relying on spurious features due to overfitting training examples). 

There are two fundamental differences from previous examples:
1. scores are **numeric values** representing importance for every token instead of a classification label.
2. scores are not **predictions** but **attributions**. This means that the model is not explicitly trained to predict them, but they are instead a byproduct of model inference for a different task (e.g. text classification).

The [Thermostat](https://github.com/DFKI-NLP/thermostat) library provides us with a set of models and datasets for which saliency measures were precomputed, alongside a mean of visualizing them inside Jupyter. In the following example, a `bert-base-uncased` model trained to perform sentiment analysis is used alongside a feature attribution method called [Integrated Gradients](https://captum.ai/api/integrated_gradients.html) to highlight salient tokens in a movie review example for which the model predicts the `POSITIVE` label. The saliency scores are stored in the `attribution` attribute of the object.

In [13]:
import thermostat

annotated_imdb = thermostat.load("imdb-bert-lig")
imdb_instance = annotated_imdb[429]
imdb_instance.heatmap

Found cached dataset dataset (/home/gsarti/.cache/huggingface/datasets/dataset/imdb-bert-lig/1.0.2/0cbe93e1fbe5b8ed0217559442d8b49a80fd4c2787185f2d7940817c67d8707b)


Loading Thermostat configuration: imdb-bert-lig


token_index     0        1         2         3         4         5         6   \
token        [CLS]  amazing     movie         .      some        of       the   
attribution    0.0      1.0  0.028762 -0.206694  0.067878 -0.012668 -0.038177   
text_field    text     text      text      text      text      text      text   

token_index        7         8         9   ...        39        40        41  \
token          script   writing     could  ...     great    acting         .   
attribution -0.114986 -0.173781 -0.083315  ...  0.800842 -0.108838 -0.182696   
text_field       text      text      text  ...      text      text      text   

token_index        42        43        44        45         46        47  \
token            very    poetic         .    highly  recommend         .   
attribution  0.657471  0.301335 -0.190809  0.231953   0.643599  0.061746   
text_field       text      text      text      text       text      text   

token_index     48  
token        [SEP]  
attribu

The library provides us with a tool to print the saliency in a more visual way, using the `displacy.render` function: red hues represent a high contribution in determining the predicted label, while blue is used to mark tokens that contribute to the opposite prediction (i.e. that "make the review negative"). We can now clearly see that positive words are very important to define the positive outcome, while some expressions like "could have been better" add a slight negative sentiment to the review.

In [14]:
labels = {0: "NEGATIVE", 1: "POSITIVE"}
print(f"Predicted label: {labels[imdb_instance.predicted_label_index]}")
imdb_instance.render()

Predicted label: POSITIVE


We can look at another example, in which a BERT model is now used to predict the topic of a newspaper article summary using the same feature attribution method. Many words associated with the predicted `SPORT` topic, such as *team, captain, picks*, are now salient.

### 

In [15]:
import thermostat

annotated_agnews = thermostat.load("ag_news-bert-lig")
agnews_instance = annotated_agnews[60]
agnews_labels = {0: "WORLD", 1: "SPORT", 2: "BUSINESS", 3: "SCIENCE & TECHNOLOGY"}
print(f"Predicted label: {agnews_labels[imdb_instance.predicted_label_index]}")
agnews_instance.render()

Found cached dataset dataset (/home/gsarti/.cache/huggingface/datasets/dataset/ag_news-bert-lig/1.0.2/0cbe93e1fbe5b8ed0217559442d8b49a80fd4c2787185f2d7940817c67d8707b)


Loading Thermostat configuration: ag_news-bert-lig
Predicted label: SPORT


Many more pre-loaded examples are available on the [Thermostat Huggingface Demo](https://huggingface.co/spaces/nfel/Thermostat).