[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W4T_Text_Tagging.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy transformers sentencepiece thermostat-datasets
!python -m spacy download en_core_web_sm

# Text tagging with spaCy and ü§ó Transformers

In week 3 we covered some of the tagging functionalities of spaCy (namely, part-of-speech and morphological tagging). In this notebook we will extend the overview of spaCy functionalities to include named entity recognition. You will then learn how to use ü§ó Transformers for span labeling tasks, and see some other interesting tagging use-cases.

## Named entity recognition with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#named-entities).*

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a ‚Äúreal-world object‚Äù that‚Äôs assigned a name ‚Äì for example, a person, a country, a product or a book title. spaCy can **recognize various types of named entities in a document**, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn‚Äôt always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the `ents` property of a `Doc`:

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for US$1B")

def print_entities(doc):
    print("Token\tStart\tEnd\tType\tExplanation\n" + "-"*80)
    for ent in doc.ents:
        print(
            f"{ent.text}\t{ent.start_char}\t{ent.end_char}\t{ent.label_}\t{spacy.explain(ent.label_)}"
        )

print_entities(doc)

Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
Apple	0	5	ORG	Companies, agencies, institutions, etc.
U.K.	27	31	GPE	Countries, cities, states
1B	47	49	MONEY	Monetary values, including unit


In [3]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

### Accessing entity annotations and labels

The standard way to access entity annotations is the `doc.ents` property, which produces a sequence of `Span` objects. The entity type is accessible either as a hash value or as a string, using the attributes `ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the `token.ent_iob` and `token.ent_type` attributes. `token.ent_iob` indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

**The IOB Scheme**

- I ‚Äì Token is inside an entity.
- O ‚Äì Token is outside an entity.
- B ‚Äì Token is the beginning of an entity.

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
print_entities(doc)

print("\n\n\n")

# token level
def print_token_iob(doc):
    print("Token\tIOB Tag\tType\tExplanation\n" + "-"*80)
    for t in doc:
        print(
            f"{t.text.ljust(10)}\t{t.ent_iob_}\t{t.ent_type_}\t{spacy.explain(t.ent_type_)}"
        )

print_token_iob(doc)

Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
San Francisco	0	13	GPE	Countries, cities, states




Token	IOB Tag	Type	Explanation
--------------------------------------------------------------------------------
San       	B	GPE	Countries, cities, states
Francisco 	I	GPE	Countries, cities, states
considers 	O		None
banning   	O		None
sidewalk  	O		None
delivery  	O		None
robots    	O		None


### Setting Entity Annotation

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can‚Äôt write directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest way to set entities is to use the `doc.set_ents` function and create the new entity as a `Span`.

In [5]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
# The model didn't recognize "fb" as an entity :(
print("BEFORE:")
print_entities(doc)
print("\n\n")

# Create a span for the new entity
fb_ent = Span(doc, 0, 1, label="ORG")
orig_ents = list(doc.ents)

# Option 1: Modify the provided entity spans, leaving the rest unmodified
doc.set_ents([fb_ent], default="unmodified")

# Option 2: Assign a complete list of ents to doc.ents
doc.ents = orig_ents + [fb_ent]

print("AFTER:")
print_entities(doc)

BEFORE:
Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------



AFTER:
Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
fb	0	2	ORG	Companies, agencies, institutions, etc.


### Setting entity annotations from array

You can also assign entity annotations using the `doc.from_array` method. To do this, you should include both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you‚Äôre importing from.

In [6]:
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
doc = nlp.make_doc("London is a big city in the United Kingdom.")
print("BEFORE:")
print_entities(doc)
print("\n\n")

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
print("BEFORE:")
print_entities(doc)

BEFORE:
Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------



BEFORE:
Token	Start	End	Type	Explanation
--------------------------------------------------------------------------------
London	0	6	GPE	Countries, cities, states


### Visualizing named entities

The [displaCy ENT visualizer](https://explosion.ai/demos/displacy-ent) lets you explore an entity recognition model‚Äôs behavior interactively. If you‚Äôre training a model, it‚Äôs very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a `Doc` or a list of `Doc` objects to displaCy and run `displacy.serve` to run the web server, or `displacy.render` to generate the raw markup. It works in Jupyter Notebooks!

For more details and examples, see the usage [guide on visualizing spaCy](https://spacy.io/usage/visualizers).

In [7]:
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")


## Span Labeling with ü§ó Transformers

We are now going to see some interesting use-cases for span labeling with the transformer library. We will use `pipelines` covered in the week 2 tutorial, since they are the fastest way to use fine-tuned models.

In the first example, we will use a model trained to perform NER on the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/index.html) dataset. Recall that these model operate on subword tokens, which complicates a bit the mapping of predicted tags to the original word representations. Luckily, this is handled automatically by the `ner` pipeline, a subclass of the `TokenClassificationPipeline` we mentioned in the first tutorial.

In [8]:
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")
example = "My name is Wolfgang and I live in Berlin"
ner_results = ner(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.99901396, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


Let's now build a function that lets us visualize the predictions of a model using displaCy:

In [9]:
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")

def render_tags(nlp, text, tags):
    doc = nlp.make_doc(text)
    for ent in tags:
        # Using char_span instead of building the Span manually allows us to use the char indexes
        # produced alongside the model's output
        ent_span = doc.char_span(ent["start"], ent["end"], label=ent.get("entity", "entity_group"))
        doc.set_ents([ent_span], default="unmodified")
    return displacy.render(doc, style="ent")

render_tags(nlp, example, ner_results)

### Span Labeling Beyond POS & NER

While POS tagging and NER are by far the two most popular tasks in which per-token labels need to be predicted, they are not the only interesting ones. In the next code bit we will use a model trained to tag personal information in clinical reports, which can be used for **clinical note de-deidentification**:

In [10]:
note = (
    "Hospital Care Team Service: Orthopedics Inpatient Attending: Roger C Kelly, "
    "MD Attending phys phone: (634)743-5135 Discharge Unit: HCS843 Primary Care "
    "Physician: Hassan V Kim, MD 512-832-5025."
)

deid = pipeline("ner", model="obi/deid_bert_i2b2")
deid_result = deid(note)
deid_result[:3]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity': 'B-STAFF',
  'score': 0.9994332,
  'index': 15,
  'word': 'Roger',
  'start': 61,
  'end': 66},
 {'entity': 'I-STAFF',
  'score': 0.99912274,
  'index': 16,
  'word': 'C',
  'start': 67,
  'end': 68},
 {'entity': 'L-STAFF',
  'score': 0.9993361,
  'index': 17,
  'word': 'Kelly',
  'start': 69,
  'end': 74}]

Let's now customize the function from before to have only one tag per entity, under the assumption of having only entities composed by contiguous tokens:

In [11]:
def render_tags_deid(nlp, text, tags):
    doc = nlp.make_doc(text)
    start_idx = 0
    end_idx = 0
    for i, ent in enumerate(tags):
        if ent["entity"].startswith("U"):
            start_idx = ent["start"]
            end_idx = ent["end"]
        elif ent["entity"].startswith("B"):
            start_idx = ent["start"]
            continue
        # The second part of the expression lets us have some margin of classification error with a simple
        # heuristic: if the next entity is the same as the current one, we consider it a continuation even
        # if the present one is an end tag (L). Try to remove this to see what happens.
        elif ent["entity"].startswith("L") and (i < len(tags) - 1 and tags[i+1]["entity"][2:] != ent["entity"][2:]):
            end_idx = ent["end"]
        else:
            continue
        # The alignment mode parameter is normally set to "strict", meaning that if the start/end indices
        # do not match the token boundaries, the span is not created. We set it to "expand" to make the matching
        # more lenient.
        ent_span = doc.char_span(start_idx, end_idx, label=ent.get("entity")[2:], alignment_mode="expand")
        doc.set_ents([ent_span], default="unmodified")
    return displacy.render(doc, style="ent")

render_tags_deid(nlp, note, deid_result)

As you can see, the model is not perfect, but it can provide a nice baseline to anonymize sensitive information in clinical reports. 

## Other Interesting Text Tagging Tools

In this final section, we will briefly cover two more interesting tagging tools: [Thermostat](https://github.com/DFKI-NLP/thermostat), using Huggingface Transformers and Datasets to visualize interpretability metrics for text classification models, and [AllenNLP](https://allennlp.org/) for Semantic Role Labeling.

### Token Saliency with Thermostat

The concept of **saliency** in interpretable NLP will be covered in the final lesson of the course. For now, it is sufficient to know that we can use different methods, called **feature attribution methods**, to obtain scores approximating the importance of every feature in the input (i.e. tokens in NLP) in determining the model output. Saliency scores are nowadays widely used to interpret the behavior of models, and identify issues in their inference procedure (e.g. relying on spurious features due to overfitting training examples). A fundamental difference from previous examples is that we obtain a **numeric score** representing importance for every token instead of a classification label.

The [Thermostat](https://github.com/DFKI-NLP/thermostat) library provides us with a set of models and dataset for which saliency measures were precomputed, alongside a mean of visualizing them inside Jupyter. In the following example, a `bert-base-uncased` model trained to perform sentiment analysis is used alongside a feature attribution method called [Integrated Gradients](https://captum.ai/api/integrated_gradients.html) to highlight salient tokens in a movie review example for which the model predicts the `POSITIVE` label. The saliency scores are stored in the `attribution` attribute of the object.

In [12]:
import thermostat

annotated_imdb = thermostat.load("imdb-bert-lig")
imdb_instance = annotated_imdb[429]
imdb_instance.heatmap

Reusing dataset thermostat (/home/gsarti/.cache/huggingface/datasets/thermostat/imdb-bert-lig/1.0.1/0cbe93e1fbe5b8ed0217559442d8b49a80fd4c2787185f2d7940817c67d8707b)


Loading Thermostat configuration: imdb-bert-lig


token_index     0        1         2         3         4         5         6   \
token        [CLS]  amazing     movie         .      some        of       the   
attribution    0.0      1.0  0.028762 -0.206694  0.067878 -0.012668 -0.038177   
text_field    text     text      text      text      text      text      text   

token_index        7         8         9   ...        39        40        41  \
token          script   writing     could  ...     great    acting         .   
attribution -0.114986 -0.173781 -0.083315  ...  0.800842 -0.108838 -0.182696   
text_field       text      text      text  ...      text      text      text   

token_index        42        43        44        45         46        47  \
token            very    poetic         .    highly  recommend         .   
attribution  0.657471  0.301335 -0.190809  0.231953   0.643599  0.061746   
text_field       text      text      text      text       text      text   

token_index     48  
token        [SEP]  
attribu

The library provides us with a tool to print the saliency in a more visual way, using the `displacy.render` function: red hues represent a high contribution in determining the predicted label, while blue is used to mark tokens that contribute to the opposite prediction (i.e. that "make the review negative"). We can now clearly see that positive words are very important to define the positive outcome, while some expressions like "could have been better" add a slight negative sentiment to the review.

In [13]:
labels = {0: "NEGATIVE", 1: "POSITIVE"}
print(f"Predicted label: {labels[imdb_instance.predicted_label_index]}")
imdb_instance.render()

Predicted label: POSITIVE


We can look at another example, in which a BERT model is now used to predict the topic of a newspaper article summary using the same feature attribution method. Many words associated with the predicted `SPORT` topic, such as *team, captain, picks*, are now salient.

### 

In [14]:
import thermostat

annotated_agnews = thermostat.load("ag_news-bert-lig")
agnews_instance = annotated_agnews[60]
agnews_labels = {0: "WORLD", 1: "SPORT", 2: "BUSINESS", 3: "SCIENCE & TECHNOLOGY"}
print(f"Predicted label: {agnews_labels[imdb_instance.predicted_label_index]}")
agnews_instance.render()

Reusing dataset thermostat (/home/gsarti/.cache/huggingface/datasets/thermostat/ag_news-bert-lig/1.0.1/0cbe93e1fbe5b8ed0217559442d8b49a80fd4c2787185f2d7940817c67d8707b)


Loading Thermostat configuration: ag_news-bert-lig
Predicted label: SPORT


Many more pre-loaded examples are available on the [Thermostat Huggingface Demo](https://huggingface.co/spaces/nfel/Thermostat).

### Semantic Role Labeling with AllenNLP

**Semantic Role Labeling (SRL)** is the task of determining the latent predicate argument structure of a sentence and providing representations that can answer basic questions about sentence meaning, including who did what to whom, etc. Since it involves highlight agents and patients in a text, SRL can be framed very naturally as a span labeling task.

For this example, head to the [AllenNLP SRL demo](https://demo.allennlp.org/semantic-role-labeling) and try out some of their examples. Let's consider for example the sentence `Did Uriah honestly think he could beat the game in under three hours?`. The system returns us a text labeled as follows:

![Frames returned by the AllenNLP SRL demo](../img/allennlp_srl.png)

We can see that for the predicates `think` and `beat`, the semantic role labeler correctly identify an argument structure in which `ARG0` thinks about `ARG1` and `ARG0` beats `ARG1`, respectively. This can be a useful component to identify agency relations between different entities in the text. 

>üí° **Insight**: A closely related task, **semantic frame parsing**, is currently being researched in our group to understand the societal framing of issues such as [femicides in Italy](https://www.semanticscholar.org/paper/Frame-Semantics-for-Social-NLP-in-Italian%3A-Framing-Minnema-Gemelli/63ec78548897496a6ec44cb89106b782e986f8b6). Authors focus on semantic roles defined by predicates beloging to the categories of `KILLING` and `CAUSING_HARM` to understand the agentivity and responsibility of femicide perpetrators in Italian news.