# Introduction to NLP Annotation Pipelines in Python

## Douglas Rice

*This tutorial was originally created by Burt Monroe for his prior work with the Essex Summer School. I've updated and modified it.*

In this notebook, we'll learn about doing NLP tasks in Python, including tokenization, stemming, part-of-speech tagging, named entity recognition, and dependency parsing. After completing this notebook, you should be familar with:


1. Tokenization
2. Stemming & Lemmatization
3. Part-of-Speech Tagging
4. Named Entity Recognition
5. Dependency Parsing

# Annotation Pipelines

NLP tasks are typically organized around the concept of an annotation "pipeline" based on a given "language model." Basically, you download/install/load a given model, then pass the model and your input text to the software's annotation pipeline and receive an output object with annotated text.

For example, the pipeline for Stanford CoreNLP is depicted below. The text is first tokenized, then split into sentences, then tokens are tagged with respect to parts of speech, then the tokens are lemmatized, then the named entity recognizer is applied, and finally the dependency parser is applied. The output is an object from which all of those annotations can be accessed.

![CoreNLP Pipeline (Source: https://stanfordnlp.github.io/CoreNLP/index.html)](https://stanfordnlp.github.io/CoreNLP/assets/images/pipeline.png)

There are many different NLP libraries. In this notebook, we will focus on spaCy, Stanza/coreNLP, and NLTK. 






## spaCy

The **spaCy** package is, by some accounts, now the "default" standard NLP pipeline, especially in industry. Unlike its Python predecessor, NLTK, spaCy is "opinionated" -- it tries to provide easy, computationally efficient access to the best available model for any given task; NLTK provides many options and more direct ability for the researcher to test and modify different models. Also in contrast to NLTK, spaCy interacts nicely with modern neural / deep learning methods.

In [None]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### The standard spaCy pipeline (tokens, lemmas, pos, dependencies, ner, morphology, etc.)

As noted above, the first step is to load the model. With spaCy, you have a number of different models to choose from that follow a standard naming convention. The first part specifies the language ("en"), the second part specifies the capabilities ("core"), the third part specifies what it was trained on ("web" or "news") and the last part specifies the size ("sm", "md", "lrg", or "trf"). 

In the below, we load "en_core_web_sm", which is the "English, core capability, trained on the web, small" model.

Note that the model is loaded and assigned to the variable `nlp_spacy`. Now `nlp_spacy` is a *function* that says "run this model's annotation pipeline on these string(s)." This is very standard syntax for NLP pipelines.

By default, spaCy runs *everything* supported by the given model

In [None]:
import spacy 
import sys

nlp_spacy = spacy.load("en_core_web_sm")


We'll set up a small text and then look at what spaCy does for us.

In [None]:
annotated_doc_spacy = nlp_spacy("Joel Embiid should have been the 2022 NBA MVP, but instead Nikola Jokic won the award.")
for token in annotated_doc_spacy:
    print(token.text, token.pos_, token.dep_)

Joel PROPN compound
Embiid PROPN nsubj
should AUX aux
have AUX aux
been AUX ROOT
the DET det
2022 NUM nummod
NBA PROPN compound
MVP PROPN attr
, PUNCT punct
but CCONJ cc
instead ADV advmod
Nikola PROPN compound
Jokic PROPN nsubj
won VERB conj
the DET det
award NOUN dobj
. PUNCT punct


Notice that we are only printing a few fields in the above. The annotated object that we created (`annotated_doc_spacy`) retained much more detailed information. To access that information, we just need to specify what fields from the object we want:

In [None]:
for token in annotated_doc_spacy:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Joel Joel PROPN NNP compound Xxxx True False
Embiid Embiid PROPN NNP nsubj Xxxxx True False
should should AUX MD aux xxxx True True
have have AUX VB aux xxxx True True
been be AUX VBN ROOT xxxx True True
the the DET DT det xxx True True
2022 2022 NUM CD nummod dddd False False
NBA NBA PROPN NNP compound XXX True False
MVP MVP PROPN NNP attr XXX True False
, , PUNCT , punct , False False
but but CCONJ CC cc xxx True True
instead instead ADV RB advmod xxxx True False
Nikola Nikola PROPN NNP compound Xxxxx True False
Jokic Jokic PROPN NNP nsubj Xxxxx True False
won win VERB VBD conj xxx True False
the the DET DT det xxx True True
award award NOUN NN dobj xxxx True False
. . PUNCT . punct . False False


Beyond what we specified above, the model object contains a host of additional attributes, which are listed below. Many attributes that are intuitively strings (e.g., the token's lemma, its part of speech tag) are stored internally by spaCy as "hashes" (an integer). The attribute that provides the corresponding text will end in an underscore character.

* `doc`: The parent document.
* `lex`: The underlying lexeme.
* `sent`: The sentence span that this token is a part of.
* `text`:	Verbatim text content.
* `text_with_ws`:	Text content, with trailing space character if present.
* `whitespace_`:	Trailing space character if present.
* `orth`:	ID of the verbatim text content.
* `orth_`:	Verbatim text content (identical to Token.text).
* `vocab`:	The vocab object of the parent Doc.
* `tensor`:	The token’s slice of the parent Doc’s tensor.
* `head`:	The syntactic parent, or “governor”, of this token.
* `left_edge`: The leftmost token of this token’s syntactic descendants.
* `right_edge`:	The rightmost token of this token’s syntactic descendants.
* `i`:	The index of the token within the parent document.
* `ent_type`:	Named entity type. (integer)
* `ent_type_`:	Named entity type. (string)
* `ent_iob`:	IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.
* `ent_iob_`:	IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.
* `ent_kb_id`:	Knowledge base ID that refers to the named entity this token is a part of, if any. (integer)
* `ent_kb_id_`:	Knowledge base ID that refers to the named entity this token is a part of, if any. (string)
* `ent_id`:	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
* `ent_id_`:	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
* `lemma`:	Base form of the token, with no inflectional suffixes. (integer)
* `lemma_`:	Base form of the token, with no inflectional suffixes. (string)
* `norm`:	The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. (integer)
* `norm_`:	The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. (string)
* `lower`:	Lowercase form of the token. (integer)
* `lower_`:	Lowercase form of the token text. Equivalent to Token.text.lower(). (string)
* `shape`:	Transform of the token’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". (integer)
* `shape_`:	Transform of the token’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". (string)
* `prefix`:	Hash value of a length-N substring from the start of the token. Defaults to N=1. (integer)
* `prefix_`:	A length-N substring from the start of the token. Defaults to N=1.
* `suffix`:	Hash value of a length-N substring from the end of the token. Defaults to N=3. (integer)
* `suffix_`:	Length-N substring from the end of the token. Defaults to N=3.
* `is_alpha`:	Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().
* `is_ascii`:	Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).
* `is_digit`:	Does the token consist of digits? Equivalent to token.text.isdigit().
* `is_lower`:	Is the token in lowercase? Equivalent to token.text.islower().
* `is_upper`:	Is the token in uppercase? Equivalent to token.text.isupper().
* `is_title`:	Is the token in titlecase? Equivalent to token.text.istitle().
* `is_punct`:	Is the token punctuation?
* `is_left_punct`:	Is the token a left punctuation mark, e.g. "(" ?
* `is_right_punct`:	Is the token a right punctuation mark, e.g. ")" ?
* `is_space`:	Does the token consist of whitespace characters? Equivalent to token.text.isspace().
* `is_bracket`:	Is the token a bracket?
* `is_quote`:	Is the token a quotation mark?
* `is_currency`:	Is the token a currency symbol?
* `like_url`:	Does the token resemble a URL?
* `like_num`:	Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.
* `like_email`:	Does the token resemble an email address?
* `is_oov`:	Is the token out-of-vocabulary (i.e. does it not have a word vector)?
* `is_stop`:	Is the token part of a “stop list”?
* `pos`:	Coarse-grained part-of-speech from the Universal POS tag set. (integer)
* `pos_`:	Coarse-grained part-of-speech from the Universal POS tag set. (string)
* `tag`:	Fine-grained part-of-speech. (integer)
* `tag_`:	Fine-grained part-of-speech. (string)
* `morph`:	Morphological analysis.
* `dep`:	Syntactic dependency relation. (integer)
* `dep_`:	Syntactic dependency relation. (string)
* `lang`:	Language of the parent document’s vocabulary. (integer)
* `lang_`:	Language of the parent document’s vocabulary. (string)
* `prob`:	Smoothed log probability estimate of token’s word type (context-independent entry in the vocabulary).
* `idx`:	The character offset of the token within the parent document.
* `sentiment`:	A scalar value indicating the positivity or negativity of the token.
* `lex_id`:	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
* `rank`:	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
* `cluster`:	Brown cluster ID.

As a sidebar, I note Spacy stores vocabulary strings internally as hashes, and these can be accessed directly in Python:

In [None]:
print(annotated_doc_spacy.vocab.strings["Apple"]) # 6418411030699964375
print(annotated_doc_spacy.vocab.strings[6418411030699964375]) # "Apple"

6418411030699964375
Apple


As noted, the model here is "en_core_web_sm". Small models are more compact and computationally efficient than the medium, large, or transformer-based models, but less accurate and they do not come with pretrained embeddings. You can see which models are available, along with exact details of each model and performance statistics, here: https://spacy.io/models/ (As of this writing, there are pretrained models for 22 languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, and Swedish), as well as a multi-language support model.




### Named entities

As we saw above, the spacy pipeline includes a named entity recognizer:

In [None]:
for ent in annotated_doc_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Joel Embiid 0 11 PERSON
2022 33 37 DATE
NBA MVP 38 45 ORG


For comparison with the other NLP pipelines discussed below and in other notebooks, we can see what named entities spaCy extracts from an excerpt of Ketanji Brown Jackson's acceptance speech.

In [None]:
kbj = "I have spent years toiling away in the relative solitude of my chambers, with just my law clerks, in isolation. So, it's been somewhat overwhelming, in a good way, to recently be flooded with thousands of notes and cards and photos expressing just how much this moment means to so many people.\nThe notes that I've received from children are particularly cute and especially meaningful because, more than anything, they speak directly to the hope and promise of America.\nIt has taken 232 years and 115 prior appointments for a Black woman to be selected to serve on the Supreme Court of the United States.\nBut we've made it. We've made it, all of us. All of us.\nAnd our children are telling me that they see now, more than ever, that, here in America, anything is possible.\nThey also tell me that I'm a role model, which I take both as an opportunity and as a huge responsibility. I am feeling up to the task, primarily because I know that I am not alone. I am standing on the shoulders of my own role models, generations of Americans who never had anything close to this kind of opportunity but who got up every day and went to work believing in the promise of America, showing others through their determination and, yes, their perseverance that good -- good things can be done in this great country -- from my grandparents on both sides who had only a grade-school education but instilled in my parents the importance of learning, to my parents who went to racially segregated schools growing up and were the first in their families to have the chance to go to college.\nI am also ever buoyed by the leadership of generations past who helped to light the way: Dr. Martin Luther King Jr., Justice Thurgood Marshall, and my personal heroine, Judge Constance Baker Motley. They, and so many others, did the heavy lifting that made this day possible. And for all of the talk of this historic nomination and now confirmation, I think of them as the true pathbreakers. I am just the very lucky first inheritor of the dream of liberty and justice for all.\nTo be sure, I have worked hard to get to this point in my career, and I have now achieved something far beyond anything my grandparents could've possibly ever imagined. But no one does this on their own. The path was cleared for me so that I might rise to this occasion. And in the poetic words of Dr. Maya Angelou, I do so now, while 'bringing the gifts...my ancestors gave.'  'I am the dream and the hope of the slave.' So as I take on this new role, I strongly believe that this is a moment in which all Americans can take great pride. We have come a long way toward perfecting our union. In my family, it took just one generation to go from segregation to the Supreme Court of the United States. And it is an honor -- the honor of a lifetime -- for me to have this chance to join the Court, to promote the rule of law at the highest level, and to do my part to carry our shared project of democracy and equal justice under law forward, into the future. Thank you, again, Mr. President and members of the Senate for this incredible honor."

In [None]:
kbj_ann_spacy = nlp_spacy(kbj) 
for ent in kbj_ann_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

years 13 18 DATE
thousands 192 201 CARDINAL
America 461 468 GPE
232 years 483 492 DATE
115 497 500 CARDINAL
the Supreme Court 565 582 ORG
the United States 586 603 GPE
America 742 749 GPE
Americans 1024 1033 NORP
America 1161 1168 GPE
first 1511 1516 ORDINAL
Martin Luther King Jr. 1665 1687 PERSON
Thurgood Marshall 1697 1714 PERSON
Constance Baker Motley 1747 1769 PERSON
this day 1829 1837 DATE
first 1989 1994 ORDINAL
Maya Angelou 2352 2364 PERSON
Americans 2557 2566 NORP
the Supreme Court 2710 2727 ORG
the United States 2731 2748 GPE
Court 2838 2843 ORG
Senate 3058 3064 ORG


### Morphological features

A lemma can be *inflected* with **morphological features** to produce a "surface form". Examples of morpholological features include case, number, verb form, tense, person and mood. Spacy conducts morphological analysis as part of its pipeline.

In [None]:
ann2 = nlp_spacy("I walked the dog yesterday.")
print(ann2[1],ann2[1].lemma_,ann2[1].pos_,ann2[1].tag_, [mf for mf in ann2[1].morph])
ann3 = nlp_spacy("I will walk the dog tomorrow.")
print(ann3[2],ann3[2].lemma_,ann3[2].pos_,ann3[1].tag_,[mf for mf in ann3[2].morph])
ann4 = nlp_spacy("I am walking the dog.")
print(ann4[2],ann4[2].lemma_,ann4[2].pos_,ann4[2].tag_,[mf for mf in ann4[2].morph])
ann5 = nlp_spacy("I was walking the dog.")
print(ann5[2],ann5[2].lemma_,ann5[2].pos_,ann5[2].tag_,[mf for mf in ann5[2].morph])

walked walk VERB VBD ['Tense=Past', 'VerbForm=Fin']
walk walk VERB MD ['VerbForm=Inf']
walking walk VERB VBG ['Aspect=Prog', 'Tense=Pres', 'VerbForm=Part']
walking walk VERB VBG ['Aspect=Prog', 'Tense=Pres', 'VerbForm=Part']


I invite the grammar afficionados among you to assess whether those were all as expected. The first two are tagged as finite present and past tense respectively. The third is labeled as a modal verb in infinitive form. The last two are labeled identically -- with detailed POS tag of VBG ("gerund") -- and progressive/ongoing present participle (despite the last being past).



### Noun phrases
 As part of its dependency parsing, Spacy will isolate **noun chunks** or **noun phrases**.

In [None]:
docnp = nlp_spacy("Joel Embiid should have been the 2022 NBA MVP, but instead Nikola Jokic won the award.")
for chunk in docnp.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Joel Embiid Embiid nsubj been
the 2022 NBA MVP MVP attr been
Nikola Jokic Jokic nsubj won
the award award dobj won


### Dependency parse trees

Dependency parsing is a complex subject we'll discuss in more detail separately. For present purposes, I'll just show some basics.



Spacy provides a visualization tool for its dependency parse, called **displacy**. So we need to load that up. 

Spacy's dependency parse is a tree that can be navigated like one. Every word has exactly one head, one word (or "root") that points to it via an arc. The example from the documentation looks like this:


In [None]:
from spacy import displacy

docnp = nlp_spacy("Autonomous cars shift insurance liability toward manufacturers")
displacy.render(docnp, style='dep', jupyter=True, options={'distance': 90})

So we iterate over words to find an arc of interest "from below." Specifically, in this example, we search for a verb that has a subject (a verb with an "nsubj" arc leading from it) like so:

In [None]:
from spacy.symbols import nsubj, VERB
# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in docnp:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{shift}


The verb "shift" is the only verb with a subject ("cars"). (The core subject-verb-object construction is "cars shift liability.")

Spacy provides attributes that can be used to traverse the tree. For example attribute `lefts` contains the children nodes to the left of a given node, and `rights` contains the children nodes to the right.

In [None]:
for left in docnp[2].lefts:
  print("left", left,left.dep_)
for right in docnp[2].rights:
  print("right",right,right.dep_)

left cars nsubj
right liability dobj
right toward prep


So, "shift" has three children, the subject "cars" to its left, the direct object "liability" to its right, and the preposition "toward" to its right.

Dependency parsing is the basic foundation for many information extraction applications, such as political event data production.

## Stanza (formerly Stanford NLP) and coreNLP


Stanza -- formerly StanfordNLP -- is a Python library from the Stanford NLP group. Stanza provides a wrapper to coreNLP, the research group's Java library, and it inherits coreNLP functionality.  

The official description:

> Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism.

> Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. You will get much faster performance if you run the software on a GPU-enabled machine.

> In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additional functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.

> To summarize, Stanza features:

> Native Python implementation requiring minimal efforts to set up;

> Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition;

> Pretrained neural models supporting 66 (human) languages;

> A stable, officially maintained Python interface to CoreNLP.

> Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020.

Stanford's **coreNLP** is generally pretty close to the gold-standard NLP engine. From its official page (https://stanfordnlp.github.io/CoreNLP/): 

> CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish.

Stanza is now *the* way to access coreNLP through Python.


In [None]:
!pip install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.0-py3-none-any.whl (574 kB)
[K     |████████████████████████████████| 574 kB 7.5 MB/s 
[?25hCollecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 49.0 MB/s 
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 41.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-

### The standard Stanza pipeline (tokens, pos, lemmas, dependency parse, sentiment, ner)

Just as we did before, we need to begin by loading the model. We'll load the "en" (or English) model. You can find the full list of models [here](https://stanfordnlp.github.io/stanza/available_models.html). There are more than 60 models listed there, so you have a lot of options depending on your need.

In [None]:
import stanza

stanza.download('en')
nlp_stanza = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-07-13 01:42:34 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip:   0%|          | 0…

2022-07-13 01:42:52 INFO: Finished downloading models and saved to /root/stanza_resources.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-07-13 01:42:54 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-07-13 01:42:54 INFO: Use device: cpu
2022-07-13 01:42:54 INFO: Loading: tokenize
2022-07-13 01:42:54 INFO: Loading: pos
2022-07-13 01:42:54 INFO: Loading: lemma
2022-07-13 01:42:54 INFO: Loading: depparse
2022-07-13 01:42:54 INFO: Loading: sentiment
2022-07-13 01:42:54 INFO: Loading: constituency
2022-07-13 01:42:55 INFO: Loading: ner
2022-07-13 01:42:56 INFO: Done loading processors!


Note that we downloaded the "default" processors for the NLP pipeline. You can specify *which* specific processors to use for any given task (at least if options exist in that particular language model) as well as substitute your own.

The default pipeline includes a tokenizer, a POS tagger, a lemmatizer, a dependency parser, a sentiment analyzer, a constituency parser, and a named-entity recognizer.

We apply the pipeline to our text and assign it to a Document object, which will include the annotations.

In [None]:
annotated_doc_stanza = nlp_stanza("Joel Embiid should have been the 2022 NBA MVP. Instead, Nikola Jokic won the award.")

The Document now has attributes including `sentences`, a list of Sentence objects. Sentence objects have attributes that include a list of `tokens`, `words`, entities (`ents`), `dependencies`, and `sentiment` (if there was a sentiment processor).  

In [None]:
for sentence in annotated_doc_stanza.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Joel Joel PROPN
Embiid Embiid PROPN
should should AUX
have have AUX
been be AUX
the the DET
2022 2022 NUM
NBA NBA PROPN
MVP MVP NOUN
. . PUNCT
Instead instead ADV
, , PUNCT
Nikola Nikola PROPN
Jokic Jokic PROPN
won win VERB
the the DET
award award NOUN
. . PUNCT


### Named entities and dependency parse trees

Entities and dependencies are provided in lists of dictionaries.

In [None]:
for sentence in annotated_doc_stanza.sentences:
    print(sentence.ents)
    print(sentence.dependencies)

[{
  "text": "Joel Embiid",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 11
}, {
  "text": "2022",
  "type": "DATE",
  "start_char": 33,
  "end_char": 37
}, {
  "text": "NBA",
  "type": "ORG",
  "start_char": 38,
  "end_char": 41
}]
[({
  "id": 9,
  "text": "MVP",
  "lemma": "MVP",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 0,
  "deprel": "root",
  "start_char": 42,
  "end_char": 45
}, 'nsubj', {
  "id": 1,
  "text": "Joel",
  "lemma": "Joel",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 9,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 4
}), ({
  "id": 1,
  "text": "Joel",
  "lemma": "Joel",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 9,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 4
}, 'flat', {
  "id": 2,
  "text": "Embiid",
  "lemma": "Embiid",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 1,
  "deprel": "flat",
  "start_char": 5,
  "end_char": 11
}), 

As a comparison with spaCy, let's look at the entities we've identified from Ketanji Brown Jackson's speech with Stanza, and compare those to ones identified with spaCy. First, we need to run the `kbj` speech through our Stanza pipeline. Note that this takes a minute or so to run.

In [None]:
kbj_ann_stanza = nlp_stanza(kbj)

In [None]:
for sentence in kbj_ann_stanza.sentences:
    print(sentence.ents)

[{
  "text": "years",
  "type": "DATE",
  "start_char": 13,
  "end_char": 18
}]
[{
  "text": "thousands",
  "type": "CARDINAL",
  "start_char": 192,
  "end_char": 201
}]
[{
  "text": "America",
  "type": "GPE",
  "start_char": 461,
  "end_char": 468
}]
[{
  "text": "232 years",
  "type": "DATE",
  "start_char": 483,
  "end_char": 492
}, {
  "text": "115",
  "type": "CARDINAL",
  "start_char": 497,
  "end_char": 500
}, {
  "text": "Black",
  "type": "NORP",
  "start_char": 526,
  "end_char": 531
}, {
  "text": "the Supreme Court",
  "type": "ORG",
  "start_char": 565,
  "end_char": 582
}, {
  "text": "the United States",
  "type": "GPE",
  "start_char": 586,
  "end_char": 603
}]
[]
[]
[]
[{
  "text": "America",
  "type": "GPE",
  "start_char": 742,
  "end_char": 749
}]
[]
[]
[{
  "text": "Americans",
  "type": "NORP",
  "start_char": 1024,
  "end_char": 1033
}, {
  "text": "America",
  "type": "GPE",
  "start_char": 1161,
  "end_char": 1168
}, {
  "text": "first",
  "type": "ORDINAL",
 

As a reminder, here's what we found with spaCy:

In [None]:
for ent in kbj_ann_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

years 13 18 DATE
thousands 192 201 CARDINAL
America 461 468 GPE
232 years 483 492 DATE
115 497 500 CARDINAL
the Supreme Court 565 582 ORG
the United States 586 603 GPE
America 742 749 GPE
Americans 1024 1033 NORP
America 1161 1168 GPE
first 1511 1516 ORDINAL
Martin Luther King Jr. 1665 1687 PERSON
Thurgood Marshall 1697 1714 PERSON
Constance Baker Motley 1747 1769 PERSON
this day 1829 1837 DATE
first 1989 1994 ORDINAL
Maya Angelou 2352 2364 PERSON
Americans 2557 2566 NORP
the Supreme Court 2710 2727 ORG
the United States 2731 2748 GPE
Court 2838 2843 ORG
Senate 3058 3064 ORG


## NLTK (Natural Language Toolkit)

NLTK is the longest established NLP library. It has lots of tools for lots of NLP tasks in lots of languages, including classification, tokenization, stemming, tagging, parsing, semantic reasoning. It interfaces to “over 50 corpora and lexical resources such as WordNet" (many standard corpora are available directly from NLTK). It is easier to tweak / modify / extend functionality in NLTK than spaCy and there is a large user community, so it is easy to find lots of examples, etc.  There is a free book that serves as most people's entree into NLTK: Steven Bird, Ewan Klein, and Edward Loper. “Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit” updated for Python 3 and NLTK3: http://www.nltk.org/book.

NLTK is still widely used, but it is not integrated with neural network / word embedding approaches and definitely not as hip anymore.

There isn't a generic "pipeline" command that does a default series of sequence labeling tasks, as there is with spaCy and Stanza. You need to download and apply models/resources for different tasks.



In [None]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Tokenization

There are roughly 20 tokenizers available in NLTK. The  generic sounding `word_tokenize` and `sent_tokenize` commands load NLTK's default recommended `punkt` tokenizer.

In [None]:
import nltk
nltk.download("punkt")
from nltk import word_tokenize, sent_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
sent = "Joel Embiid should have been the 2022 NBA MVP. Instead, Nikola Jokic won the award."
tok_nltk = word_tokenize(sent)
print(tok_nltk)

['Joel', 'Embiid', 'should', 'have', 'been', 'the', '2022', 'NBA', 'MVP', '.', 'Instead', ',', 'Nikola', 'Jokic', 'won', 'the', 'award', '.']


### POS tagging

There are about a dozen different taggers available in the `nltk.tag` module. The one that seems to be used in most examples is `pos_tag`. It is applied to tokenized text, so we begin to see a pipeline forming.

In [None]:
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
tagged = pos_tag(tok_nltk)                 
print(tagged)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('Joel', 'NNP'), ('Embiid', 'NNP'), ('should', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('the', 'DT'), ('2022', 'CD'), ('NBA', 'NNP'), ('MVP', 'NNP'), ('.', '.'), ('Instead', 'RB'), (',', ','), ('Nikola', 'NNP'), ('Jokic', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('award', 'NN'), ('.', '.')]


### Named entities

In turn, the tagged object can be passed to a named entity "chunker." A chunker divides the tokens into "chunks" -- non-overlapping sequences of tokens. This is also known as shallow parsing. The recommended NLTK named entity chunker is accessed through the `ne_chunk` command.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

named_ents = nltk.ne_chunk(tagged, binary=False)
print(named_ents)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


(S
  (PERSON Joel/NNP)
  (PERSON Embiid/NNP)
  should/MD
  have/VB
  been/VBN
  the/DT
  2022/CD
  (ORGANIZATION NBA/NNP)
  MVP/NNP
  ./.
  Instead/RB
  ,/,
  (PERSON Nikola/NNP Jokic/NNP)
  won/VBD
  the/DT
  award/NN
  ./.)


It identifies "Joel" and "Embiid" as two different people, but correctly identifies "Nikola Jokic" as a person.  That's not great . The `binary=False` option asks for these classifications into types of named entities -- person, GPE, etc. The default of `binary=True` just returns an indication that something is a named entity.

Note that this returns an nltk Tree object, which needs to be traversed in a tree-like way for some purposes.

Let's return to the Ketanji Brown Jackson speech example one more time, and create our own pipeline to fit these different pieces together. We will first apply the sentence tokenizer, then the word tokenizer, then the POS tagger, then the named entity chunker. The Tree object output by the chunker has tokens *not* in named entities as leaves of the tree as well. These don't have the named entity label, though, so we'll just barrel through the Tree object brute force, look at every leaf, check for that label and output only those that have it.

In [None]:
for sent in nltk.sent_tokenize(kbj):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

GPE America
ORGANIZATION Supreme Court
GPE United States
GPE America
GPE America
PERSON Martin Luther
ORGANIZATION Justice Thurgood Marshall
ORGANIZATION Supreme Court
GPE United States
PERSON Mr.
ORGANIZATION Senate


This one misses some entities (Maya Angelou) and creates an odd one ("Mr."). It's pretty clearly not working quite as well as the other two pipelines.

### Noun phrases

Speaking of noun phrases ... noun phrase chunking in nltk requires you to define a pattern of parts of speech that you consider to be a noun phrase and then parse using regular expressions. There are lots and lots of patterns that folks consider noun phrases. A couple of them are demonstrated below.

In [None]:
sent = "Joel Embiid should have been the 2022 NBA MVP. Instead, Nikola Jokic won the award."
tagged_sent = nltk.pos_tag(nltk.word_tokenize(sent))

NPpattern1 = r"""NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"""
chunkParser = nltk.RegexpParser(NPpattern1)
chunked_sent = chunkParser.parse(tagged_sent)
print(chunked_sent)

NPpattern2 = r"""
    NP: {<JJ>*<NN>+}
    {<JJ>*<NN><CC>*<NN>+}
    """
chunkParser = nltk.RegexpParser(NPpattern2)
chunked_sent = chunkParser.parse(tagged_sent)
print(chunked_sent)



(S
  (NP Joel/NNP Embiid/NNP)
  should/MD
  have/VB
  been/VBN
  the/DT
  2022/CD
  (NP NBA/NNP MVP/NNP)
  ./.
  Instead/RB
  ,/,
  (NP Nikola/NNP Jokic/NNP)
  won/VBD
  the/DT
  award/NN
  ./.)
(S
  Joel/NNP
  Embiid/NNP
  should/MD
  have/VB
  been/VBN
  the/DT
  2022/CD
  NBA/NNP
  MVP/NNP
  ./.
  Instead/RB
  ,/,
  Nikola/NNP
  Jokic/NNP
  won/VBD
  the/DT
  (NP award/NN)
  ./.)


### Dependency parsing

Dependency parsing is also a little convoluted in NLTK. Generally, NLTK calls the Stanford CoreNLP dependency parser. But that gets Java involved, which is not something you probably want to do. If for some reason you do, you can check out the documentation: http://www.nltk.org/api/nltk.parse.html#module-nltk.parse.corenlp. (If you just want to use coreNLP, I recommend you just use it through Stanza.) 