# Example: Natural Language Processing with Stanza

Natural Language Processing (NLP) is a subfield of Computer Science focused on the processing of data encoded in natural human languages. It is closely related to Computational Linguistics (a subfield of linguistics) as well as other CS subfields such as Machine Learning, Artificial Intelligence, Information Retrieval, and Knowledge Representation.

[Stanza](https://stanfordnlp.github.io/stanza/) is a Python library that uses neural network components to enable efficient annotation and analysis of language corpora with large library of pre-trained models as well as model fine-tuning, training, and evaluation.


# 1. Setup

Stanza can be installed by running `pip install stanza` in the command line or by running `!pip install stanza` (note the exclamation mark) in a notebook cell.

In [None]:
import warnings
from urllib.request import urlretrieve

import stanza  # load the library

# and some additional tools
from stanza.utils.visualization.dependency_visualization import visualize_doc
from stanza.utils.visualization.ner_visualization import visualize_ner_doc

warnings.filterwarnings('ignore')


Pre-trained language models can be downloaded using:

```python
stanza.download('[LANG]')```
```

Substitute `[LANG]` with the desired language's full name (*eg* "english") or its short code (*eg* "en"). An up-to-date list of available models can be found on the [Stanza website](https://stanfordnlp.github.io/stanza/available_models.html).

<div class="alert alert-block alert-info">
<b>Note:</b> Downloaded models will be saved to `~/stanza_resources` by default. If you want to specify your own path, add a `dir=your_path` argument to the command above.
</div>

In [None]:
# let's download the English model
stanza.download('en')


# 2. NLP Pipeline

"Pipeline" refers to the sequence of tasks involved in analyzing and understanding human language. A typical NLP pipeline will include some or all of the following tasks:

- [sentence segmentation](https://stanfordnlp.github.io/stanza/tokenize.html)
- [tokenization](https://stanfordnlp.github.io/stanza/tokenize.html)
- [multi-word token expansion](https://stanfordnlp.github.io/stanza/mwt.html)
- [part-of-speech tagging](https://stanfordnlp.github.io/stanza/pos.html)
- [lemmatization](https://stanfordnlp.github.io/stanza/lemma.html)
- [constituency parsing](https://stanfordnlp.github.io/stanza/constituency.html)
- [dependency parsing](https://stanfordnlp.github.io/stanza/depparse.html)
- [named entity recognition](https://stanfordnlp.github.io/stanza/ner.html)
- [sentiment analysis](https://stanfordnlp.github.io/stanza/sentiment.html)

Which tasks should be included in a pipeline is largely language-dependent. 

## 2.1. Running text through the pipeline

In Stanza, the `Pipeline([LANG])` command instantiates a pipeline for the specified language. A new pipeline will include, by default, all available task processors, but you can also specify what processors you want to include with the `processors` argument.

<div class="alert alert-block alert-info">
<b>Note:</b> Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting `use_gpu=False`.
</div>


In [None]:
# Let's instantiate an English pipeline, with all processors by default
en_pipeline = stanza.Pipeline('en')


After a pipeline is successfully instantiated, text can be run through it to generate annotations by passing the text to the pipeline object. 

Let's use the text of Jane Austen's *Pride and Prejudice* as our input. We can use just a paragraph (faster) or the entire text. Just run the cell of whichever option you prefer (if in doubt, choose paragraph!).

### A: Use a paragraph

In [45]:
text = """Elizabeth was excessively disappointed.
The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent. Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present plan, were to go no farther northward than Derbyshire. In that
county there was enough to be seen to occupy the chief of their three
weeks; and to Mrs. Gardiner it had a peculiarly strong attraction. The
town where she had formerly passed some years of her life, and where
they were now to spend a few days, was probably as great an object of
her curiosity as all the celebrated beauties of Matlock, Chatsworth,
Dovedale, or the Peak."""


### B: Use the entire novel

In [None]:
# we can get the full text from Project Gutenberg: https://www.gutenberg.org/cache/epub/1342/pg1342.txt
# we can use the urlretrieve function to download the text and save it to a file in the current directory:
urlretrieve('https://www.gutenberg.org/cache/epub/1342/pg1342.txt', 'pride_prejudice.txt')

# next we open the file and read its contents
with open('pride_prejudice.txt') as f:
    text = f.read()


In [46]:
# now we pass the text to the pipeline for analysis
document = en_pipeline(text)


## 2.2. Accessing the results

The pipeline returns a `Document` object, which can be used to access the annotations generated from the text. 

A `Document` contains a list of `Sentence` **objects**, which in turn contain a **list** of annotated `Token` and `Word` objects. In English, the latter two overlap for the most part, but in other languages tokens can often be divided into mutiple words, for instance the French token `aux` is divided into the words `à` and `les`. Another type of object called a `Span` is used to represent annotations that are part of a document, such as named entity mentions.

A full description of the objects generated by Stanza can be found on [their documentation](https://stanfordnlp.github.io/stanza/data_objects.html).

The words in bold above are being used in their strict Python sense, namely `object` is a class with methods that can be invoked using dot syntax, and `list` is a list that can be used as a regular Python list. 

Let's look at a few examples:


In [47]:
# Documents contain a list of sentences:
for i, sentence in enumerate(document.sentences):
    print(f'Sentence {i}: {sentence.text}\n')

# print the first sentence:
print('THE FIRST SENTENCE:')
print(document.sentences[0].text)

Sentence 0: Elizabeth was excessively disappointed.

Sentence 1: The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent.

Sentence 2: Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present plan, were to go no farther northward than Derbyshire.

Sentence 3: In that
county there was enough to be seen to occupy the chief of their three
weeks; and to Mrs. Gardiner it had a peculiarly strong attraction.

Sentence 4: The
town where she had formerly passed some years of her lif

In [48]:
# Sentences contain a list of words and tokens:
for i, word in enumerate(document.sentences[0].words):
    print(f'{i}: Word: {word.text}, Token: {document.sentences[0].tokens[i].text}')


0: Word: Elizabeth, Token: Elizabeth
1: Word: was, Token: was
2: Word: excessively, Token: excessively
3: Word: disappointed, Token: disappointed
4: Word: ., Token: .


## 2.2.1. Morphology and Syntax

Word and token objects are annotated with the results of the pipeline's tasks as properties. 

Morphological properties include `lemma`, `pos` (universal part-of-speech tag), `xpos` (extended part-of-speech tags), and a general `feats` (morphological features including tense, verb form, mood, number, gender, and so on).


In [None]:
# for example, the words in the first sentence
for i, word in enumerate(document.sentences[0].words):
    print(f'{i}: Word: {word.text}')
    print(f'\tLEMMA: {word.lemma}')
    print(f'\tPOS: {word.pos}')
    print(f'\tXPOS: {word.xpos}')
    print(f'\tFEATS: {word.feats}')


0: Word: Elizabeth
	LEMMA: Elizabeth
	POS: PROPN
	XPOS: NNP
	FEATS: Number=Sing
1: Word: was
	LEMMA: be
	POS: AUX
	XPOS: VBD
	FEATS: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
2: Word: excessively
	LEMMA: excessively
	POS: ADV
	XPOS: RB
	FEATS: None
3: Word: disappointed
	LEMMA: disappointed
	POS: ADJ
	XPOS: JJ
	FEATS: Degree=Pos
4: Word: .
	LEMMA: .
	POS: PUNCT
	XPOS: .
	FEATS: None


Syntactic information (*eg* from the dependency parsing task), if available for the chosen language, is made accessible in the same way via the properties `head` and `deprel` (dependency relation).

In [None]:
# for example, the words in the first sentence
for i, word in enumerate(document.sentences[0].words):
    print(f'{i}: Word: {word.text}')
    print(f'\tHEAD: {word.head}')
    print(f'\tDEPREL: {word.deprel}')

0: Word: Elizabeth
	HEAD: 4
	DEPREL: nsubj
1: Word: was
	HEAD: 4
	DEPREL: cop
2: Word: excessively
	HEAD: 4
	DEPREL: advmod
3: Word: disappointed
	HEAD: 0
	DEPREL: root
4: Word: .
	HEAD: 4
	DEPREL: punct


Alternatively, you can directly print a `Word` object to view all its annotations as a Python dictionary:

<div class="alert alert-block alert-warning">
<b>Note:</b> In order to access the universal part-of-speech tag, the property name is <b>pos</b>, whereas the same information is under the <b>upos</b> key in the output dictionary.
</div>

In [55]:
# for example, the first word in the first sentence
word = document.sentences[0].words[0]
print(word)

{
  "id": 1,
  "text": "Elizabeth",
  "lemma": "Elizabeth",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 4,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 9
}


## 2.2.2. Annotated ranges

Annotations that span multiple words, such as named entities are stored as `Span` objects in the `Document` using a property name related to the task that generated them. In addition to character offsets for their start and end points, `Span` objects contain information specific to that task the generated them.
For example, `Span` objects generated by the named entity recognition task, stored in the `entities` property of `Document`, contain a `type` value that indicates the type of entity:


In [None]:
# For example: the results of NER use the `entities` key:
print('Mention text\tType\tStart-End')
for entity in document.entities:
    print(f'{entity.text}\t{entity.type}\t{entity.start_char}-{entity.end_char}')


Mention text	Type	Start-End
Elizabeth	PERSON	0-9
Gardiner	PERSON	194-202
Gardiner	PERSON	273-281
July	DATE	350-354
London	GPE	371-377
a month	DATE	391-398
Lakes	LOC	600-605
Derbyshire	GPE	719-729
three
weeks	DATE	803-814
Gardiner	PERSON	828-836
some years	DATE	915-925
a few days	DATE	972-982
Matlock	GPE	1067-1074
Chatsworth	GPE	1076-1086
Dovedale	GPE	1088-1096
Peak	LOC	1105-1109


# 2.3. Visualizing the results

Stanza includes a few simple tools designed to quickly visualize some of the analytical results of the pipeline, such as the parsed dependencies' tree:


In [None]:
visualize_doc(document, 'en')

Or the results of the named entity recognition task:

In [54]:
visualize_ner_doc(document, 'en')
