# Example: Natural Language Processing with Stanza

## Table of Contents

1. [Introduction](#1.-Introduction)  
    1.1. [Setup](#1.1.-Setup)  
    1.2. [The NLP Pipeline](#1.2.-The-NLP-Pipeline)  
    1.3. [Running the Pipeline](#1.3.-Running-the-Pipeline)  
2. [Inspecting the Results](#2.-Inspecting-the-Results)  
    2.1. [The `Document` Object](#2.1.-The-Document-Object)  
    2.2. [`Sentence` Objects](2.2.-Sentence-Objects)  
    2.3. ['Word' and 'Token' Objects](2.3.-Word-and-Token-Objects)  
    2.4. ['Span' Objects](2.4.-Span-Objects)  
    2.5. ['ParseTree' Objects](2.5.-ParseTree-Objects)  
    2.6. [Visualization Tools](#2.6.-Visualization-Tools)  
3. [Exporting and Importing Results](#3.-Exporting-and-Importing-Results)  
    3.1. [Native Python Data Structures](#3.1.-Native-Python-Data-Structures)  
    3.2. [Pickled Python Objects](#3.2.-Pickled-Python-Objects)  
    3.3. [CoNLL-U Format](3.3.-CoNLL-U-Format)  


# 1. Introduction

Natural Language Processing (NLP) is a subfield of Computer Science focused on the processing of data encoded in natural human languages. It is closely related to Computational Linguistics (a subfield of linguistics) as well as other CS subfields such as Machine Learning, Artificial Intelligence, Information Retrieval, and Knowledge Representation.

[Stanza](https://stanfordnlp.github.io/stanza/) is a Python library that uses neural network components to enable efficient annotation and analysis of language corpora with large library of pre-trained models as well as model fine-tuning, training, and evaluation.

## 1.1. Setup

Stanza can be installed by running `pip install stanza` in the command line or by running `!pip install stanza` (note the exclamation mark) in a notebook cell. Once installed, we can load it, alongside some other libraries we'll use in this notebook:


In [None]:
import pprint
import warnings

import stanza
from stanza.models.common.doc import Document
from stanza.utils.conll import CoNLL
from stanza.utils.visualization.dependency_visualization import visualize_doc
from stanza.utils.visualization.ner_visualization import visualize_ner_doc

warnings.filterwarnings('ignore')  # this prevents some annoying warnings during execution


We can ask Stanza to download pre-trained language models by using the `download()` method and providing it the desired language's full name (*eg* "english") or its short code (*eg* "en"). An up-to-date list of available models can be found on the [Stanza website](https://stanfordnlp.github.io/stanza/available_models.html).

<div class="alert alert-block alert-info">
<b>Note:</b> Downloaded models will be saved to `~/stanza_resources` by default. If you want to specify your own path, add a `dir=your_path` argument to the command above.
</div>

In [None]:
# let's download the English models
stanza.download('en')


## 1.2. The NLP Pipeline

"Pipeline" refers to the sequence of tasks involved in analyzing and understanding human language. A typical NLP pipeline will include some or all of the following tasks:

- [sentence segmentation](https://stanfordnlp.github.io/stanza/tokenize.html)
- [tokenization](https://stanfordnlp.github.io/stanza/tokenize.html)
- [multi-word token expansion](https://stanfordnlp.github.io/stanza/mwt.html)
- [part-of-speech tagging](https://stanfordnlp.github.io/stanza/pos.html)
- [lemmatization](https://stanfordnlp.github.io/stanza/lemma.html)
- [constituency parsing](https://stanfordnlp.github.io/stanza/constituency.html)
- [dependency parsing](https://stanfordnlp.github.io/stanza/depparse.html)
- [named entity recognition](https://stanfordnlp.github.io/stanza/ner.html)
- [sentiment analysis](https://stanfordnlp.github.io/stanza/sentiment.html)

## 1.3 Running the Pipeline

In Stanza, the `Pipeline` command instantiates a pipeline for the specified language. A new pipeline will include a number of task processors by default. Which processors are included depends largely on two factors: the target language and the availability or pre-trained models for a specific task. If a language doesn't use multi-token words, for example, the *multi-word token expansion* processor won't be used. Similarly if a pre-trained model for a task is not available for the target language, it won't be included. The latter is usually true for complex tasks (*dependency parsing*, *named entity recognition*, *sentiment analysis*) and [low-resource languages](https://arxiv.org/abs/2006.07264).

If necessary, one can [specify which modeules are loaded into the pipeline manually](https://stanfordnlp.github.io/stanza/getting_started.html#specifying-processors) via the `processors` argument.

<div class="alert alert-block alert-info">
<b>Note:</b> Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting `use_gpu=False`.
</div>


In [87]:
# Let's instantiate an English pipeline, with all processors by default
en_pipeline = stanza.Pipeline('en')


2025-05-05 14:29:57 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-05-05 14:29:57 INFO: Downloaded file to /Users/gabep/stanza_resources/resources.json
2025-05-05 14:29:58 INFO: Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |

2025-05-05 14:29:58 INFO: Using device: cpu
2025-05-05 14:29:58 INFO: Loading: tokenize
2025-05-05 14:29:58 INFO: Loading: mwt
2025-05-05 14:29:58 INFO: Loading: pos
2025-05-05 14:29:59 INFO: Loading: lemma
2025-05-05 14:29:59 INFO: Loading: constituency
2025-05-05 14:30:00 INFO: Loading: depparse
2025-05-05 14:30:00 INFO: Loading: sentiment
2025-05-05 14:30:00 INFO: Loading: ner
2

After a pipeline is successfully instantiated, text can be run through it to generate annotations by passing the text to the pipeline object. In this notebook, we'll use a paragraph from Jane Austen's *Pride and Prejudice* as our input. If we wanted to run the same code with the entire novel, for example, we could just load the [text from Project Gutenberg](https://www.gutenberg.org/ebooks/1342). In order to do that we could replace the code in the next cell with the following:

```python
# we can get the full text from Project Gutenberg: https://www.gutenberg.org/cache/epub/1342/pg1342.txt
# we can use the urlretrieve function to download the text and save it to a file in the current directory:
urlretrieve('https://www.gutenberg.org/cache/epub/1342/pg1342.txt', 'pride_prejudice.txt')

# next we open the file and read its contents
with open('pride_prejudice.txt') as f:
    text = f.read()
```


In [45]:
text = """Elizabeth was excessively disappointed.
The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent. Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present plan, were to go no farther northward than Derbyshire. In that
county there was enough to be seen to occupy the chief of their three
weeks; and to Mrs. Gardiner it had a peculiarly strong attraction. The
town where she had formerly passed some years of her life, and where
they were now to spend a few days, was probably as great an object of
her curiosity as all the celebrated beauties of Matlock, Chatsworth,
Dovedale, or the Peak."""


In [46]:
# now we pass the text to the pipeline for analysis
document = en_pipeline(text)


The best way to annotate an entire corpus with Stanza is to pass all the documents into the neural pipeline as a list at once and get back a list of them as the output.

<div class="alert alert-block alert-info">
<b>Note:</b> For performance reasons, it is essential to run the pipeline on batches of documents. See the <a href="https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents" target="_blank">relevant section in the documentation</a> for details how to run Stanza on multiple documents at once.
</div>

# 2. Inspecting the Results

The pipeline returns a `Document` object, which holds all the annotation data generated from the target text. The data is organized in further `Sentence`, `Token`, `Word`, `Span`, and `ParseTree` objects, each with properties and methods designed to interact with the output of the pipeline's analysis. The following chart gives a full overview of this system:

<img src="https://mermaid.ink/img/pako:eNqdVd9v2jAQ_lcsP20VrQhpCeRhL-0jm6at0qQVhEx8IdYSO7KdtbTr_76zIRCS0F95AOfu-y53353tJ5ooDjSmSc6MuRFsrVkxl3PLhYbECiXJ7MdcEnxuVFIVIC05P__3hfzEFcgEen0lk1t7jdrafynN--y36g_sCH75MvpU9O9MG7jVsEtJKgskVdrHIXN6VipzRlaQqAIMOav8632GX0uU_AvaAidWES4SO6cnQnAokSQMUTLfkFKDcTWLlIiiVD5CqlVBGLlW32aziqQihzqYF_gg1NPW6h4LD5bExFh9sJldaQYdM2HsXV3r4oBBi7CiAUFhGm5ZFUvr5HQAIe2x5x5L6jisWrr6P33eBvQ_N2hYLI4wBrRguXgEjsjVxoI5uJ0Ax4C65C3muanGvoENNbhKMK1j0imVsB8gOUZoqHDnmjVA2MAtjjPfieFxftIa3loQ7_TUNyvtmiV8V1v54WQZZFaY4QZ9rQmtqUvB28RWI3wP-vTDPD6q3cfVsJsSOvNqmbbLJGO6PVPYoF77G0t0325UiFKJ1yrLoShY24jbvW166LGlwGzHmAHj7fRx9DTkPQPZYRfCJJ1smN6OS-Poe4connWsCoaqyhzuMMnFy_L05fNSu9_fWgm6Mx8lw8NsyVJ70rcCPGrhYxthv7OauyFnq26HkkzkHMWvi90z67B0QNdacBpbXcGAFqAL5l6pjzynNoMCD_UYl7lYZ_6ycCTcir-VKmqeVtU6o3HKcoNvVcmZhd31uoe4s0tfq0paGo-D0cgHofETfaDxeXARhKOr4WQ8vAyiKBqPwgHd0DiYTi-mw_ByGIXjMJhMr8bPA_roPxxcDCdBGAXBZBKMw-gqQgZwYZX-urvj3d_zf6InZN4?type=png)](https://mermaid.live/edit#pako:eNqdVd9v2jAQ_lcsP20VrQhpCeRhL-0jm6at0qQVhEx8IdYSO7KdtbTr_76zIRCS0F95AOfu-y53353tJ5ooDjSmSc6MuRFsrVkxl3PLhYbECiXJ7MdcEnxuVFIVIC05P__3hfzEFcgEen0lk1t7jdrafynN--y36g_sCH75MvpU9O9MG7jVsEtJKgskVdrHIXN6VipzRlaQqAIMOav8632GX0uU_AvaAidWES4SO6cnQnAokSQMUTLfkFKDcTWLlIiiVD5CqlVBGLlW32aziqQihzqYF_gg1NPW6h4LD5bExFh9sJldaQYdM2HsXV3r4oBBi7CiAUFhGm5ZFUvr5HQAIe2x5x5L6jisWrr6P33eBvQ_N2hYLI4wBrRguXgEjsjVxoI5uJ0Ax4C65C3muanGvoENNbhKMK1j0imVsB8gOUZoqHDnmjVA2MAtjjPfieFxftIa3loQ7_TUNyvtmiV8V1v54WQZZFaY4QZ9rQmtqUvB28RWI3wP-vTDPD6q3cfVsJsSOvNqmbbLJGO6PVPYoF77G0t0325UiFKJ1yrLoShY24jbvW166LGlwGzHmAHj7fRx9DTkPQPZYRfCJJ1smN6OS-Poe4connWsCoaqyhzuMMnFy_L05fNSu9_fWgm6Mx8lw8NsyVJ70rcCPGrhYxthv7OauyFnq26HkkzkHMWvi90z67B0QNdacBpbXcGAFqAL5l6pjzynNoMCD_UYl7lYZ_6ycCTcir-VKmqeVtU6o3HKcoNvVcmZhd31uoe4s0tfq0paGo-D0cgHofETfaDxeXARhKOr4WQ8vAyiKBqPwgHd0DiYTi-mw_ByGIXjMJhMr8bPA_roPxxcDCdBGAXBZBKMw-gqQgZwYZX-urvj3d_zf6InZN4" width=900>

A full description of the data objects generated by Stanza can be found on [their documentation](https://stanfordnlp.github.io/stanza/data_objects.html).

Let's go over each in turn and look at a few examples on how to manipulate the returned annotations.

## 2.1. The `Document` Object

The `Document` object has a few convenience propeties that allow us to retrieve the text analyzed (`.text`), as well as the count of words (`.num_words`) and tokens (`.num_tokens`) in it:

In [91]:
# the text provided to the pipeline
print(document.text)

# the number of words in the document
print('Words: ', document.num_words)

# and the number of tokens
print('Tokens: ', document.num_tokens)


Elizabeth was excessively disappointed.
The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent. Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present plan, were to go no farther northward than Derbyshire. In that
county there was enough to be seen to occupy the chief of their three
weeks; and to Mrs. Gardiner it had a peculiarly strong attraction. The
town where she had formerly passed some years of her life, and where
they were now to spend a few days, was probably as 

Within a Document, annotations are further stored in `Sentence` objects, which can be accessed through the `.sentence` property.

## 2.2. `Sentence` Objects

Like in `Document` objects, the `.text` property of a `Sentence` returns the corresponding text. A couple of additional convenience properties include `.doc`, which is a pointer to its parent document, and `.sent_id`, which returns the unique ID given to the sentence. IDs are assigned automatically from the document index when parsing, unless the document was created by importing a *CoNLLu* file, in which case they are taken from the `#sent_id` comments.

In [None]:
# the .sentences property of a Document returns a list of Sentence objects
print('Type of document.sentences: ', type(document.sentences))
print('Number of sentences: ', len(document.sentences))

# we can get at individual sentences by using indexing as usual
# let's get the first sentence
first_sentence = document.sentences[0]

# the property .text of a Sentence object returns the text of the sentence
print('First sentence text: ', first_sentence.text)

# the property .sent_id returns the sentence ID
print('First sentence ID: ', first_sentence.sent_id)

# .doc points back to the Document object
document_of_first_sentence = first_sentence.doc
print('Type of document_of_first_sentence: ', type(document_of_first_sentence))
print('Count of words in document_of_first_sentence: ', document_of_first_sentence.num_words)


Type of document.sentences:  <class 'list'>
Number of sentences:  5
First sentence text:  Elizabeth was excessively disappointed.
First sentence ID:  0
Type of document_of_first_sentence:  <class 'stanza.models.common.doc.Document'>
Count of words in document_of_first_sentence:  225
Sentence 0: Elizabeth was excessively disappointed.

Sentence 1: The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent.

Sentence 2: Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present pl

Depending on which processors/tasks were included in our pipeline a number of other properties are also available for sentences. If the *sentiment analysis* task was included, for example, the `.sentiment` property will return the sentiment value for the sentence:

In [None]:
# let's print the sentiment value for each sentence
for sentence in document.sentences:
    print(f'Sentence: {sentence.sent_id} | Sentiment: {sentence.sentiment}\n\t{sentence.text}\n')


Sentence: 0 | Sentiment: 0
	Elizabeth was excessively disappointed.

Sentence: 1 | Sentiment: 0
	The time fixed for the beginning of their northern tour was now fast
approaching; and a fortnight only was wanting of it, when a letter
arrived from Mrs. Gardiner, which at once delayed its commencement and
curtailed its extent.

Sentence: 2 | Sentiment: 0
	Mr. Gardiner would be prevented by business from
setting out till a fortnight in July, and must be in London again
within a month; and as that left too short a period for them to go so
far, and see so much as they had proposed, or at least to see it with
the leisure and comfort they had built on, they were obliged to give up
the Lakes, and substitute a more contracted tour; and, according to the
present plan, were to go no farther northward than Derbyshire.

Sentence: 3 | Sentiment: 1
	In that
county there was enough to be seen to occupy the chief of their three
weeks; and to Mrs. Gardiner it had a peculiarly strong attraction.

Sentence

Similarly, the `.dependencies` property makes available the results of the *dependency parsing* task, `.entities` those of the *named entity recognition* task, and `.constituency` those of the *constituency* task. We'll discuss those in more detail below.

In [110]:
print(document.sentences[0].constituency)

(ROOT (S (NP (NNP Elizabeth)) (VP (VBD was) (ADJP (RB excessively) (JJ disappointed))) (. .)))


## 2.3. 'Word' and 'Token' Objects

`Sentence` objects also make available lists of annotated `Token` and `Word` objects. In many languages tokens can often be divided into mutiple words, for instance the French token `aux` is divided into the words `à` and `les`. In English, however, tokens and words overlap for the most part:

In [94]:
# Sentences contain a list of words and tokens:
for i, word in enumerate(document.sentences[0].words):
    print(f'{i}: Word: {word.text}, Token: {document.sentences[0].tokens[i].text}')


0: Word: Elizabeth, Token: Elizabeth
1: Word: was, Token: was
2: Word: excessively, Token: excessively
3: Word: disappointed, Token: disappointed
4: Word: ., Token: .


`Token` objects make data accessible through the following properties:

<div style="width: 50%; margin: 20px;">

| Property   | Value         |
|:-----------|:--------------|
| id         | Token index in the sentence, 1-based. If the token is a multi-word token the value will have two ids in a tuple. |
| text       | Token's text. |
| words      | List of words underlying the token. |
| start_char | Character offset where the token starts. |
| end_char   | Character offset where the token ends. |
| ner        | NER tag of this token, in [BIOES format](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)). |

</div>

Let's look at some examples:


In [121]:
# Since tokens are returned as a list, we can access them by index:
# let's get the first token in the first sentence
first_token = first_sentence.tokens[0]

# and print some of its properties
print('id: ', first_token.id)
print('text: ', first_token.text)
print('start_char: ', first_token.start_char)
print('end_char: ', first_token.end_char)
print('ner: ', first_token.ner)


id:  (1,)
text:  Elizabeth
start_char:  0
end_char:  9
ner:  S-PERSON


`Word` objects have some of the same properties, but add others related to morphology and syntax:

<div style="width: 50%; margin: 20px;">

| Property   | Value         |
|:-----------|:--------------|
| id         | Word's index in the sentence, 1-based. |
| text       | Word's text. |
| lemma      | Word's lemma. |
| pos        | Universal part-of-speech tag. |
| xpos       | Extended or treebank-specific part-of-speech. |
| feats      | Morphological features. |
| head       | Id of the syntactic head of this word in the sentence, 1-based. |
| deprel     | Dependency relation between this word and its syntactic head. |
| parent     | Pointer back to the token this word is a part of. |

</div>

Let's look at some examples:

In [None]:
# let's take the second word in the first sentence this time
second_word = first_sentence.words[1]  # <- the word is: was

# the basic properties are just as before
print('id: ', second_word.id)
print('text: ', second_word.text)

# morphological properties are more interesting
print('lemma: ', second_word.lemma)  # <- should return the infinitive of the verb
print('pos: ', second_word.pos)
print('xpos: ', second_word.xpos)
print('feats: ', second_word.feats)  # <- should return the conjugation details of the verb

# and syntactic properties are also available
print('head: ', second_word.head)
print('deprel: ', second_word.deprel)


id:  2
text:  was
lemma:  be
pos:  AUX
xpos:  VBD
feats:  Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
head:  4
deprel:  cop


In [125]:
# let's do all the words in the first sentence next
for word in first_sentence.words:
    print(f'{word.id}: Word: {word.text}')
    print(f'\tLemma: {word.lemma}')
    print(f'\tPOS: {word.pos}')
    print(f'\tXPOS: {word.xpos}')
    print(f'\tFeats: {word.feats}')


1: Word: Elizabeth
	Lemma: Elizabeth
	POS: PROPN
	XPOS: NNP
	Feats: Number=Sing
2: Word: was
	Lemma: be
	POS: AUX
	XPOS: VBD
	Feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
3: Word: excessively
	Lemma: excessively
	POS: ADV
	XPOS: RB
	Feats: None
4: Word: disappointed
	Lemma: disappointed
	POS: ADJ
	XPOS: JJ
	Feats: Degree=Pos
5: Word: .
	Lemma: .
	POS: PUNCT
	XPOS: .
	Feats: None


Alternatively, you can directly print a `Word` object to view all its annotations as a Python dictionary:

<div class="alert alert-block alert-warning">
<b>Note:</b> In order to access the universal part-of-speech tag, the property name is <b>pos</b>, whereas the same information is under the <b>upos</b> key in the output dictionary.
</div>

In [None]:
print(second_word)


{
  "id": 2,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 4,
  "deprel": "cop",
  "start_char": 10,
  "end_char": 13
}


## 2.4. 'Span' Objects

`Span` objects are used to store annotations that span multiple words, examples include the output of tasks such as *named entity recognition*. These objects make the following properties available:

<div style="width: 50%; margin: 20px;">

| Property   | Value         |
|:-----------|:--------------|
| doc        | Pointer to the parent document of this span. |
| text       | Span's text. |
| tokens     | List of tokens that correspond to this span. |
| words      | List of words that correspond to this span. |
| type       | Type of span or entity type if NER. |
| start_char | Character offset where the span starts. |
| end_char   | Character offset where the span ends. |

</div>

`Span` objects can be accessed through properties associated with the task that generated them. For example, the results of the NER task are available via the `.entities` property of the `Document` and `Sentence` objects:

In [None]:
# here are the entities identified in the first sentence
first_sentence.entities


[{
   "text": "Elizabeth",
   "type": "PERSON",
   "start_char": 0,
   "end_char": 9
 }]

In [130]:
# next we list all the entities identified in the document
# with their corresponding type and range
for entity in document.entities:
    print(f'{entity.text} ({entity.type}, {entity.start_char}-{entity.end_char})')


Elizabeth (PERSON, 0-9)
Gardiner (PERSON, 194-202)
Gardiner (PERSON, 273-281)
July (DATE, 350-354)
London (GPE, 371-377)
a month (DATE, 391-398)
Lakes (LOC, 600-605)
Derbyshire (GPE, 719-729)
three
weeks (DATE, 803-814)
Gardiner (PERSON, 828-836)
some years (DATE, 915-925)
a few days (DATE, 972-982)
Matlock (GPE, 1067-1074)
Chatsworth (GPE, 1076-1086)
Dovedale (GPE, 1088-1096)
Peak (LOC, 1105-1109)


## 2.5 'ParseTree' Objects

A `ParseTree` object is a nested data structure used to represent the result of the *constituency parser* task. Each layer of nesting has two properties: `label`, which represents the bracket type of an inner node (a part-of-speech tag in the case of preterminals or the text of the word in the case of leaves), and `children` which is a list of nodes under the current one (a tag and a word for preterminals or a word with no children for leaves). The object can be accessed via the `.constituency` property of `Sentence` objects.

In [142]:
# let's get the constituency parse tree for the first sentence
print(first_sentence.constituency)

# the first node should be the ROOT
# let's verify by printing its label
print('Label: ', first_sentence.constituency.label)

# because it's a ROOT, the children should be the rest of the sentence
print('Children: ', first_sentence.constituency.children)

# we can also access its children by index
print('First child label: ', first_sentence.constituency.children[0].label)
print('First child children: ', first_sentence.constituency.children[0].children)


(ROOT (S (NP (NNP Elizabeth)) (VP (VBD was) (ADJP (RB excessively) (JJ disappointed))) (. .)))
Label:  ROOT
Children:  ((S (NP (NNP Elizabeth)) (VP (VBD was) (ADJP (RB excessively) (JJ disappointed))) (. .)),)
First child label:  S
First child children:  ((NP (NNP Elizabeth)), (VP (VBD was) (ADJP (RB excessively) (JJ disappointed))), (. .))


## 2.6. Visualization Tools

Stanza includes a few simple tools designed to quickly visualize some of the analytical results of the pipeline, such as the parsed dependencies' tree:


In [None]:
visualize_doc(document, 'en')

Or the results of the named entity recognition task:

In [54]:
visualize_ner_doc(document, 'en')


# 3. Exporting and Importing Results

Stanza has a number of functions designed to help us convert the results of running text through the pipeline into formats other than the internal `Document` object. This is particularly useful for saving those results to disk for later use.

## 3.1. Native Python Data Structures

For certain types of manipulations it is often convenient to convert the `Document` object to a native Python data structure. This can be achieved by using the `to_dict()` method of the `Document` class. The resulting data structure will have the form `List[List[Dict]]`. That is, the `Document` becomes a list, every element in that list is another list representing each `Sentence`, and the elements of those are in turn dictionaries representing each `Token`/`Word` in the sentence:


In [135]:
# let's convert our document to a Python native structure
python_structure = document.to_dict()

# The type of the new structure is a list
print('type of python_structure is: ', type(python_structure))

# That list contains lists representing sentences
print('type of python_structure[0] is: ', type(python_structure[0]))

# Each sentence list in turn contains dictionaries representing tokens/words
print('type of python_structure[0][0] is: ', type(python_structure[0][0]))

# let's look at the first sentence
pprint.pp(python_structure[0])


type of python_structure is:  <class 'list'>
type of python_structure[0] is:  <class 'list'>
type of python_structure[0][0] is:  <class 'dict'>
[{'id': 1,
  'text': 'Elizabeth',
  'lemma': 'Elizabeth',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'feats': 'Number=Sing',
  'head': 4,
  'deprel': 'nsubj',
  'start_char': 0,
  'end_char': 9,
  'ner': 'S-PERSON',
  'multi_ner': ('S-PERSON',)},
 {'id': 2,
  'text': 'was',
  'lemma': 'be',
  'upos': 'AUX',
  'xpos': 'VBD',
  'feats': 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin',
  'head': 4,
  'deprel': 'cop',
  'start_char': 10,
  'end_char': 13,
  'ner': 'O',
  'multi_ner': ('O',)},
 {'id': 3,
  'text': 'excessively',
  'lemma': 'excessively',
  'upos': 'ADV',
  'xpos': 'RB',
  'head': 4,
  'deprel': 'advmod',
  'start_char': 14,
  'end_char': 25,
  'ner': 'O',
  'multi_ner': ('O',)},
 {'id': 4,
  'text': 'disappointed',
  'lemma': 'disappointed',
  'upos': 'ADJ',
  'xpos': 'JJ',
  'feats': 'Degree=Pos',
  'head': 0,
  'deprel': 'root'

If we want to convert the Python data structure back to a Stanza `Document`, we can just provide it as an argument to the `Document` class:

In [136]:
document_from_python_structure = Document(python_structure)

# let's check the type of the new document
print('type of document_from_python_structure is: ', type(document_from_python_structure))

# and the number of words
print('Number of words: ', document_from_python_structure.num_words)


type of document_from_python_structure is:  <class 'stanza.models.common.doc.Document'>
Number of words:  225


## 3.2. Pickled Python Objects

*Pickling* refers to the [process of serializing a Python data structure into a byte stream](https://docs.python.org/3/library/pickle.html), which can then be saved to a file or transmitted over a network. This is a way of storing complex data structures like lists and dictionaries for later use, enabling you to reconstruct the original object when needed.

Stanza provides methods to *pickle* and *unpickle* a `Document`:


In [137]:
# let's pickle our document using the to_serialized method
pickled_document = document.to_serialized()

# next we save the pickled document to the hard drive
with open('document.pickle', 'wb') as file:
    file.write(pickled_document)

# A new file called "document.pickle" should now exist in the same directory as the notebook


In [138]:
# to load the document back from the hard drive
# we open the file and pass its contents to the from_serialized method
with open('document.pickle', 'rb') as file:
    unpickled_document = Document.from_serialized(file.read())

# let's test it by printing the number of words
print(unpickled_document.num_words)


225


## 3.3. CoNLL-U Format

*CoNLL-U* is a [widely-used format for representing treebanks and procesed NLP corpora](https://universaldependencies.org/format.html). Stanza can both consume CoNLL-U formatted data and export results to that format.


In [139]:
# the following command will write our document to a CoNLL-U formatted file,
# 'document.conllu', in the same directory as the notebook
CoNLL.write_doc2conll(document, 'document.conllu')


In [140]:
# to load the document back from the hard drive:
document_from_conll = CoNLL.conll2doc('document.conllu')

# let's test it by printing the number of words
print(document_from_conll.num_words)


225
