[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W3T_Analysis_Spacy_Stanza.ipynb)

In [1]:
# Run in Colab to install local packages
!pip install spacy stanza
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download de_core_news_sm

# Introduction to spaCy and Stanza

*This tutorial is based on the [Advanced NLP with spaCy](https://course.spacy.io/en) course, the [spaCy documentation](https://spacy.io/usage/linguistic-features) and the [Stanza documentation](https://stanfordnlp.github.io/stanza/).*

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: 

- You create a pipeline, or load one of the ones available into an instance, usually called simply `nlp`.
- You put in raw text, and get back a [Doc](https://spacy.io/api/doc) object, that comes with a variety of annotations.

<div>
<img src="https://spacy.io/images/pipeline.svg" alt="Visualizing a nlp pipeline" style="width: 50%">
</div>

Docs are containers for [Token](https://spacy.io/api/token) objects, which can also be subsetted in [Span](https://spacy.io/api/span) objects. It contains the raw text, the part of speech tag, the dependency relation, the named entity tag, and more.

<div>
<img src="https://course.spacy.io/doc_span.png" alt="Visualizing tokens, docs and spans" style="width: 40%">
</div>

In [2]:
import spacy

# Load an blank pipeline for English, i.e. only basic lexical information
# like tokenization, is_number, is_alpha, is_punct, etc.
nlp = spacy.blank('en')
print("Pipeline:", nlp.pipe_names)

# Load a small pipeline for English. This includes also a POS tagger, a dependency
# parser and a named entity recognizer trained on labeled data.
nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

def print_attributes(doc_or_span):
    print("Text\tLemma\tUPOS\tXPOS\tDEP\tShape\tIsAlpha\tIsStop\tIsPunct\tIsAscii\n" + "-"*80)
    for token in doc_or_span:
        print(
            f"{token.text}\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t"
            f"{token.dep_}\t{token.shape_}\t{token.is_alpha}\t"
            f"{token.is_stop}\t{token.is_punct}\t{token.is_ascii}"
        )

print_attributes(doc)

2022-02-23 13:13:42.395971: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-23 13:13:42.395991: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Pipeline: []
Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
Apple	Apple	PROPN	NNP	nsubj	Xxxxx	True	False	False	True
is	be	AUX	VBZ	aux	xx	True	True	False	True
looking	look	VERB	VBG	ROOT	xxxx	True	False	False	True
at	at	ADP	IN	prep	xx	True	True	False	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False	False	True
U.K.	U.K.	PROPN	NNP	dobj	X.X.	False	False	False	True
startup	startup	VERB	VBD	dep	xxxx	True	False	False	True
for	for	ADP	IN	prep	xxx	True	True	False	True
$	$	SYM	$	quantmod	$	False	False	False	True
1	1	NUM	CD	compound	d	False	False	False	True
billion	billion	NUM	CD	pobj	xxxx	True	False	False	True


Attributes extracted:

- **Text:** The original word text.

- **Lemma:** The base form of the word.

- **POS:** The simple UPOS part-of-speech tag.

- **Tag:** The detailed part-of-speech tag.

- **Dep:** Syntactic dependency, i.e. the relation between tokens.

- **Shape:** The word shape – capitalization, punctuation, digits.

- **is alpha:** Is the token an alpha character?

- **is stop:** Is the token part of a stop list, i.e. the most common words of the language?

- **is punct:** Is the token a punctuation mark?

- **is_ascii:** Is the token part of the ASCII charset?

If some tags may look mysterious for the non-linguists among you, the `spacy.explain` tool can help in getting a better grasp at them:

In [3]:
print("quantmod:", spacy.explain("quantmod"))
print("dobj:", spacy.explain("dobj"))
print("VBZ:", spacy.explain("VBZ"))

quantmod: modifier of quantifier
dobj: direct object
VBZ: verb, 3rd person singular present


In [4]:
# A slice from the doc is a Span object

span = doc[1:5]
print_attributes(span)

Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
is	be	AUX	VBZ	aux	xx	True	True	False	True
looking	look	VERB	VBG	ROOT	xxxx	True	False	False	True
at	at	ADP	IN	prep	xx	True	True	False	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False	False	True


## Lemmatization

The [Lemmatizer](https://spacy.io/api/lemmatizer) is a pipeline component that provides lookup and rule-based lemmatization methods in a configurable component.

In [5]:
import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
if "lemmatizer" not in nlp.pipe_names:
    config = {"mode": "rule"}
    lemmatizer = nlp.add_pipe("lemmatizer", config=config)
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)

doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])

rule
['I', 'be', 'read', 'the', 'paper', '.']


The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from [WordNet](https://wordnet.princeton.edu/).

>**💡 Interesting Fact** spaCy v3 introduced new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages. These lemmatizer learns to predict lemmatization rules from a corpus of examples and removes the need to write an exhaustive set of per-language lemmatization rules. See more in this blog post: [Neural edit-tree lemmatization for spaCy](https://explosion.ai/blog/edit-tree-lemmatizer)

### Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a [Vocab](https://spacy.io/api/vocab) instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization.

The process of splitting a text into tokens in spaCy is started with a **whitespace tokenization**, and then a series of splitting steps are applied to prefixes, suffixes, infixes and other exceptions (e.g. N.Y. and U.K. are kept as a single token).

<div>
<img src="https://spacy.io/images/tokenization.svg" alt="Tokenization example with spaCy" style="width: 60%">
</div>

Refer to the [spaCy Documentation](https://spacy.io/usage/linguistic-features#tokenization) for more details on how to define custom rules and exceptions.

### Creating a custom tokenizer

We are going now to demonstrate how a custom tokenizer can be used to replace the default tokenizer in a spaCy pipeline. We are going to create a class `WhitespaceTokenizer` and replace the original English tokenizer with it:

In [6]:
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False   
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print("Before substitution:", [token.text for token in doc])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print("After substitution:", [token.text for token in doc])

Before substitution: ['What', "'s", 'happened', 'to', 'me', '?', 'he', 'thought', '.', 'It', 'was', "n't", 'a', 'dream', '.']
After substitution: ["What's", 'happened', 'to', 'me?', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.']


### Using pre-tokenized text

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word is followed by a space.

In [7]:
import spacy
from spacy.tokens import Doc

nlp = spacy.load("en_core_web_sm")
words = ["Hello", ",", "world", "!"]
spaces = [False, True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print_attributes(doc)
print("\n\n\n")
doc = nlp.get_pipe("tagger")(doc)
doc = nlp.get_pipe("parser")(doc)
doc = nlp.get_pipe("lemmatizer")(doc)
print_attributes(doc)

Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
Hello					Xxxxx	True	False	False	True
,					,	False	False	True	True
world					xxxx	True	False	False	True
!					!	False	False	True	True




Text	Lemma	UPOS	XPOS	DEP	Shape	IsAlpha	IsStop	IsPunct	IsAscii
--------------------------------------------------------------------------------
Hello	hello		NN	nsubj	Xxxxx	True	False	False	True
,	,		NN	ROOT	,	False	False	True	True
world	world		NN	dobj	xxxx	True	False	False	True
!	!		NN	punct	!	False	False	True	True




### Aligning tokenization

spaCy’s tokenization is non-destructive and uses language-specific rules optimized for compatibility with treebank annotations. Other tools and resources can sometimes tokenize things differently – for example, `"I'm"` → `["I", "'", "m"]` instead of `["I", "'m"]`.

In situations like that, you often want to align the tokenization so that you can merge annotations from different sources together, or take vectors predicted by a pretrained BERT model and apply them to spaCy tokens. spaCy’s [`Alignment`](https://spacy.io/api/example#alignment-object) object allows the one-to-one mappings of token indices in both directions as well as taking into account indices where multiple tokens align to one single token.

In [8]:
from spacy.training import Alignment

other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
print(f"a -> b, lengths: {align.x2y.lengths}")  # array([1, 1, 1, 1, 1, 1, 1, 1])
print(f"a -> b, mapping: {align.x2y.data}")  # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
print(f"b -> a, lengths: {align.y2x.lengths}")  # array([1, 1, 1, 1, 2, 1, 1])   : the token "'s" refers to two tokens
print(f"b -> a, mappings: {align.y2x.data}")  # array([0, 1, 2, 3, 4, 5, 6, 7])

a -> b, lengths: [1 1 1 1 1 1 1 1]
a -> b, mapping: [0 1 2 3 4 4 5 6]
b -> a, lengths: [1 1 1 1 2 1 1]
b -> a, mappings: [0 1 2 3 4 5 6 7]


Here are some insights from the alignment information generated in the example above:

<ul class="_0c0c9282"><li class="_0aa126ad">The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they’re also identical in the
input: <code class="_1d7c6046">"i"</code>, <code class="_1d7c6046">"listened"</code>, <code class="_1d7c6046">"to"</code> and <code class="_1d7c6046">"obama"</code>.</li><li class="_0aa126ad">The value of <code class="_1d7c6046">x2y.dataXd[6]</code> is <code class="_1d7c6046">5</code>, which means that <code class="_1d7c6046">other_tokens[6]</code>
(<code class="_1d7c6046">"podcasts"</code>) aligns to <code class="_1d7c6046">spacy_tokens[5]</code> (also <code class="_1d7c6046">"podcasts"</code>).</li><li class="_0aa126ad"><code class="_1d7c6046">x2y.dataXd[4]</code> and <code class="_1d7c6046">x2y.dataXd[5]</code> are both <code class="_1d7c6046">4</code>, which means that both tokens
4 and 5 of <code class="_1d7c6046">other_tokens</code> (<code class="_1d7c6046">"'"</code> and <code class="_1d7c6046">"s"</code>) align to token 4 of <code class="_1d7c6046">spacy_tokens</code>
(<code class="_1d7c6046">"'s"</code>).</li></ul>

## Sentence Segmentation

A Doc object’s sentences are available via the `Doc.sents` property. To view a Doc’s sentences, you can iterate over the `Doc.sents`, a generator that yields `Span` objects. You can check whether a Doc has sentence boundaries by calling `Doc.has_annotation` with the attribute name `"SENT_START"`.

Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a **trained pipeline** that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box with spaCy’s provided trained pipelines. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom trained or rule-based component.

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


Other modalities of segmentation are also available, see the [documentation](https://spacy.io/usage/linguistic-features#sbd)

## Morphological features

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples:

| **Context** | **Surface** | **Lemma** | **POS** | **&nbsp;Morphological Features** |
|---|---|---|---|---|
| I was reading the paper | reading | read | VERB | VerbForm=Ger |
| I don’t watch the news, I read the paper | read | read | VERB | VerbForm=Fin, Mood=Ind, Tense=Pres |
| I read the paper yesterday | read | read | VERB | VerbForm=Fin, Mood=Ind, Tense=Past |

Morphological features are stored in the `token.morph` attribute of tokens:

In [10]:
doc = nlp("I was reading the paper.")

def get_morphological_features(doc_or_span):
    print("Token\tCase\tNumber\tPerson\tPronType\n" + "-"*40)
    for token in doc_or_span:
        print(
            f"{token.text}\t{next(iter(token.morph.get('Case')), '')}\t"
            f"{next(iter(token.morph.get('Number')), '')}\t"
            f"{next(iter(token.morph.get('Person')), '')}\t"
            f"{next(iter(token.morph.get('PronType')), '')}"
        )

get_morphological_features(doc)

Token	Case	Number	Person	PronType
----------------------------------------
I	Nom	Sing	1	Prs
was		Sing	3	
reading				
the				Art
paper		Sing		
.				


For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and **fine-grained part-of-speech** tags to produce **coarse-grained part-of-speech tags** and morphological features. For other languages with a more complex morphological system, spaCy's [`Morphologizer`](https://spacy.io/api/morphologizer) is used instead.

Let's now try another example using a small German pipeline including the statistical `Morphologizer` component:

In [11]:
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # English: 'Where are you?'
get_morphological_features(doc)

Token	Case	Number	Person	PronType
----------------------------------------
Wo				Int
bist		Sing	2	
du	Nom	Sing	2	Prs
?				


## Word vectors and similarity

Remember the assignment of finding the most relevant context to a specific question last week? Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using algorithms like `word2vec` and models like `BERT`and usually look like this:

```python
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

The medium and large pipeline packages (that is, for example, `en_core_web_md` and `en_core_web_lg` as opposed to `en_core_web_sm`) provide word vectors for the entire vocabulary of the model. Small pipelines do not contain word vectors.

Pipeline packages that come with built-in word vectors make them available as the `Token.vector` attribute. `Doc.vector` and `Span.vector` will default to an average of their token vectors.

Similarly to what we achieved in the previous lesson, we can compute a similarity score between sentences using the `similarity` method:

In [12]:
import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use to install the package!
doc1 = nlp("The ship was traveling alonside the river.")
doc2 = nlp("The boat sailed next to the river bank.")
doc3 = nlp("I like to make money.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc1, "<->", doc3, doc1.similarity(doc3))
print(doc2, "<->", doc3, doc2.similarity(doc3))

# Similarity of tokens and spans
make_money = doc3[3:5]
bank = doc2[7]
river = doc2[6]
print(make_money, "<->", bank, make_money.similarity(bank))
print(make_money, "<->", river, make_money.similarity(river))

The ship was traveling alonside the river. <-> The boat sailed next to the river bank. 0.8644729105970773
The ship was traveling alonside the river. <-> I like to make money. 0.642535314700431
The boat sailed next to the river bank. <-> I like to make money. 0.7221674694573358
make money <-> bank 0.5089183
make money <-> river 0.2408454


Notice in this example how the similarity score is high for the first two sentences because of a clear semantic overlap: both sentences contain `river`, and `sailing`/`traveling` and `boat`/`ship`/`river` are probably close in embedding space.

When comparing the first and last example, we can see the similarity is much lower, since they don't share common words or semantics. However, the third example shows a high similarity with the second one: this is most likely due to the presence of the word `bank`, that is semantically related to `money`. This is confirmed by comparing the similarity of `make money` with `river` and `bank` directly.

>**💡 Interesting Fact** This is due to the usage of **static word vectors** like word2vec, which are pre-trained and fixed for all token in the vocabulary. Tokens exhibiting **polysemy** (i.e. multiple meanings) like `bank` will have all their possible meanings "crammed" into a single vector. This is one major limitation of static word vectors, and a reason why **contextualized word vectors** produced by pretrained neural language models, in which each resulting output embedding is generated dynamically depending on the full context, have gained much traction in NLP.

## Other references

We saw just a little part of what spaCy can do. In the next week, we are going to have a better look at the tagging and parsing components of the library. These are the general functionalities of the library:

| **Name** | **Description** |
|---|---|
| Tokenization | Segmenting text into words, punctuations marks etc. |
| Part-of-speech (POS) Tagging | Assigning word types to tokens, like verb or noun. |
| Dependency Parsing | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| Lemmatization | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”. |
| Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences. |
| Named Entity Recognition (NER) | Labelling named “real-world” objects, like persons, companies or locations. |
| Entity Linking (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
| Similarity | Comparing words, text spans and documents and how similar they are to each other. |
| Text Classification | Assigning categories or labels to a whole document, or parts of a document. |
| Rule-based Matching | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| Training | Updating and improving a statistical model’s predictions. |
| Serialization | Saving objects to files or byte strings. |

Refer to [spaCy 101](https://spacy.io/usage/spacy-101) and the documentation for more details.

## Handling More Languages with Stanza

spaCy is a great tool for English, but what if you want to use it for other languages? The good news is that spaCy supports a wide range of languages, but the bad news is that it doesn't support all of them. In this section, we will see how to understand whether a language is supported, and what to do if it isn't.

The list of supported languages is available on the [spaCy website](https://spacy.io/usage/models#languages). If no packages are available for the language you are interested in, you can still use spaCy to process text in that language, but you will have to use the `Language` class directly, and you won't be able to use the pre-trained statistical models. This limits the functionality of spaCy, but it is still possible to use it for tokenization, lemmatization, and other basic NLP tasks.

An alternative is to use the [Stanza](https://stanfordnlp.github.io/stanza/) library, which is a Python wrapper for the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library. Stanza supports a wide range of languages, including Chinese, Japanese, and Arabic (full list [here](https://stanfordnlp.github.io/stanza/available_models.html)). Conveniently, Stanza authors thoroughly report the [performance](https://stanfordnlp.github.io/stanza/performance.html) of their models, so you can easily decide whether to use them or not.

Stanza `Pipeline` is very similar to spaCy `Language`, and it can be used in a similar way. Notice that while we had to manually download pipelines for spaCy at the beginning of the notebook, Stanza pipelines are automatically downloaded when you first use them. Here are some examples matching the ones we saw for spaCy:

**Word and Sentence Tokenization, Multiword Tokens**

In [6]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

nlp = stanza.Pipeline(lang='fr', processors='tokenize')
doc = nlp("Ça n'est pas du tout une phrase.")
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {word.id}\tword: {word.text}\ttoken: {word.parent.text}' for word in sentence.words], sep='\n')

2023-02-03 15:56:33 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 4.70MB/s]                    
2023-02-03 15:56:33 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-02-03 15:56:33 INFO: Use device: gpu
2023-02-03 15:56:33 INFO: Loading: tokenize
2023-02-03 15:56:33 INFO: Done loading processors!
2023-02-03 15:56:33 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: test
id: (5,)	text: sentence
id: (6,)	text: for
id: (7,)	text: stanza
id: (8,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: another
id: (4,)	text: sentence
id: (5,)	text: .


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 4.77MB/s]                    
2023-02-03 15:56:33 INFO: Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |

2023-02-03 15:56:33 INFO: Use device: gpu
2023-02-03 15:56:33 INFO: Loading: tokenize
2023-02-03 15:56:33 INFO: Loading: mwt
2023-02-03 15:56:33 INFO: Done loading processors!


id: 1	word: Ça	token: Ça
id: 2	word: n'	token: n'
id: 3	word: est	token: est
id: 4	word: pas	token: pas
id: 5	word: de	token: du
id: 6	word: le	token: du
id: 7	word: tout	token: tout
id: 8	word: une	token: une
id: 9	word: phrase	token: phrase
id: 10	word: .	token: .


**POS Tagging and Morphological Features**

In [3]:
import stanza

nlp = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos')
doc = nlp('Ceci n’est pas une pipe.')
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

2023-02-03 15:53:21 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 1.58MB/s]                    
2023-02-03 15:53:22 INFO: Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |

2023-02-03 15:53:22 INFO: Use device: gpu
2023-02-03 15:53:22 INFO: Loading: tokenize
2023-02-03 15:53:22 INFO: Loading: mwt
2023-02-03 15:53:22 INFO: Loading: pos
2023-02-03 15:53:23 INFO: Done loading processors!


word: Ceci	upos: PRON	xpos: None	feats: Gender=Masc|Number=Sing|Person=3|PronType=Dem
word: n’	upos: ADV	xpos: None	feats: Polarity=Neg
word: est	upos: AUX	xpos: None	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: pas	upos: ADV	xpos: None	feats: Polarity=Neg
word: une	upos: DET	xpos: None	feats: Definite=Ind|Gender=Fem|Number=Sing|PronType=Art
word: pipe	upos: NOUN	xpos: None	feats: Gender=Fem|Number=Sing
word: .	upos: PUNCT	xpos: None	feats: _


In [7]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Gabriele Sarti teaches at the University of Groningen. He lives in the Netherlands.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

2023-02-03 15:58:22 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 17.1MB/s]                    
2023-02-03 15:58:23 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2023-02-03 15:58:23 INFO: Use device: gpu
2023-02-03 15:58:23 INFO: Loading: tokenize
2023-02-03 15:58:23 INFO: Loading: ner
2023-02-03 15:58:23 INFO: Done loading processors!


entity: Gabriele Sarti	type: PERSON
entity: the University of Groningen	type: ORG
entity: Netherlands	type: GPE


Find more usage examples in the [official documentation](https://stanfordnlp.github.io/stanza/index.html).