#Welcome to Stanza!
Latest Version Python Versions

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

In this tutorial, we will demonstrate how to set up Stanza and annotate text with its native neural network NLP models. For the use of the Python CoreNLP interface, please see other tutorials.

<br/>

original source -
https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_Beginners_Guide.ipynb 

# 1. Installing Stanza
Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

In [None]:
!pip install Stanza

Collecting Stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 4.4 MB/s 
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 55.8 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=3bef87fc3129c45a0aaebb2e927acc96cc7b7593abe6861857dad70bdca3c4de
  Stored in directory: /root/.cache/pip/wheels/8a/4e/b6/57b01db010d17ef6ea9b40300af725ef3e210cb1acfb7ac8b6
Successfully built emoji
Installing collected packages: emoji, Stanza
Successfully installed Stanza-1.3.0 emoji-1.7.0


In [None]:
# Import the package
import stanza

More Information
For common troubleshooting, please visit our troubleshooting page.

* Downloading Models
You can ownload models with the stanza.download command. The language can be specified with either a full language name (e.g., "english"), or a short code (e.g., "en").

* By default, models will be saved to your ~/stanza_resources directory. If you want to specify your own path to save the model files, you can pass a dir=your_path argument.

In [None]:
# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

# Similarly, download a (simplified) Chinese model
# Note that you can use verbose=False to turn off all printed messages
print("Downloading Chinese model...")
stanza.download('zh', verbose=False)

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:27:18 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-10 06:27:35 INFO: Finished downloading models and saved to /root/stanza_resources.


Downloading Chinese model...


## More Information
Pretrained models are provided for 60+ different languages. For all languages, available models and the corresponding short language codes, please check out the models page.

# 3. Processing Text
Constructing Pipeline
To process a piece of text, you'll need to first construct a Pipeline with different Processor units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

* By default, the pipeline will include all processors, including tokenization, multi-word token expansion, part-of-speech tagging, lemmatization, dependency parsing and named entity recognition (for supported languages). However, you can always specify what processors you want to include with the processors argument.

* Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting use_gpu=False.

* Again, you can suppress all printed messages by setting verbose=False.

In [None]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en',verbose=True)

print()

# Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
print("Building a Chinese pipeline...")
zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=True, use_gpu=False)

2022-04-10 06:28:21 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-10 06:28:22 INFO: Use device: cpu
2022-04-10 06:28:22 INFO: Loading: tokenize


Building an English pipeline...


2022-04-10 06:28:22 INFO: Loading: pos
2022-04-10 06:28:22 INFO: Loading: lemma
2022-04-10 06:28:22 INFO: Loading: depparse
2022-04-10 06:28:22 INFO: Loading: sentiment
2022-04-10 06:28:23 INFO: Loading: constituency
2022-04-10 06:28:24 INFO: Loading: ner
2022-04-10 06:28:24 INFO: Done loading processors!
2022-04-10 06:28:24 INFO: "zh" is an alias for "zh-hans"
2022-04-10 06:28:24 INFO: Loading these models for language: zh-hans (Simplified_Chinese):
| Processor | Package |
-----------------------
| tokenize  | gsdsimp |
| pos       | gsdsimp |
| lemma     | gsdsimp |
| depparse  | gsdsimp |

2022-04-10 06:28:24 INFO: Use device: cpu
2022-04-10 06:28:24 INFO: Loading: tokenize
2022-04-10 06:28:24 INFO: Loading: pos



Building a Chinese pipeline...


2022-04-10 06:28:25 INFO: Loading: lemma
2022-04-10 06:28:25 INFO: Loading: depparse
2022-04-10 06:28:26 INFO: Done loading processors!


#Annotating Text
After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a Document object, which can be used to access detailed annotations from. For example:

In [None]:
# Processing English text
en_doc = en_nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
print(type(en_doc))

print()

# Processing Chinese text
zh_doc = zh_nlp("达沃斯世界经济论坛是每年全球政商界领袖聚在一起的年度盛事。")
print(type(zh_doc))




<class 'stanza.models.common.doc.Document'>

<class 'stanza.models.common.doc.Document'>


## More Information
For more information on how to construct a pipeline and information on different processors, please visit our pipeline page.

#4. Accessing Annotations
Annotations can be accessed from the returned Document object.

A Document contains a list of Sentences, and a Sentence contains a list of Tokens and Words. For the most part Tokens and Words overlap, but some tokens can be divided into mutiple words, for instance the French token aux is divided into the words à and les, while in English a word and a token are equivalent. Note that dependency parses are derived over Words.

Additionally, a Span object is used to represent annotations that are part of a document, such as named entity mentions.

The following example iterate over all English sentences and words, and print the word information one by one:

In [None]:
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))

    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
        
    print("")

[Sentence 1]
Barack      	Barack      	PROPN 	4	nsubj:pass  
Obama       	Obama       	PROPN 	1	flat        
was         	be          	AUX   	4	aux:pass    
born        	bear        	VERB  	0	root        
in          	in          	ADP   	6	case        
Hawaii      	Hawaii      	PROPN 	4	obl         
.           	.           	PUNCT 	4	punct       

[Sentence 2]
He          	he          	PRON  	3	nsubj:pass  
was         	be          	AUX   	3	aux:pass    
elected     	elect       	VERB  	0	root        
president   	president   	NOUN  	3	xcomp       
in          	in          	ADP   	6	case        
2008        	2008        	NUM   	3	obl         
.           	.           	PUNCT 	3	punct       



The following example iterate over all extracted named entity mentions and print out their character spans and types.

In [None]:
print("Mention text\tType\tStart-End")
for ent in en_doc.ents:
    print("{}\t{}\t{}-{}".format(ent.text, ent.type, ent.start_char, ent.end_char))

Mention text	Type	Start-End
Barack Obama	PERSON	0-12
Hawaii	GPE	25-31
2008	DATE	62-66


And similarly for the Chinese text:

In [None]:
for i, sent in enumerate(zh_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
达沃斯         	达沃斯         	PROPN 	4	nmod        
世界          	世界          	NOUN  	4	nmod        
经济          	经济          	NOUN  	4	nmod        
论坛          	论坛          	NOUN  	16	nsubj       
是           	是           	AUX   	16	cop         
每年          	每年          	DET   	10	det         
全球          	全球          	NOUN  	10	nmod        
政           	政           	PART  	9	case        
商界          	商界          	NOUN  	10	nmod        
领袖          	领袖          	NOUN  	11	nsubj       
聚           	聚           	VERB  	16	acl:relcl   
在           	在           	VERB  	11	mark        
一起          	一起          	NOUN  	11	obj         
的           	的           	PART  	11	mark:rel    
年度          	年度          	NOUN  	16	nmod        
盛事          	盛事          	NOUN  	0	root        
。           	。           	PUNCT 	16	punct       



there are limits as you can see it can't auto NER for more difficult chinese word entities 

In [None]:


print("Mention text\tType\tStart-End")

for zh in zh_doc.ents:
    print("{}\t{}\t{}-{}".format(zh.text, zh.type, zh.start_char, zh.end_char))

Mention text	Type	Start-End


Alternatively, you can directly print a Word object to view all its annotations as a Python dict:

In [None]:
word = en_doc.sentences[0].words[0]
print(word)

{
  "id": 1,
  "text": "Barack",
  "lemma": "Barack",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 4,
  "deprel": "nsubj:pass",
  "start_char": 0,
  "end_char": 6
}


# NER Example Usage

Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

Accessing Named Entities for Sentence and Document
Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")

print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

2022-04-10 06:28:27 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2022-04-10 06:28:27 INFO: Use device: cpu
2022-04-10 06:28:27 INFO: Loading: tokenize
2022-04-10 06:28:27 INFO: Loading: ner
2022-04-10 06:28:27 INFO: Done loading processors!


entity: Chris Manning	type: PERSON
entity: Stanford University	type: ORG
entity: the Bay Area	type: LOC


Instead of accessing entities in the entire document, you can also access the named entities in each sentence of the document. The following example provides an identical result from the one above, by accessing entities from sentences instead of the entire document:

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')

2022-04-10 06:28:27 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2022-04-10 06:28:27 INFO: Use device: cpu
2022-04-10 06:28:27 INFO: Loading: tokenize
2022-04-10 06:28:27 INFO: Loading: ner
2022-04-10 06:28:28 INFO: Done loading processors!


entity: Chris Manning	type: PERSON
entity: Stanford University	type: ORG
entity: the Bay Area	type: LOC


As can be seen in the output, Stanza correctly identifies that Chris Manning is a person, Stanford University an organization, and the Bay Area is a location.

#Accessing Named Entity Recogition (NER) 
Tags for Token
It might sometimes be useful to access the BIOES NER tags for each token, and here is an example how:

In [None]:
import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')


2022-04-10 06:28:28 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2022-04-10 06:28:28 INFO: Use device: cpu
2022-04-10 06:28:28 INFO: Loading: tokenize
2022-04-10 06:28:28 INFO: Loading: ner
2022-04-10 06:28:29 INFO: Done loading processors!


token: Chris	ner: B-PERSON
token: Manning	ner: E-PERSON
token: teaches	ner: O
token: at	ner: O
token: Stanford	ner: B-ORG
token: University	ner: E-ORG
token: .	ner: O
token: He	ner: O
token: lives	ner: O
token: in	ner: O
token: the	ner: B-LOC
token: Bay	ner: I-LOC
token: Area	ner: E-LOC
token: .	ner: O


#Training-Only Options
Most training-only options are documented in the argument parser of the NER tagger.
https://github.com/stanfordnlp/stanza/blob/main/stanza/models/ner_tagger.py#L32 



More Information
For all information on different data objects, please visit our data objects page.

https://stanfordnlp.github.io/stanza/data_objects.html 

5. Resources
Apart from this interactive tutorial, we also provide tutorials on our website that cover a variety of use cases such as how to use different model "packages" for a language, how to use spaCy as a tokenizer, how to process pretokenized text without running the tokenizer, etc. For these tutorials please visit our Tutorials page.

Other resources that you may find helpful include:

* Stanza Homepage
https://stanfordnlp.github.io/stanza/index.html 
* FAQs
https://stanfordnlp.github.io/stanza/faq.html 
* GitHub Repo
https://github.com/stanfordnlp/stanza 
* Reporting Issues
https://github.com/stanfordnlp/stanza/issues
* Stanza System Description Paper
http://arxiv.org/abs/2003.07082

## Building A Pipline Through Stanza 

Stanza is a Python natural language analysis library created by the Stanford NLP group. It is a collection of NLP tools that can be used to create neural network pipelines for text analysis. It supports functionalities like tokenization, multi-word token expansion, lemmatization, part-of-speech (POS), morphological features tagging, dependency parsing, named entity recognition(NER), and sentiment analysis. It uses Universal Dependencies to provide consistent annotations of grammar in over 60 human languages. Additionally, it provides a Python interface to the CoreNLPJava package. This can be used to inherit additional functionalities like constituency parsing, coreference resolution, and linguistic pattern matching.

Source - 
https://analyticsindiamag.com/how-to-use-stanza-by-stanford-nlp-group-with-python-code/ 

Stanza provides a plethora of pre-trained NLP models for 66 human languages that we can make use of. Downloading a pre-trained model and creating a pipeline is as easy as:

In [None]:
 stanza.download('en')
 nlp = stanza.Pipeline('en') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:28:29 INFO: Downloading default packages for language: en (English)...
2022-04-10 06:28:31 INFO: File exists: /root/stanza_resources/en/default.zip.
2022-04-10 06:28:36 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:28:36 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-10 06:28:36 INFO: Use device: cpu
2022-04-10 06:28:36 INFO: Loading: tokenize
2022-04-10 06:28:36 INFO: Loading: pos
2022-04-10 06:28:37 INFO: Loading: lemma
2022-04-10 06:28:37 INFO: Loading: depparse
2022-04-10 06:28:37 INFO: Loading: sentiment
2022-04-10 06:28:38 INFO: Loading: constituency
2022-04-10 06:28:38 INFO: Loading: ner
2022-04-10 06:28:39 INFO: Done loading processors!


## Specifying the Model Package and download directory

By default, this downloads the default package and all processors for the language specified, English in our case, to the home directory. A language may have multiple packages trained on different datasets, for example, English has four available packages – ewt[defualt], gum, lines, and partut. 

A list of all available languages and corresponding packages can be found here.

To explicitly choose the desired package we use the package argument. And to change the download location of the model we make use of the model_dir argument.

In [None]:
stanza.download('en', model_dir = '/models/english/')
nlp = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:28:39 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-10 06:28:57 INFO: Finished downloading models and saved to /models/english/.
2022-04-10 06:28:57 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-10 06:28:57 INFO: Use device: cpu
2022-04-10 06:28:57 INFO: Loading: tokenize
2022-04-10 06:28:57 INFO: Loading: pos
2022-04-10 06:28:57 INFO: Loading: lemma
2022-04-10 06:28:57 INFO: Loading: depparse
2022-04-10 06:28:57 INFO: Loading: sentiment
2022-04-10 06:28:58 INFO: Loading: constituency
2022-04-10 06:28:59 INFO: Loading: ner
2022-04-10 06:28:59 INFO: Done loading processors!


## Specifying the Processors

Depending on the use-case one might need to specify a set of processors and the package to fetch the different processors from. There are two ways for specifying the processors argument: 

using a string of comma-separated processors
using a dictionary of processor-package pairs
1. Only downloading the required processors:

To download only the required processors we can use a list of processors string like shown below

In [None]:
stanza.download('hi', processors='tokenize,pos')
nlp = stanza.Pipeline('hi', processors='tokenize,pos') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:28:59 INFO: Downloading these customized packages for language: hi (Hindi)...
| Processor | Package |
-----------------------
| tokenize  | hdtb    |
| pos       | hdtb    |
| pretrain  | hdtb    |



Downloading https://huggingface.co/stanfordnlp/stanza-hi/resolve/v1.3.0/models/tokenize/hdtb.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-hi/resolve/v1.3.0/models/pos/hdtb.pt:   0%|          | 0…

Downloading https://huggingface.co/stanfordnlp/stanza-hi/resolve/v1.3.0/models/pretrain/hdtb.pt:   0%|        …

2022-04-10 06:29:19 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:29:19 INFO: Loading these models for language: hi (Hindi):
| Processor | Package |
-----------------------
| tokenize  | hdtb    |
| pos       | hdtb    |

2022-04-10 06:29:19 INFO: Use device: cpu
2022-04-10 06:29:19 INFO: Loading: tokenize
2022-04-10 06:29:19 INFO: Loading: pos
2022-04-10 06:29:19 INFO: Done loading processors!


Downloads and loads the default tokenize (TokenizeProcessor) and pos (POSProcessor) processors for Hindi.

2. Specifying package names for the processors:

Choosing the package name for processors can be done using the package argument.

In [None]:
stanza.download('it', processors='tokenize,mwt', package='twittiro')
nlp = stanza.Pipeline('it', processors='tokenize,mwt', package='twittiro') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:29:20 INFO: Downloading these customized packages for language: it (Italian)...
| Processor | Package  |
------------------------
| tokenize  | twittiro |
| mwt       | twittiro |



Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.3.0/models/tokenize/twittiro.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.3.0/models/mwt/twittiro.pt:   0%|         …

2022-04-10 06:29:26 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:29:26 INFO: Loading these models for language: it (Italian):
| Processor | Package  |
------------------------
| tokenize  | twittiro |
| mwt       | twittiro |

2022-04-10 06:29:26 INFO: Use device: cpu
2022-04-10 06:29:26 INFO: Loading: tokenize
2022-04-10 06:29:26 INFO: Loading: mwt
2022-04-10 06:29:26 INFO: Done loading processors!


Downloads and initializes the tokenize (TokenizeProcessor) and mwt (MWTProcessor) trained on the twittiro dataset for Italian.

We may need to specify the package for one or a few processors and keep the default package for the rest, this can be achieved using the dictionary-based processor’s argument. 

This example shows how to download and load the NERProcessor trained on the Dutch CoNLL02 dataset, but use the default package for all other processors for Dutch.

In [None]:
stanza.download('nl', processors={'ner': 'conll02'})
nlp = stanza.Pipeline('nl', processors={'ner': 'conll02'}) 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:29:30 INFO: Downloading these customized packages for language: nl (Dutch)...
| Processor       | Package |
-----------------------------
| tokenize        | alpino  |
| pos             | alpino  |
| lemma           | alpino  |
| depparse        | alpino  |
| ner             | conll02 |
| backward_charlm | ccwiki  |
| pretrain        | alpino  |
| forward_charlm  | ccwiki  |



Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/tokenize/alpino.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/pos/alpino.pt:   0%|          |…

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/lemma/alpino.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/depparse/alpino.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/ner/conll02.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/backward_charlm/ccwiki.pt:   0%…

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/pretrain/alpino.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-nl/resolve/v1.3.0/models/forward_charlm/ccwiki.pt:   0%|…

2022-04-10 06:30:36 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:30:36 INFO: Loading these models for language: nl (Dutch):
| Processor | Package |
-----------------------
| tokenize  | alpino  |
| pos       | alpino  |
| lemma     | alpino  |
| depparse  | alpino  |
| ner       | conll02 |

2022-04-10 06:30:36 INFO: Use device: cpu
2022-04-10 06:30:36 INFO: Loading: tokenize
2022-04-10 06:30:36 INFO: Loading: pos
2022-04-10 06:30:37 INFO: Loading: lemma
2022-04-10 06:30:37 INFO: Loading: depparse
2022-04-10 06:30:38 INFO: Loading: ner
2022-04-10 06:30:39 INFO: Done loading processors!


For a more granular control over the package names for the processors, we can set the package argument to None and use the dictionary-based processor’s argument to specify the package name for each process. 

The example shows how to use a GSD TokenizeProcessor, an HDT POSProcessor,  a CoNLL03 NERProcessor, and a default LemmaProcessor for German.

In [None]:
processor_dict = {
     'tokenize': 'gsd', 
     'pos': 'hdt', 
     'ner': 'conll03', 
     'lemma': 'default'
 }
stanza.download('de', processors=processor_dict, package=None)
nlp = stanza.Pipeline('de', processors=processor_dict, package=None) 


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:38:48 INFO: Downloading these customized packages for language: de (German)...
| Processor       | Package  |
------------------------------
| tokenize        | gsd      |
| mwt             | gsd      |
| pos             | hdt      |
| lemma           | gsd      |
| ner             | conll03  |
| pretrain        | hdt      |
| forward_charlm  | newswiki |
| backward_charlm | newswiki |



Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/tokenize/gsd.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/mwt/gsd.pt:   0%|          | 0.…

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/pos/hdt.pt:   0%|          | 0.…

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/lemma/gsd.pt:   0%|          | …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/ner/conll03.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/pretrain/hdt.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/forward_charlm/newswiki.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/backward_charlm/newswiki.pt:   …

2022-04-10 06:39:52 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:39:52 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | hdt     |
| lemma     | gsd     |
| ner       | conll03 |

2022-04-10 06:39:52 INFO: Use device: cpu
2022-04-10 06:39:52 INFO: Loading: tokenize
2022-04-10 06:39:52 INFO: Loading: mwt
2022-04-10 06:39:52 INFO: Loading: pos
2022-04-10 06:39:53 INFO: Loading: lemma
2022-04-10 06:39:53 INFO: Loading: ner
2022-04-10 06:39:54 INFO: Done loading processors!


#Creating & Overwriting Processors

In version 1.1, Stanza added the ability to create new Processors and to overwrite existing ones using the decorator @register_processor_variant. 

In [None]:
from stanza.pipeline.processor import Processor, register_processor, register_processor_variant
@register_processor("lowercase")
class LowercaseProcessor(Processor):
    ''' Processor that lowercases all text '''
    _requires = set(['tokenize'])
    _provides = set(['lowercase'])
    def __init__(self, config, pipeline, use_gpu):
        pass
    def _set_up_model(self, *args):
        pass
    def process(self, doc):
        doc.text = doc.text.lower()
        for sent in doc.sentences:
            for tok in sent.tokens:
                tok.text = tok.text.lower()
            for word in sent.words:
                word.text = word.text.lower()
        return doc 

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,lowercase')
doc = nlp('''Question answering is a task where a sentence or sample of text is provided from
 which questions are asked and must be answered.''')
s =[]

for sentence in doc.sentences:
    for word in sentence.words:
        s.append(word.text)
        
print(" ".join(s)) 

2022-04-10 06:42:04 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| lowercase | default  |

2022-04-10 06:42:04 INFO: Use device: cpu
2022-04-10 06:42:04 INFO: Loading: tokenize
2022-04-10 06:42:04 INFO: Loading: lowercase
2022-04-10 06:42:04 INFO: Done loading processors!


question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered .


## Annotating a Document

After a pipeline has been created, we can annotate a string/document by simply passing it to the Pipeline object.

In [None]:
stanza.download('hi')
hi_nlp = stanza.Pipeline('hi')
hindi_doc = hi_nlp('''प्रश्न का उत्तर देना एक ऐसा कार्य है जहाँ एक वाक्य या पाठ
 का नमूना प्रदान किया जाता है जहाँ से प्रश्न पूछे जाते हैं और उसका उत्तर दिया जाना चाहिए।''')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:44:26 INFO: Downloading default packages for language: hi (Hindi)...


Downloading https://huggingface.co/stanfordnlp/stanza-hi/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-10 06:44:34 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:44:34 INFO: Loading these models for language: hi (Hindi):
| Processor | Package |
-----------------------
| tokenize  | hdtb    |
| pos       | hdtb    |
| lemma     | hdtb    |
| depparse  | hdtb    |

2022-04-10 06:44:34 INFO: Use device: cpu
2022-04-10 06:44:34 INFO: Loading: tokenize
2022-04-10 06:44:34 INFO: Loading: pos
2022-04-10 06:44:34 INFO: Loading: lemma
2022-04-10 06:44:34 INFO: Loading: depparse
2022-04-10 06:44:35 INFO: Done loading processors!


Printing each word with its lemma and POS tag:

In [None]:
for sentence in hindi_doc.sentences:
  for word in sentence.words:
    print("{:12s}\t{:12s}\t{:6s}".format(word.text,word.lemma, word.pos))

प्रश्न      	प्रश्न      	NOUN  
का          	का          	ADP   
उत्तर       	उत्तर       	NOUN  
देना        	दे          	VERB  
एक          	एक          	NUM   
ऐसा         	ऐसा         	DET   
कार्य       	कार्य       	NOUN  
है          	है          	VERB  
जहाँ        	जहाँ        	PRON  
एक          	एक          	NUM   
वाक्य       	वाक्य       	NOUN  
या          	या          	CCONJ 
पाठ         	पाठ         	NOUN  
का          	का          	ADP   
नमूना       	नमूना       	NOUN  
प्रदान      	प्रदान      	NOUN  
किया        	कर          	VERB  
जाता        	जा          	AUX   
है          	है          	AUX   
जहाँ        	जहाँ        	PRON  
से          	से          	ADP   
प्रश्न      	प्रश्न      	NOUN  
पूछे        	पूछ         	VERB  
जाते        	जा          	AUX   
हैं         	है          	AUX   
और          	और          	CCONJ 
उसका        	वह          	PRON  
उत्तर       	उत्तर       	NOUN  
दिया        	दे          	VERB  
जाना        	जा          	AUX   
चाहिए     

Printing all the named entities with their dependencies:

In [None]:
for sentence in hindi_doc.sentences:
     print(sentence.ents)
     print(sentence.dependencies) 


[]
[({
  "id": 3,
  "text": "उत्तर",
  "lemma": "उत्तर",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Case=Nom|Gender=Masc|Number=Sing|Person=3",
  "head": 4,
  "deprel": "obj",
  "start_char": 10,
  "end_char": 15
}, 'nmod', {
  "id": 1,
  "text": "प्रश्न",
  "lemma": "प्रश्न",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Case=Acc|Gender=Masc|Number=Sing|Person=3",
  "head": 3,
  "deprel": "nmod",
  "start_char": 0,
  "end_char": 6
}), ({
  "id": 1,
  "text": "प्रश्न",
  "lemma": "प्रश्न",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Case=Acc|Gender=Masc|Number=Sing|Person=3",
  "head": 3,
  "deprel": "nmod",
  "start_char": 0,
  "end_char": 6
}, 'case', {
  "id": 2,
  "text": "का",
  "lemma": "का",
  "upos": "ADP",
  "xpos": "PSP",
  "feats": "AdpType=Post|Case=Nom|Gender=Masc|Number=Sing",
  "head": 1,
  "deprel": "case",
  "start_char": 7,
  "end_char": 9
}), ({
  "id": 4,
  "text": "देना",
  "lemma": "दे",
  "upos": "VERB",
  "xpos": "VM",
  "feats": "Case=Nom|VerbForm=Inf",
  

Stanza's online demo
Dependencies visualized on the online demo available here: http://stanza.run/ 

#Biomedical & Clinical Models

Stanza also provides packages that support syntactic analysis and named entity recognition (NER) on both English biomedical literature and clinical note text. Offered packages include:

2 biomedical syntactic analysis pipelines, trained with human-annotated treebanks
1 clinical syntactic analysis pipeline, trained with silver data
8 biomedical NER models augmented with contextualized representations
2 clinical NER models, including one specialized in radiology reports.
A list of all available biomedical packages with their performance is available here

The Stanza biomedical models can be used in the same way as the normal NLP models.

The example below shows the code for downloading and initializing the i2b2 clinical NER model and annotating the various entities in a clinical note text.

In [None]:
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})

doc = nlp('The patient had a dry cough and fever, they were treated with Paracetamol and saw major improvements in the next few weeks .')

# print out the entities
for ent in doc.entities:
    print(f'{ent.text}\t{ent.type}') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-10 06:51:22 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| pretrain        | mimic   |
| forward_charlm  | mimic   |
| backward_charlm | mimic   |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/tokenize/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/pos/mimic.pt:   0%|          | …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/lemma/mimic.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/depparse/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/ner/i2b2.pt:   0%|          | 0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/pretrain/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/forward_charlm/mimic.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/backward_charlm/mimic.pt:   0%|…

2022-04-10 06:51:37 INFO: Finished downloading models and saved to /root/stanza_resources.
2022-04-10 06:51:37 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | i2b2    |

2022-04-10 06:51:37 INFO: Use device: cpu
2022-04-10 06:51:37 INFO: Loading: tokenize
2022-04-10 06:51:37 INFO: Loading: pos
2022-04-10 06:51:38 INFO: Loading: lemma
2022-04-10 06:51:38 INFO: Loading: depparse
2022-04-10 06:51:38 INFO: Loading: ner
2022-04-10 06:51:39 INFO: Done loading processors!


a dry cough	PROBLEM
fever	PROBLEM
Paracetamol	TREATMENT
