# Stanza

Stanza is a collection of accurate and efficient tools for many human languages in one place.Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing.
Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 60 languages, using the Universal Dependencies formalism.

Native Python implementation requiring minimal efforts to set up; Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition; Pretrained neural models supporting 66 (human) languages; A stable, officially maintained Python interface to CoreNLP.

## Installing Packages

In [2]:
pip install stanza

Collecting stanza
  Downloading stanza-1.2-py3-none-any.whl (282 kB)
Collecting torch>=1.3.0
  Downloading torch-1.8.1-cp38-cp38-win_amd64.whl (190.5 MB)
Installing collected packages: torch, stanza
Successfully installed stanza-1.2 torch-1.8.1
Note: you may need to restart the kernel to use updated packages.


## Importing Packages

In [3]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 16.1MB/s]
2021-03-29 13:20:21 INFO: Downloading default packages for language: en (English)...
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/default.zip: 100%|█████| 411M/411M [01:52<00:00, 3.65MB/s]
2021-03-29 13:22:25 INFO: Finished downloading models and saved to C:\Users\piyush.pathak\stanza_resources.


## Implementation

In [4]:
nlp = stanza.Pipeline('en')

2021-03-29 13:22:25 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-03-29 13:22:25 INFO: Use device: cpu
2021-03-29 13:22:25 INFO: Loading: tokenize
2021-03-29 13:22:25 INFO: Loading: pos
2021-03-29 13:22:26 INFO: Loading: lemma
2021-03-29 13:22:26 INFO: Loading: depparse
2021-03-29 13:22:26 INFO: Loading: sentiment
2021-03-29 13:22:27 INFO: Loading: ner
2021-03-29 13:22:28 INFO: Done loading processors!


In [6]:
doc = nlp("Akshay is teaching Stanza library.")

print(doc)

[
  [
    {
      "id": 1,
      "text": "Akshay",
      "lemma": "Akshay",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "head": 3,
      "deprel": "nsubj",
      "misc": "start_char=0|end_char=6",
      "ner": "S-PERSON"
    },
    {
      "id": 2,
      "text": "is",
      "lemma": "be",
      "upos": "AUX",
      "xpos": "VBZ",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "head": 3,
      "deprel": "aux",
      "misc": "start_char=7|end_char=9",
      "ner": "O"
    },
    {
      "id": 3,
      "text": "teaching",
      "lemma": "teach",
      "upos": "VERB",
      "xpos": "VBG",
      "feats": "Tense=Pres|VerbForm=Part",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=10|end_char=18",
      "ner": "O"
    },
    {
      "id": 4,
      "text": "Stanza",
      "lemma": "Stanza",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "head": 5,
      "deprel": "compound",
 

In [7]:
doc.entities

[{
   "text": "Akshay",
   "type": "PERSON",
   "start_char": 0,
   "end_char": 6
 },
 {
   "text": "Stanza",
   "type": "ORG",
   "start_char": 19,
   "end_char": 25
 }]

In [8]:
for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.xpos)

Akshay Akshay NNP
is be VBZ
teaching teach VBG
Stanza Stanza NNP
library library NN
. . .


In [9]:
processor_dict = {
    'tokenize': 'gsd', 
    'pos': 'hdt', 
    'ner': 'conll03', 
    'lemma': 'default'
}
stanza.download('en', processors=processor_dict, package=None)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 8.38MB/s]
2021-03-29 13:22:29 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package  |
------------------------------
| lemma           | combined |
| ner             | conll03  |
| forward_charlm  | 1billion |
| backward_charlm | 1billion |

2021-03-29 13:22:29 INFO: File exists: C:\Users\piyush.pathak\stanza_resources\en\lemma\combined.pt.
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/ner/conll03.pt: 100%|█| 80.8M/80.8M [00:56<00:00, 1.43MB/s
2021-03-29 13:23:28 INFO: File exists: C:\Users\piyush.pathak\stanza_resources\en\forward_charlm\1billion.pt.
2021-03-29 13:23:28 INFO: File exists: C:\Users\piyush.pathak\stanza_resources\en\backward_charlm\1billion.pt.
2021-03-29 13:23:28 INFO: Finished downloading models and saved to C:\Users\piyush.pathak\stanza_resources.


## Tokenization and Sentence segmentation

In [10]:
nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('Akshay is teaching Stanza.Stanza is the next revolution')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

2021-03-29 13:24:32 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-03-29 13:24:32 INFO: Use device: cpu
2021-03-29 13:24:32 INFO: Loading: tokenize
2021-03-29 13:24:32 INFO: Done loading processors!


id: (1,)	text: Akshay
id: (2,)	text: is
id: (3,)	text: teaching
id: (4,)	text: Stanza
id: (5,)	text: .
id: (1,)	text: Stanza
id: (2,)	text: is
id: (3,)	text: the
id: (4,)	text: next
id: (5,)	text: revolution


## Parts of Speech

In [11]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

2021-03-29 13:25:00 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |

2021-03-29 13:25:00 INFO: Use device: cpu
2021-03-29 13:25:00 INFO: Loading: tokenize
2021-03-29 13:25:00 INFO: Loading: pos
2021-03-29 13:25:00 INFO: Done loading processors!


word: Barack	upos: PROPN	xpos: NNP	feats: Number=Sing
word: Obama	upos: PROPN	xpos: NNP	feats: Number=Sing
word: was	upos: AUX	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: born	upos: VERB	xpos: VBN	feats: Tense=Past|VerbForm=Part|Voice=Pass
word: in	upos: ADP	xpos: IN	feats: _
word: Hawaii	upos: PROPN	xpos: NNP	feats: Number=Sing
word: .	upos: PUNCT	xpos: .	feats: _


## Sentiment Analysis using Stanza

In [12]:
nlp = stanza.Pipeline('en', processors='tokenize,sentiment')


2021-03-29 13:25:24 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| sentiment | sstplus  |

2021-03-29 13:25:24 INFO: Use device: cpu
2021-03-29 13:25:24 INFO: Loading: tokenize
2021-03-29 13:25:24 INFO: Loading: sentiment
2021-03-29 13:25:25 INFO: Done loading processors!


In [13]:
doc = nlp('Ram is a bad boy')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 0


In [14]:
doc = nlp('Ram is a good boy')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 2


In [15]:
doc = nlp('Ram is a boy')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 1


## Lemmatization

In [16]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
doc = nlp('Akshay is teaching Stanza.')
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

2021-03-29 13:26:36 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2021-03-29 13:26:36 INFO: Use device: cpu
2021-03-29 13:26:36 INFO: Loading: tokenize
2021-03-29 13:26:36 INFO: Loading: pos
2021-03-29 13:26:36 INFO: Loading: lemma
2021-03-29 13:26:36 INFO: Done loading processors!


word: Akshay 	lemma: Akshay
word: is 	lemma: be
word: teaching 	lemma: teach
word: Stanza 	lemma: Stanza
word: . 	lemma: .
