<a href="https://colab.research.google.com/github/kim-ji-youn/tutorials/blob/main/stanza_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install stanza

Collecting stanza
[?25l  Downloading https://files.pythonhosted.org/packages/e7/8b/3a9e7a8d8cb14ad6afffc3983b7a7322a3a24d94ebc978a70746fcffc085/stanza-1.1.1-py3-none-any.whl (227kB)
[K     |█▍                              | 10kB 13.1MB/s eta 0:00:01[K     |██▉                             | 20kB 11.0MB/s eta 0:00:01[K     |████▎                           | 30kB 8.6MB/s eta 0:00:01[K     |█████▊                          | 40kB 7.2MB/s eta 0:00:01[K     |███████▏                        | 51kB 4.5MB/s eta 0:00:01[K     |████████▋                       | 61kB 4.4MB/s eta 0:00:01[K     |██████████                      | 71kB 4.9MB/s eta 0:00:01[K     |███████████▌                    | 81kB 5.1MB/s eta 0:00:01[K     |█████████████                   | 92kB 5.4MB/s eta 0:00:01[K     |██████████████▍                 | 102kB 4.4MB/s eta 0:00:01[K     |███████████████▉                | 112kB 4.4MB/s eta 0:00:01[K     |█████████████████▎              | 122kB 4.4MB/s eta 0:0

# Import
```
import stanza
```



# Set Pipeline
1) 언어 설정
```
stanza.download('de')
```
  * 한국어: ko
  * 영어: en
  * 독일어:de

2) processors
```
nlp = stanza.Pipeline(lang = 'de', processor = 'tokenize, mwt, pos')
```
  * **tokenize**: TokenizeProcessor
    * *Tokenizes* the text and performs *sentence segmentation*
  * **mwt**: MWTProcessor
    * Requirement: tokenize
    * Expands *multi-word tokens* predicted by TokenizeProcessor
    * Only applicable to some languagues: German, French
  * **pos**: POSProcessor
    * Requirement: tokenize, mwt
    * Labels tokens with...
      * UPOS: universal POS
      * XPOS: treebank-specific POS
      * UFeats: universal morphological features
  * **lemma**: LemmaProcessor
    * Requirement: tokenize, mwt, pos
    * Generates the *word leammas* for all words in the Document.
  * **depparse**: DepparseProcessor
    * Requirement: tokenize, mwt, pos, lemma
    * Provides an accuate *syntactic dependency parsing* analysis
  * **ner**: NERProcessing
    * Requirement: tokenize, mwt
    * Recognize *named entities* for all token spans in the corpus.

In [2]:
import stanza

In [3]:
stanza.download('de')
nlp = stanza.Pipeline('de') #따로 processor를 설정하지 않는 경우 defult 값으로 모두 다운로드 받아짐

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 18.9MB/s]                    
2021-01-15 13:14:44 INFO: Downloading default packages for language: de (German)...
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/de/default.zip: 100%|██████████| 607M/607M [02:34<00:00, 3.92MB/s]
2021-01-15 13:17:29 INFO: Finished downloading models and saved to /root/stanza_resources.
2021-01-15 13:17:29 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2021-01-15 13:17:29 INFO: Use device: cpu
2021-01-15 13:17:29 INFO: Loading: tokenize
2021-01-15 13:17:29 INFO: Loading: mwt
2021-01-15 13:17:29 INFO: Loading: pos
2021-01-15 13:17:30 INFO: Loading: lemma
2021-01-15 13:17:30 INFO: Loading: depparse
2021-01-15 1

# Data Object 1: Document
* An entire document
* A collection of ```Sentence```s and entities (represented as ```Span```s)
* define document
```
doc = nlp("A document object holds the annotation of an entire document")
```
* 문장 출처:
1. https://www.dw.com/de/faktencheck-wie-verl%C3%A4sslich-ist-wikipedia/a-56212126
2. https://www.dw.com/de/david-bowie-todestag/a-56167786

## Properties
  * **text**: The raw *text(string)* of the document
  * **sentences**: The *list* of sentences in the document
  * **entities (ents)**: The *list* of entities in the document
  * **num_tokens**: The total *number* of tokens in the document
  * **num_words**: The total *number* of words in the document


In [4]:
doc = nlp("""Im Unterschied zu den neuartigen mRNA-Impfstoffen von BioNTech und Moderna wird für das Präparat von Valneva eine klassische Technologie mit inaktiven Viren verwendet, wie sie bei den meisten Influenza-Präparaten zum Einsatz kommt. Es wird derzeit in klinischen Studien in Europa getestet.""")

In [5]:
doc.text

'Im Unterschied zu den neuartigen mRNA-Impfstoffen von BioNTech und Moderna wird für das Präparat von Valneva eine klassische Technologie mit inaktiven Viren verwendet, wie sie bei den meisten Influenza-Präparaten zum Einsatz kommt. Es wird derzeit in klinischen Studien in Europa getestet.'

In [6]:
doc.sentences

[[
   {
     "id": [
       1,
       2
     ],
     "text": "Im",
     "ner": "O",
     "misc": "start_char=0|end_char=2"
   },
   {
     "id": 1,
     "text": "In",
     "lemma": "in",
     "upos": "ADP",
     "xpos": "APPR",
     "head": 3,
     "deprel": "case"
   },
   {
     "id": 2,
     "text": "dem",
     "lemma": "der",
     "upos": "DET",
     "xpos": "ART",
     "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
     "head": 3,
     "deprel": "det"
   },
   {
     "id": 3,
     "text": "Unterschied",
     "lemma": "Unterschied",
     "upos": "NOUN",
     "xpos": "NN",
     "feats": "Case=Dat|Gender=Masc|Number=Sing",
     "head": 26,
     "deprel": "obl",
     "misc": "start_char=3|end_char=14",
     "ner": "O"
   },
   {
     "id": 4,
     "text": "zu",
     "lemma": "zu",
     "upos": "ADP",
     "xpos": "APPR",
     "head": 7,
     "deprel": "case",
     "misc": "start_char=15|end_char=17",
     "ner": "O"
   },
   {
     "id": 5,
     "text": "den",


In [7]:
doc.entities

[{
   "text": "Moderna",
   "type": "MISC",
   "start_char": 67,
   "end_char": 74
 }, {
   "text": "Valneva",
   "type": "PER",
   "start_char": 101,
   "end_char": 108
 }, {
   "text": "Europa",
   "type": "LOC",
   "start_char": 273,
   "end_char": 279
 }]

In [8]:
doc.num_tokens

48

In [9]:
doc.num_words

50

# Data Object 2: Sentence
* A sentence as is segmented by the TokenizeProcessor
* A sentence contains a list of ```Token```s in the sentence, a list of all its ```Word```s, and list of entities in the sentence(```Span```).
## Properties
  * **text**: The raw *text(string)* for the sentence
  * **dependencies**: The *list* of dependencies for the sentence, where each item contains head ```Word``` of the dependency relation, the type of dependency relation, and the dependent ```Word``` in that relation
  * **tokens**: The *list* of tokens in the sentence
  * **words**: The *list* of words in the sentence
  * **entities(ents)**: The *list* of entities in the sentence
  * **sentiment**: The sentiment value for the sentence, as a *string*. Only English, German, and Chinese. (0=negative, 1=neutral, 2=positive)

In [10]:
sent = doc.sentences[0]

In [11]:
sent.text

'Im Unterschied zu den neuartigen mRNA-Impfstoffen von BioNTech und Moderna wird für das Präparat von Valneva eine klassische Technologie mit inaktiven Viren verwendet, wie sie bei den meisten Influenza-Präparaten zum Einsatz kommt.'

In [12]:
sent.dependencies

[({
    "id": 3,
    "text": "Unterschied",
    "lemma": "Unterschied",
    "upos": "NOUN",
    "xpos": "NN",
    "feats": "Case=Dat|Gender=Masc|Number=Sing",
    "head": 26,
    "deprel": "obl",
    "misc": "start_char=3|end_char=14"
  }, 'case', {
    "id": 1,
    "text": "In",
    "lemma": "in",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 3,
    "deprel": "case"
  }), ({
    "id": 3,
    "text": "Unterschied",
    "lemma": "Unterschied",
    "upos": "NOUN",
    "xpos": "NN",
    "feats": "Case=Dat|Gender=Masc|Number=Sing",
    "head": 26,
    "deprel": "obl",
    "misc": "start_char=3|end_char=14"
  }, 'det', {
    "id": 2,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 3,
    "deprel": "det"
  }), ({
    "id": 26,
    "text": "verwendet",
    "lemma": "verwenden",
    "upos": "VERB",
    "xpos": "VVPP",
    "feats": "VerbForm=Part",
    "head": 0,
    "deprel"

In [13]:
sent.dependencies[0] #Word, str(relation), Word 

({
   "id": 3,
   "text": "Unterschied",
   "lemma": "Unterschied",
   "upos": "NOUN",
   "xpos": "NN",
   "feats": "Case=Dat|Gender=Masc|Number=Sing",
   "head": 26,
   "deprel": "obl",
   "misc": "start_char=3|end_char=14"
 }, 'case', {
   "id": 1,
   "text": "In",
   "lemma": "in",
   "upos": "ADP",
   "xpos": "APPR",
   "head": 3,
   "deprel": "case"
 })

In [14]:
sent.tokens

[[
   {
     "id": [
       1,
       2
     ],
     "text": "Im",
     "ner": "O",
     "misc": "start_char=0|end_char=2"
   },
   {
     "id": 1,
     "text": "In",
     "lemma": "in",
     "upos": "ADP",
     "xpos": "APPR",
     "head": 3,
     "deprel": "case"
   },
   {
     "id": 2,
     "text": "dem",
     "lemma": "der",
     "upos": "DET",
     "xpos": "ART",
     "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
     "head": 3,
     "deprel": "det"
   }
 ], [
   {
     "id": 3,
     "text": "Unterschied",
     "lemma": "Unterschied",
     "upos": "NOUN",
     "xpos": "NN",
     "feats": "Case=Dat|Gender=Masc|Number=Sing",
     "head": 26,
     "deprel": "obl",
     "misc": "start_char=3|end_char=14",
     "ner": "O"
   }
 ], [
   {
     "id": 4,
     "text": "zu",
     "lemma": "zu",
     "upos": "ADP",
     "xpos": "APPR",
     "head": 7,
     "deprel": "case",
     "misc": "start_char=15|end_char=17",
     "ner": "O"
   }
 ], [
   {
     "id": 5,
     

In [15]:
sent.tokens[0]

[
  {
    "id": [
      1,
      2
    ],
    "text": "Im",
    "ner": "O",
    "misc": "start_char=0|end_char=2"
  },
  {
    "id": 1,
    "text": "In",
    "lemma": "in",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 3,
    "deprel": "case"
  },
  {
    "id": 2,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 3,
    "deprel": "det"
  }
]

In [16]:
sent.tokens[1]

[
  {
    "id": 3,
    "text": "Unterschied",
    "lemma": "Unterschied",
    "upos": "NOUN",
    "xpos": "NN",
    "feats": "Case=Dat|Gender=Masc|Number=Sing",
    "head": 26,
    "deprel": "obl",
    "misc": "start_char=3|end_char=14",
    "ner": "O"
  }
]

In [17]:
sent.words

[{
   "id": 1,
   "text": "In",
   "lemma": "in",
   "upos": "ADP",
   "xpos": "APPR",
   "head": 3,
   "deprel": "case"
 }, {
   "id": 2,
   "text": "dem",
   "lemma": "der",
   "upos": "DET",
   "xpos": "ART",
   "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
   "head": 3,
   "deprel": "det"
 }, {
   "id": 3,
   "text": "Unterschied",
   "lemma": "Unterschied",
   "upos": "NOUN",
   "xpos": "NN",
   "feats": "Case=Dat|Gender=Masc|Number=Sing",
   "head": 26,
   "deprel": "obl",
   "misc": "start_char=3|end_char=14"
 }, {
   "id": 4,
   "text": "zu",
   "lemma": "zu",
   "upos": "ADP",
   "xpos": "APPR",
   "head": 7,
   "deprel": "case",
   "misc": "start_char=15|end_char=17"
 }, {
   "id": 5,
   "text": "den",
   "lemma": "der",
   "upos": "DET",
   "xpos": "ART",
   "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Plur|PronType=Art",
   "head": 7,
   "deprel": "det",
   "misc": "start_char=18|end_char=21"
 }, {
   "id": 6,
   "text": "neuartigen",
   "lemm

In [18]:
sent.words[0]

{
  "id": 1,
  "text": "In",
  "lemma": "in",
  "upos": "ADP",
  "xpos": "APPR",
  "head": 3,
  "deprel": "case"
}

In [19]:
sent.words[1]

{
  "id": 2,
  "text": "dem",
  "lemma": "der",
  "upos": "DET",
  "xpos": "ART",
  "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
  "head": 3,
  "deprel": "det"
}

In [20]:
sent.entities

[{
   "text": "Moderna",
   "type": "MISC",
   "start_char": 67,
   "end_char": 74
 }, {
   "text": "Valneva",
   "type": "PER",
   "start_char": 101,
   "end_char": 108
 }]

In [21]:
sent.sentiment

1

# Data Object 3: Token
* A token, and a list of underlying syntactic ```Word```s.
* Multi-word token: a range ```id```. (e.g. zum = zu dem)
## Properties
  * **id**: The index of the token in the sentence (*tuple[int]*).
    * Multi-Word Token (MWT): The index contains two elements (e.g., ```(1, 2)```)
    * One word token: The index contains a single element (e.g. ```(1, )```)
  * **text**: The *text(string)* of the token.
  * **misc**: Miscellaneous annotations with regard to the token. Used to store whether a token is a multi-word token
  * **word**: The *list* of syntactic words underlying the token.
  * **start_char**: The start character index(*int*) for the token in the raw text of the document. Useful if you want to detokenize or apply annotations back to the raw text
  * **end_char**: The end character index(*int*) for the token in the raw text of the document.  
  * **ner**: The NER tag of the token. BIOES format
    * BIOES format
      * S: start (single token)
      * B: begin (multi-word token)
      * I: in (multi-word token)
      * E: end (multi-word token)
      * O: out 

In [22]:
token1 = sent.tokens[0] #mwt
token1

[
  {
    "id": [
      1,
      2
    ],
    "text": "Im",
    "ner": "O",
    "misc": "start_char=0|end_char=2"
  },
  {
    "id": 1,
    "text": "In",
    "lemma": "in",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 3,
    "deprel": "case"
  },
  {
    "id": 2,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 3,
    "deprel": "det"
  }
]

In [23]:
token2 = sent.tokens[1]
token2

[
  {
    "id": 3,
    "text": "Unterschied",
    "lemma": "Unterschied",
    "upos": "NOUN",
    "xpos": "NN",
    "feats": "Case=Dat|Gender=Masc|Number=Sing",
    "head": 26,
    "deprel": "obl",
    "misc": "start_char=3|end_char=14",
    "ner": "O"
  }
]

In [24]:
token3 = sent.tokens[11] #NER
token3

[
  {
    "id": 13,
    "text": "Moderna",
    "lemma": "Moderna",
    "upos": "PROPN",
    "xpos": "NE",
    "feats": "Case=Dat|Gender=Neut|Number=Sing",
    "head": 11,
    "deprel": "conj",
    "misc": "start_char=67|end_char=74",
    "ner": "S-MISC"
  }
]

In [25]:
#property 1: id
print(token1.id)
print(token2.id)
print(token3.id)

(1, 2)
(3,)
(13,)


In [26]:
print(token1.text)
print(token2.text)
print(token3.text)

Im
Unterschied
Moderna


In [27]:
print(token1.misc)
print(token2.misc)
print(token3.misc)

start_char=0|end_char=2
start_char=3|end_char=14
start_char=67|end_char=74


In [28]:
print(token1.words)
print(token2.words)
print(token3.words)

[{
  "id": 1,
  "text": "In",
  "lemma": "in",
  "upos": "ADP",
  "xpos": "APPR",
  "head": 3,
  "deprel": "case"
}, {
  "id": 2,
  "text": "dem",
  "lemma": "der",
  "upos": "DET",
  "xpos": "ART",
  "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
  "head": 3,
  "deprel": "det"
}]
[{
  "id": 3,
  "text": "Unterschied",
  "lemma": "Unterschied",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Case=Dat|Gender=Masc|Number=Sing",
  "head": 26,
  "deprel": "obl",
  "misc": "start_char=3|end_char=14"
}]
[{
  "id": 13,
  "text": "Moderna",
  "lemma": "Moderna",
  "upos": "PROPN",
  "xpos": "NE",
  "feats": "Case=Dat|Gender=Neut|Number=Sing",
  "head": 11,
  "deprel": "conj",
  "misc": "start_char=67|end_char=74"
}]


In [29]:
print(token1.start_char)
print(token2.start_char)
print(token3.start_char)

0
3
67


In [30]:
print(token1.end_char)
print(token2.end_char)
print(token3.end_char)

2
14
74


In [31]:
print(token1.ner)
print(token2.ner)
print(token3.ner)

O
O
S-MISC


# Data Object 4: Word
* A syntactic word
* Words are generated as a result of applying ```MWTProcessor```
* Tagging, lemmatization, parsing의 기준
## Properties
  * **id**: The index(*inf*) of the word in the sentence. 
  * **text**: The text(*string*) of the word.
  * **lemma**: The lemma of the word
  * **upos(pos)** : The universal part-of-speech of the word
  * **xpos**: The treebank-specific part-of-speech of the word
  * **feats**: The morpholofical features of the word.
  * **head**: The id(*int*) of the syntactic head of the word in the sentence. 
  * **deprel**: The dependency relation(*string*) between the word and its syntactc head
  * **deps**: The combination of head and deprel that captures all syntactic dependency information
  * **misc**: Miscellaneous annotations with regard to the word. 
  * **parent**: A "back pointer" to the parent token that the word is part of. In the case of a multi-word token, a token can be the parent of multiple words. 

In [32]:
word1 = sent.words[0]
word1

{
  "id": 1,
  "text": "In",
  "lemma": "in",
  "upos": "ADP",
  "xpos": "APPR",
  "head": 3,
  "deprel": "case"
}

In [33]:
word2 = sent.words[1]
word2

{
  "id": 2,
  "text": "dem",
  "lemma": "der",
  "upos": "DET",
  "xpos": "ART",
  "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
  "head": 3,
  "deprel": "det"
}

In [34]:
word3 = sent.words[2]
word3

{
  "id": 3,
  "text": "Unterschied",
  "lemma": "Unterschied",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Case=Dat|Gender=Masc|Number=Sing",
  "head": 26,
  "deprel": "obl",
  "misc": "start_char=3|end_char=14"
}

In [35]:
print(word1.id)
print(word2.id)
print(word3.id)

1
2
3


In [36]:
print(word1.text)
print(word2.text)
print(word3.text)

In
dem
Unterschied


In [37]:
print(word1.lemma)
print(word2.lemma)
print(word3.lemma)

in
der
Unterschied


In [38]:
print(word1.upos)
print(word2.upos)
print(word2.upos)

ADP
DET
DET


In [39]:
print(word1.xpos)
print(word2.xpos)
print(word3.xpos)

APPR
ART
NN


In [40]:
print(word1.feats)
print(word2.feats)
print(word3.feats)

None
Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art
Case=Dat|Gender=Masc|Number=Sing


In [41]:
print(word1.head)
print(word2.head)
print(word3.head)

3
3
26


In [42]:
print(word1.deprel)
print(word2.deprel)
print(word3.deprel)

case
det
obl


In [43]:
print(word1.deps)
print(word2.deps)
print(word3.deps)

None
None
None


In [44]:
for sent in doc.sentences:
  for word in sent.words :
    if word.deps != None :
      print(word)
      print(word.deps)
      print()

In [45]:
print(word1.misc)
print(word2.misc)
print(word3.misc)

None
None
start_char=3|end_char=14


In [46]:
print(word1.parent)
print(word2.parent)
print(word3.parent)

[
  {
    "id": [
      1,
      2
    ],
    "text": "Im",
    "ner": "O",
    "misc": "start_char=0|end_char=2"
  },
  {
    "id": 1,
    "text": "In",
    "lemma": "in",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 3,
    "deprel": "case"
  },
  {
    "id": 2,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 3,
    "deprel": "det"
  }
]
[
  {
    "id": [
      1,
      2
    ],
    "text": "Im",
    "ner": "O",
    "misc": "start_char=0|end_char=2"
  },
  {
    "id": 1,
    "text": "In",
    "lemma": "in",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 3,
    "deprel": "case"
  },
  {
    "id": 2,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 3,
    "deprel": "det"
  }
]
[
  {
    "id": 3,
    "text": "Unterschied",
    "lemma": "Unterschied",


# Data Object 5: Span
* A range of objects (e.g. Named Entities) can be represented as a ```Span```.
## Properties
  * **text**: The *text(string)* of the span.
  * **token**: The list of tokens that correspond to the span.
  * **words**: The list of words that correspond to the span.
  * **type**: The entity type of the span.
  * **start_char**: The start character offset of the span in the document.
  * **end_char**: The end character offset of the span in the document.


In [47]:
doc = nlp("""Seine Musik und die seiner Alter Egos, darunter "Ziggy Stardust", bleibt jedoch zeitlos.""")

In [48]:
sent = doc.sentences[0]

In [49]:
sent.entities

[{
   "text": "Ziggy Stardust",
   "type": "MISC",
   "start_char": 49,
   "end_char": 63
 }]

In [50]:
ne = sent.entities[0]

In [51]:
ne.text

'Ziggy Stardust'

In [52]:
ne.tokens

[[
   {
     "id": 11,
     "text": "Ziggy",
     "lemma": "Ziggy",
     "upos": "PROPN",
     "xpos": "NE",
     "feats": "Case=Nom|Gender=Masc|Number=Sing",
     "head": 7,
     "deprel": "appos",
     "misc": "start_char=49|end_char=54",
     "ner": "B-MISC"
   }
 ], [
   {
     "id": 12,
     "text": "Stardust",
     "lemma": "Stardust",
     "upos": "PROPN",
     "xpos": "NE",
     "feats": "Case=Nom|Gender=Masc|Number=Sing",
     "head": 11,
     "deprel": "flat",
     "misc": "start_char=55|end_char=63",
     "ner": "E-MISC"
   }
 ]]

In [53]:
ne.words

[{
   "id": 11,
   "text": "Ziggy",
   "lemma": "Ziggy",
   "upos": "PROPN",
   "xpos": "NE",
   "feats": "Case=Nom|Gender=Masc|Number=Sing",
   "head": 7,
   "deprel": "appos",
   "misc": "start_char=49|end_char=54"
 }, {
   "id": 12,
   "text": "Stardust",
   "lemma": "Stardust",
   "upos": "PROPN",
   "xpos": "NE",
   "feats": "Case=Nom|Gender=Masc|Number=Sing",
   "head": 11,
   "deprel": "flat",
   "misc": "start_char=55|end_char=63"
 }]

In [54]:
ne.type

'MISC'

In [55]:
ne.start_char

49

In [56]:
ne.end_char

63

# Processor 1: Tokenization & Sentence Segmentation
* **name**: tokenize
* **Annotator class name**: TokenizeProcessor
* **Requirement**: - 
* **Description**: 
  * Segments a ```Document``` into ```Sentence```s, each containing a list of ```Token```s. 
  * Tokenizes the text and performs sentence segmentation
## Options
  * **tokenize_pretokenized** (default: False)
    * Assume the text is tokenized by white space and sentence split by newline. 
  * **tokenize_no_ssplit** (defulat: False)
    * Assume the sentences are split by two continuous newlines (```\n\n```).
    * Only run tokenization and disable sentence segmentation. 


In [57]:
#tokenization and sentence segmentation - default
doc = nlp("""An diesem 10. Januar jährt sich David Bowies Todestag zum fünften Mal. Seine Musik und die seiner Alter Egos, darunter "Ziggy Stardust", bleibt jedoch zeitlos.""")
sentences = doc.sentences #한 문장이 하나의 요소를 이루는 리스트
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} tokens ===')
  for token in sentence.tokens :
    print("id: ",token.id, end = "\t")
    print("token: " + token.text)

=== SENTENCE 1 tokens ===
id:  (1,)	token: An
id:  (2,)	token: diesem
id:  (3,)	token: 10
id:  (4,)	token: .
=== SENTENCE 2 tokens ===
id:  (1,)	token: Januar
id:  (2,)	token: jährt
id:  (3,)	token: sich
id:  (4,)	token: David
id:  (5,)	token: Bowies
id:  (6,)	token: Todestag
id:  (7, 8)	token: zum
id:  (9,)	token: fünften
id:  (10,)	token: Mal
id:  (11,)	token: .
=== SENTENCE 3 tokens ===
id:  (1,)	token: Seine
id:  (2,)	token: Musik
id:  (3,)	token: und
id:  (4,)	token: die
id:  (5,)	token: seiner
id:  (6,)	token: Alter
id:  (7,)	token: Egos
id:  (8,)	token: ,
id:  (9,)	token: darunter
id:  (10,)	token: "
id:  (11,)	token: Ziggy
id:  (12,)	token: Stardust
id:  (13,)	token: "
id:  (14,)	token: ,
id:  (15,)	token: bleibt
id:  (16,)	token: jedoch
id:  (17,)	token: zeitlos
id:  (18,)	token: .


In [58]:
#tokenization and sentence segmentation - tokenize_no_ssplit = True: Sentence segmentation을 하지 않음
nlp = stanza.Pipeline(lang = 'de', tokenize_no_ssplit = True)
doc = nlp("""An diesem 10. Januar jährt sich David Bowies Todestag zum fünften Mal.Seine Musik und die seiner Alter Egos, darunter "Ziggy Stardust", bleibt jedoch zeitlos.""")
sentences = doc.sentences #한 문장이 하나의 요소를 이루는 리스트
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} tokens ===')
  for token in sentence.tokens :
    print("id: ",token.id, end = "\t")
    print("token: " + token.text)

2021-01-15 13:17:38 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2021-01-15 13:17:38 INFO: Use device: cpu
2021-01-15 13:17:38 INFO: Loading: tokenize
2021-01-15 13:17:38 INFO: Loading: mwt
2021-01-15 13:17:38 INFO: Loading: pos
2021-01-15 13:17:39 INFO: Loading: lemma
2021-01-15 13:17:39 INFO: Loading: depparse
2021-01-15 13:17:40 INFO: Loading: sentiment
2021-01-15 13:17:42 INFO: Loading: ner
2021-01-15 13:17:43 INFO: Done loading processors!


=== SENTENCE 1 tokens ===
id:  (1,)	token: An
id:  (2,)	token: diesem
id:  (3,)	token: 10
id:  (4,)	token: .
id:  (5,)	token: Januar
id:  (6,)	token: jährt
id:  (7,)	token: sich
id:  (8,)	token: David
id:  (9,)	token: Bowies
id:  (10,)	token: Todestag
id:  (11, 12)	token: zum
id:  (13,)	token: fünften
id:  (14,)	token: Mal.Seine
id:  (15,)	token: Musik
id:  (16,)	token: und
id:  (17,)	token: die
id:  (18,)	token: seiner
id:  (19,)	token: Alter
id:  (20,)	token: Egos
id:  (21,)	token: ,
id:  (22,)	token: darunter
id:  (23,)	token: "
id:  (24,)	token: Ziggy
id:  (25,)	token: Stardust
id:  (26,)	token: "
id:  (27,)	token: ,
id:  (28,)	token: bleibt
id:  (29,)	token: jedoch
id:  (30,)	token: zeitlos
id:  (31,)	token: .


In [59]:
#tokenization and sentence segmentation - tokenize_pretokenized = True: \n를 기준으로 문장을 분절
nlp = stanza.Pipeline(lang = 'de', tokenize_pretokenized = True)
doc = nlp("""An diesem 10. Januar jährt sich David Bowies Todestag zum fünften Mal.\nSeine Musik und die seiner Alter Egos, darunter "Ziggy Stardust", bleibt jedoch zeitlos.""")
sentences = doc.sentences #한 문장이 하나의 요소를 이루는 리스트
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} tokens ===')
  for token in sentence.tokens :
    print("id: ",token.id, end = "\t")
    print("token: " + token.text)

2021-01-15 13:17:44 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2021-01-15 13:17:44 INFO: Use device: cpu
2021-01-15 13:17:44 INFO: Loading: tokenize
2021-01-15 13:17:44 INFO: Loading: mwt
2021-01-15 13:17:44 INFO: Loading: pos
2021-01-15 13:17:45 INFO: Loading: lemma
2021-01-15 13:17:45 INFO: Loading: depparse
2021-01-15 13:17:46 INFO: Loading: sentiment
2021-01-15 13:17:48 INFO: Loading: ner
2021-01-15 13:17:49 INFO: Done loading processors!


=== SENTENCE 1 tokens ===
id:  (1,)	token: An
id:  (2,)	token: diesem
id:  (3,)	token: 10.
id:  (4,)	token: Januar
id:  (5,)	token: jährt
id:  (6,)	token: sich
id:  (7,)	token: David
id:  (8,)	token: Bowies
id:  (9,)	token: Todestag
id:  (10,)	token: zum
id:  (11,)	token: fünften
id:  (12,)	token: Mal.
=== SENTENCE 2 tokens ===
id:  (1,)	token: Seine
id:  (2,)	token: Musik
id:  (3,)	token: und
id:  (4,)	token: die
id:  (5,)	token: seiner
id:  (6,)	token: Alter
id:  (7,)	token: Egos,
id:  (8,)	token: darunter
id:  (9,)	token: "Ziggy
id:  (10,)	token: Stardust",
id:  (11,)	token: bleibt
id:  (12,)	token: jedoch
id:  (13,)	token: zeitlos.


# Processor 2: Multi-Word Token (MWT) Expansion
* **Name**: mwt
* **MWTProcessor**: MWTProcessor
* **Requirement**: tokenize
* **Description**: 
  * Expands multi-word tokens into multi words when they are predicted by the tokenizer. 
  * Each ```Token``` will correspond to one or more ```Word```s


In [None]:
nlp = stanza.Pipeline('de')
doc = nlp("""Jede Sprachversion von Wikipedia ist ein eigenständiges Projekt. Das heißt, Artikel zum gleichen Thema werden in verschiedenen Sprachen von unterschiedlichen Autoren geschrieben und bearbeitet. Diese können unterschiedliche Schwerpunkte setzen. Doch es kann auch dazu kommen, dass gerade politisch heikle Themen in verschiedenen Sprachen anders gedeutet werden. Ein Beispiel ist die Situation um die Halbinsel Krim, die Russland im März 2014 von der Ukraine annektiert hat.""")
sentences = doc.sentences

In [None]:
# token -> word
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} tokens ===')
  for token in sentence.tokens :
    print("token: ",token.text, end = "\t")
    print(f"words: {', '.join([word.text for word in token.words])}")

In [None]:
#mwt만 추출
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} tokens ===')
  for token in sentence.tokens:
    word = token.words[0].text
    if token.text != word : 
      print("token: ",token.text, end = "\t")
      print(f"words: {', '.join([word.text for word in token.words])}")

In [63]:
# word -> token
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words :
    print("word: ",word.text, end = "\t")
    print("parent token: ", word.parent.text)

=== SENTENCE 1 words ===
word:  Jede	parent token:  Jede
word:  Sprachversion	parent token:  Sprachversion
word:  von	parent token:  von
word:  Wikipedia	parent token:  Wikipedia
word:  ist	parent token:  ist
word:  ein	parent token:  ein
word:  eigenständiges	parent token:  eigenständiges
word:  Projekt	parent token:  Projekt
word:  .	parent token:  .
=== SENTENCE 2 words ===
word:  Das	parent token:  Das
word:  heißt	parent token:  heißt
word:  ,	parent token:  ,
word:  Artikel	parent token:  Artikel
word:  zu	parent token:  zum
word:  dem	parent token:  zum
word:  gleichen	parent token:  gleichen
word:  Thema	parent token:  Thema
word:  werden	parent token:  werden
word:  in	parent token:  in
word:  verschiedenen	parent token:  verschiedenen
word:  Sprachen	parent token:  Sprachen
word:  von	parent token:  von
word:  unterschiedlichen	parent token:  unterschiedlichen
word:  Autoren	parent token:  Autoren
word:  geschrieben	parent token:  geschrieben
word:  und	parent token:  und
wor

In [None]:
#mwt만 추출
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words :
    if word.text != word.parent.text:
      print("word: ",word.text, end = "\t")
      print("parent token: ", word.parent.text)

# Processor 3: Part-of-Speech & Morphological Features
* **Name**: pos
* **Annotator class name**: POSProcessor
* **Requirement**: tokenize, mwt
* **Description**: UPOS, XPOS and UFeats through ```Word```'s properties

In [None]:
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words:
    print("word: ", word.text, end = '\t')
    print("upos: ", word.upos, end = '\t')
    print("xpos: ", word.xpos, end = '\t')
    print("feats: ", word.feats)

In [None]:
#upos == NOUN만 프린트
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words:
    if word.upos == 'NOUN' :
      print("word: ", word.text, end = '\t')
      print("feats: ", word.feats)

In [None]:
#feats: Gender = Fem, Number = Sing만 추출
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words:
    if word.feats is not None:
      features = word.feats.split("|")
      if 'Gender=Fem' in features and 'Number=Sing' in features :
        print("word: ", word.text, end = "\t")
        print("upos: ", word.upos, end = "\t")
        print("feats: ", word.feats)     

# Processor 4: Lemmatization
* **Name**: lemma
* **Annotator class name**: LemmaProcessor
* **Requirement**: tokenize, mwt, pos
* **Desciption**: Gererates the ```Word``` lemmas for all words in the document. 

In [None]:
sentence = sentences[0]
sentence.text

In [69]:
for word in sentence.words :
  print("word: ", word.text, end = "\t")
  print("lemma: ", word.lemma)

word:  Jede	lemma:  jed
word:  Sprachversion	lemma:  Sprachversion
word:  von	lemma:  von
word:  Wikipedia	lemma:  Wikipedia
word:  ist	lemma:  sein
word:  ein	lemma:  ein
word:  eigenständiges	lemma:  eigenständig
word:  Projekt	lemma:  Projekt
word:  .	lemma:  .


In [70]:
#문장 단위로 lemmatization
for sentence in sentences :
  print(sentence.text, end = "\t")
  print("==lemmatization===>", end = "\t")
  print(' '.join(word.lemma for word in sentence.words))

Jede Sprachversion von Wikipedia ist ein eigenständiges Projekt.	==lemmatization===>	jed Sprachversion von Wikipedia sein ein eigenständig Projekt .
Das heißt, Artikel zum gleichen Thema werden in verschiedenen Sprachen von unterschiedlichen Autoren geschrieben und bearbeitet.	==lemmatization===>	der heißen , Artikel zu der gleich Thema werden in verschieden Sprache von unterschiedlich Autor schreiben und bearbeiten .
Diese können unterschiedliche Schwerpunkte setzen.	==lemmatization===>	dies können unterschiedlich Schwerpunkt setzen .
Doch es kann auch dazu kommen, dass gerade politisch heikle Themen in verschiedenen Sprachen anders gedeutet werden.	==lemmatization===>	doch es können auch dazu kommen , dass gerade politisch heikl Thema in verschieden Sprache anders deuten werden .
Ein Beispiel ist die Situation um die Halbinsel Krim, die Russland im März 2014 von der Ukraine annektiert hat.	==lemmatization===>	ein Beispiel sein der Situation um der Halbinsel Krim , der Rußland in der 

# Processor 5: Dependency Parsing
* **Name**: depparse
* **Annotator class name**: DepparseProcessor
* **Requirement**: tokenize, mwt, pos, lemma
* **Description**: Determines the syntactic head of each word in a sentence and the dependency relation between the two words that are accessible through ```Word```'s ```head``` and ```deprel``` attrubutes.

In [71]:
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words:
      print("word id : ", word.id, end = '\t')
      print("word: ", word.text, end = "\t")
      print("head id: ", word.head, end = "\t")
      print("head: ", end = "")
      if word.head > 0 :
        print(sentence.words[word.head-1].text, end = "\t")
      else :
        print("root", end = "\t")
      print("deprel: ", word.deprel)

=== SENTENCE 1 words ===
word id :  1	word:  Jede	head id:  2	head: Sprachversion	deprel:  det
word id :  2	word:  Sprachversion	head id:  8	head: Projekt	deprel:  nsubj
word id :  3	word:  von	head id:  4	head: Wikipedia	deprel:  case
word id :  4	word:  Wikipedia	head id:  2	head: Sprachversion	deprel:  nmod
word id :  5	word:  ist	head id:  8	head: Projekt	deprel:  cop
word id :  6	word:  ein	head id:  8	head: Projekt	deprel:  det
word id :  7	word:  eigenständiges	head id:  8	head: Projekt	deprel:  amod
word id :  8	word:  Projekt	head id:  0	head: root	deprel:  root
word id :  9	word:  .	head id:  8	head: Projekt	deprel:  punct
=== SENTENCE 2 words ===
word id :  1	word:  Das	head id:  2	head: heißt	deprel:  nsubj
word id :  2	word:  heißt	head id:  0	head: root	deprel:  root
word id :  3	word:  ,	head id:  16	head: geschrieben	deprel:  punct
word id :  4	word:  Artikel	head id:  16	head: geschrieben	deprel:  nsubj:pass
word id :  5	word:  zu	head id:  8	head: Thema	deprel:  case


In [72]:
# head가 root인 문장 성분만 추출
for i, sentence in enumerate(sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for word in sentence.words:
    if word.head == 0:
      headId = word.id
  for word in sentence.words:
    if word.head == headId or word.head == 0:
        print("word id : ", word.id, end = '\t')
        print("word: ", word.text, end = "\t")
        print("head id: ", word.head, end = "\t")
        print("head: ", end = "")
        if word.head > 0 :
          print(sentence.words[word.head-1].text, end = "\t")
        else :
          print("root", end = "\t")
        print("deprel: ", word.deprel)   

=== SENTENCE 1 words ===
word id :  2	word:  Sprachversion	head id:  8	head: Projekt	deprel:  nsubj
word id :  5	word:  ist	head id:  8	head: Projekt	deprel:  cop
word id :  6	word:  ein	head id:  8	head: Projekt	deprel:  det
word id :  7	word:  eigenständiges	head id:  8	head: Projekt	deprel:  amod
word id :  8	word:  Projekt	head id:  0	head: root	deprel:  root
word id :  9	word:  .	head id:  8	head: Projekt	deprel:  punct
=== SENTENCE 2 words ===
word id :  1	word:  Das	head id:  2	head: heißt	deprel:  nsubj
word id :  2	word:  heißt	head id:  0	head: root	deprel:  root
word id :  16	word:  geschrieben	head id:  2	head: heißt	deprel:  conj
word id :  19	word:  .	head id:  2	head: heißt	deprel:  punct
=== SENTENCE 3 words ===
word id :  1	word:  Diese	head id:  5	head: setzen	deprel:  nsubj
word id :  2	word:  können	head id:  5	head: setzen	deprel:  aux
word id :  4	word:  Schwerpunkte	head id:  5	head: setzen	deprel:  obj
word id :  5	word:  setzen	head id:  0	head: root	deprel:  r

# Processor 6: Named Entity Recognition
* **Name**: ner
* **Annotator class name**: NERProcessor
* **Requirement**: tokenize, mwt
* **Description**: 
  * Named entities in ```Document```'s properties ```entities```.
  * Named entities in  ```Sentence```'s properties ```entities```. 
  * Named entities in token-level NER tags in ```Token```'s properties ```ner```.

In [73]:
doc = nlp("""David Bowie hinterlässt ein immenses und einzigartiges Werk. In den frühen 1970er Jahren mischte er die Musikwelt in schillernden Kostümen und flammend rot gefärbtem Vokuhila-Schnitt als "Ziggy Stardust" auf.""")
doc.entities

[{
   "text": "David Bowie",
   "type": "PER",
   "start_char": 0,
   "end_char": 11
 }, {
   "text": "Ziggy Stardust",
   "type": "MISC",
   "start_char": 188,
   "end_char": 202
 }]

In [74]:
#document 단위
for ent in doc.entities :
  print("entity: ", ent.text, end = "\t")
  print("type: ", ent.type)

entity:  David Bowie	type:  PER
entity:  Ziggy Stardust	type:  MISC


In [75]:
#sentence 단위
for i, sentence in enumerate(doc.sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for ent in sentence.entities :
    print("entity: ", ent.text, end = "\t")
    print("type: ", ent.type)

=== SENTENCE 1 words ===
entity:  David Bowie	type:  PER
=== SENTENCE 2 words ===
entity:  Ziggy Stardust	type:  MISC


In [76]:
#token 단위
for i, sentence in enumerate(doc.sentences) :
  print(f'=== SENTENCE {i+1} words ===')
  for token in sentence.tokens :
    print("token: ", token.text, end = '\t')
    print("ner: ", token.ner)

=== SENTENCE 1 words ===
token:  David	ner:  B-PER
token:  Bowie	ner:  E-PER
token:  hinterlässt	ner:  O
token:  ein	ner:  O
token:  immenses	ner:  O
token:  und	ner:  O
token:  einzigartiges	ner:  O
token:  Werk	ner:  O
token:  .	ner:  O
=== SENTENCE 2 words ===
token:  In	ner:  O
token:  den	ner:  O
token:  frühen	ner:  O
token:  1970er	ner:  O
token:  Jahren	ner:  O
token:  mischte	ner:  O
token:  er	ner:  O
token:  die	ner:  O
token:  Musikwelt	ner:  O
token:  in	ner:  O
token:  schillernden	ner:  O
token:  Kostümen	ner:  O
token:  und	ner:  O
token:  flammend	ner:  O
token:  rot	ner:  O
token:  gefärbtem	ner:  O
token:  Vokuhila	ner:  O
token:  -	ner:  O
token:  Schnitt	ner:  O
token:  als	ner:  O
token:  "	ner:  O
token:  Ziggy	ner:  B-MISC
token:  Stardust	ner:  E-MISC
token:  "	ner:  O
token:  auf	ner:  O
token:  .	ner:  O


# Processor 7: Sentiment Analysis
* **Name**: sentiment
* **Annotator class name**: SentimentProcessor
* **Requirement**: tokenize
* **Desription**: Add the ```Sentiment``` to each ```Sentence```.
  * 0: negative
  * 1: neutral
  * 2: positive

In [77]:
for sentence in doc.sentences :
  print(sentence.sentiment, end = "\t")
  print(sentence.text)

1	David Bowie hinterlässt ein immenses und einzigartiges Werk.
1	In den frühen 1970er Jahren mischte er die Musikwelt in schillernden Kostümen und flammend rot gefärbtem Vokuhila-Schnitt als "Ziggy Stardust" auf.
1	David Bowie hinterlässt ein immenses und einzigartiges Werk.
1	In den frühen 1970er Jahren mischte er die Musikwelt in schillernden Kostümen und flammend rot gefärbtem Vokuhila-Schnitt als "Ziggy Stardust" auf.
