**Introduction to Stanza and Named Entity Recognition**

Link to stanza documentation: https://stanfordnlp.github.io/stanza/

More documentation with examples, geared towards beginners: https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_Beginners_Guide.ipynb

Optional Reading: https://www.newfireglobal.com/learn/natural-language-understanding-tools/#informed 
Article that covers some of the main differences between spacy and stanza with examples 

**Imports**

In [1]:
import os
import pandas as pd

In [4]:
import stanza

In [3]:
! pip install stanza

Collecting stanza
  Downloading stanza-1.6.1-py3-none-any.whl (881 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.2/881.2 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf>=3.15.0 (from stanza)
  Downloading protobuf-4.24.4-cp37-abi3-macosx_10_9_universal2.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.4/409.4 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting torch>=1.3.0 (from stanza)
  Downloading torch-2.1.0-cp311-none-macosx_11_0_arm64.whl (59.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting filelock (from torch>=1.3.0->stanza)
  Downloading filelock-3.12.4-py3-none-any.

In [5]:
print(os.getcwd())  #this shows us the current working directory we are in 
                    #Printing it is useful for making your eventual file path as we will see below!
#make sure that you are in the directory and folder which contains the textfile 

/Users/Max/Desktop/Intro to Digital Humanities/cls161_fall23/python_exercises/In_Class_Exercises/NER_Gibon_II


In [12]:
%cd /Users/Max/Desktop/Intro to Digital Humanities/cls161_fall23/python_exercises/In_Class_Exercises/NER_Gibon_II

/Users/Max/Desktop/Intro to Digital Humanities/cls161_fall23/python_exercises/In_Class_Exercises/NER_Gibon_II


**Inputs**
Stanza can handle multiple types of inputs for different tasks. For the purposes of this exercise, we are giving it a list of sentences (ie strings).
If we weren't splitting the text by sentence ourselves beforehand, we would have to give it just strings (i.e.,the entire text file) and the pipeline would split the strings into sentences itself as part of the tokenization process

In [7]:
#Here, we are loading the stanza model for english language processing 
#We will go through what these different processors do in a bit!
# To prevent output capacity issues, we are pre tokenizing the text before we give it to the processors
nlp = stanza.Pipeline(lang='en', processors='tokenize, pos,lemma, depparse,ner',tokenize_pretokenized=True)

2023-10-25 15:12:30 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/pos/combined_charlm.pt:   0%|  …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/lemma/combined_nocharlm.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/depparse/combined_charlm.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/ner/ontonotes_charlm.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/pretrain/conll17.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/pretrain/fasttextcrawl.pt:   0%…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/backward_charlm/1billion.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/forward_charlm/1billion.pt:   0…

2023-10-25 15:12:58 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
| ner       | ontonotes_charlm  |

2023-10-25 15:12:58 INFO: Using device: cpu
2023-10-25 15:12:58 INFO: Loading: tokenize
2023-10-25 15:12:58 INFO: Loading: pos
2023-10-25 15:12:59 INFO: Loading: lemma
2023-10-25 15:12:59 INFO: Loading: depparse
2023-10-25 15:12:59 INFO: Loading: ner
2023-10-25 15:12:59 INFO: Done loading processors!


In [17]:
#uses an f string with my current working directory #ands file name
#this is an example of when you can use f-strings to your advantage 
path= f'{(os.getcwd())}/gibbon_decline_volume1_chap21.txt' 
with open( path, encoding='utf-8', mode='r') as f:
        vol1_chap21 = f.read()

In [18]:
vol1_chap21 #looking at the text 
#luckily for us, the format of this file shows every sentence ends with a period and then one white space 
#so we can use .split to create a list of sentences 

' Persecution of heresy — The Schism of the Donatists — The Arian Controversy — Athanasius — Distracted state of the Church and the Empire under Constantine and his sons — Toleration of Paganism \n THE grateful applause of the clergy has consecrated the memory of a prince, who indulged their passions and promoted their interest. Constantine gave them security, wealth, honours, and revenge; and the support of the orthodox faith  was considered as the most sacred and important duty of the civil magistrate. The edict of Milan, the great charter of toleration, had confirmed to each individual of the Roman world the privilege of choosing and professing his own religion. But this inestimable privilege was soon violated: with the knowledge of truth the emperor imbibed the maxims of persecution; and the sects which dissented from the Catholic church were afflicted and oppressed by the triumph of Christianity. Constantine easily believed that the heretics, who presumed to dispute his opinions o

In [19]:
chap21_sents= vol1_chap21.split('. ')
chap21_sents   #even if it is not perfect, its good enough for our purposes 
#now we have one large list of every sentence in this chapter! 

[' Persecution of heresy — The Schism of the Donatists — The Arian Controversy — Athanasius — Distracted state of the Church and the Empire under Constantine and his sons — Toleration of Paganism \n THE grateful applause of the clergy has consecrated the memory of a prince, who indulged their passions and promoted their interest',
 'Constantine gave them security, wealth, honours, and revenge; and the support of the orthodox faith  was considered as the most sacred and important duty of the civil magistrate',
 'The edict of Milan, the great charter of toleration, had confirmed to each individual of the Roman world the privilege of choosing and professing his own religion',
 'But this inestimable privilege was soon violated: with the knowledge of truth the emperor imbibed the maxims of persecution; and the sects which dissented from the Catholic church were afflicted and oppressed by the triumph of Christianity',
 'Constantine easily believed that the heretics, who presumed to dispute h

In [20]:
#we can use list indexing to grab a sentence at a time 
#creating a Doc can take some time, so let's start with a test sentence 
test= chap21_sents[21]   

test

#pulling a random sentence 


'That divided church was incapable of affording\n an impartial judicature; the controversy was solemnly tried\n in five successive tribunals, which were appointed by the\n emperor; and the whole proceeding, from the first appeal to\n the final sentence, lasted above three years'

In [21]:
doc = nlp(test)

In [22]:
 #stanza's output is something called a Doc 
#lets see what the output looks like 
doc

[
  [
    {
      "id": 1,
      "text": "That",
      "lemma": "that",
      "upos": "DET",
      "xpos": "DT",
      "feats": "Number=Sing|PronType=Dem",
      "head": 3,
      "deprel": "det",
      "misc": "",
      "start_char": 0,
      "end_char": 4,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 2,
      "text": "divided",
      "lemma": "divide",
      "upos": "VERB",
      "xpos": "VBN",
      "feats": "Tense=Past|VerbForm=Part",
      "head": 3,
      "deprel": "amod",
      "misc": "",
      "start_char": 5,
      "end_char": 12,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 3,
      "text": "church",
      "lemma": "church",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "head": 5,
      "deprel": "nsubj",
      "misc": "",
      "start_char": 13,
      "end_char": 19,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 4,
      "text": 

In [23]:
#lets see what type this is 
type(doc)

stanza.models.common.doc.Document

In [36]:
##take 5-10 minutes and go to the stanza documentation I have linked above
#in a group, write code to access the text, pos, ner, and upos compontents of our tokens
print(*[f'word:  {word.pos}\tpos: {word.text}\tupos:  {word.upos}\txpos:  {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

word:  DET	pos: That	upos:  DET	xpos:  DT	feats: Number=Sing|PronType=Dem
word:  VERB	pos: divided	upos:  VERB	xpos:  VBN	feats: Tense=Past|VerbForm=Part
word:  NOUN	pos: church	upos:  NOUN	xpos:  NN	feats: Number=Sing
word:  AUX	pos: was	upos:  AUX	xpos:  VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word:  ADJ	pos: incapable	upos:  ADJ	xpos:  JJ	feats: Degree=Pos
word:  ADP	pos: of	upos:  ADP	xpos:  IN	feats: _
word:  NOUN	pos: affording	upos:  NOUN	xpos:  NN	feats: Number=Sing
word:  DET	pos: an	upos:  DET	xpos:  DT	feats: Definite=Ind|PronType=Art
word:  ADJ	pos: impartial	upos:  ADJ	xpos:  JJ	feats: Degree=Pos
word:  NOUN	pos: judicature;	upos:  NOUN	xpos:  NN	feats: Number=Sing
word:  DET	pos: the	upos:  DET	xpos:  DT	feats: Definite=Def|PronType=Art
word:  NOUN	pos: controversy	upos:  NOUN	xpos:  NN	feats: Number=Sing
word:  AUX	pos: was	upos:  AUX	xpos:  VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word:  ADV	pos: solemnly	upos:  ADV	xpos:

In [37]:
#next, look at the stanza documentation and print the named entities for our document 
print(doc.entities)

[{
  "text": "five",
  "type": "CARDINAL",
  "start_char": 110,
  "end_char": 114
}, {
  "text": "first",
  "type": "ORDINAL",
  "start_char": 209,
  "end_char": 214
}, {
  "text": "three years",
  "type": "DATE",
  "start_char": 258,
  "end_char": 269
}]


**NER and processing multiple documents**

In [39]:
# load a new pipeline with just the NER processor 
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner',tokenize_pretokenized=True)
#pretokenized= True is still included, since we are going to give the processor our large list of sentences 

2023-10-25 15:41:15 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-25 15:41:16 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2023-10-25 15:41:16 INFO: Using device: cpu
2023-10-25 15:41:16 INFO: Loading: tokenize
2023-10-25 15:41:16 INFO: Loading: ner
2023-10-25 15:41:17 INFO: Done loading processors!


In [40]:
#this is a solution for our prior output issue-- still takes a bit to run though 
# Note! Can only be done with text that has already been tokenized into sentences
#this allows us to parallel process multiple sentences at a time
Gibbon_docs_chap21sents = [stanza.Document([], text=d) for d in chap21_sents]
Gibbon_out_docs = nlp_ner(Gibbon_docs_chap21sents) 
print(Gibbon_out_docs[1]) 

[
  [
    {
      "id": 1,
      "text": "Constantine",
      "misc": "",
      "start_char": 0,
      "end_char": 11,
      "ner": "S-PERSON",
      "multi_ner": [
        "S-PERSON"
      ]
    },
    {
      "id": 2,
      "text": "gave",
      "misc": "",
      "start_char": 12,
      "end_char": 16,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 3,
      "text": "them",
      "misc": "",
      "start_char": 17,
      "end_char": 21,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 4,
      "text": "security,",
      "misc": "",
      "start_char": 22,
      "end_char": 31,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 5,
      "text": "wealth,",
      "misc": "",
      "start_char": 32,
      "end_char": 39,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 6,
      "text": "honours,",
      "misc": "",
      "start_char": 40,
      "end_c

In [41]:
#how we access the named entities from our documents list 
for doc in Gibbon_out_docs:
    print(doc.ents) 
 
    

[{
  "text": "The Schism of the Donatists",
  "type": "WORK_OF_ART",
  "start_char": 24,
  "end_char": 51
}, {
  "text": "The Arian Controversy — Athanasius — Distracted state of the Church and the Empire under Constantine",
  "type": "WORK_OF_ART",
  "start_char": 54,
  "end_char": 154
}, {
  "text": "Toleration of Paganism",
  "type": "WORK_OF_ART",
  "start_char": 170,
  "end_char": 192
}]
[{
  "text": "Constantine",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 11
}]
[{
  "text": "Milan,",
  "type": "GPE",
  "start_char": 13,
  "end_char": 19
}, {
  "text": "Roman",
  "type": "NORP",
  "start_char": 93,
  "end_char": 98
}]
[{
  "text": "Catholic",
  "type": "NORP",
  "start_char": 164,
  "end_char": 172
}, {
  "text": "Christianity",
  "type": "NORP",
  "start_char": 227,
  "end_char": 239
}]
[{
  "text": "Constantine",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 11
}]
[]
[{
  "text": "East",
  "type": "LOC",
  "start_char": 94,
  "end_char": 98
}, {
  "text": "Cons

In [43]:
NER_Dict= {}
for doc in Gibbon_out_docs:
    for sent in doc.sentences:
        for token in sent.tokens:
            if token.ner == 'O':
                continue
            else:
                NER_Dict[token.text]= token.ner
           
NER_Dict

{'The': 'B-PERSON',
 'Schism': 'B-PERSON',
 'of': 'I-DATE',
 'the': 'I-DATE',
 'Donatists': 'S-PERSON',
 'Arian': 'S-NORP',
 'Controversy': 'I-WORK_OF_ART',
 '—': 'I-WORK_OF_ART',
 'Athanasius': 'S-PERSON',
 'Distracted': 'I-WORK_OF_ART',
 'state': 'I-WORK_OF_ART',
 'Church': 'I-WORK_OF_ART',
 'and': 'I-CARDINAL',
 'Empire': 'I-WORK_OF_ART',
 'under': 'I-WORK_OF_ART',
 'Constantine': 'S-PERSON',
 'Toleration': 'B-WORK_OF_ART',
 'Paganism': 'S-NORP',
 'Milan,': 'B-PERSON',
 'Roman': 'S-NORP',
 'Catholic': 'S-NORP',
 'Christianity': 'S-NORP',
 'East': 'S-LOC',
 'Paul': 'S-PERSON',
 'Samosata;': 'I-PERSON',
 'Montanists': 'I-PERSON',
 'Phrygia,': 'E-PERSON',
 'Novatians,': 'S-NORP',
 'Marcionites': 'S-NORP',
 'Valentinians,': 'S-NORP',
 'Gnostics': 'S-NORP',
 'Asia': 'B-LOC',
 'Egypt': 'S-GPE',
 'Manichaeans,': 'S-NORP',
 'Persia': 'S-GPE',
 'Oriental': 'S-NORP',
 'Christian': 'S-NORP',
 'Diocletian;': 'S-PERSON',
 'Two': 'B-DATE',
 'Manichaeans': 'S-NORP',
 'Constantinople;': 'S-GPE',
 '

**Quick Pandas Dataframes Intro**

Beginner Tutorials: https://pandas.pydata.org/docs/getting_started/index.html#getting-started


https://access.tufts.edu/udemy-business UDEMY is a great resource for coding tutorials.
I especially recommend Python for Data Science and Machine Learning Bootcamp by Jose Portilla (there is a section on pandas for data science analysis)

In [45]:
NER_Dict

{'The': 'B-PERSON',
 'Schism': 'B-PERSON',
 'of': 'I-DATE',
 'the': 'I-DATE',
 'Donatists': 'S-PERSON',
 'Arian': 'S-NORP',
 'Controversy': 'I-WORK_OF_ART',
 '—': 'I-WORK_OF_ART',
 'Athanasius': 'S-PERSON',
 'Distracted': 'I-WORK_OF_ART',
 'state': 'I-WORK_OF_ART',
 'Church': 'I-WORK_OF_ART',
 'and': 'I-CARDINAL',
 'Empire': 'I-WORK_OF_ART',
 'under': 'I-WORK_OF_ART',
 'Constantine': 'S-PERSON',
 'Toleration': 'B-WORK_OF_ART',
 'Paganism': 'S-NORP',
 'Milan,': 'B-PERSON',
 'Roman': 'S-NORP',
 'Catholic': 'S-NORP',
 'Christianity': 'S-NORP',
 'East': 'S-LOC',
 'Paul': 'S-PERSON',
 'Samosata;': 'I-PERSON',
 'Montanists': 'I-PERSON',
 'Phrygia,': 'E-PERSON',
 'Novatians,': 'S-NORP',
 'Marcionites': 'S-NORP',
 'Valentinians,': 'S-NORP',
 'Gnostics': 'S-NORP',
 'Asia': 'B-LOC',
 'Egypt': 'S-GPE',
 'Manichaeans,': 'S-NORP',
 'Persia': 'S-GPE',
 'Oriental': 'S-NORP',
 'Christian': 'S-NORP',
 'Diocletian;': 'S-PERSON',
 'Two': 'B-DATE',
 'Manichaeans': 'S-NORP',
 'Constantinople;': 'S-GPE',
 '

In [46]:
ner_frame= pd.DataFrame.from_dict(NER_Dict, orient= 'index')
ner_frame

Unnamed: 0,0
The,B-PERSON
Schism,B-PERSON
of,I-DATE
the,I-DATE
Donatists,S-PERSON
...,...
"Numa,",S-PERSON
"Augustus,",S-PERSON
(173),S-CARDINAL
Christianity;,S-NORP


In [47]:
#creates a new index of numbers that is not the entity 
ner_frame.reset_index(inplace=True)


In [48]:
ner_frame.rename(columns={'index':'token', 0: 'NER_tag'})

Unnamed: 0,token,NER_tag
0,The,B-PERSON
1,Schism,B-PERSON
2,of,I-DATE
3,the,I-DATE
4,Donatists,S-PERSON
...,...,...
519,"Numa,",S-PERSON
520,"Augustus,",S-PERSON
521,(173),S-CARDINAL
522,Christianity;,S-NORP


In [38]:
NER_tags= { 'PER':'People, including fictional characters', 'NORP':'Nationalities, religious and political groups: Jewish, Buddhist',
           'TIME':'Time shorter than a day','ORG':'Companies, agencies, institutions', 'GPE':'Countries, cities, regions (districts)',
'LOC':'Geographical entities', 'PRODUCT':'Objects, vehicles, foods', 'EVENT':'Named battles, wars, sports events, catastrophes',
'QUANTITY':'Measurements: 10 kg, 200 km', 'ORDINAL':'Numbers of order: first ,third','CARDINAL': 'Numerals that do not fall under another type',
'FAC': 'Buildings, airports, roads', 'LANGUAGE':'Any named language'}
NER_tags


{'PER': 'People, including fictional characters',
 'NORP': 'Nationalities, religious and political groups: Jewish, Buddhist',
 'TIME': 'Time shorter than a day',
 'ORG': 'Companies, agencies, institutions',
 'GPE': 'Countries, cities, regions (districts)',
 'LOC': 'Geographical entities',
 'PRODUCT': 'Objects, vehicles, foods',
 'EVENT': 'Named battles, wars, sports events, catastrophes',
 'QUANTITY': 'Measurements: 10 kg, 200 km',
 'ORDINAL': 'Numbers of order: first ,third',
 'CARDINAL': 'Numerals that do not fall under another type',
 'FAC': 'Buildings, airports, roads',
 'LANGUAGE': 'Any named language'}

In [44]:
nested_NER_prefixes= {'B':'Beginning of named entity','I':'Token is inside a named entity','O':'Corresponding word is not an entity','E':'End of named entity','S':'Named entity has only one token/element'}
nested_NER_prefixes

{'B': 'Beginning of named entity',
 'I': 'Token is inside a named entity',
 'O': 'Corresponding word is not an entity',
 'E': 'End of named entity',
 'S': 'Named entity has only one token/element'}

In [2]:
nertags_frame= pd.DataFrame.from_dict(NER_tags, orient= 'index')
nertags_frame

NameError: name 'pd' is not defined

In [None]:
nertags_frame.reset_index(inplace=True)

In [1]:
nertags_frame.rename(columns={'index':'tag', 0: 'example'})


NameError: name 'nertags_frame' is not defined

In [None]:
bioestags_frame= pd.DataFrame.from_dict(nested_NER_prefixes, orient= 'index')
bioestags_frame

After what we have seen today, how would you describe a named entity? 
Do these named entities tell you anything about the contents of this chapter? 

In this notebook, what inputs did we give stanza to process?
What outputs does stanza produce?
How did we access the different components of this output? (ie pos, lemma, ner, etc.)

Do you have any remaining questions or things that feel unclear after this demo? 